[om] A Proposal for extending OpenMath with XML annotations

Andreas Franke afranke at ags.uni-sb.de
Sun Aug 18 14:38:28 CEST 2002


> We need to be able to include non OpenMath XML in an OpenMath object.

I think the same need arised in the design of OMDoc, and there the 
answer was the <data> element (which can there occur inside <private> 
or <code> elements) with an ANY content model; a new <OM...> element 
seems to be the analogous thing for the OMOBJ language, be it OMXML 
or something else.

[The relevant definition from OMDoc 1.1:
 <!ELEMENT data ANY>
 <!ATTLIST data %midmatter;
                format CDATA #IMPLIED
                href CDATA #IMPLIED
                size CDATA #IMPLIED>
]

> (2) the "backward-compatible" possibility would be to expand the
> definition of <OMATTP> attributions to allow arbitrary xml data as
> the attribute values.

I vote against the second alternative because it would mean that there 
is no strict encapsulation of the custom xml, the "arbitrariness" of 
the embedded xml would spread out to the other children of the OMATP 
element. One symptom is that you'd have to change the OMATP element 
to ANY content model in the DTD, whereas the current (OMS, (%omel;))+ 
is much nicer and more precise. I guess that tools would have some
problems with this approach, too, for the same reason.

> We would propose that:
>
>     If an OMATTP has an OMS key S_i with cd="xml", 
>     then the value O_i can be an arbitrary XML object.
>
>     In this case the name="...." field of the OMS can be used to
>     capture other information.

I hate the idea of introducing "fake" symbols (i.e. abusing the 
cd and name attribute of the OMS element for other purposes than 
referring to a symbol that is defined somewhere). This breaks my 
application (or at least requires very ugly hacks in the code), 
and it is not necessary: If you use a container element for your 
xml data, then the cd="xml" part becomes obsolete (since this 
information is supplied by the container element itself), and you 
can just use a real <OMS cd="mathml99" name="presentation-xml"/> for
the attribution. (If the set of strings is dynamic so that you can't
have a symbol for each of them, ordinary attribution might be all you
need to associate that string as an OMSTR with the xml container 
element. Alternatively, if the string is something that makes sense
to have for _all_ xml, then you could make the xml container element
have an attribute to hold it.)

> So if this is really needed I would propose to let 
> <OMSTR> ... </OMSTR> be able to hold any data.

Please don't do this. IMO there is a difference between text (similar
to #PCDATA content) and arbitrary xml (similar to ANY content),
and it would seem to be a mistake to make the OMSTR element more fuzzy 
by using it for more than text, especially given the mention of "string"
in its name. If there is a more general element like OMXML and someone
uses this element to contain plain strings, then it's just their problem
to deal with the unnecessary complexity overhead (they probably could 
have used OMSTR instead). On the other hand, if arbitrary xml is allowed
in OMSTR, then *everyone* is forced to deal with this additional 
complexity, even if they only have a need for plain text. OMSTR is 
currently very well defined, and I don't see a reason to change that.

> This would not break backwards-compatibility and the libraries
> could just add a method to set/get the data to/from the OMString 
> as XML.

I don't fully understand how this is supposed to work, but I fear
following this route would lead to a lot of problems: You would
have to use an additional flag in the internal representation of 
your <OMSTR>s that states whether the content was imported from
xml or not (otherwise you wouldn't be able to decide whether
to output some content as &lt;foo/&gt; or as <foo/>, just using
the heuristic of producing xml "if it is wellformed xml" would
be wrong because then you need to be able to preserved the
escaped form as well). If you have this bit internally, then
you need to make sure that you modify it when you change the
content dynamically, so what do you do if you have xml content
but you append some string which is not? and then append some
other string which makes it well-formed again? So wouldn't you
have to add a new parameter to the DOM's append method to indicate
whether to preserve this xml flag or reset it? and so on...

> Suppose there is a (real!) CD "xml".  It would have names like "entity",
> "CDATA", "element", "attribute-name", "attribute-value"... all the names
> of concepts in the XML standard.  Perhaps even things like "namespace",
> but let's not get carried away.

This is an interesting topic. If you create such an "xml" CD,
please let me know. 

However, there is a fundamental difference between
- embedding xml in the document on the same level as the 
  surrounding xml, so that (existing) tools can operate on the 
  overall document, without having to resolve any indirection, 
  and
- including a _description of_ the embedded xml in the document,
  so that the embedded xml is on a different level. 

The current "escaped xml in an OMSTR" approach belongs to the
latter category, and so does an approach like <OMSTR><![PCDATA[ 
<mrow>...</mrow> ]]></OMSTR>, I guess (if this is legal at all). 
The structural encoding based on a tree of OMA,OMS,and OMSTR elements 
using a new "xml" CD belongs there, too; thus it is not a solution 
if the goal is to get rid of the indirection completely.

In fact, this approach with an "indirection" allows for a variety
of encodings with different degrees of "stucture":
 
- On one end of the spectrum, there is the "flat" representation 
  as a "black box", using an <OMSTR> with the escaped xml, 
  possibly attributed with an indication that the OMSTR contains 
  xml, e.g. 

<OMATTR>
  <OMATP>
    <OMS cd="openmath" name="OMSTR-content-type"/>
    <OMS cd="content-types" name="whatever...mathml"/>
      (or maybe <OMSTR>application/xml</OMSTR> or some other string)
  </ATP>
  <OMSTR> &lt;mrow&gt; &lt;mo&gt;sin&lt;/mo&gt; ...</OMSTR>
</OMATTR>

  This works for arbitrary content types and doesn't require 
  any knowledge about the dtd.

- The next step in the parsing process could be to consider
  the xml string as a sequence of characters and entity 
  references, like this:
  
<OMA>
  <OMS cd="xml" name="char-sequence"/>
  <OMA>
    <OMS cd="ascii" name="char"/>
    <OMI>32</OMI> <!-- space -->
  </OMA>
  <OMA>
    <OMS cd="ascii" name="char"/>
    <OMI>60</OMI> <!-- &lt; -->
  </OMA>
  ...
  <OMA>
    <OMS cd="ascii" name="char"/>
    <OMI>38</OMI> <!-- &amp; -->
  </OMA>
  <OMA>
    <OMS cd="ascii" name="char"/>
    <OMI>65</OMI> <!-- A -->
  </OMA>
  ...
</OMA>

- Next, tokenization gives you a sequence of SAX events:

<OMA>
  <OMS cd="xml" name="token-sequence"/>
  <OMA>
    <OMS name="xml" name"characters"/>
    <OMSTR> </OMSTR>
  </OMA>
  <OMA>
    <OMS name="xml" name="start-tag"/>
    <OMSTR>mrow</OMSTR>
  </OMA>
  ...
  <OMA>
    <OMS cd="xml" name="entity-ref"/>
    <OMSTR>ApplyFunction</OMSTR>
  </OMA>
  ...
</OMA>

- Then, you can transform the list into a tree structure by matching
  start- and end-tags, etc. 

<OMA>
  <OMS cd="xml" name="content"/>
  <OMSTR> </OMSTR>
  <OMA>
    <OMS cd="xml" name="element"/>
    <OMSTR>mrow</OMSTR>
    <OMA>
      <OMS cd="xml" name="content"/>
      <OMSTR> </OMSTR>
      ...
      <OMA>
        <OMS cd="xml" name="entity-ref"/>
        <OMSTR>ApplyFunction</OMSTR>
      </OMA>
      ...
    </OMA>
    ...
  </OMA>
  ...
</OMA>

I think this is what you propose. But if you only want to
_embed_ some external xml, not _understand_ or further
process it, then compared to the "flat" <OMSTR>...</OMSTR> 
approach this "structural" approach requires a significant 
transformation overhead without enough gain to justify it,
IMO.

However if you do want to further process it, then you
will need to do this specifically for your document type,
because e.g. you will need to distinguish between significant
and insignificant whitespace, and possibly make use of
other information (like the elements' content models)
provided by the dtd.

If you accept that you need a specialized encoding for
each document type, then you can get quite far towards
a more "semantic" representation, assuming that there 
exists a CD with symbol definitions for all elements, 
attributes, entities, and possibly other things that 
are used in your document:
  
<OMA>
  <OMS cd="mathml" name="mrow"/>
  <OMSTR> </OMSTR>
  <OMA>
    <OMS cd="mathml" name="mo"/>
    <OMSTR>sin</OMSTR>
  </OMA>
  <OMSTR> </OMSTR>
  <OMS cd="mathml" name="ApplyFunction"/>
  <OMSTR> </OMSTR>
  <OMA>
    <OMS cd="mathml" name="mn"/>
    <OMSTR>1</OMSTR>
  </OMA>
  <OMSTR> </OMSTR> 
</OMA>

You can even try to move the pure presentational stuff
into attributions, but then it's probably better to go 
directly to the OpenMath with annotated MathML-Presentation.

The variety of indirect representations should offer a 
suitable solution for each application, assuming that
they can deal with the indirection.
 
But if you decide to embed foreign xml in an OMOBJ without
any indirection, then please do it cleanly by introducing 
a new container element with content type ANY, and neither
by overloading the existing OMSTR element nor by diluting 
OMATP. :-)

... now that I'm about to send this mail, I realize that
the three mails starting with "Juergen Zimmer: Re: [om] " 
in the subject belong to this thread, too...

>  <OMSTR> <mrow> <mo>sin</mo> &ApplyFunction; <mn>1</mn> </mrow>

If you do want to use it just as a string, then please treat is
as such, either by escaping the tag markers or by using <![PCDATA[.

If none of this is a viable solution, then I strongly prefer
extending the standard with a new element over changing (and 
unnecessarily complicating) the semantics of the existing standard.

> a) Can this be solved by a sensible use of namespaces?
> <om:OMATTR> 
> 	<mml:mrow> 
> 	<mml:mo>sin</mml:mo> &ApplyFunction; <mml:mn>1</mml:mn> 
> 	</mml:mrow>
> </om:OMATTR>

If you want to make this change then I suggest to be be consequent and 
- require that everyone is forced to always use namespaces, and
- liberalize all places where &omel; is used in the dtd and allow
  arbitrary foreign xml basically everywhere.

But I don't particularly like this proposal because 
- it would basically destroy the usefulness of the dtd (since almost 
  all elements would need to have an ANY content type), 
- namespaces are complicated, and it would be very rude to those
  people who don't use them yet, and
- my feeling is that this is against the general design of the
  OpenMath standard where you have very few well-defined elements
  and where you can be sure not to find any surprises.

If you don't generalize it and introduce a special case only for 
the OMATTR element, and only if namespaces are used, which however 
are still optional, then I call this inconsequent and nothing but 
an ugly hack that just does not belong in a standard.

(This may be a bit emotional, but please understand that I'm 
already having a hard time fighting my own bad hacks in my code,
and by all means I'd like to avoid being forced to introduce even 
more hacks because they are mandated by a standard.)

> b) Alternatively, can OM Objects refer to names of things outside 
> themselves, but in the same XML document?

> <om:OMATTR>
>   <om:OMA>
>     <om:OMS cd="extrefs" name="ref">
>     <om:OMSTR>x</om:OMSTR>
>   </om:OMA>
> </om:OMATTR>

This looks better to me. It probably could already be used
this way now, without any change, but if the idea of 
external references proves useful, then it may be worth
thinking about introducing an OMEXTREF element that points
to some arbitrary xml fragments, a bit like the <ref> element 
of OMDoc 1.1, but different in that it allows the targets 
to be non-OM and non-omdoc.

Cheers,
Andreas (Franke)


--
om at openmath.org  -  general discussion on OpenMath
Post public announcements to om-announce at openmath.org
Automatic list maintenance software at majordomo at openmath.org
Mail om-owner at openmath.org for assistance with any problems



More information about the Om mailing list