XML topics

Some XML topics, and essentials

There is so much to read about XML on the web. But here I've put up a few XML topics which I found interesting.

1. XML namespaces

The following question was asked on xml-dev list.

What is the real use of XML namespaces? I know that XML namespace is useful in XSL. But is there any real use of namespace to represent one xml document. Can somebody provide some example?

There were interesting replies.

James Fuller wrote:

XML namespaces exist to avoid name collisions.

It allows independent developers to create XML vocabularies which may use the same local name e.g. <first/> <last/> of an address book markup maybe different to <first/> <last/> element in a purchase order which requires a name for shipping...in any event we can ensure differentiation by associating a unique identifier to a namespace prefix e.g. <address:first/> and <purchase:first/>.

The unique identifier and prefix is associated using xmlns attribute...the URI (commonly urn or HTTP URI) is just an identifier. (check out http://www.rpbourret.com/xml/NamespacesFAQ.htm#q12_3).

if u mean if you have only one XML vocabulary and do not need to disambiguate between multiple vocabs...then admittedly this type of local processing probably doesnt really need namespaces.

though...as time goes on and with increased XML usage you will want to be able to work with composite documents which contain a number of XML vocabularies (dense and layered information=compelling)...and depending on some sort of physicality, like a seperate file in a file system demarcating differences in vocabulary just wont work.G. Ken Holman wrote:

The use of namespaces in any XML document is the same use as is used in an XSL document: global vocabulary distinction and identification.
Because all XML is a labeled hierarchy of information and nothing more, using namespaces provides a rich method of writing the labels. This method distinguishes labels as being from different "owners" and, for each "owner", distinguishes labels from each other in a collection of labels.

I put the word owner in quotes because depending on the URI method you use for the namespace URI it may actually be owned through public registration (such as domain names in a URI) or it may not actually be owned but just private use (such as private-use conventions allow in a URN).
I published a new formatting semantic called the Page Sequence Master Interleave (PSMI) and I labeled it as a one-element vocabulary using a URI that I "own" as "http://www.CraneSoftwrights.com/resources/psmi". I also published a stylesheet that reads any instance of XSL-FO+PSMI and transforms it into a pure XSL-FO instance for use with any XSL-FO engine (since the engines don't recognize my label as representing my semantic).

Users add the element from this vocabulary to their XSL-FO whenever they want to take advantage of this new formatting concept. Because the element is labeled with a globally-unique label, no other stylesheet could mistakenly process the information labeled this way as some other semantic (though of course they could deliberately choose to process the information labeled this way as some other semantic).

For one of my customers working with a published vocabulary deployed across many (many!) installations, each installation can mix into instances of the published vocabulary their own labels in their own namespaces that represent their own semantic interpretations of the information they label that way. Which is all that I did when I added my own information with my own labels to an instance using the standard labels to represent the
standardized semantics for the labeled information.
Steven J. DeRose wrote:

James Fuller > XML namespaces exist to avoid name collisions.

That's the key idea (IMHO).

If you have your schema, and I have mine, and we never have to share data, then namespaces may not be hugely useful. But if we eventually have to share our data, somehow we have to resolve cases where we use the same names to mean different things.

In that case, one of us can rename the elements in our data -- but that means *touching* the data, which risks corrupting the data no matter how careful we are. Instead (and somewhat simpler), we can assign our elements different namespaces:

mine:title may be somebody's job title on their application form

your:title may be a book title in a bibliography database.

If you use inheritance as much as possible for the namespace prefixes, this greatly reduces how much you have to touch the data. If you set up default attributes in your DTD you may not have to touch the data at all, which is ideal.

As James pointed out, namespaces also make it feasible for someone to create a nice schema for one particular thing, and let other people use it as a part of their schemas, without having to worry about name conflicts.

For example, a number of other schemas have adopted HTML's markup for tables (and another bunch use "CALS" tables, but that's another story I'll tell you over some beers sometime). If you code:

<!ATTLIST table CDATA "xmlns='http://ns.example.org/html-tables'">

Assuming you are using a parser that deals with the ATTLIST declaration, this will cause all <table> elements in the document to default to the meaning called "http://ns.example.org/html-tables". And then if you don't put namespace prefixes on any elements *within* the tables, they'll inherit the same namespace.

The end result is that you can paste in your HTML tables without having to tweak them at all. This makes it much clearer what's going on, *and* it enables processors to detect what you're doing. For example, a tool that can extract information from HTML tables (or format them, or encrypt them, or sign them, or whatever), can reliably find them, despite their being embedded in documents with lots of other stuff going on.

This is useful. It would be more immensely useful if we had some nice standardized schema-bits for common elements such as addresses, bibliography entries, lists, appointment item, itinerary, product-catalog entry, and so on.

2. Constructing XML from flat data file

The following question was asked on xml-dev list.

What are the ways to construct a xml document from a data file using Java or C or Perl etc.
Performance wise which one to choose, when creating a xml document file out of a flat data file?

The following article compares different approaches:
http://www.tbray.org/ongoing/When/200x/2004/01/16/XML-Writing

3. Order of attributes

There was an interesting discussion on the xml-dev list about ordering of attributes in XML.

Richard Tobin wrote:
>I think that if anything, all discussion of processing, including attribute order, should be removed from the spec.
>XML is a syntax, and what applications do with it is entirely up to them.I disagree. To invent a document format in which the order of attributes is significant and then claim that it conforms to XML would be misleading to say the least, since it would not be interoperable with the majority of XML tools (which don't preserve attribute order).Vladimir Gapeyev wrote:
The question was probably in a different plane: not inventing a new format, but judging whether XML spec was true to its principle of separating the responsibilities of a parser and an application when it made attributes unordered. Attributes, as elements, _are_ ordered in the document, so wouldn't it be more appropriate to require a parser to report them in order and let the application to impose an unordered semantics, if it wishes? This would be similar to elements and the Schema's "interleave" operator that manifests "don't care" attitude of some applications to the order of elements.More practically, if a parser guarantees reporting attributes to the application in their order of occurrence in a document, would this be a reason to declare the parser incompliant with XML 1.0? Is there a scenario where such a parser can create interoperability problems with existing XML tools? (There are, of course, problems for applications that use an order-preserving parser: they cannot reliably rely on order if they consume XML from applications that are incapable of ensuring an attribute order.)Btw. the Rec (Section 3.1) says that "order of attribute specifications in a start-tag or empty-element tag is not significant", without apparently explaining what "being significant" means. Moreover, this phrase appears starting from 2nd edition only!To this, Richard replied:
>Attributes, as elements, _are_ ordered in the document, so wouldn't it be more appropriate to require a parser to
>report them in order and let the application to impose an unordered semantics, if it wishes?
No, because that's not what attributes are for. The attributes of an element are uniquely identified by their names, and the children are not. The semantics of an attribute depends on its name, not its position within the start tag.You can write down a matrix of the properties of attributes and child elements: whether they are named, whether their order is significant, whether they have recursive structure (it helps to have several dimensions to write in :-). The three I listed give you 8 possible combinations; XML (and SGML) only provide two of them, but they are two that cover a lot of useful cases.Gavin Thomas Nicol wrote:
>More practically, if a parser guarantees reporting attributes to the application in their order of occurrence in a document,
>would this be a reason to declare the parser incompliant with XML 1.0?>No, of course not. An XML editor is the obvious example of an application that can benefit from that information.
>But an XML editor does not use an XML document for its semantics.
That's not true... some/many editors are built directly on top of DOM implementations, or something akin to the DOM.

I reasoned:
I agree with Mr. Tobin. I would imagine elements as objects, and attributes as the object's properties. I think the object's properties are not supposed to be ordered (so should attributes). There was an interestingdiscussionon this topic.

Richard Tobin wrote further:
>I'm probably being thick.. but I don't understand this. That's probably because I think *semantics* are entirely in the eye of the beholder.If they were *entirely* in the eye of the beholder, then XML wouldn't help interoperability at all. For example, as I suggested before, you could encode all your information in the spacing between attributes, but then what good would it be to use XML? No other XML tool would understand it.

Long ago ASCII, and later Unicode, saved us from having how to decide how to encode characters for each application. Apart from saving us the effort of choosing, this had an obvious advantage: we could write all kinds of tool that were useful for many different file formats. We could edit Fortran programs and invoices with the same editor. We could grep for strings or count lines in any kind of text file.

XML does the same at the next level up: it saves us from choosing a format for simple nested structures and named attributes. And it lets us build generic tools that can do useful things with all kinds of XML document - consider XSLT for example.

We have gained that at the expense of removing some semantics from the eye of the beholder. We accept start and end tags as meaning some kind of nesting. We accept attributes as things that are identified by name, not position. And we accept white space inside tags as just being for formatting and readability.

A binary file could be considered to conform to the "grammar" of ISO Latin-1, because it consists of a sequence of 8-bit bytes. But it isn't Latin-1, because the bytes aren't interpreted in the way that Latin-1 specifies. Likewise a file that looks like XML but assigns significance to attribute order may conform to the grammar of XML, but it isn't XML.

And this is a good thing, because imposing such (minimal) semantics allows us to write a range of generic XML tools.>It seems to be that *some* editors might understand *all* of the semantics of XML (as I believe you mean it),
>but augment them with constraints/semantics above and beyond those?Yes, an XML editor could do that. But when it preserved the order of attributes, it would be operating merely on the syntax.

4. When should I use elements, and when should I use attributes

The following information is useful:

http://xml.coverpages.org/elementsAndAttrs.html

5. Validation of XML document using JAXP 1.3 APIs

One of the samples provided with Xerces-J is jaxp.SourceValidator, which can validate multiple source XML documents with multiple XML Schemas. We can use this out of the box utility to validate XML documents with XML Schemas.

But out of my interest I implemented my own little validator, which also uses JAXP 1.3 APIs provided with Xerces-J.

I am providing both the DOM version and the SAX version. The sample XML document and the Schema I used can be downloaded from here: XML, Schema.

There is nothing complicated about these utilities. We need to have Xerces-J jars in the JDK classpath, and these programs can be compiled and run.

Following are the example command lines:

java SchmValidateDOM books.xml
Line: 24, Column: 9, Message: cvc-complex-type.2.4.a: Invalid content was found
starting with element 'ISBN'. One of '{QUANTITY}' is expected.

java SchmValidateSAX books.xml
Line: 24, Column: 9, Message: cvc-complex-type.2.4.a: Invalid content was found
starting with element 'ISBN'. One of '{QUANTITY}' is expected.

The validator retrieves the Schema from the hint provided in an instance document.

I hope that this is useful.

Notes: These utilities are just to demonstrate JAXP validation APIs. If you are looking for a production quality, and robust command line XML Schema validator, I recommend the Xerces utility, jaxp.SourceValidator.

6. Creating an XML document with entity references, using DOM APIs

Let's say that we need to create following XML documents:

[1]
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE root [
<!ENTITY x 'hello'>
]>
<root>
<a>&x; world</a>
</root>

[2]
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE root SYSTEM "sample.dtd">
<root>
<a>&x; world</a>
</root>

The document [1] references a DTD using the mechanism known as "internal DTD subset".
The document [2] references a DTD using a system identifier, which refers an external DTD file.

These XML documents contain an entity 'x', which is defined in the DTD. When XML documents are parsed, entity references are expanded by the XML parser, which (the XML parser) replaces entity references, with the values of the entities.

Creating or modifying an internal DTD subset is not possible with standard DOM APIs. But Xerces-J provides an extension class named as, DocumentTypeImpl which allows to create or modify an internal DTD subset in the XML document.

I'm providing two java programs below, which can generate output [1] and [2] respectively, as specified above.

CreateXMLdocDOM.java, CreateXMLdocDOM1.java

I hope that this is useful.

7. Adding external DTD subset (and entity definitions) programmatically

I'm presenting a finding below, from the discussion I participated in, at xml-dev list.

Let's say we have following XML file (named, test.xml) [1]:

<?xml version="1.0" encoding="UTF-8" ?>
<x>
&message;
&copyright;
</x>

This file is not a well-formed XML file, which means it is not a correct/legal XML document. Because, we do not have definitions for the entities message and copyright.

When an attempt is made to parse this XML file (using APIs like DOM or SAX, from a Java application), we get errors like following:

org.xml.sax.SAXParseException: The entity "message" was referenced, but not declared.

And the parsing process fails, and we cannot do anything more with this XML file.

To successfully parse an XML document, the XML parser must have access to the entity definitions at run time (so the entity references are resolved and substituted with their values).

The entity definitions are typically provided to the parser, by an internal or an external DTD subset.

for e.g, as following [2]:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE x SYSTEM "test.dtd">
<x>
&message;
&copyright;
</x>

The DTD test.dtd would provide the definitions of the referenced entities (for example as following):

<!ENTITY message 'hello'>
<!ENTITY copyright 'ABC inc'>

If the XML document, [2] is provided to the XML parser for parsing, and the DTD test.dtd is accessible to the parser (with correct entity definitions), the parsing would occur successfully.

The purpose of this note is to demonstrate a technique, to provide an XML document like [1] above to the XML parser, and the parser should be able to successfully parse this document.

The solution to this problem is, that we need to use the SAX 2.0 EntityResolver2 methods (or indirectly use it via the SAX 2.0 class, DefaultHandler2).

Following are the Java code fragments to solve this problem:

public class EntityTest extends DefaultHandler2 {

   // some code here

   XMLReader reader = XMLReaderFactory.createXMLReader();
   reader.setEntityResolver(this);
   reader.parse(new InputSource(xmlFile));

   public InputSource getExternalSubset(String name, String baseURI) {
        StringReader strReader = new StringReader("<!ENTITY message 'hello'> <!ENTITY copyright 'ABC inc'>");
        return new InputSource(strReader);
   }

   // some code here
}

The code fragments shown with bold emphasis are the key elements of the solution.

The entity definitions are provided to the XML parser at run time, by implementing the EntityResolver2 method, getExternalSubset.

When the above Java class is executed by a Java runtime, by supplying the XML file [1] above, the XML file is successfully parsed.

The complete source code for the above Java class is available, here.

The Java class needs to be run as, following:

java EntityTest test.xml

Thanks to, Michael Glavassevich for suggesting an answer to this.

Notes:
I tested the example in this note, using Xerces-J 2.9.1 XML parser.

I hope that this is useful.

Acknowledgements:
1. Michael Glavassevich
2. Bj�rn H�hrmann
3. David Carlisle
4. Andrew Welch

Useful references

* http://www.w3.org/standards/xml/ (W3C XML Technology home page)

* http://www.xml.org/ (An online community gathering place for those interested and involved in XML related standards and specifications - hosted by OASIS)

* http://www.ibm.com/developerworks/xml (IBM developerWorks - XML tutorials, code and forums)

* http://www.xml.com/ (O'Reilly's XML portal)

* http://xerces.apache.org/xerces2-j/ (Xerces Java XML parsers)

* http://cmsmcq.com/doclist.html (C. M. Sperberg-McQueen's XML pages)

* http://www.xfront.com/ (Roger L. Costello's XML pages)

* http://www.cafeconleche.org/ (Cafe con Leche XML News and Resources, Elliotte Rusty Harold)

* http://xml.silmaril.ie/ (Peter Flynn's XML FAQ)

Interesting articles / papers related to XML

* A technical introduction to XML Norman Walsh

* Collection of XML technology articles Norman Walsh

* Building workflow applications with XML and XQuery Michael Kay

* XML namespaces FAQ Ronald Bourret

* XML and Databases Ronald Bourret

Home

Last updated: Aug 22, 2011