Some XML topics, and essentials
There is so much to read about XML on the web. But here I've put up a few XML topics which I found interesting.
1.
XML namespaces
The following question was asked on xml-dev list.
What is the real use of XML namespaces? I know that XML namespace is useful in XSL. But is there any real use of namespace to represent one xml document. Can somebody provide some example?
There were interesting replies.
James Fuller wrote:
XML namespaces exist to avoid name collisions.
It allows independent developers to create XML vocabularies which may use the
same local name e.g. <first/> <last/> of an address book markup maybe different
to <first/> <last/> element in a purchase order which requires a name for
shipping...in any event we can ensure differentiation by associating a unique
identifier to a namespace prefix
e.g. <address:first/> and <purchase:first/>.
The unique identifier and prefix is associated using xmlns attribute...the URI
(commonly urn or HTTP URI) is just an identifier. (check out
http://www.rpbourret.com/xml/NamespacesFAQ.htm#q12_3).
if u mean if you have only one XML vocabulary and do not need to disambiguate
between multiple vocabs...then admittedly this type of local processing probably
doesnt really need namespaces.
though...as time goes on and with increased XML usage you will want to be able
to work with composite documents which contain a number of XML vocabularies
(dense and layered information=compelling)...and depending on some sort of
physicality, like a seperate file in a file system demarcating differences in
vocabulary just wont work.
G. Ken Holman wrote:
The use of namespaces in any XML document is the same use as is used in an
XSL document: global vocabulary distinction and identification.
Because all XML is a labeled hierarchy of information and nothing more, using
namespaces provides a rich method of writing the labels. This method
distinguishes labels as being from different "owners" and, for each "owner",
distinguishes labels from each other in a collection of labels.
I put the word owner in quotes because depending on the URI method you use for
the namespace URI it may actually be owned through public registration (such as
domain names in a URI) or it may not actually be owned but just private use
(such as private-use conventions allow in a URN).
I published a new formatting semantic called the Page Sequence Master Interleave
(PSMI) and I labeled it as a one-element vocabulary using a URI that I "own" as
"http://www.CraneSoftwrights.com/resources/psmi". I also published a stylesheet
that reads any instance of XSL-FO+PSMI and transforms it into a pure XSL-FO
instance for use with any XSL-FO engine (since the engines don't recognize my
label as representing my semantic).
Users add the element from this vocabulary to their XSL-FO whenever they want to
take advantage of this new formatting concept. Because the element is labeled
with a globally-unique label, no other stylesheet could mistakenly process the
information labeled this way as some other semantic (though of course they
could deliberately choose to process the information labeled this way as some
other semantic).
For one of my customers working with a published vocabulary deployed across many
(many!) installations, each installation can mix into instances of the published
vocabulary their own labels in their own namespaces that represent their own
semantic interpretations of the information they label that way. Which is all
that I did when I added my own information with my own labels to an instance
using the standard labels to represent the
standardized semantics for the labeled information.
Steven J. DeRose wrote:
James Fuller > XML namespaces exist to avoid name collisions.
That's the key idea (IMHO).
If you have your schema, and I have mine, and we never have to share data, then
namespaces may not be hugely useful. But if we eventually have to share our
data, somehow we have to resolve cases where we use the same names to mean
different things.
In that case, one of us can rename the elements in our data -- but that means
*touching* the data, which risks corrupting the data no matter how careful we
are. Instead (and somewhat simpler), we can assign our elements different
namespaces:
mine:title may be somebody's job title on their application form
your:title may be a book title in a bibliography database.
If you use inheritance as much as possible for the namespace prefixes, this
greatly reduces how much you have to touch the data. If you set up default
attributes in your DTD you may not have to touch the data at all, which is
ideal.
As James pointed out, namespaces also make it feasible for someone to create a
nice schema for one particular thing, and let other people use it as a part of
their schemas, without having to worry about name conflicts.
For example, a number of other schemas have adopted HTML's markup for tables
(and another bunch use "CALS" tables, but that's another story I'll tell you
over some beers sometime). If you code:
<!ATTLIST table CDATA "xmlns='http://ns.example.org/html-tables'">
Assuming you are using a parser that deals with the ATTLIST declaration, this
will cause all <table> elements in the document to default to the meaning called
"http://ns.example.org/html-tables". And then if you don't put namespace
prefixes on any elements *within* the tables, they'll inherit the same
namespace.
The end result is that you can paste in your HTML tables without having to tweak
them at all. This makes it much clearer what's going on, *and* it enables
processors to detect what you're doing. For example, a tool that can extract
information from HTML tables (or format them, or encrypt them, or sign them, or
whatever), can reliably find them, despite their being embedded in documents
with lots of other stuff going on.
This is useful. It would be more immensely useful if we had some nice
standardized schema-bits for common elements such as addresses, bibliography
entries, lists, appointment item, itinerary, product-catalog entry, and so on.
2.
Constructing XML from flat data file
The following question was asked on xml-dev list.
What are the ways to construct a xml document from a data file using Java or C
or Perl etc.
Performance wise which one to choose, when creating a xml document file out of a
flat data file?
The following article compares different approaches:
http://www.tbray.org/ongoing/When/200x/2004/01/16/XML-Writing
3.
Order of attributes
There was an interesting discussion on the xml-dev list about ordering of attributes in XML.
Richard Tobin wrote:
>I think that if anything, all discussion of processing, including attribute
order, should be removed from the spec.
>XML is a syntax, and what applications
do with it is entirely up to them.
I disagree. To invent a document format in which the order of attributes is
significant and then claim that it conforms to XML would be misleading to say
the least, since it would not be interoperable with the majority of XML tools
(which don't preserve attribute order).
Vladimir Gapeyev wrote:
The question was probably in a different plane: not inventing a new format, but
judging whether XML spec was true to its principle of separating the
responsibilities of a parser and an application when it made attributes
unordered. Attributes, as elements, _are_ ordered in the document, so
wouldn't it be more appropriate to require a parser to report them in order and
let the application to impose an unordered semantics, if it wishes? This
would be similar to elements and the Schema's "interleave" operator that
manifests "don't care" attitude of some applications to the order of elements.
More practically, if a parser guarantees reporting attributes to the
application in their order of occurrence in a document, would this be a reason
to declare the parser incompliant with XML 1.0? Is there a scenario where
such a parser can create interoperability problems with existing XML tools?
(There are, of course, problems for applications that use an order-preserving
parser: they cannot reliably rely on order if they consume XML from applications
that are incapable of ensuring an attribute order.)
Btw. the Rec (Section 3.1) says that "order of attribute specifications in
a start-tag or empty-element tag is not significant", without apparently
explaining what "being significant" means. Moreover, this phrase
appears starting from 2nd edition only!
To this, Richard replied:
>Attributes, as elements, _are_ ordered in the document, so wouldn't it be more
appropriate to require a parser to
>report them in order and let the application
to impose an unordered semantics, if it wishes?
No, because that's not what attributes are for. The attributes of an
element are uniquely identified by their names, and the children are not.
The semantics of an attribute depends on its name, not its position within the
start tag.
You can write down a matrix of the properties of attributes and child
elements: whether they are named, whether their order is significant, whether
they have recursive structure (it helps to have several dimensions to write in
:-). The three I listed give you 8 possible combinations; XML (and SGML)
only provide two of them, but they are two that cover a lot of useful cases.
Gavin Thomas Nicol wrote:
>More practically, if a parser guarantees reporting attributes to the
application in their order of occurrence in a document,
>would this be a reason
to declare the parser incompliant with XML 1.0?
>No, of course not. An XML editor is the obvious example of an application
that can benefit from that information.
>But an XML editor does not use an XML
document for its semantics.
That's not true... some/many editors are built directly on top of DOM
implementations, or something akin to the DOM.
I reasoned:
I agree with Mr. Tobin. I would imagine elements as objects, and attributes as
the object's properties. I think the object's properties are not supposed to be
ordered (so should attributes). There was an interesting
discussion on this topic.
Richard Tobin wrote further:
>I'm probably being thick.. but I don't understand this. That's probably because
I think *semantics* are entirely in the eye of the beholder.
If they were *entirely* in the eye of the beholder, then XML wouldn't help
interoperability at all. For example, as I suggested before, you could
encode all your information in the spacing between attributes, but then what
good would it be to use XML? No other XML tool would understand it.
Long ago ASCII, and later Unicode, saved us from having how to decide how to
encode characters for each application. Apart from saving us the effort of
choosing, this had an obvious advantage: we could write all kinds of tool that
were useful for many different file formats. We could edit Fortran programs and
invoices with the same editor. We could grep for strings or count lines in
any kind of text file.
XML does the same at the next level up: it saves us from choosing a format for
simple nested structures and named attributes. And it lets us build
generic tools that can do useful things with all kinds of XML document -
consider XSLT for example.
We have gained that at the expense of removing some semantics from the eye of
the beholder. We accept start and end tags as meaning some kind of
nesting. We accept attributes as things that are identified by name, not
position. And we accept white space inside tags as just being for
formatting and readability.
A binary file could be considered to conform to the "grammar" of ISO Latin-1,
because it consists of a sequence of 8-bit bytes. But it isn't Latin-1,
because the bytes aren't interpreted in the way that Latin-1 specifies.
Likewise a file that looks like XML but assigns significance to attribute order
may conform to the grammar of XML, but it isn't XML.
And this is a good thing, because imposing such (minimal) semantics allows us to
write a range of generic XML tools.
>It seems to be that *some* editors might understand *all* of the semantics
of XML (as I believe you mean it),
>but augment them with constraints/semantics
above and beyond those?
Yes, an XML editor could do that. But when it preserved the order of
attributes, it would be operating merely on the syntax.
4.
When should I use elements, and when should I use attributes
The following information is useful:
http://xml.coverpages.org/elementsAndAttrs.html
5. Validation of XML
document using JAXP 1.3 APIs
One of the samples provided with Xerces-J is jaxp.SourceValidator, which can validate multiple source XML documents with multiple XML Schemas. We can use this out of the box utility to validate XML documents with XML Schemas.
But out of my interest I implemented my own little validator, which also uses JAXP 1.3 APIs provided with Xerces-J.
I am providing both the DOM version and the SAX version. The sample XML document and the Schema I used can be downloaded from here: XML, Schema.
There is nothing complicated about these utilities. We need to have Xerces-J jars in the JDK classpath, and these programs can be compiled and run.
Following are the example command lines:
java SchmValidateDOM books.xml
Line: 24, Column: 9, Message:
cvc-complex-type.2.4.a: Invalid content was found
starting with element 'ISBN'. One of '{QUANTITY}' is expected.
java SchmValidateSAX books.xml
Line: 24, Column: 9, Message:
cvc-complex-type.2.4.a: Invalid content was found
starting with element 'ISBN'. One of '{QUANTITY}' is expected.
The validator retrieves the Schema from the hint provided in an instance document.
I hope that this is useful.
Notes: These utilities are just to demonstrate JAXP validation APIs.
If you are looking for a production quality, and robust command line XML Schema
validator, I recommend the Xerces utility, jaxp.SourceValidator.
6. Creating an XML document with entity references, using DOM APIs
Let's say that we need to create following XML documents:
[1]
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE root [
<!ENTITY x 'hello'>
]>
<root>
<a>&x; world</a>
</root>
OR
[2]
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE root SYSTEM "sample.dtd">
<root>
<a>&x; world</a>
</root>
The document [1] references a DTD using the mechanism known as "internal DTD
subset".
The document [2] references a DTD using a system identifier, which refers an
external DTD file.
These XML documents contain an entity 'x', which is defined in the DTD. When XML documents are parsed, entity references are expanded by the XML parser, which (the XML parser) replaces entity references, with the values of the entities.
Creating or modifying an internal DTD subset is not possible with standard
DOM APIs. But Xerces-J provides an extension class named as,
DocumentTypeImpl which allows
to create or modify an internal DTD subset in
the XML document.
I'm providing two java programs below, which can generate output [1] and [2]
respectively,
as specified above.
CreateXMLdocDOM.java, CreateXMLdocDOM1.java
I hope that this is useful.
7. Adding external DTD subset (and entity definitions) programmatically
I'm presenting a finding below, from the discussion I participated in, at xml-dev list.
Let's say we have following XML file (named, test.xml) [1]:
<?xml version="1.0" encoding="UTF-8" ?>
<x>
&message;
©right;
</x>
This file is not a well-formed XML file, which means it is not a
correct/legal XML document. Because, we do not have definitions for the entities message and copyright.
When an attempt is made to parse this XML file (using APIs like DOM or SAX, from
a Java application), we get errors like following:
org.xml.sax.SAXParseException: The entity "message" was referenced, but
not declared.
And the parsing process fails, and we cannot do anything more with this XML
file.
To successfully parse an XML document, the XML parser must have access to the
entity definitions at run time (so the entity references are resolved
and substituted with their values).
The entity definitions are typically provided to the parser, by an internal or
an external DTD subset.
for e.g, as following [2]:
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE x SYSTEM "test.dtd">
<x>
&message;
©right;
</x>
The DTD test.dtd would provide the definitions of the referenced entities
(for example as following):
<!ENTITY message 'hello'>
<!ENTITY copyright 'ABC inc'>
If the XML document, [2] is provided to the XML parser for parsing, and the DTD test.dtd is accessible to the parser (with correct entity definitions), the parsing would occur successfully.
The purpose of this note is to demonstrate a technique, to provide an XML document like [1] above to the XML parser, and the parser should be able to successfully parse this document.
The solution to this problem is, that we need to use the SAX 2.0 EntityResolver2 methods (or indirectly use it via the SAX 2.0 class, DefaultHandler2).
Following are the Java code fragments to solve this problem:
public class EntityTest extends DefaultHandler2 {
// some code here
XMLReader reader = XMLReaderFactory.createXMLReader();
reader.setEntityResolver(this);
reader.parse(new InputSource(xmlFile));
public InputSource getExternalSubset(String name, String baseURI) {
StringReader strReader = new
StringReader("<!ENTITY message 'hello'> <!ENTITY copyright 'ABC inc'>");
return new InputSource(strReader);
}
// some code here
}
The code fragments shown with bold emphasis are the key elements of the solution.
The entity definitions are provided to the XML parser at run time, by implementing the EntityResolver2 method, getExternalSubset.
When the above Java class is executed by a Java runtime, by supplying the XML file [1] above, the XML file is successfully parsed.
The complete source code for the above Java class is available, here.
The Java class needs to be run as, following:
java EntityTest test.xml
Thanks to, Michael Glavassevich for suggesting an answer to this.
Notes:
I tested the example in this note, using
Xerces-J 2.9.1 XML parser.
I hope that this is useful.
Acknowledgements:
1. Michael Glavassevich
2. Björn Höhrmann
3. David Carlisle
4. Andrew Welch
Useful references
* http://www.w3.org/standards/xml/ (W3C XML Technology home page)
* http://www.xml.org/ (An online community gathering place for those interested and involved in XML related standards and specifications - hosted by OASIS)
* http://www.ibm.com/developerworks/xml (IBM developerWorks - XML tutorials, code and forums)
* http://www.xml.com/ (O'Reilly's XML portal)
* http://xerces.apache.org/xerces2-j/ (Xerces Java XML parsers)
* http://cmsmcq.com/doclist.html (C. M. Sperberg-McQueen's XML pages)
* http://www.xfront.com/ (Roger L. Costello's XML pages)
* http://www.cafeconleche.org/ (Cafe con Leche XML News and Resources, Elliotte Rusty Harold)
*
http://xml.silmaril.ie/ (Peter Flynn's XML FAQ)
Interesting articles / papers related to XML
* A technical introduction to XML Norman Walsh
* Collection of XML technology articles Norman Walsh
* Building workflow applications with XML and XQuery Michael Kay
* XML namespaces FAQ Ronald Bourret
* XML and Databases Ronald Bourret