XSLT Limitations
In this page, I maintain a collection of problems which cannot be solved using XSLT.
Please refer to Dave Pawson's site for a more complete list: http://www.dpawson.co.uk/xsl/sect2/nono.html.
1. Suppressing entity expansion during XSLT transform
I asked following question on XSL-List.
I have the following XML document:
<?xml version="1.0"?>
<!DOCTYPE root [
<!ENTITY x "hello">
]>
<root>
<a>&x; world</a>
</root>
(here I am using an internal DTD subset to define custom entities)
[1] I want to write a transform which prevents entity expansion. For e.g. the
output of transform should be (this is an identity transform):
<?xml version="1.0"?>
<!DOCTYPE root [
<!ENTITY x "hello">
]>
<root>
<a>&x; world</a>
</root>
(I think generating the !DOCTYPE declaration (for entity declaration) in output
is necessary to have declaration of the entity 'x').
If I write the identity transform as following:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="2.0">
<xsl:output method="xml" indent="yes" />
<xsl:template match="node() | @*">
<xsl:copy>
<xsl:apply-templates select="node() |
@*" />
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
I get the output:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<a>hello world</a>
</root>
But I want the output as specified in [1] above.
What is the solution to this problem? I can use XSLT 2.0.
There were interesting replies ...
G. Ken Holman
>[1] I want to write a transform
which prevents entity expansion.
The XPath data model on which XSLT is built does not represent entity references
so that any present in the input source tree cannot be preserved for
serialization in the output tree. Since the result tree is based on the XPath
data model, one cannot represent an entity reference in the result tree as a
data model abstraction.
If you want to create entity references in the result file using XSLT 2.0, you
can play with serialization using xsl:output-character and compose entity
references, but since you don't know where the entity references were in the
source tree, you won't know where to compose them in the output serialization.
>For
>e.g. the output of transform should be (this is an identity
>transform):
>
><?xml version="1.0"?>
><!DOCTYPE root [
> <!ENTITY x "hello">
>]>
><root>
> <a>&x; world</a>
></root>
>...
>I get the output:
>
><?xml version="1.0" encoding="UTF-8"?>
><root>
> <a>hello world</a>
></root>
That is because that is the information found in the source tree. The XML
processor in the XSLT processor has already resolved all aspects of syntax and
has built the XPath node tree with the information found in the source file (but
not the syntax found in the source file).
Constructs such as entity references and CDATA sections are merely aspects of
syntax, not of information.
>But I want the output as specified in [1] above.
>
>What is the solution to this problem?
Using a language other than XSLT. I believe the DOM supports entity references.
Mukul: After this discussion, I explored the DOM APIs for this requirement.
Please refer my findings, here.
>I can use XSLT 2.0.
Not for this problem you can't.
David Carlisle
> I am asking this question assuming that saxon:character-representation
> can solve this problem. Is my assumption correct?
this extension has no effect on general entities such as the one in your post it
just affects how characters are output, as character data as hex
numeric references as decimal numeric references or as well known (html) entity
names.
As Ken said, your problem, isn't soluable in XSLT as XSLT has no information
that the entity reference was ever there. You need to do a non-xml pre-pass,
something like
sed -i -e "s/&/[[[amp]]]/g" file.xml
converts &x; to [[[amp]]]x; then do the XSLT, then do a reverse text replace to
put the entities back.
Michael Kay
Since entity expansion is done by the XML parser before the
XSLT processor gets any sight of the data, the only possible way you could stop
it happening would be at the XML parser level.
In practice the pragmatic solution is to preprocess entity references (at the
level of a text file, e.g. using Perl) to change &abc; to something like #abc;
where # is some sufficiently-rare character, and then to postprocess the
transformation result to change it back.
During this discussion, it was suggested that I should have a look at (which I will !) Jesper Tverskov's improved identity transformation (written using XSLT 2.0).