XSLT Limitations

XSLT Limitations

In this page, I maintain a collection of problems which cannot be solved using XSLT.

Please refer to Dave Pawson's site for a more complete list: http://www.dpawson.co.uk/xsl/sect2/nono.html.

1. Suppressing entity expansion during XSLT transform

I asked following question on XSL-List.

I have the following XML document:

<?xml version="1.0"?>
<!DOCTYPE root [
    <!ENTITY x "hello">
]>
<root>
    <a>&x; world</a>
</root>

(here I am using an internal DTD subset to define custom entities)

[1] I want to write a transform which prevents entity expansion. For e.g. the output of transform should be (this is an identity transform):

<?xml version="1.0"?>
<!DOCTYPE root [
    <!ENTITY x "hello">
]>
<root>
    <a>&x; world</a>
</root>

(I think generating the !DOCTYPE declaration (for entity declaration) in output is necessary to have declaration of the entity 'x').

If I write the identity transform as following:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">

<xsl:output method="xml" indent="yes" />

<xsl:template match="node() | @*">
    <xsl:copy>
        <xsl:apply-templates select="node() | @*" />
    </xsl:copy>
</xsl:template>

</xsl:stylesheet>

I get the output:

<?xml version="1.0" encoding="UTF-8"?>
<root>
    <a>hello world</a>
</root>

But I want the output as specified in [1] above.

What is the solution to this problem? I can use XSLT 2.0.

There were interesting replies ...

G. Ken Holman

>[1] I want to write a transform which prevents entity expansion.

The XPath data model on which XSLT is built does not represent entity references so that any present in the input source tree cannot be preserved for serialization in the output tree. Since the result tree is based on the XPath data model, one cannot represent an entity reference in the result tree as a data model abstraction.

If you want to create entity references in the result file using XSLT 2.0, you can play with serialization using xsl:output-character and compose entity references, but since you don't know where the entity references were in the source tree, you won't know where to compose them in the output serialization.

>For
>e.g. the output of transform should be (this is an identity
>transform):
>
><?xml version="1.0"?>
><!DOCTYPE root [
> <!ENTITY x "hello">
>]>
><root>
> <a>&x; world</a>
></root>
>...

>I get the output:
>
><?xml version="1.0" encoding="UTF-8"?>
><root>
> <a>hello world</a>
></root>

That is because that is the information found in the source tree. The XML processor in the XSLT processor has already resolved all aspects of syntax and has built the XPath node tree with the information found in the source file (but not the syntax found in the source file).

Constructs such as entity references and CDATA sections are merely aspects of syntax, not of information.

>But I want the output as specified in [1] above.
>
>What is the solution to this problem?

Using a language other than XSLT. I believe the DOM supports entity references.

Mukul: After this discussion, I explored the DOM APIs for this requirement. Please refer my findings, here.

>I can use XSLT 2.0.

Not for this problem you can't.

David Carlisle

> I am asking this question assuming that saxon:character-representation
> can solve this problem. Is my assumption correct?

this extension has no effect on general entities such as the one in your post it just affects how characters are output, as character data as hex numeric references as decimal numeric references or as well known (html) entity names.

As Ken said, your problem, isn't soluable in XSLT as XSLT has no information that the entity reference was ever there. You need to do a non-xml pre-pass, something like

sed -i -e "s/&/[[[amp]]]/g" file.xml

converts &x; to [[[amp]]]x; then do the XSLT, then do a reverse text replace to put the entities back.

Michael Kay

Since entity expansion is done by the XML parser before the XSLT processor gets any sight of the data, the only possible way you could stop it happening would be at the XML parser level.

In practice the pragmatic solution is to preprocess entity references (at the level of a text file, e.g. using Perl) to change &abc; to something like #abc; where # is some sufficiently-rare character, and then to postprocess the transformation result to change it back.

During this discussion, it was suggested that I should have a look at (which I will !) Jesper Tverskov's improved identity transformation (written using XSLT 2.0).