Friday, September 30, 2011

XSLT, entities, Java, Xalan...

Java
The Apache XML Project / Xalan-J
As already mentioned, I have been using XSLT and Xalan-J for quite a long while. It was recently that I was dealing with an HTML-related transform that used a lot of special characters.

Previously I blindly used to assume that named character references are a part of HTML, not defined for generic XML, so not available in XSLT files too. No big problem if numeric character references are available. But that time I had a good incentive to give it a better thought . After all, the very name of XML states extensibility. So what prevents us from getting named character references being defined for XSLT files too?

In fact, nothing prevents. An absolutely legal way to get them defined is to extend the document type declaration for XSLT, like this:
<?xml version="1.0" standalone="yes" ?>
<!DOCTYPE transform [
  <!ENTITY % w3centities-f PUBLIC "-//W3C//ENTITIES Combined Set//EN//XML"
      "http://www.w3.org/2003/entities/2007/w3centities-f.ent">
  %w3centities-f;
]>
<xsl:transform version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
...

With this small modification the miracle happened, and named character references stopped complaining . Everything was fine, but, as one of the Murphy's laws states, "If everything seems to be going well, you have obviously overlooked something" .

This time the pitfall was in the reference to the entities definition,
"http://www.w3.org/2003/entities/2007/w3centities-f.ent". If we tell the XSLT engine to use some external resource, it naturally has to go for it . In case we do not take necessary care, the only place to go is the Internet. Everything is going to work, but maybe slower than we might expect, and crashing if Internet connection is not available.

The standard approach to deal with this is using a catalog-based entity/URL resolver. This time unexpectedly it did not help. Nothing was wrong with the resolver and the catalog, and still the XSLT engine persistently went fetching the entities from the Internet.

The cause of the issue was found in the Xalan-J sources. Perhaps nobody before considered seriously using external entities in an XSLT file, thus no traces of using an entity resolver for an XSLT file in the code. It may be worth mentioning that those of us who might be not familiar with Xalan-J as such and just plainly use JAXP, are likely to tread on the same issue, as Xalan-J is a part of the reference JAXP implementation that is endorsed into standard JDK/JRE implementations like Oracle J2SE and OpenJDK.

The Fix


The last Xalan-J release 2.7.1 was taken as the basis for amendments. For those who need just things working, a patched Xalan-J binary build is available as xalan-2.7.1-usn-20110928.jar. Those interested in what is under the hood are welcome to have a look at the modifications.

The fix was submitted to the Xalan-J project as an attachment to issue XALANJ-2544.

Usage – HOW-TO


In case you feel like using some named character references in your XSLT file:
  1. Add appropriate entities to your XSLT transform/stylesheet, either one by one, like this:
    <?xml version="1.0" standalone="yes" ?>
    <!DOCTYPE transform [
      <!ENTITY nbsp  "&#x000A0;">
      <!ENTITY ndash "&#x02013;">
    ]>
    <xsl:transform version="1.0"
        xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    ...
    
    or all at once:
    ...
    <!DOCTYPE transform [
      <!ENTITY % w3centities-f PUBLIC "-//W3C//ENTITIES Combined Set//EN//XML"
          "http://www.w3.org/2003/entities/2007/w3centities-f.ent">
      %w3centities-f;
    ]>
    ...
    
  2. If you could limit yourself to a definite set of entities, or your XSLT engine is not Xalan-based, have fun
  3. If not, go ahead and download the patched Xalan-J binary as xalan-2.7.1-usn-20110928.jar.
  4. Arrange your JVM to use the patched .jar, keeping in mind that Xalan-J is an endorsed technology. One of possible approaches is to use -Djava.endorsed.dirs=path_to_the_folder_containing_the_new_jar JVM launcher option. See more on the Oracle Java Endorsed Standards Override Mechanism page.
  5. Make sure you use some XML Entity Resolver. You may wish to have a look at Apache XML Commons Resolver as a good candidate.
  6. Get yourself a catalog for your resolver, if not yet. If starting from scratch, that may be a plain text file in OASIS TR9401 format:
    PUBLIC "-//W3C//ENTITIES Combined Set//EN//XML"
           "file:///path_to_your_local_copy/w3centities-f.ent"
    SYSTEM "http://www.w3.org/2003/entities/2007/w3centities-f.ent"
           "file:///path_to_your_local_copy/w3centities-f.ent"
    
  7. Get your catalog known to the resolver, say, via the xml.catalog.files system property supplied to the JVM launcher: -Dxml.catalog.files=your_catalog_file_name_with_path.
  8. Enjoy .

Sunday, September 4, 2011

Java, Xalan, Unicode command line arguments...

Java
I have always believed on bare word that Java is Unicode-friendly by design. And I have been using Xalan-J happily for quite a long while already.

It was just recently that I ran into Unicode-related problem with Java, and that was occasionally related to Xalan-J. Strictly speaking, Xalan-J was not to blame. But that was Xalan-J Command-Line Utility who refused to do its job when handling file names with characters missing from my local system code page.

It did not take a long search to discover that the problem is known and platform-specific, at least specific for Windows. The bug is known at least since the early days when the yet-to-come Java 6 was called Mustang ("Support for Unicode in Mustang ?" , 2005). It was discussed later on JavaRanch - Unicode: cmd parameters (main args); exec parameters; filenames and on OpenJDK 'core-libs-dev' mailing list - RFE 4519026: (process) Process should support Unicode on Win NT, request for review and Unicode support in Java JRE on Windows.

The origin of the bug looks like dating back to the days when it was necessary to have Java running on both Unicode and non-Unicode (e.g. Windows 9x) platforms, so using non-Unicode system calls in Java launcher code was somehow justified. A lot of water has flown under the bridge since then, and Windows 9x is no more supported by new JVM versions... And still my quick experiment revealed that the new Java 7 comes with the same bug in its place...

The right approach would be to get the bug fixed in the Java launcher. But unfortunately it is already long since I wrote something in C/C++... So the quick-and-dirty solution happened to be a Java wrapper.

Wrapper Implementation


The first idea was to get a wrapper just for Xalan-J. But then I remembered of the other command-line utilities written in Java and came to an idea of a general-purpose command line wrapper.

It was necessary to provide possibility of handling Unicode parameters at command shell level (e.g. file names, etc). And then supply all the command line arguments to Java, bypassing the command line . The solution was found as passing the command line arguments via a temporary file, one argument per line, to be written by command shell in some flavor of Unicode, to be further read in by the Java wrapper. And there was no rich variety of convenient Unicode flavors under Windows, as the cmd tool only allows easy output in UTF-16LE using the /u switch.

The resulting wrapper is available for download and free use under Apache 2.0 license, as both source and a compiled .jar.

Other command-line tools coming to my mind besides Xalan-J, that might benefit from using with this wrapper, include: Batik, FOP, CSSToXSLFO, ...

Usage – HOW-TO


  1. Download the .jar file and save it to a location of your choice.
  2. Prepare / modify the script you use to launch your command-line tool. Say, for Xalan-J your script might contain something like:
    @cmd /u /c echo -IN >%TMP%\xalan-args.tmp
    @cmd /u /c echo %~f1 >>%TMP%\xalan-args.tmp
    @cmd /u /c echo -XSL >>%TMP%\xalan-args.tmp
    @cmd /u /c echo %~f2 >>%TMP%\xalan-args.tmp
    @cmd /u /c echo -OUT >>%TMP%\xalan-args.tmp
    @cmd /u /c echo %~f3 >>%TMP%\xalan-args.tmp
    @if not %4. == .  cmd /u /c echo %~4 >>%TMP%\xalan-args.tmp
    @if not %5. == .  cmd /u /c echo %~5 >>%TMP%\xalan-args.tmp
    ...
    java [...] usn.unicode.UnicodeLauncher org.apache.xalan.xslt.Process %TMP%\xalan-args.tmp
    
    Ensure that usn-unicode-20110903.jar is on your class path. Note that all right angle redirection characters are doubled, except the first line... Refer to the "Using batch parameters" Microsoft page for syntax like %~f1. Note also %~4 instead of %4 to remove surrounding quotes if any.
  3. Have fun