Saxon 9.8: Which patterns are supported in EXPath File Module function file:list? - saxon

Good afternoon,
I am working with Java Saxon 9.8.0.4. I would like to use EXPath File Module function "file:list" with its third "pattern" parameter. But I am in doubt, which style of pattern is supported.
I read both Saxon documentation and EXPath documentation. But I do not know, which patterns are supported in Saxon 9.8.0.4. It would be great to support regular expression, but I understand it is overkill for most users. I tried several blind tests, but just * and ? wildchars works for me as defined in EXPath documentation.
Yes, I can quite easily do regexp postprocessing in for-each, but to know more about list function could help.
Thank You in advance for Your help, Stepan
P.S: My use-case is to get all files without extension ("test" and not "test.txt") recursively from large and deep directory structure and process all of matching files with XSL-T 3.0. Most of such files have identical fileName and thus I can not do "copy to one folder" pre-processing for Saxon's -s:directory -o:directory one time invocation and invocation of Java (Saxon) for each file is of cource terrible time overhead. So I would like to read all matching files into sequence and process each item of such sequence using for-each (files are text ones and I read them using unparsed-text). And no, GAWK is not solution, as I have all transformation infrastructure from XML to SQL already in XSL-T, because 95 % of files are XMLs.
--ADDED code and explanation below:
Example of my test files.
XML file "a.xml":
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="a.xsl"?>
<root/>
XSL-T file "a.xsl":
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:saxon="http://saxon.sf.net/"
xmlns:expathFile="http://expath.org/ns/file"
exclude-result-prefixes="xs saxon"
version="3.0">
<xsl:output method="text" />
<xsl:template match="/root">
<xsl:variable name="list" select="expathFile:list('C:\temp\temp\test\', false(), '^.*$')"/>
<xsl:for-each select="$list">
<xsl:value-of select="."/>
</xsl:for-each>
</xsl:template>
My folder "C:\temp\temp\test\" contains 6 test files: "a.txt", "b.txt", "c.txt", "e", "f", "g".
But after testing of online Java RegExp tester on "http://www.regexplanet.com/advanced/java/index.html" I have found, that the problem is solely on my side, because Java regular expression behaves little different than PCRE (Perl), sed, gawk regular expressions. So it is my fault and I need to learn Java regular expression.

Saxon uses the same code for this pattern as for the filter in select="pattern" in collection URIs, which is described at http://www.saxonica.com/documentation/index.html#!sourcedocs/collections
Extracting the relevant details:
The pattern used in the select parameter can use glob-like syntax, for
example *.xml selects all files with extension "xml". More generally,
the pattern is converted to a regular expression by prepending "^",
appending "$", replacing "." by "\.", "*" by ".*", and "?" by ".?",
and it is then used to match the file names appearing in the directory
using the Java regular expression rules. So, for example, you can
write ?select=*.(xml|xhtml) to match files with either of these two
file extensions. Note however, that special characters used in the URL
(that is, characters such as backslash and curly braces that are not
allowed in the query part of a URI) must be escaped using the %HH
convention. For example, vertical bar needs to be written as %7C. This
escaping can be achieved using the encode-for-uri() function.
Note that Saxon's collection() function now also supports match=pattern in the URI, where the pattern is a standard XPath 3.1 regular expression.

Related

How to get over this SXXP0003 parser error?

I have an XML document which prolog looks like this :
<?xml version="1.0" encoding="utf-8" standalone="no"?>
...
This XML document is valid against the external DTD with the exact same prolog :
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<!DOCTYPE root [
...
]>
When I transform using Saxon (latest release):
$:/opt/tomcat/webapps/ROOT/$ java net.sf.saxon.Transform -s:pandora.xml -xsl:pandora.xsl -o:pandora.html
Error on line 1 column 53 of pandora.dtd:
SXXP0003 Error reported by XML parser: No more pseudo attributes are allowed.: No more
pseudo attributes are allowed.
org.xml.sax.SAXParseException; systemId: file:/opt/tomcat/webapps/ROOT/fred/pandora/dtd/pandora.dtd; lineNumber: 1; columnNumber: 53; No more pseudo attributes are allowed.
I am newbie and my research about this has only led to listing the pseudo-attributes in the order they actually are. If anybody have a clue there.
Edit
I have made other transformations using the same process with other projects without any problem. The only difference is in this problematic application, I make use of another namespace exsl to use a function not provided with version 1.0 (node-set). Everything else is similar.
For an external subset of the DTD the specification defines the format in https://www.w3.org/TR/xml/#NT-extSubset as
extSubset ::= TextDecl? extSubsetDecl
extSubsetDecl ::= ( markupdecl | conditionalSect | DeclSep)*
, for the "Text Declaration" in https://www.w3.org/TR/xml/#sec-TextDecl as TextDecl ::= '<?xml' VersionInfo? EncodingDecl S? '?>' so a standalone "pseudo" attribute is indeed not allowed there.
So make sure that your external DTD file does not repeat <!DOCTYPE root, it is just meant to contain declaration of markup, e.g. elements, attributes.
The error message you get comes anyway just from the XML parser and is not transformation/XSLT related.

Notepad++ deleting string in multiple files

I'm trying to removing a specific line from many files I'm working on with Notepad++.
For example i've a lines:
1 file:
<mana now="110" max="110" manaGain="6" manaTicks="500" type="3"/>
2 file:
<mana now="100" max="100" manaGain="11" manaTicks="500"/>
As you can see, there are different values. I'd like to remove this string from all files. Can i do it with Notepad++, especially if each file has a different value?
You can do it by using Notepad++ and RegEx. You maybe warned - please make a backup copy of all files first.
I assume your files all have the extension *.xml and reside in folder e.g. D:\_working:
<?xml version="1.0" encoding="UTF-8"?>
<bookstore>
<mana now="110" max="110" manaGain="6" manaTicks="500" type="3"/>
</bookstore>
First open one of the files in your working directory by Notepad++
Ctrl+H
Go to the Find in Files tab
Find what:<mana now="[0-9]{1,}" max="[0-9]{1,}" manaGain="[0-9]{1,}" manaTicks="[0-9]{1,}".+
Replace with: NOTHING
Filters: *.xml
Directory: e.g. D:\_working
Search mode: Regular expression
Click on Replace in Files
Click on OK when you're really sure.
You may want to refine the RegEx for your needs. Short explanation:
0-9 a single character in the range between 0 (index 48) and 9 (index 57) (case sensitive)
{1,} Quantifier — Matches between one and unlimited times, as many times as possible
.+ matches any character (except for line terminators)
This is resulting in:
<?xml version="1.0" encoding="UTF-8"?>
<bookstore>
</bookstore>

Is this substring-after and concat breaking my encoding?

Our GSA uses a FileConnector to index different shares which are targets of DFS Links. I am trying to rewrite file://filesrv01.example.com/share$/dir/file.ext to file://R:/hare/dir/file.ext in the frontend XSL.
There is a xsl:choose element wich tests for different protocols but not file://, so I assume the default handling for my source links would be this node:
<xsl:otherwise>
<xsl:value-of disable-output-escaping='yes' select="U"/>
</xsl:otherwise>
We created a new xsl:when node like this:
<xsl:when test="starts-with(U, 'file://server.example.com/share$>
<xsl:value-of disable-output-escaping='yes'
select="concat('file://R:/share/',
substring-after(U,'file://server.example.com/share$/') )"/>
</xsl:when>
This works for almost all entries in our index, but it fails when the path contains a german umlaut. Following input, actual and expected Output:
file://server/share$/dir/FileWithUmläut.txt
file://R:/share/dir/FileWithUmläut.txt
file://R:/share/dir/FileWithUmläut.txt
Why is the default xsl:otherwise working without changing umlauts but our concat+substring is not? Anything I could check or change?
Edit #1
There is only one output element in the XSL file: <xsl:output method="html"/>. The XSL itself is recognised as ANSI in Notepad++ with some Umlauts in UI texts. Output to the browser is utf-8 xhtml.
Edit #2
When I replace the xsl:when with the following block, the encoding is not broken and the link can be opened (not using the DFS root but directly using unc). Because of this I believe it is not the encoding of XML or XSL, thanks for your input nevertheless, #MathiasMüller.
<xsl:when test="starts-with(U, 'file://server.example.com/share$/')">
<xsl:value-of disable-output-escaping='yes' select="U"/>
</xsl:when>
My specific problem vanished as soon as I used file:///R:/ instead of file://R:/ (additional forward slash) but I still try to figure out why that helped. In the GSA XSL there is a JavaScript snippet to "fix" encoding issues in IE but that does not care if the protocol has 2 or 3 slashes.
Although Firefox does not allow the file protocol out of the box, neither syntax works when copied from there. This leads me to believe that my currently installed IE 9 fixes some encoding issues on its own when using the correct file:/// prefix and Firefox does not.
As we would like the links to work in Firefox too, I will continue my quest for glory in the land of unicode, plagued by the ancient dragon of file:/// and home to the houses of IE and FF.

How to provide an empty Source in xslTransformer.transform() method?

I have an xslt 2.0 file which is being used to transform a csv file to an xml file. The xsl has been taken from here:
http://p2p.wrox.com/xslt/40898-transform-csv-file-xml.html#post164344
Now I am trying to execute this through Java transformer (using the Saxon9 xsl transformer factory). Since the csv file is being passed into the xsl as a parameter, there is no need for me to pass anything in the Source parameter in the transform method. Since the javadocs for the transform method state the following:
The javadocs for the Transformer.transform method clearly state that the following:
"An empty Source is represented as an empty document as constructed by DocumentBuilder.newDocument(). The result of transforming an empty Source depends on the transformation behavior; it is not always an empty Result."
I tried to create an empty document and try the transformation as seen below:
TransformerFactory transformerFactory = TransformerFactory.newInstance("net.sf.saxon.TransformerFactoryImpl",null);
Source xsltSource = new StreamSource("file:///C:/my.xsl");
Transformer xsltTransformer = transformerFactory.newTransformer(xsltSource);
xsltTransformer.setParameter("pathToCSV", "'file:///C:/input.csv'");
StringWriter writer = new StringWriter();
xsltTransformer.transform(new DOMSource(DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument()), new StreamResult(writer));
The above piece of code does not output anything and does not work as expected since I think the empty document given as input is taken into consideration rather than the csv file passed in the following line in the xsl:
<xsl:param name="pathToCSV" />
<xsl:variable name="input" select="unparsed-text($pathToCSV)"/>
Could anyone give me pointers on how to accomplish what I am trying to achieve?
Consider to use the Saxon API http://saxonica.com/documentation/html/using-xsl/embedding/s9api-transformation.html and not to use the JAXP API if you want to use XSLT 2.0 features like starting with a named template as the XSLT you linked to requires. Or, if you want to use JAXP with an empty dummy document you at least need to add a template doing
<xsl:template match="/">
<xsl:call-template name="main"/>
</xsl:template>

UniVerse XDOM Trouble with Default Namespaces

I'm trying to use the XDOM functions of UniVerse to parse an XML file, but I can't get it to correctly parse XML that uses a default namespace. It can correctly handle XML without namespaces, or with named namespaces, but if there is a default namespace, all xPaths will fail to locate the nodes they are supposed to match.
To give a simple example, I'm, trying to parse this XML:
<?xml version="1.0" encoding="ISO-8859-1"?>
<bookstore xmlns="http://www.example.com">
<book category="COOKING">
<title lang="en">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>30.00</price>
</book>
<book category="CHILDREN">
<title lang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="WEB">
<title lang="en">Learning XML</title>
<author>Erik T. Ray</author>
<year>2003</year>
<price>39.95</price>
</book>
</bookstore>
With this code:
PROGRAM XDOM.TEST
$INCLUDE SYSCOM XML.H
OPEN "XML" TO F.XML ELSE STOP "OPEN FAILED"
READ XML FROM F.XML, 'TEST.xml' ELSE STOP "READ FAILED"
EXIT.PROG = #FALSE
CONVERT #FM TO CHAR(10) IN XML
IF NOT(EXIT.PROG) AND XDOMOpen(XML, XML.FROM.STRING, XDOM) # XML.SUCCESS THEN GOSUB XML.ERR
IF NOT(EXIT.PROG) AND XDOMLocate(XDOM, '/bookstore/book[#category="CHILDREN"]', 'xmlns=http://www.example.com', XNODE) # XML.SUCCESS THEN GOSUB XML.ERR
IF NOT(EXIT.PROG) AND XDOMEvaluate(XNODE, './author', 'xmlns=http://www.example.com', AUTHOR) # XML.SUCCESS THEN GOSUB XML.ERR
IF NOT(EXIT.PROG) then PRINT AUTHOR
STOP
XML.ERR:
XML.CODE = ''
XML.ERR = ''
EXIT.PROG = #TRUE
IF XMLGetError(XML.CODE, XML.ERR) = XML.SUCCESS THEN
PRINT XML.CODE
PRINT XML.ERR
END
RETURN
END
When I run this code as is, I get the output:
10
The location path '/bookstore/book[#category="CHILDREN"]' was not found.
However, if I remove the "xmlns=http://www.example.com" namespace, it works fine.
After searching on my own, I found out that this is actually a bug UniVerse's XDOM parser itself. Someone has kindly documented a work-around here. You can "fool" the parser into working by giving a name to the default namespace. They also note that you can't use double-quotes in the namespace map either.
Personally, I prefer the simpler solution of just manually stripping out the offending namespace before parsing it. Adding this line to the above program fixes the issue, albeit in a very hacky way:
XML = CHANGE(XML, ' xmlns="http://www.example.com"', '')
This way you don't have to put unnecessary prefixes on all your xPath nodes. This may not always be an option though, depending on how you are getting the XDOM handle.

Resources