UniVerse XDOM Trouble with Default Namespaces - xml-parsing

I'm trying to use the XDOM functions of UniVerse to parse an XML file, but I can't get it to correctly parse XML that uses a default namespace. It can correctly handle XML without namespaces, or with named namespaces, but if there is a default namespace, all xPaths will fail to locate the nodes they are supposed to match.
To give a simple example, I'm, trying to parse this XML:
<?xml version="1.0" encoding="ISO-8859-1"?>
<bookstore xmlns="http://www.example.com">
<book category="COOKING">
<title lang="en">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>30.00</price>
</book>
<book category="CHILDREN">
<title lang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="WEB">
<title lang="en">Learning XML</title>
<author>Erik T. Ray</author>
<year>2003</year>
<price>39.95</price>
</book>
</bookstore>
With this code:
PROGRAM XDOM.TEST
$INCLUDE SYSCOM XML.H
OPEN "XML" TO F.XML ELSE STOP "OPEN FAILED"
READ XML FROM F.XML, 'TEST.xml' ELSE STOP "READ FAILED"
EXIT.PROG = #FALSE
CONVERT #FM TO CHAR(10) IN XML
IF NOT(EXIT.PROG) AND XDOMOpen(XML, XML.FROM.STRING, XDOM) # XML.SUCCESS THEN GOSUB XML.ERR
IF NOT(EXIT.PROG) AND XDOMLocate(XDOM, '/bookstore/book[#category="CHILDREN"]', 'xmlns=http://www.example.com', XNODE) # XML.SUCCESS THEN GOSUB XML.ERR
IF NOT(EXIT.PROG) AND XDOMEvaluate(XNODE, './author', 'xmlns=http://www.example.com', AUTHOR) # XML.SUCCESS THEN GOSUB XML.ERR
IF NOT(EXIT.PROG) then PRINT AUTHOR
STOP
XML.ERR:
XML.CODE = ''
XML.ERR = ''
EXIT.PROG = #TRUE
IF XMLGetError(XML.CODE, XML.ERR) = XML.SUCCESS THEN
PRINT XML.CODE
PRINT XML.ERR
END
RETURN
END
When I run this code as is, I get the output:
10
The location path '/bookstore/book[#category="CHILDREN"]' was not found.
However, if I remove the "xmlns=http://www.example.com" namespace, it works fine.

After searching on my own, I found out that this is actually a bug UniVerse's XDOM parser itself. Someone has kindly documented a work-around here. You can "fool" the parser into working by giving a name to the default namespace. They also note that you can't use double-quotes in the namespace map either.
Personally, I prefer the simpler solution of just manually stripping out the offending namespace before parsing it. Adding this line to the above program fixes the issue, albeit in a very hacky way:
XML = CHANGE(XML, ' xmlns="http://www.example.com"', '')
This way you don't have to put unnecessary prefixes on all your xPath nodes. This may not always be an option though, depending on how you are getting the XDOM handle.

Related

How to get over this SXXP0003 parser error?

I have an XML document which prolog looks like this :
<?xml version="1.0" encoding="utf-8" standalone="no"?>
...
This XML document is valid against the external DTD with the exact same prolog :
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<!DOCTYPE root [
...
]>
When I transform using Saxon (latest release):
$:/opt/tomcat/webapps/ROOT/$ java net.sf.saxon.Transform -s:pandora.xml -xsl:pandora.xsl -o:pandora.html
Error on line 1 column 53 of pandora.dtd:
SXXP0003 Error reported by XML parser: No more pseudo attributes are allowed.: No more
pseudo attributes are allowed.
org.xml.sax.SAXParseException; systemId: file:/opt/tomcat/webapps/ROOT/fred/pandora/dtd/pandora.dtd; lineNumber: 1; columnNumber: 53; No more pseudo attributes are allowed.
I am newbie and my research about this has only led to listing the pseudo-attributes in the order they actually are. If anybody have a clue there.
Edit
I have made other transformations using the same process with other projects without any problem. The only difference is in this problematic application, I make use of another namespace exsl to use a function not provided with version 1.0 (node-set). Everything else is similar.
For an external subset of the DTD the specification defines the format in https://www.w3.org/TR/xml/#NT-extSubset as
extSubset ::= TextDecl? extSubsetDecl
extSubsetDecl ::= ( markupdecl | conditionalSect | DeclSep)*
, for the "Text Declaration" in https://www.w3.org/TR/xml/#sec-TextDecl as TextDecl ::= '<?xml' VersionInfo? EncodingDecl S? '?>' so a standalone "pseudo" attribute is indeed not allowed there.
So make sure that your external DTD file does not repeat <!DOCTYPE root, it is just meant to contain declaration of markup, e.g. elements, attributes.
The error message you get comes anyway just from the XML parser and is not transformation/XSLT related.

Saxon 9.8: Which patterns are supported in EXPath File Module function file:list?

Good afternoon,
I am working with Java Saxon 9.8.0.4. I would like to use EXPath File Module function "file:list" with its third "pattern" parameter. But I am in doubt, which style of pattern is supported.
I read both Saxon documentation and EXPath documentation. But I do not know, which patterns are supported in Saxon 9.8.0.4. It would be great to support regular expression, but I understand it is overkill for most users. I tried several blind tests, but just * and ? wildchars works for me as defined in EXPath documentation.
Yes, I can quite easily do regexp postprocessing in for-each, but to know more about list function could help.
Thank You in advance for Your help, Stepan
P.S: My use-case is to get all files without extension ("test" and not "test.txt") recursively from large and deep directory structure and process all of matching files with XSL-T 3.0. Most of such files have identical fileName and thus I can not do "copy to one folder" pre-processing for Saxon's -s:directory -o:directory one time invocation and invocation of Java (Saxon) for each file is of cource terrible time overhead. So I would like to read all matching files into sequence and process each item of such sequence using for-each (files are text ones and I read them using unparsed-text). And no, GAWK is not solution, as I have all transformation infrastructure from XML to SQL already in XSL-T, because 95 % of files are XMLs.
--ADDED code and explanation below:
Example of my test files.
XML file "a.xml":
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="a.xsl"?>
<root/>
XSL-T file "a.xsl":
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:saxon="http://saxon.sf.net/"
xmlns:expathFile="http://expath.org/ns/file"
exclude-result-prefixes="xs saxon"
version="3.0">
<xsl:output method="text" />
<xsl:template match="/root">
<xsl:variable name="list" select="expathFile:list('C:\temp\temp\test\', false(), '^.*$')"/>
<xsl:for-each select="$list">
<xsl:value-of select="."/>
</xsl:for-each>
</xsl:template>
My folder "C:\temp\temp\test\" contains 6 test files: "a.txt", "b.txt", "c.txt", "e", "f", "g".
But after testing of online Java RegExp tester on "http://www.regexplanet.com/advanced/java/index.html" I have found, that the problem is solely on my side, because Java regular expression behaves little different than PCRE (Perl), sed, gawk regular expressions. So it is my fault and I need to learn Java regular expression.
Saxon uses the same code for this pattern as for the filter in select="pattern" in collection URIs, which is described at http://www.saxonica.com/documentation/index.html#!sourcedocs/collections
Extracting the relevant details:
The pattern used in the select parameter can use glob-like syntax, for
example *.xml selects all files with extension "xml". More generally,
the pattern is converted to a regular expression by prepending "^",
appending "$", replacing "." by "\.", "*" by ".*", and "?" by ".?",
and it is then used to match the file names appearing in the directory
using the Java regular expression rules. So, for example, you can
write ?select=*.(xml|xhtml) to match files with either of these two
file extensions. Note however, that special characters used in the URL
(that is, characters such as backslash and curly braces that are not
allowed in the query part of a URI) must be escaped using the %HH
convention. For example, vertical bar needs to be written as %7C. This
escaping can be achieved using the encode-for-uri() function.
Note that Saxon's collection() function now also supports match=pattern in the URI, where the pattern is a standard XPath 3.1 regular expression.

How to replace tag with some value in XML using Nokogiri

I have a predefine XML template with some tags that need to be replaced. The tag values come dynamically from the front-end.
<?xml version="1.0"?>
<catalog>
<book id="bk101">
<author>AUTHOR1</author>
<title>TITLE1</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>An in-depth look at creating applications
with XML.</description>
</book>
<book id="bk102">
<author>AUTHOR2</author>
<title>TITLE2</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-12-16</publish_date>
<description>A former architect battles corporate zombies,
an evil sorceress, and her own childhood to become queen
of the world.</description>
</book>
</catalog>
In above example I need to replace TITLE1, TITLE2, AUTHOR1, AUTHOR2 with the actual value dynamically.
What is the best way to do this? I am using Nokogiri in some Ruby code but have had no luck.
The basic idea is you need to search the XML for the <book> tags. For each book found, retrieve the block of values that apply to it. Find the <author> tag and replace its text. Find the <title> tag, and replace its text also. Then go to the next book.
However, in your example, writing code to do that is overkill when a simple gsub will do it in one pass:
xml = '<?xml version="1.0"?>
<catalog>
<book id="bk101">
<author>AUTHOR1</author>
<title>TITLE1</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>An in-depth look at creating applications
with XML.</description>
</book>
<book id="bk102">
<author>AUTHOR2</author>
<title>TITLE2</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-12-16</publish_date>
<description>A former architect battles corporate zombies,
an evil sorceress, and her own childhood to become queen
of the world.</description>
</book>
</catalog>
'
values = {
'TITLE1' => 'Moby Dick',
'AUTHOR1' => 'Herman Melville',
'TITLE2' => 'Tom Sawyer',
'AUTHOR2' => 'Mark Twain',
}
puts xml.gsub(Regexp.union(values.keys), values)
# >> <?xml version="1.0"?>
# >> <catalog>
# >> <book id="bk101">
# >> <author>Herman Melville</author>
# >> <title>Moby Dick</title>
# >> <genre>Computer</genre>
# >> <price>44.95</price>
# >> <publish_date>2000-10-01</publish_date>
# >> <description>An in-depth look at creating applications
# >> with XML.</description>
# >> </book>
# >> <book id="bk102">
# >> <author>Mark Twain</author>
# >> <title>Tom Sawyer</title>
# >> <genre>Fantasy</genre>
# >> <price>5.95</price>
# >> <publish_date>2000-12-16</publish_date>
# >> <description>A former architect battles corporate zombies,
# >> an evil sorceress, and her own childhood to become queen
# >> of the world.</description>
# >> </book>
# >> </catalog>
This use of gsub isn't used often, but I've used it many times when substituting values into templates. Using tags/keys that are guaranteed to be unique in the document are essential, so I often flag them using leading and trailing double underscores. In other words __TITLE1__, __AUTHOR1__, etc.
Doing this you can easily replace the content of the other fields, such as <genre>, <price>, etc.
Name the variables in the form the same as the keys/tags, and the task becomes even easier because you should receive a hash of field-names and field-values, which becomes the source for your hash used in the gsub.
Be sure to verify/sanitize the values before substituting. Users mistype and malicious ones can deliberately enter data in an attempt to break your code, or worse, whatever the XML is fed into.

Validating XML with an in-memory DTD in C using libxml2

I need to validate XML using DTD stored in memory, i.e. something like the following:
static const char *dtd_str = "<!ELEMENT ...>";
xmlDtdPtr dtd;
dtd = xmlParseMemoryDtd(dtd_str);
XML_PARSE_DTDVALID parser option allows to validate DTD embedded into XML:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE some_tag[
<!ELEMENT some_tag ...>
...
]>
<some_tag>...</some_tag>
So a workaround is to modify in-memory XML. Things become more complicated with
a parser used in "push mode". In push mode we have to detect whether the XML
declaration (<?xml ...?>), or start of the root element, then put our inline
DTD between them.
Could you suggest better solution?
EDIT
A workaround is to validate parsed XML posteriori as Daniel(_DV) suggested below.
Example: main.c, response.xml.
But I was searching for way to "embed" a DTD and validate XML "on-the-fly" while libxml2 parses XML chunk-by-chunk.
The following aproach doesn't work for me:
xmlCtxtUseOptions(ctxt, XML_PARSE_NOENT | XML_PARSE_NOWARNING | XML_PARSE_DTDVALID);
ctxt->sax->internalSubset = ngx_http_file_chunks_sax_internal_subset;
ctxt->sax->externalSubset = NULL;
$ ./parsexml
validity error : Validation failed: no DTD found !
<response>
^
Document is not valid
xmlValidateDtd allows to do DTD validation a posteriori of an already parsed XML document
to make sure it validates against the DTD. This will not use the internal subset...
http://xmlsoft.org/html/libxml-valid.html#xmlValidateDtd
See xmllint.c code in libxml2 for a full example of how to use it,
Daniel

(Kiss)XML xpath and default namespace

I am working on an iPhone project that needs to parse some xml. The xml may or may not include a default namespace. I need to know how to parse the xml in case it uses a default namespace. As I need to both read an write xml, I'm leaning towards using KissXML, but I'm open for suggestions.
This is my code:
NSString *content = [NSString stringWithContentsOfFile:[[NSBundle mainBundle]
pathForResource:#"bookstore" ofType:#"xml"] encoding:NSUTF8StringEncoding error:nil];
DDXMLDocument *theDocument = [[DDXMLDocument alloc] initWithXMLString:content options:0 error:nil];
NSArray *results = [theDocument nodesForXPath:#"//book" error:nil];
NSLog(#"%d", [results count]);
It works as expected on this xml:
<?xml version="1.0" encoding="UTF-8"?>
<bookstore>
<book category="COOKING">
<title lang="en">Everyday Italian</title>
</book>
<book category="CHILDREN">
<title lang="en">Harry Potter</title>
</book>
</bookstore>
But when the xml has a namespace, like this, it stops working:
<?xml version="1.0" encoding="UTF-8"?>
<bookstore xmlns="[INSERT RANDOM NAMESPACE]">
<book category="COOKING">
<title lang="en">Everyday Italian</title>
</book>
<book category="CHILDREN">
<title lang="en">Harry Potter</title>
</book>
</bookstore>
Of course, I could just preprocess the string and remove the xmlns, though that feels like a sort of ugly hack. What is the proper way to handle this?
The Clean Way: Querying for the Namespace
You can use two XPath queries, one to fetch the namespace, then register it; as second query use the one you already have including namespaces. I can only help you with the query, but it seems you're quite familiar with namespaces and how to register them in the KissXML framework:
namespace-uri(/*)
This expression fetches all child nodes starting at the document root, which is per XML definition a single root element, and returns it's namespace uri.
The Ugly Way: Only Testing for Local Name
It seems KissXML only supports XPath 1.0. With this less-capable language version, you need to use wildcard selectors at each axis step and compare the local name (without namespace prefix) inside the predicate:
//*[local-name(.) = 'book']
Starting from XPath 2.0, you could query using the namespace wildcard, which is much shorter:
//*:book
According to this comment KissXML implements "correct" behaviour while NSXML doesn't. Which doesn't exactly help. There is a proposed fix for this waiting to be merged. [edit] 11/2021 - still waiting to be merged!
Expanding on the accepted answer's first proposed solution the workaround I found was to rename the default namespace and then use that prefix in my XPath queries. Something like:
DDXMLNode *defaultNamespace = [document.rootElement namespaceForPrefix:#""];
defaultNamespace.name = #"default";
NSArray *xmlNodes = [[document rootElement] nodesForXPath:#"//default:foo/default:bar" error:nil];
This seems cleaner to me than textual processing of the file. You could of course check and handle namespace collisions but the above should work in most simple cases.

Resources