How to get over this SXXP0003 parser error? - xml-parsing

I have an XML document which prolog looks like this :
<?xml version="1.0" encoding="utf-8" standalone="no"?>
...
This XML document is valid against the external DTD with the exact same prolog :
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<!DOCTYPE root [
...
]>
When I transform using Saxon (latest release):
$:/opt/tomcat/webapps/ROOT/$ java net.sf.saxon.Transform -s:pandora.xml -xsl:pandora.xsl -o:pandora.html
Error on line 1 column 53 of pandora.dtd:
SXXP0003 Error reported by XML parser: No more pseudo attributes are allowed.: No more
pseudo attributes are allowed.
org.xml.sax.SAXParseException; systemId: file:/opt/tomcat/webapps/ROOT/fred/pandora/dtd/pandora.dtd; lineNumber: 1; columnNumber: 53; No more pseudo attributes are allowed.
I am newbie and my research about this has only led to listing the pseudo-attributes in the order they actually are. If anybody have a clue there.
Edit
I have made other transformations using the same process with other projects without any problem. The only difference is in this problematic application, I make use of another namespace exsl to use a function not provided with version 1.0 (node-set). Everything else is similar.

For an external subset of the DTD the specification defines the format in https://www.w3.org/TR/xml/#NT-extSubset as
extSubset ::= TextDecl? extSubsetDecl
extSubsetDecl ::= ( markupdecl | conditionalSect | DeclSep)*
, for the "Text Declaration" in https://www.w3.org/TR/xml/#sec-TextDecl as TextDecl ::= '<?xml' VersionInfo? EncodingDecl S? '?>' so a standalone "pseudo" attribute is indeed not allowed there.
So make sure that your external DTD file does not repeat <!DOCTYPE root, it is just meant to contain declaration of markup, e.g. elements, attributes.
The error message you get comes anyway just from the XML parser and is not transformation/XSLT related.

Related

SAXParseException when using restassured

I am trying to verify a XML response with rest-assured like this:
.then().body("some.xml.path", is("abc"));
However, what I get is a SAXParseException:
DOCTYPE is disallowed when the feature "http://apache.org/xml/features/disallow-doctype-decl" set to true.]
Response starts like this:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE cXML SYSTEM "http://xml.cXML.org/schemas/cXML/1.2.021/cXML.dtd">
<cXML ...
Why am I getting this exception? What should I change?
I am using version 3.2.0 of rest-assured.
A similar question has been answered here. In short, the answer describes to use disableLoadingOfExternalDtd() to have RestAssured ignore the Document Type Definition in your XML.
Normally, the DTD would describe (using the external definition) the structural layout of the element defined as cXML.

Saxon 9.8: Which patterns are supported in EXPath File Module function file:list?

Good afternoon,
I am working with Java Saxon 9.8.0.4. I would like to use EXPath File Module function "file:list" with its third "pattern" parameter. But I am in doubt, which style of pattern is supported.
I read both Saxon documentation and EXPath documentation. But I do not know, which patterns are supported in Saxon 9.8.0.4. It would be great to support regular expression, but I understand it is overkill for most users. I tried several blind tests, but just * and ? wildchars works for me as defined in EXPath documentation.
Yes, I can quite easily do regexp postprocessing in for-each, but to know more about list function could help.
Thank You in advance for Your help, Stepan
P.S: My use-case is to get all files without extension ("test" and not "test.txt") recursively from large and deep directory structure and process all of matching files with XSL-T 3.0. Most of such files have identical fileName and thus I can not do "copy to one folder" pre-processing for Saxon's -s:directory -o:directory one time invocation and invocation of Java (Saxon) for each file is of cource terrible time overhead. So I would like to read all matching files into sequence and process each item of such sequence using for-each (files are text ones and I read them using unparsed-text). And no, GAWK is not solution, as I have all transformation infrastructure from XML to SQL already in XSL-T, because 95 % of files are XMLs.
--ADDED code and explanation below:
Example of my test files.
XML file "a.xml":
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="a.xsl"?>
<root/>
XSL-T file "a.xsl":
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:saxon="http://saxon.sf.net/"
xmlns:expathFile="http://expath.org/ns/file"
exclude-result-prefixes="xs saxon"
version="3.0">
<xsl:output method="text" />
<xsl:template match="/root">
<xsl:variable name="list" select="expathFile:list('C:\temp\temp\test\', false(), '^.*$')"/>
<xsl:for-each select="$list">
<xsl:value-of select="."/>
</xsl:for-each>
</xsl:template>
My folder "C:\temp\temp\test\" contains 6 test files: "a.txt", "b.txt", "c.txt", "e", "f", "g".
But after testing of online Java RegExp tester on "http://www.regexplanet.com/advanced/java/index.html" I have found, that the problem is solely on my side, because Java regular expression behaves little different than PCRE (Perl), sed, gawk regular expressions. So it is my fault and I need to learn Java regular expression.
Saxon uses the same code for this pattern as for the filter in select="pattern" in collection URIs, which is described at http://www.saxonica.com/documentation/index.html#!sourcedocs/collections
Extracting the relevant details:
The pattern used in the select parameter can use glob-like syntax, for
example *.xml selects all files with extension "xml". More generally,
the pattern is converted to a regular expression by prepending "^",
appending "$", replacing "." by "\.", "*" by ".*", and "?" by ".?",
and it is then used to match the file names appearing in the directory
using the Java regular expression rules. So, for example, you can
write ?select=*.(xml|xhtml) to match files with either of these two
file extensions. Note however, that special characters used in the URL
(that is, characters such as backslash and curly braces that are not
allowed in the query part of a URI) must be escaped using the %HH
convention. For example, vertical bar needs to be written as %7C. This
escaping can be achieved using the encode-for-uri() function.
Note that Saxon's collection() function now also supports match=pattern in the URI, where the pattern is a standard XPath 3.1 regular expression.

Allure reporting: if suite name contains dot, suite name in reporting is truncated

If suite name contains dot, suite name in reporting is truncated.
The XML seems to have proper name as below
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ns2:test-suite xmlns:ns2="urn:model.allure.qatools.yandex.ru" start="1424101948829" stop="1424102027462" version="1.4.4">
<name>SampleSuite.xml</name>
<test-cases>
<test-case start="1424101949080" stop="1424101952336" status="passed">
....
...
But html report(generated using allure.bat generate allure-results -v 1.4.0) is truncated as below:
Looks like not a bug. There happens extracting simple class name out of fully qualified one. Usually suite names look like "my.company.SimpleTest". In that case current behavior is more preferred.

Validating XML with an in-memory DTD in C using libxml2

I need to validate XML using DTD stored in memory, i.e. something like the following:
static const char *dtd_str = "<!ELEMENT ...>";
xmlDtdPtr dtd;
dtd = xmlParseMemoryDtd(dtd_str);
XML_PARSE_DTDVALID parser option allows to validate DTD embedded into XML:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE some_tag[
<!ELEMENT some_tag ...>
...
]>
<some_tag>...</some_tag>
So a workaround is to modify in-memory XML. Things become more complicated with
a parser used in "push mode". In push mode we have to detect whether the XML
declaration (<?xml ...?>), or start of the root element, then put our inline
DTD between them.
Could you suggest better solution?
EDIT
A workaround is to validate parsed XML posteriori as Daniel(_DV) suggested below.
Example: main.c, response.xml.
But I was searching for way to "embed" a DTD and validate XML "on-the-fly" while libxml2 parses XML chunk-by-chunk.
The following aproach doesn't work for me:
xmlCtxtUseOptions(ctxt, XML_PARSE_NOENT | XML_PARSE_NOWARNING | XML_PARSE_DTDVALID);
ctxt->sax->internalSubset = ngx_http_file_chunks_sax_internal_subset;
ctxt->sax->externalSubset = NULL;
$ ./parsexml
validity error : Validation failed: no DTD found !
<response>
^
Document is not valid
xmlValidateDtd allows to do DTD validation a posteriori of an already parsed XML document
to make sure it validates against the DTD. This will not use the internal subset...
http://xmlsoft.org/html/libxml-valid.html#xmlValidateDtd
See xmllint.c code in libxml2 for a full example of how to use it,
Daniel

UniVerse XDOM Trouble with Default Namespaces

I'm trying to use the XDOM functions of UniVerse to parse an XML file, but I can't get it to correctly parse XML that uses a default namespace. It can correctly handle XML without namespaces, or with named namespaces, but if there is a default namespace, all xPaths will fail to locate the nodes they are supposed to match.
To give a simple example, I'm, trying to parse this XML:
<?xml version="1.0" encoding="ISO-8859-1"?>
<bookstore xmlns="http://www.example.com">
<book category="COOKING">
<title lang="en">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>30.00</price>
</book>
<book category="CHILDREN">
<title lang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="WEB">
<title lang="en">Learning XML</title>
<author>Erik T. Ray</author>
<year>2003</year>
<price>39.95</price>
</book>
</bookstore>
With this code:
PROGRAM XDOM.TEST
$INCLUDE SYSCOM XML.H
OPEN "XML" TO F.XML ELSE STOP "OPEN FAILED"
READ XML FROM F.XML, 'TEST.xml' ELSE STOP "READ FAILED"
EXIT.PROG = #FALSE
CONVERT #FM TO CHAR(10) IN XML
IF NOT(EXIT.PROG) AND XDOMOpen(XML, XML.FROM.STRING, XDOM) # XML.SUCCESS THEN GOSUB XML.ERR
IF NOT(EXIT.PROG) AND XDOMLocate(XDOM, '/bookstore/book[#category="CHILDREN"]', 'xmlns=http://www.example.com', XNODE) # XML.SUCCESS THEN GOSUB XML.ERR
IF NOT(EXIT.PROG) AND XDOMEvaluate(XNODE, './author', 'xmlns=http://www.example.com', AUTHOR) # XML.SUCCESS THEN GOSUB XML.ERR
IF NOT(EXIT.PROG) then PRINT AUTHOR
STOP
XML.ERR:
XML.CODE = ''
XML.ERR = ''
EXIT.PROG = #TRUE
IF XMLGetError(XML.CODE, XML.ERR) = XML.SUCCESS THEN
PRINT XML.CODE
PRINT XML.ERR
END
RETURN
END
When I run this code as is, I get the output:
10
The location path '/bookstore/book[#category="CHILDREN"]' was not found.
However, if I remove the "xmlns=http://www.example.com" namespace, it works fine.
After searching on my own, I found out that this is actually a bug UniVerse's XDOM parser itself. Someone has kindly documented a work-around here. You can "fool" the parser into working by giving a name to the default namespace. They also note that you can't use double-quotes in the namespace map either.
Personally, I prefer the simpler solution of just manually stripping out the offending namespace before parsing it. Adding this line to the above program fixes the issue, albeit in a very hacky way:
XML = CHANGE(XML, ' xmlns="http://www.example.com"', '')
This way you don't have to put unnecessary prefixes on all your xPath nodes. This may not always be an option though, depending on how you are getting the XDOM handle.

Resources