Handling Invalid XML with Nokogiri::XML::Reader - xml-parsing

I have found the Nokogiri xml reader to be strict with xml syntax so if it encounters an invalid character within the xml, such as a non-escaped ampersand (eg. <tag> Garage & Driveway </tag>) will cause an error to be thrown.
So when I use the reader as follows:
Nokogiri::XML::Reader(infile).each do |node|
# does stuff with node
end
Throws the error:
Entity: line 1056614: parser error : xmlParseEntityRef: no name
<tag>The & is invalid</tag>
^
transmogrifier/gems/nokogiri-1.5.5/lib/nokogiri/xml/reader.rb:106:in `each'
With XML such as this:
<root>
<items>
<tag>The & is invalid</tag>
</items>
<items> ... </items>
<root>
Midway through parsing a large document. I've noticed Nokogiri::XML::Parser handles this (more) gracefully, and removes all invalid characters, which gives me hope for a more graceful solution.
Ideally, I would love to be able to catch the error and continue with the each parsing (as very few items have invalid characters). Any suggestions on how to handle this gracefully?
Ive noticed you can pass in ParseOptions, but havent had any luck with using those.
Thanks in advance!

Switching from Nokogiri::XML to Nokogiri::HTML, which is much more forgiving of XML errors, will probably help.

Related

Xcode Info.plist error: Consecutive statements on a line must be separated by ';' but looks like valid XML

Why is Xcode giving me errors on my Info.plist file?
The first error is on the first line (which I had nothing to do with writing.)
The first line is:
<?xml version="1.0" encoding="UTF-8"?>
and the error given is:
Consecutive statements on a line must be separated by ';'
If I hit the "Fix" button on the error message it inserts a ; right after "1.0" but I don't think this is right because the way I have it is how I see it in every online example. Plus it goes on to complain about '$' is not an identifier; use backticks to escape it. And fixing those causes more problems. Hopefully this makes sense to someone with more knowledge than me.

SearchIO.parse xml blast and ampersands cElementTree.ParseError: not well-formed (invalid token) error

I would like some advice to work around an xml parsing error. In my BLAST xml output, I have a description that has an '&' character which is throwing off the SearchIO.parse function.
If I run
qresults=SearchIO.parse(PLAST_output,"blast-xml")
for record in qresults:
#do some stuff
I get the following error:
cElementTree.ParseError: not well-formed (invalid token): line 13701986, column 30
Which directs me to the this line:
<Hit_def>Lysosomal & prostatic acid phosphatases [Xanthophyllomyces dendrorhous</Hit_def>
Is there a way to override this in biopython so I do not have to change my xml file? Right now, I'm just doing a 'Try/Except' loop, but that is not optimal!
Thanks for your help!
Courtney

Validate XML against schematron using SAXON EE edition

I am evaluating SAXON EE edition to validate XML against xsd and schematron.
Can someone help me in resolving the following queries:
While validating xml document against xsd, can we also get xpath of that error node along with errors in plain text. Currently I am getting error only.
Can we validate xml against schematron using Saxon EE version? Any code sample would be a great help.
Thanks.
1. While validating xml document against xsd, can we also get xpath of that error node.
Yes, the error information includes an XPath reference to the invalid node (in most cases: there are some cases such as duplicate IDs where there isn't one specific node in error).
If you generate an XML validity report using SchemaValidator.SetValidityReporting() then the resulting report will include the path information. Here's an example:
<?xml version="1.0" encoding="UTF-8"?>
<validation-report xmlns="http://saxon.sf.net/ns/validation"
system-id="file:/Users/mike/repo2/samples/data/books-invalid.xml">
<error line="3"
column="17"
path="/Q{}BOOKLIST[1]/Q{}BOOKS[1]/#x"
xsd-part="1"
constraint="cvc-complex-type.3">Attribute #x is not allowed on element <BOOKS></error>
<error line="10"
column="17"
path="/Q{}BOOKLIST[1]/Q{}BOOKS[1]/Q{}ITEM[1]/Q{}PRICE[1]"
xsd-part="2"
constraint="cvc-datatype-valid.1">The content "$0.2" of element <PRICE> does not match the required simple type. Cannot convert string to decimal: $0.2</error>
<error line="21"
column="20"
path="/Q{}BOOKLIST[1]/Q{}BOOKS[1]/Q{}ITEM[2]/Q{}PUB-DATE[1]"
xsd-part="2"
constraint="cvc-datatype-valid.1">The content "2002-02-31" of element <PUB-DATE> does not match the required simple type. Invalid date "2002-02-31" (Non-existent date)</error>
<error line="42"
column="22"
path="/Q{}BOOKLIST[1]/Q{}BOOKS[1]/Q{}ITEM[3]/Q{}REPUTATION[1]"
xsd-part="1"
constraint="cvc-complex-type.2.4">In content of element <ITEM>: The content model does not allow element <REPUTATION> to appear immediately after element <WEIGHT>. No further elements are allowed at this point. </error>
<meta-data>
<validator name="SAXON-EE" version="9.8.0.9"/>
<results errors="4" warnings="0"/>
<schema file="books.xsd" xsd-version="1.1"/>
<run at="2018-03-07T15:22:04.847Z"/>
</meta-data>
</validation-report>
You can also get the information if you supply an IInvalidityHandler as a callback to the SchemaValidator, though this requires a bit more digging. Saxon calls your IInvalidityHandler supplying a StaticError object (which is a bit of a misnomer). The StaticError object doesn't have the path information directly available, but it contains a reference to an XPathException object which can be cast to a ValidationException, and ValidationException has a method getPath() which returns this information if available.
2. Can we validate xml against schematron.
Saxon doesn't include a schematron validator per se, though many of the third-party tools that do schematron validation make use of Saxon "under the hood". I'm not up-to-date with the situation on .NET - but essentially there are two kinds of Schematron processor: those that generate XSLT code from the schematron schema (which typically use Saxon both to generate the XSLT and to execute it), and "native" processors. Searching for "schematron on .NET" gives you quite a number of projects, but I have no idea of their current status or quality.

bad argument in call to crypto:aes_cfb_128_crypt

This is the code snippet at line 461 which is giving badarg error ,please help me solve this error guys.
ejabberd_odbc:escape(base64:encode(crypto:aes_cfb_128_encrypt(<<"abcdefghabcdefgh">>, <<"12345678abcdefgh">>, xml:element_to_binary(NewPacket)))),
Log:
bad argument in call to crypto:aes_cfb_128_crypt(<<"abcdefghabcdefgh">>, <<"12345678abcdefgh">>, <<">, true) in mod_offline:'-store_offline_msg/6-fun-2-'/2 line 225
One of the things I like about functional languages is that you generally have an easier time reproducing errors in a controlled environment. In your case, it seems like
base64:decode(XML)
is the call that's failing, so you should write
io:format("XML=~p~n", [XML]),
base64:decode(XML)
the first line will print out the contents of XML in Erlang syntax, and the second line will fail when you get to the bad input.
Once you see the string you're trying to decode, the problem will probably be obvious (it's not a string or it's not a base64 string). If it is a correctly-encoded base64 string, then you can post that problem as a StackOverflow question and get a more useful response.

NSXMLParser fails with NSXMLParserErrorDomain error 111; error 111 isn't defined?

A few people seem to have run into NSXMLParser error 111 before, but it's not defined in the constants. This answer seems to have mistaken 111 with 11: NSXMLParserErrorDomain 111
As far as I can tell, I have no illegal characters in my final xml:
<?xml version="1.0" encoding="utf-16"?><wsse:BinarySecurityToken wsu:Id="uuid:383b6148-1c27-45ab-963b-30e14af8154e" ValueType="http://schemas.xmlsoap.org/ws/2009/11/swt-token-profile-1.0" EncodingType="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-soap-message-security-1.0#Base64Binary" xmlns:wsu="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd" xmlns:wsse="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd">aHR0cCUzYSUyZiUyZnNjaGVtYXMueG1sc29hcC5vcmclMmZ3cyUyZjIwMDUlMmYwNSUyZmlkZW50aXR5JTJmY2xhaW1zJTJmbmFtZWlkZW50aWZpZXI9V3dEM2ozRzBobjE0MWFndkNWJTJmWERadmgwJTJiQ0xHV1hBblRLTmM4Qjc3N1UlM2QmaHR0cCUzYSUyZiUyZnNjaGVtYXMubWljcm9zb2Z0LmNvbSUyZmFjY2Vzc2NvbnRyb2xzZXJ2aWNlJTJmMjAxMCUyZjA3JTJmY2xhaW1zJTJmaWRlbnRpdHlwcm92aWRlcj11cmklM2FXaW5kb3dzTGl2ZUlEJkF1ZGllbmNlPWh0dHBzJTNhJTJmJTJma21haW4ta2RzLWV1czItMC5jbG91ZGFwcC5uZXQlMmYmRXhwaXJlc09uPTEzOTQ3NjExODEmSXNzdWVyPWh0dHBzJTNhJTJmJTJmdG9sZWRvLmFjY2Vzc2NvbnRyb2wud2luZG93cy5uZXQlMmYmSE1BQ1NIQTI1Nj1iVTg4cWs2OFc3bmFxOEZFam1EVUFWSlQySzZ5cCUyYkxmdGR4SlFlWDhsYXclM2Q=</wsse:BinarySecurityToken>
I've also tried changing the encoding to utf-8, but it made no difference. What causes a parser to fail with error 111? Is the parser not set up correctly, or is the XML killing it?
In NSXMLParserError docs, it says:
The following error codes are defined by NSXMLParser. For error codes not listed here, see the <libxml/xmlerror.h> header file.
The number 111 isn't mentioned in this list, so we go to /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk/usr/include/libxml2/libxml/xmlerror.h, and find the value:
XML_ERR_USER_STOP, /* 111 */
There isn't a lot of documentation on XML_ERR_USER_STOP in libxml2, but from reading the changeset, it looks like it's a fast-fail when the parser sees an unexpected EOF.
Referred DebugCN and Internet.
Turns out I was simply passing in an entirely wrong string. The XML chunk came from a larger JSON structure, which I then processed down to get the only XML part; yet when inited the parser, I used the wrong string to create the NSData. So make sure you aren't mixing variables up. I'm still not sure why error 111 isn't defined in the documentation, though.

Resources