SearchIO.parse xml blast and ampersands cElementTree.ParseError: not well-formed (invalid token) error - biopython

I would like some advice to work around an xml parsing error. In my BLAST xml output, I have a description that has an '&' character which is throwing off the SearchIO.parse function.
If I run
qresults=SearchIO.parse(PLAST_output,"blast-xml")
for record in qresults:
#do some stuff
I get the following error:
cElementTree.ParseError: not well-formed (invalid token): line 13701986, column 30
Which directs me to the this line:
<Hit_def>Lysosomal & prostatic acid phosphatases [Xanthophyllomyces dendrorhous</Hit_def>
Is there a way to override this in biopython so I do not have to change my xml file? Right now, I'm just doing a 'Try/Except' loop, but that is not optimal!
Thanks for your help!
Courtney

Related

bad argument in call to crypto:aes_cfb_128_crypt

This is the code snippet at line 461 which is giving badarg error ,please help me solve this error guys.
ejabberd_odbc:escape(base64:encode(crypto:aes_cfb_128_encrypt(<<"abcdefghabcdefgh">>, <<"12345678abcdefgh">>, xml:element_to_binary(NewPacket)))),
Log:
bad argument in call to crypto:aes_cfb_128_crypt(<<"abcdefghabcdefgh">>, <<"12345678abcdefgh">>, <<">, true) in mod_offline:'-store_offline_msg/6-fun-2-'/2 line 225
One of the things I like about functional languages is that you generally have an easier time reproducing errors in a controlled environment. In your case, it seems like
base64:decode(XML)
is the call that's failing, so you should write
io:format("XML=~p~n", [XML]),
base64:decode(XML)
the first line will print out the contents of XML in Erlang syntax, and the second line will fail when you get to the bad input.
Once you see the string you're trying to decode, the problem will probably be obvious (it's not a string or it's not a base64 string). If it is a correctly-encoded base64 string, then you can post that problem as a StackOverflow question and get a more useful response.

XMLStreamException on import-graphml

I exported the neo4j-database in graphml using neo4j-shell-tools format but while importing back the database at the production server I am getting the following error.
XMLStreamException: ParseError at [row,col]:[2542885,95] Message: An
invalid XML character (Unicode: 0x8) was found in the element content
of the document.
But there is no such character on line number 2542885.
I even deleted this line using sed -i (2542885d) but I am still getting the same error at the same line while importing. Strange.
It seems the line number which sed is referring to is not the same as the line at which the error is been thrown.
Please help out, I have spent a day to resolve this error. But no success.
Thank you. Error resolved.
I used xmllint, which gave the same error at another line number, and replacing that unicode character resolves the issue.

NSXMLParser fails with NSXMLParserErrorDomain error 111; error 111 isn't defined?

A few people seem to have run into NSXMLParser error 111 before, but it's not defined in the constants. This answer seems to have mistaken 111 with 11: NSXMLParserErrorDomain 111
As far as I can tell, I have no illegal characters in my final xml:
<?xml version="1.0" encoding="utf-16"?><wsse:BinarySecurityToken wsu:Id="uuid:383b6148-1c27-45ab-963b-30e14af8154e" ValueType="http://schemas.xmlsoap.org/ws/2009/11/swt-token-profile-1.0" EncodingType="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-soap-message-security-1.0#Base64Binary" xmlns:wsu="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd" xmlns:wsse="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd">aHR0cCUzYSUyZiUyZnNjaGVtYXMueG1sc29hcC5vcmclMmZ3cyUyZjIwMDUlMmYwNSUyZmlkZW50aXR5JTJmY2xhaW1zJTJmbmFtZWlkZW50aWZpZXI9V3dEM2ozRzBobjE0MWFndkNWJTJmWERadmgwJTJiQ0xHV1hBblRLTmM4Qjc3N1UlM2QmaHR0cCUzYSUyZiUyZnNjaGVtYXMubWljcm9zb2Z0LmNvbSUyZmFjY2Vzc2NvbnRyb2xzZXJ2aWNlJTJmMjAxMCUyZjA3JTJmY2xhaW1zJTJmaWRlbnRpdHlwcm92aWRlcj11cmklM2FXaW5kb3dzTGl2ZUlEJkF1ZGllbmNlPWh0dHBzJTNhJTJmJTJma21haW4ta2RzLWV1czItMC5jbG91ZGFwcC5uZXQlMmYmRXhwaXJlc09uPTEzOTQ3NjExODEmSXNzdWVyPWh0dHBzJTNhJTJmJTJmdG9sZWRvLmFjY2Vzc2NvbnRyb2wud2luZG93cy5uZXQlMmYmSE1BQ1NIQTI1Nj1iVTg4cWs2OFc3bmFxOEZFam1EVUFWSlQySzZ5cCUyYkxmdGR4SlFlWDhsYXclM2Q=</wsse:BinarySecurityToken>
I've also tried changing the encoding to utf-8, but it made no difference. What causes a parser to fail with error 111? Is the parser not set up correctly, or is the XML killing it?
In NSXMLParserError docs, it says:
The following error codes are defined by NSXMLParser. For error codes not listed here, see the <libxml/xmlerror.h> header file.
The number 111 isn't mentioned in this list, so we go to /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.11.sdk/usr/include/libxml2/libxml/xmlerror.h, and find the value:
XML_ERR_USER_STOP, /* 111 */
There isn't a lot of documentation on XML_ERR_USER_STOP in libxml2, but from reading the changeset, it looks like it's a fast-fail when the parser sees an unexpected EOF.
Referred DebugCN and Internet.
Turns out I was simply passing in an entirely wrong string. The XML chunk came from a larger JSON structure, which I then processed down to get the only XML part; yet when inited the parser, I used the wrong string to create the NSData. So make sure you aren't mixing variables up. I'm still not sure why error 111 isn't defined in the documentation, though.

Handling Invalid XML with Nokogiri::XML::Reader

I have found the Nokogiri xml reader to be strict with xml syntax so if it encounters an invalid character within the xml, such as a non-escaped ampersand (eg. <tag> Garage & Driveway </tag>) will cause an error to be thrown.
So when I use the reader as follows:
Nokogiri::XML::Reader(infile).each do |node|
# does stuff with node
end
Throws the error:
Entity: line 1056614: parser error : xmlParseEntityRef: no name
<tag>The & is invalid</tag>
^
transmogrifier/gems/nokogiri-1.5.5/lib/nokogiri/xml/reader.rb:106:in `each'
With XML such as this:
<root>
<items>
<tag>The & is invalid</tag>
</items>
<items> ... </items>
<root>
Midway through parsing a large document. I've noticed Nokogiri::XML::Parser handles this (more) gracefully, and removes all invalid characters, which gives me hope for a more graceful solution.
Ideally, I would love to be able to catch the error and continue with the each parsing (as very few items have invalid characters). Any suggestions on how to handle this gracefully?
Ive noticed you can pass in ParseOptions, but havent had any luck with using those.
Thanks in advance!
Switching from Nokogiri::XML to Nokogiri::HTML, which is much more forgiving of XML errors, will probably help.

PGError: ERROR: invalid byte sequence for encoding "UTF8

I'm getting the following PGError while ingesting Rails emails from Cloudmailin:
PGError: ERROR: invalid byte sequence for encoding "UTF8": 0xbb HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding". : INSERT INTO "comments" ("content") VALUES ('Reply with blah blah ����������������������������������������������������� .....
So it seems pretty clear I have some invalid UTF8 characters getting into the email right? So I tried to clean that up but something is still Sneaking through. Here's what I have so far:
message_all_clean = params[:message]
Iconv.conv('UTF-8//IGNORE', 'UTF-8', message_all_clean)
message_plain_clean = params[:plain]
Iconv.conv('UTF-8//IGNORE', 'UTF-8', message_plain_clean)
#incoming_mail = IncomingMail.create(:message_all => Base64.encode64(message_all_clean), :message_plain => Base64.encode64(message_plain_clean))
Any ideas, thoughts or suggestions? Thanks
When encountering this issue on Heroku, we converted to US-ASCII to sanitize incoming data appropriately (i.e. pasted from Word):
Iconv.conv("UTF-8//IGNORE", "US-ASCII", content)
With this, we had no more issues with character encoding.
Also, double check that there's no other fields that need the same conversion, as it could affect anything that's passing a block of text to the database.

Resources