Reading malformed XML with Nokogiri: Unescaped Ampersands in URL field - ruby-on-rails

I am trying to read a XML file from a third party with Nokogiri in my rails project.
One of the nodes I have ot parse contains an URL with unescaped ampersands (like foo.com/index.html?page=1&query=bar)
I understand that this is considered malformed XML, and Nokogiri just tries to parse it anyway, resulting in foo.com/index.html?page=1=bar.
How can I obtain the full URL? Can I tweak Nokogiri? Would you do a search&replace-prerun or what would be the best practice?

Had the same issue parsing SVGs with image links containing ampersands.
Parsing SVGs as HTML seems to correctly handle the links, escaping &.
fixed_svg = Nokogiri::HTML.fragment(raw_svg).to_html
# proceed with XML parsing
svg = Nokogiri::XML(fixed_svg)

Related

Why does Network.URI (parseURI) not like the pipe character?

I'm using the parseURI function from the network-uri package to parse some urls. Some of these urls have a pipe character in and parsing fails for them. For example:
Network.URI> parseURI "http://something.com/foo|bar"
Nothing
However, these urls are obtained from a real website and they load correctly in a web browser, so there must be some sort of correct way of dealing with them.
Why does parsing fail on urls with a pipe character, and what can I do to make them correctly parse?
You need to use escapeURIString before parsing. isUnescapedInURI will tell you if the character is allowed unescaped in a URI component as mentioned in the documentation.
λ> isUnescapedInURI '|'
False
So, to properly encode and parse it:
λ> parseURI $ escapeURIString isUnescapedInURI "http://something.com/foo|bar"
Just http://something.com/foo%7Cbar
In fact this specific corner case, is well explained in the Hackage docs.

Can I leave some sections unparsed using NSXMLParser?

I have an XML document which I want to parse using NSXMLParser. One of the tags it can contain is <html>, and in my parsed representation I want the contents of that tag, verbatim. However, when I parse the document, my delegate methods are called for the start, end and contents of each tag inside the html tag.
I can't get the provider of the document to add CDATA tags; nor can I use something other than NSXMLParser to parse the document.
Is there a way for me to tell the parser to treat the contents of HTML tags as CDATA and to leave them unparsed, even if they contain other tags?
That's too bad that the owner of the XML feed won't fix it because, depending on the HTML, you may end up with a malformed XML feed. If it really is an XML document, they definitely should wrap it in a CDATA or replace all the < with < and all the > with >.
Frankly, if all you need is the HTML, and all you have is XML tag that contains the HTML without the CDATA or appropriate character replacement, I might not be inclined to try to run it through NSXMLParser at all (because the successful parsing is contingent on the nature of the HTML included). I'd use a NSScanner or NSRegularExpression to extract all of the text between the XML's opening and closing tag that wrap your HTML.
Or, if you really want to use NSXMLParser (because there's other stuff in addition to the HTML that you need), then manually alter the NSData, wrapping the HTML in a CDATA yourself.
If, on the other hand, the document you're trying to parse really isn't XML, but rather is just HTML, then of course, you shouldn't be parsing it with an XML parser. You should be using a HTML parser, like HPPLE, as described in Galloway's article, How to Parse HTML on iOS on the Ray Wendlich site.

lua reading chinese character

I have the following xml that I would like to read:
chinese xml - https://news.google.com/news/popular?ned=cn&topic=po&output=rss
korean xml - http://www.voanews.com/templates/Articles.rss?sectionPath=/korean/news
Currently, I try to use a luaxml to parse in the xml which contain the chinese character. However, when I print out using the console, the result is that the chinese character cannot be printed correctly and show as a garbage character.
I would like to ask if there is anyway to parse a chinese or korean character into lua table?
I don't think Lua is the issue here. The raw data the remote site sends is encoded using UTF-8, and Lua does no special interpretation of that—which means it should be preserved perfectly if you just (1) read from the remote site, and (2) save the read data to a file. The data in the file will contain CJK characters encoded in UTF-8, just like the remote site sent back.
If you're getting funny results like you mention, the fault probably lies either with the library you're using to read from the remote site, or perhaps simply with the way your console displays the results when you output to it.
I managed to convert the "中美" into chinese character.
I would need to do one additional step which has to convert all the the series of string by using this method from this link, http://forum.luahub.com/index.php?topic=3617.msg8595#msg8595 before saving into xml format.
string.gsub(l,"&#([0-9]+);", function(c) return string.char(tonumber(c)) end)
I would like to ask for LuaXML, I have come across this method xml.registerCode(decoded,encoded)
Under that method, it says that
registers a custom code for the conversion between non-standard characters and XML character entities
What do they mean by non-standard characters and how do I use it?

HTML DOM Parse and Character encoding on XMLHTTPRequest at Firefox extension

I am now writing firefox 4 bootstrapped extension.
The following is my story:
When I'm using #mozilla.org/xmlextras/xmlhttprequest;1, nsIXMLHttpRequest, content of target URL can be successfully loaded by req.responseText.
I parsed the responseText to DOM by createElement method and innerHTML property into a BODY Element.
Everything seems to be successful.
However, there is a problem on character encoding ( charset ).
As I need the extension detect the charset of target documents, overriding the Mine type of request with text/html; charset=blahblah.. seems not to meet my need.
I've tried the #mozilla.org/intl/utf8converterservice;1, nsIUTF8ConverterService, but it seems that XMLHTTPRequest has no ScriptableInputStream or even any InputStream or readable stream.
I have no idea on reading a target document content in a suitable, automatically detected charset, no matter the function of Auto-Detect Character Encoding in GUI or the charset readed at head meta tag of the content document.
EDIT: Would it be practical if I parse whole document including HTML, HEAD, BODY tag to a DOM object, but without loading extensive document like js, css, ico files?
EDIT: Method on the article at MDC titled as "HTML to DOM" which is using #mozilla.org/feed-unescapehtml;1, nsIScriptableUnescapeHTML is inappropriate as it parsed with lots of error and mistake with baseURI can not be set in type of text/html. All attribute HREF in A Elements are missed when it contains a relative path.
EDIT#2: It would still be nice if there are any methods that can convert the incoming responseText into readable UTF-8 charset strings. :-)
Any ideas or works to solve encoding problem are appreciated. :-)
PS. the target documents are universal so there are no specific charset ( or ... preknown ), and of course not only UTF8 as it has already defined in default.
SUPP:
Til now, I have two brief main ideas of solving this problem.
Can anybody could help me to work out of the XPCOM modules and methods' names?
To Specify the charset while parsing Content into DOM.
We need to first know the charset of the document ( by extracting head meta Tag, or header).
Then,
find out a method that can specify the charset when parsing body content.
find out a method that can parse both head and body.
To Convert or Make Incoming responseText into/be UTF-8 so parsing to DOM Element with default charset UTF-8 is still working.
X seems to be not practical and sensible : Overiding the Mine type with charset is an implementation of this idea but we can not preknow the charset before initiating a request.
It seems that there are no more other answer.
After a day of tests, I've found out that there is a way (although it is clumsy) to solve my problem.
xhr.overrideMimeType('text/plain; charset=x-user-defined'); , where xhr stand for XMLHttpRequest Handler.
To force Firefox to treat it as plain
text, using a user-defined character
set. This tells Firefox not to parse
it, and to let the bytes pass through
unprocessed.
Refers to MDC Document: Using_XMLHttpRequest#Receiving_binary_data
And then use Scriptable Unicode Converter : #mozilla.org/intl/scriptableunicodeconverter, nsIScriptableUnicodeConverter
Variable charset can be extracted from head meta tags no matter by regexp from req.responseText (with unknown charset) or something other method.
var unicodeConverter = Components.classes["#mozilla.org/intl/scriptableunicodeconverter"].createInstance(Components.interfaces.nsIScriptableUnicodeConverter);
unicodeConverter.charset = charset;
str = unicodeConverter.ConvertToUnicode(str);
An unicode string, as well as a family of UTF-8, is finally produced. :-)
Then simply parse to body element and meet my need.
Other brilliant ideas are still welcome. Feel free to object my answer by sufficient reason. :-)

Screen scaper that follows redirects and encodes to UTF-8

I'm looking for a gem (or a combination of gems) that can, given an URL, return the page content as UTF-8. It should also follow redirects if the URL is changed.
Does anyone know of such?
Thanks!
Have you looked at Nokogiri? It seems to do what you are looking for in terms of encoding:
ENCODING:
Strings are always stored as UTF-8
internally. Methods that return text
values will always return UTF-8
encoded strings. Methods that return
XML (like to_xml, to_html and
inner_html) will return a string
encoded like the source document.
You can also automate some of your screen scraping with Mechanize (click links, submit forms, etc). Mechanize builds on Nokogiri so it's a nice complement to it.
Some webcasts you may want to look at:
Nokogiri:
http://railscasts.com/episodes/190-screen-scraping-with-nokogiri
Mechanize:
http://railscasts.com/episodes/191-mechanize

Resources