How to beautify xml code in rails application - ruby-on-rails

Is there a simple way to print an unformated xml string to screen in a ruby on rails application? Something like a xml beautifier?

Ruby core REXML::Document has pretty printing:
REXML::Document#write( output=$stdout, indent=-1, transitive=false, ie_hack=false )
indent: An integer. If -1, no
indenting will be used; otherwise, the
indentation will be twice this number
of spaces, and children will be
indented an additional amount. For a
value of 3, every item will be
indented 3 more levels, or 6 more
spaces (2 * 3). Defaults to -1
An example:
require "rexml/document"
doc = REXML::Document.new "<a><b><c>TExt</c><d /></b><b><d/></b></a>"
out = ""
doc.write(out, 1)
puts out
Produces:
<a>
<b>
<c>
TExt
</c>
<d/>
</b>
<b>
<d/>
</b>
</a>
EDIT: Rails has already REXML loaded, so you only have to produce new document and then write the pretty printed XML to some string which then can be embedded in a <pre> tag.

What about the Nokogiri gem? Here is an example use.

Related

Nokogiri results different from brower inspect

I am trying to scrape a site but the results returned for just the links is different from when I inspect it with the browser.
In my browser I get normal links but all the a HREF links all become javascript:void(0); from Nokogiri.
Here is the site:
https://www.ctgoodjobs.hk/jobs/part-time
Here is my code:
url = "https://www.ctgoodjobs.hk/jobs/part-time"
response = open(url) rescue nil
next unless response
doc = Nokogiri::HTML(open(url))
links = doc.search('.job-title > a').text
is not that easy, urls are "obscured" using a js function, that's why you're getting javascript: void(0) when asking for the hrefs... looking at the html, there are some hidden inputs for each link, and, there is a preview url that you can use to build the job preview url (if that's what you're looking for), so you have this:
<div class="result-list-job current-view">
<input type="hidden" name="job_id" value="04375145">
<input type="hidden" name="each_job_title_url" value="barista-senior-barista-咖啡調配員">
<h2 class="job-title">Barista/ Senior Barista 咖 啡 調 配 員</h2>
<h3 class="job-company">PACIFIC COFFEE CO. LTD.</h3>
<div class="job-description">
<ul class="job-desc-list clearfix">
<li class="job-desc-loc job-desc-small-icon">-</li>
<li class="job-desc-work-exp">0-1 yr(s)</li>
<li class="job-desc-salary job-desc-small-icon">-</li>
<li class="job-desc-post-date">09/11/16</li>
</ul>
</div>
<a class="job-save-btn" title="save this job" style="display: inline;"> </a>
<div class="job-batch-apply"><span class="checkbox" style="background-position: 0px 0px;"></span><input type="checkbox" class="styled" name="job_checkbox" value="04375145"></div>
<div class="job-cat job-cat-de"></div>
</div>
then, you can retrieve each job_id from those inputs, like:
inputs = doc.search('//input[#name="job_id"]')
and then build the urls (i found the base url at joblist_preview.js:
urls = inputs.map do |input|
"https://www.ctgoodjobs.hk/english/jobdetails/details.asp?m_jobid=#{input['value']}&joblistmode=previewlist&ga_channel=ct"
end
Take the output of a browser and that of a tool like wget, curl or nokogiri and you will find the HTML the browser presents can differ drastically from the raw HTML.
Browsers these days can process DHTML, Nokogiri doesn't. You can only retrieve the raw HTML using something that lets you see the content without the browser, like the above mentioned tools, then compare that with what you see in a text editor, or what nokogiri shows you. Don't trust the browser - they're known to lie because they want to make you happy.
Here's a quick glimpse into what the raw HTML contains, generated using:
$ nokogiri "https://www.ctgoodjobs.hk/jobs/part-time"
Nokogiri dropped me into IRB:
Your document is stored in #doc...
Welcome to NOKOGIRI. You are using ruby 2.3.1p112 (2016-04-26 revision 54768) [x86_64-darwin15]. Have fun ;)
Counting the hits found by the selector returns:
>> #doc.search('.job-title > a').size
30
Displaying the text found shows:
>> #doc.search('.job-title > a').map(&:text)
[
[ 0] "嬰 兒 奶 粉 沖 調 機 - 兼 職 產 品 推 廣 員 Part Time Promoter (時 薪 高 達 HK$90, 另 設 銷 售 佣 金 )",
...
[29] "Customer Services Representative (Part-time)"
]
Looking at the actual href:
>> #doc.search('.job-title > a').map{ |n| n['href'] }
[
[ 0] "javascript:void(0);",
...
[29] "javascript:void(0);"
]
You can tell the HTML doesn't contain anything but what Nokogiri is telling you, so the browser is post-processing the HTML, processing the DHTML and modifying the page you see if you use something to look at the HTML. So, the short fix is, don't trust the browser if you want to know what the server sends to you.
This is why scraping isn't very reliable and you should use an API if at all possible. If you can't, then you're going to have to roll up your sleeves and dig into the JavaScript and manually interpret what it's doing, then retrieve the data and parse it into something useful.
Your code can be cleaned up and simplified. I'd write it much more simply as:
url = "https://www.ctgoodjobs.hk/jobs/part-time"
doc = Nokogiri::HTML(open(url))
links = doc.search('.job-title > a').map(&:text)
The use of search(...).text is a big mistake. text, when applied to a NodeSet, will concatenate the text of each contained node, making it extremely difficult to retrieve the individual text. Consider this:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<p>foo</p>
<p>bar</p>
</body>
</html>
EOT
doc.search('p').class # => Nokogiri::XML::NodeSet
doc.search('p').text # => "foobar"
doc.search('p').map(&:text) # => ["foo", "bar"]
The first result foobar would require being split apart to be useful, and unless you have special knowledge of the content, trying to figure out how to do it will be a major pain.
Instead, use map to iterate through the elements and apply &:text to each one, returning an array of each element's text.
See "How to avoid joining all text from Nodes when scraping" and "Taking apart a DHTML page" also.

google translate misses up the coding of my file

i am trying to use google translate for localization of an XML file, it has near 350K lines, but some of them contain coding for in-game font size and color, like so:
<replacement><p horizontalalignment="center"><br/><image enablescale="false" imagesetpath="00015590.InterD_Jeryoung_3"/><br/><image enablescale="true" imagesetpath="00015590.Tag_Dungeon_Six_Superior" scalerate="1.5"/><image enablescale="true" imagesetpath="00015590.Tag_Dungeon_Four_Superior" scalerate="1.5"/><br/><image enablescale="true" imagesetpath="00009499.Field_Boss" scalerate="1.4"/>Хмельной лик<br/><br/></p>Уничтожить зараженных насекомых<br/>возле мест обитания их королевы。<br/></replacement>
now for god knows what reason, google translate alters that code in the process of translation into some unacceptable coding, like so:
<replacement> <p horizontalalignment="center"> <br/> <image enablescale="false" imagesetpath="00015590.InterD_Jeryoung_3"/> <br/> <image enablescale = "true "imagesetpath =" 00015590.Tag_Dungeon_Six_Superior "scalerate =" 1.5 "/> <image enablescale="true" imagesetpath="00015590.Tag_Dungeon_Four_Superior" scalerate="1.5"/> <br/> <image enablescale = "true" imagesetpath = "00009499.Field_Boss" scalerate = "1.4" /> Intoxicated face <br/> <br/> </ p> Destroy infected insects <br/> habitats near their queen. <br/> </ replacement>
is there any way to avoid that, why is it happening exactly? anyhelp is appreciated on that matter,thanks
EDIT : i am also looking for a way to input my text and have it out in the same exact language with only the coding mishaps changing, so i can isolate those,build a comparison table and then use that to fix the errors after the actual translation is done, but i don't see a way for selecting the same language as input AND output in google translate, it always forces me choose a different one in input or output, kind of makes sense but if there is a way to do that, i might be able to work around it..
Do not feed Google translate with your Xml file, as far as I know it doesn't understand Xml.
Extract the text from the Xml file.
Feed the text to translate.
Transform the text back to Xml.
You could simply transform the Xml to a text document with a single line per Xml element so it would be easier to turn it back into Xml.
More detail
According to the Toolkit you can upload:
HTML (.HTML)
Microsoft Word (.DOC/.DOCX)
OpenDocument Text (.ODT)
Plain Text (.TXT)
Rich Text (.RTF)
Wikipedia URLs
And a couple of extras such as JSON. So no Xml.
The best way I see is to transform your Xml document into one of these types (I would probably use JSON) and transform it is such a way that it can easily be transformed back again by using either position (1 line in the text file is the first element in the Xml document) or by an id (add the Id or position of the element in the xml hierarchy to the JSON element)
My guess is that the toolkit recognizes the html tags in the xml and escapes them. So another option might be to un-escape the > to > and &lt to <

xpath with contains throws error if string starts with a number

I'm running into a strange problem with nokogiri and xpath. I want to parse a HTML document and get all links by href value and the anchor text they contain.
Here's my xpath so far:
xpath = "//a[contains(text(), #{link['anchor_text']}) and #href='#{link['target_url']}']"
a = doc.search(xpath)
This works fine so far as long as link['anchor_text'] is a string without numbers.
If I'm trying to get a link with the anchor text "11example" it throws the following error:
Invalid expression: //a[contains(text(), 11example) and #href='http://www.example.com/']
Maybe it's just a stupid mistake, but I'm not seeing why this error occurs. If I put some quotes around the #{link['anchor_text']} in the xpath, nothing is working.
Edit: Here's the sample HTML:
<!DOCTYPE html>
<head>
<title>Example.com</title>
</head>
<body>
<p>
<strong>Here is some text</strong><br />
11exampleSome text here and there
</p>
<p>
<strong>Another text</strong><br />
example.comSome text here and there
</p>
</body>
Edit2: If I run these queries manually in irb console everything works as expected, but only if I put the text in quotes.
Thanks in advance!
Kind regards,
madhippie
The simple answer is that you are missing quotes around #{link['anchor_text']}, like you have around #{link['target_url']}. The full XPath should be
xpath = "//a[contains(text(), '#{link['anchor_text']}') and #href='#{link['target_url']}']"
The reason it appears to work (at least not produce an error) when you don’t start with a number is that the string is being interpreted as a node query. For example Nokogiri is looking for a tag named <example.com> inside the <a> tag, then converting it to a string and seeing if the text nodes of the <a> tag contain that string. If the tag isn’t there (as in this case) then the result of contains is always true.
As a demonstration, with the HTML:
<q>foo</q>example
<q>foo</q>foo
foo
Then the query
doc.search("//a[contains(text(), q)]")
doesn’t match the first <a> tag, but does match the second and third.
When the string starts with a number, it can’t be parsed into a node query since names starting with digits aren’t valid XML (or HTML) element names, so you get an error.

Simple NSData's category to parse XML with cyrillic

I have to parse NSData with XML string, does somebody know simple category to do it? I have such for JSON, but I forced to use XML. I tried to use XMLReader, it's interface looks clean, but I found some issues:
Mysterious new line characters and spaces everywhere:
"comment_count" = {text = "\n \n 21";};
My cyrillic symbols looks so:
"description_text" = {text = "\n \U041f\U0438\U043a\U0430\U0431\U0443\U0448};
Example:
<?xml version="1.0" encoding="UTF-8" ?>
<news>
<xml_count>43</xml_count>
<hot_count>449</hot_count>
<item type="text">
<id>1469845</id>
<rating>147</rating>
<pluses>171</pluses>
<minuses>24</minuses>
<title>
<![CDATA[Обновление огромного архива Пикабу!]]>
</title>
<comment_count>26</comment_count>
<comment_link>http://pikabu.ru/story/obnovlenie_ogromnogo_arkhiva_pikabu_1469845</comment_link>
<author>icq677555</author>
<description_text>
<![CDATA[Пикабушники, я обновил свой огромный архив текстовых постов из горячего!]]>
</description_text>
</item>
</news>
I just realized whats' going on. Your data samples are obviously NSDictionary instances printed in the debugger. So the issues you found are:
As XML was originally designed as an annotated text format, the whitespace (spaces, newlines) handling doesn't perfectly fit for data only usage. You can either trim all resulting strings ([stringVar stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]]), adapt XMLReader to do it or use the XML parser at http://ios.biomsoft.com/2011/09/11/simple-xml-to-nsdictionary-converter/ (which does it by default).
The funny output you get for Cyrillic characters is the proper escaping for non-ASCII characters in the debugger output (which uses the old-style property list format). It's an artifact of the debugger output. Your variables contain the proper characters.
BTW: While JSON contains implicit type information (strings are always quoted, numbers are never quoted etc.), XML without a schema file does not. So all the parsed simple values will be strings even if they originally were numbers.
Update:
The XML parser you're using still contains the old whitespace handling code described in Pesky new lines and whitespace in XML reader class (though the comment tells otherwise). Apply the fix mentioned at the bottom of the answer, namely change the line:
[dictInProgress setObject:textInProgress forKey:kXMLReaderTextNodeKey];
to:
[dictInProgress setObject:[textInProgress stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]] forKey:kXMLReaderTextNodeKey];

RegEx Not working in Ruby!

I am using the following regex
html.scan(Regexp.new(/Name:<\/td>(.*?)<\/td>/s))
to match the name [ Burkhart, Peterson & Company ] in this
<td class="generalinfo_left" align="right">Name:</td>
<td class="generalinfo_right">Burkhart, Peterson & Company</td>
Generally parsing (X)HTML using Regular Expressions is bad practice. Ruby has the fantastic Nokogiri Library which uses libxml2 for parsing XHTML efficiently.
Which that being said, your . does not match newlines. Use the m modifier for your regexp which tells the . to match new lines. Or the Regexp::MULTILINE constant. Documented here
Your regular expression is also capturing the HTML before the text you require.
Using nokogiri and XPath would mean you could grab the content of this table cell by referring to its CSS class. Like this:
#!/usr/bin/env ruby
require 'nokogiri'
doc = Nokogiri::HTML DATA.read
p doc.at("td[#class='generalinfo_right']").text
__END__
<td class="generalinfo_left" align="right">Name:</td>
<td class="generalinfo_right">Burkhart, Peterson & Company</td>
Which will return "Burkhart, Peterson & Company"
/m makes the dot match newlines
You'll want to use /m for multiline mode:
str.scan(/Name:</td>(.*?)</td>/m)
html.scan(Regexp.new(/Name:<\/td>(.*?)<\/td>/s)) doesn't match the new line characters; even if it would match those characters, the (.*?) part would grab everything after </td>, including <td class="generalinfo_right">.
To make the regular expression more generic, and allow to match the exact text you want, you should change the code to
html.scan(Regexp.new(/Name:<\/td><td[^>]*>(.*?)<\/td>/s))
The regular expression could be better written, though.
I would also not suggest to parse HTML/XHTML content with regular expression.
You can verify that all the answers suggesting you add /m or Regexp::MULTILINE are correct by going to rubular.com.
I also verified the solution in console, and also modifed the regex so that it would return only the name instead of all the extra junk.
Loading development environment (Rails 2.3.8)
ree-1.8.7-2010.02 > html = '<td class="generalinfo_left" align="right">Name:</td>
ree-1.8.7-2010.02'> <td class="generalinfo_right">Burkhart, Peterson & Company</td>
ree-1.8.7-2010.02'> '
=> "<td class="generalinfo_left" align="right">Name:</td>\n<td class="generalinfo_right">Burkhart, Peterson & Company</td>\n"
ree-1.8.7-2010.02 > html.scan(Regexp.new(/Name:<\/td>(.*?)<\/td>/m))
=> [["\n<td class="generalinfo_right">Burkhart, Peterson & Company"]]
ree-1.8.7-2010.02 > html.scan(Regexp.new(/Name:<\/td>.*<td[^>]*>(.*?)<\/td>/m))
=> [["Burkhart, Peterson & Company"]]
ree-1.8.7-2010.02 >

Resources