How to parse a Nokogiri XML Element? - ruby-on-rails

I'm able to narrow in on the area of an HTML document using nokogiri. I need to be able to extract the href from the nokogiri object but I'm not able to figureout how to do this for the life of me. Calling row.css('td > b').to_html method gives me the pretty html representation in string form. But I need to parse this using nokogiri.
"<b>\ntour companies for botswana</b>"
The nokogiri equivalent that I'm unable to extract the url from is below:
[#<Nokogiri::XML::Element:0x3fe972a9deb8 name="b" children=[#<Nokogiri::XML::Element:0x3fe972ad90a8 name="a" attributes=[#<Nokogiri::XML::Attr:0x3fe972ad8ff4 name="href" value="/ShowTopic-g317055-i11941-k10224606-United_Expeditions_tour_company_Maun-Maun_North_West_District.html">, #<Nokogiri::XML::Attr:0x3fe972ad8fe0 name="onclick" value="setPID(34603)">] children=[#<Nokogiri::XML::Text:0x3fe972ad8900 "\nUnited Expeditions tour company, Maun">]>]>]
The snippet above is a confusing bit of nokogiri xml object I guess. But I just want to get the href. How the heck do I do this?

row.css('td > b a').attr('href')
This should do the work. Read more about How to access attributes using Nokogiri.

Related

Nokogiri returning variable name instead of actual data on website?

I am fetching data from a website. I need to fetch a text inside h1 tag. when I inspect the element , inside that h1 tag there is a text. But when I fetch using Nokogiri, there is a variable name in that h1 tag.
content = open('https://example.com').read
html = Nokogiri::HTML(content)
html.css('h1#egift-refresh-online-number-desktop').text
when I inspect in chrome i found
But when I view the source of that page, I saw
I need to extract the actual value not the variable name. How can I do that with Nokogiri? If there is any method for doing this?
Nokogiri is just a simple XML/HTML parser and is not the right tool for this job.
What you have fetched looks like a Handlebars template (or one of its many offshots) and {{ ecardDetails.cardCardnumber }} is just a placeholder in the HTML file that is replaced with actual data by JavaScript possibly after doing an AJAX request.
Nokogiri does not execute JavaScript as its not a browser.
Capybara is a DSL which is mostly used for acceptance testing which when used with the correct driver (like selenium or webkit) can automate a browser and thus scrape pages that rely on JavaScript.

How do I fix this Nokogiri document result to make it legible?

I'm trying to scrape kickass.to and I'm having difficultly returning a legible document.
Here's my code:
require 'nokogiri'
require 'open-uri'
url = "http://kickass.to/usearch/Mobile%20Suit%20Gundam:%20Char%27s%20Counterattack%201988category:movies/"
doc = Nokogiri::HTML(open(url))
result:
#<Nokogiri::HTML::Document:0x3ffb45c23ab4 name="document" children=[#<Nokogiri::XML::DTD:0x3ffb45c23744 name="html">, #<Nokogiri::XML::Element:0x3ffb45c26fc0 name="html" children=[#<Nokogiri::XML::Element:0x3ffb45c26db8 name="body" children=[#<Nokogiri::XML::Element:0x3ffb45c26bb0 name="p" children=[#<Nokogiri::XML::Text:0x3ffb45c269a8 "\u008B å}ùvÛF²÷ßñSt8Ç\u009142H,Y\u0092©Åñ\u008Cíx,%\u0099\\_],\r\tÐX$Ñ\u0093y¢ï¾ÿî\u0093Ý_u ¸\u0088\"eÑ\u008E3>>\"6º««ªkëBõþ÷Ç?\u009Dÿöæ\u0084õ\u0093áàðÑ>}°\u009Bá \u0088*ý$íÕj×××Õk£F½\u009AÖn·k7Ô¦Â\\?:¨\u0092¨BOqË=|Äðo\u007FÈ\u009D%#\u007FLý«\u0083ÊQ$">, #<Nokogiri::XML::Element:0x3ffb45c268cc name="h">]>]>, #<Nokogiri::XML::Element:0x3ffb45c26480 name="html" children=[#<Nokogiri::XML::Element:0x3ffb45c26278 name="p" children=[#<Nokogiri::XML::Text:0x3ffb45c26070 "T~\u0093Ô¨§§Ìé[QÌ\u0093\u00834ñ\u0094V¥vWGgÉxÀvçÄñôã\u00815ä\u0097ÇNä\u008F?J CάÀenxBËeÃÐö\u009CÅ©\u009F°^¸ÖpOÀ¶ì³\u0088¬$±\u009CKfÙq8H>3/\u008C\u0098q^e§V\u009C}ÅUvìGÜ\u0099ÜaW¾Å~\u007Fì+ËXö\u0080/\u00825\ní0\u0089K`¡¸ü¦Â\">8¨¤1·\"§_¯=\u0083ó0\u008A#\u0094\u00981ýÝw.­8Îoí×d§\u0092\u009C?¸\u0094CÇ\u0084ö¸ÏyRa\th\u0099\u0091\u0090pÎú÷*µúI¬ÄwªN8¬Y\u0083\u0081¢µ\u009Aå\u0094.\u008DÑ£ÄIæ\u0083OnéÖZ=×Uñ§\u0092÷ôhfk4«$aêô\u0095»»\u009Cm]=Ñ·ìö{Eyç{l\u0090°'¬ù>cSüÂùcÎ5\u009F7¦q ¨¸\u00959N¾\u007FÇ×÷Þ+Êa6«løuÆn>üØ­UçÝ\u00924ÿìùJt·óaåJfqäÌñÛ\u0087Xȳ:ô\u0083bâÀ\u009D%ný\u0080Å'»¨î×äUFÈ[1ÞK8Q¼ á.\u008A·\u008BÁ×ßB\u0092\u0096¡£WVÄ.­\u0084°\u007F\t\u0086¤{ôp+澻Ƕ²·õdª\u0089ËÈ¢\u008B\u0081ôö\u0098:ý
You get the picture. It's illegible and I can't seem to figure out where particular elements are. Any ideas where to go from here?
Works fine for me on MRI Ruby 2.1.1. You can either try to re-install/update Nokogiri and/or do the same with Ruby.
I think you misunderstand how Nokogiri works. Nokogiri does not return the raw HTML on the requested page, it wraps each DOM element within a Nokogiri object and returns a Nokogiri enumerable object that contains all of these elements.
It is difficult to help you as It's unclear if you want to extract all of the HTML or specific parts of the page. Nokogiri works by using CSS style selectors to 'query' the Nokogiri object and extract the elements you want.
If you refer to the Nokogiri docs this will help, but using there example...
doc.css('h3.r a').each do |link|
puts link.content
end
This assumes you have a variable containing results of a Nokogiri scrape (in your case you've also used 'doc').
This then performs a search for all nodes that are links (a tags) that are contained within an h3 tag with the class of 'r'.
In this case they are looping through the elements that match this criteria (.css function also returns an enumerable as there could be multiple elements matching the criteria) and printing these to console.

Library for JSP parsing and manipulation

I am trying to parse a bunch of JSP files and find places with hardcoded strings. E.g.
<h:outputText value="I am hardcoded" styleClass="someClass" />
<my:customTag value="I am hardcoded too" />
Currently I am using jsoup to do so. It seems great as an HTML parser, however if I make changes to the document and write it out to a file all of the case sensitive JSF tags and attributes are changed to lowercase. Are there any Java libraries that can parse a JSP file, let me modify some attribute values, and let me write out the JSP?
Or better yet is there a way to tell jsoup not to change the casing of my elements and attributes?
It may not be the case for other people, but it turns out replacing all of the toLowerCase() calls in jsoup worked well enough for me.

RABL and XML formatting

I'm using RABL to format the output of a Rails API I'm creating.
Is there anyway to customize the shape of the XML being produced? For instance, I need to produce an output that uses XML attributes, instead of elements. In other words, this...
<auth status="FAILED" errorcode="UNKNOWN_LOGIN" errormessage=”Error Message” />
Instead of this...
<auth>
<status type="symbol">failed</status>
<errorcode>UNKNOWN_LOGIN</errorcode>
<errormessage>Invalid credentials.</errormessage>
</auth>
Any help will be highly appreciated?
I don't think rabl can support xml attributes, but the gem's author is very helpful. I would pose this question at the gem's git home page.
https://github.com/nesquena/rabl/issues

rails, given a HTML string from a WYSIWYG - how to get just text

I have a large HTML string from a WYSIWYG and want to show a truncates string of just text, no html or html tags. Is there any way to do this built into rails or do I need a gsub to get rid of all html brackets?
Thanks
Rails already includes some powerful sanitization helpers.
string = '<span id="span_is"><br><br><u><i>Hi</i></u></span>'
strip_tags(string)
It depends upon how complex your HTML is, but you could certainly use Nokogiri and XPath to query the text that you want from the HTML. It depends upon how much you want to parse, and whether it justifies an extra library to do it.
A parser can do it but would be overkill if you have simple HTML to present. Something like Loofah or sanitize could strip all the tags using Nokogiri to parse the HTML then strip out the tags, leaving you with the text.
require 'sanitize'
html = '<html><body>Jackdaws love my giant sphinx of quartz.</body></html>'
puts Sanitize.clean(html)
# >> Jackdaws love my giant sphinx of quartz.
I think loofah is more capable than sanitize, but if all you want to do is toss tags away sanitize might be the way to go.

Resources