I'm trying to scrape kickass.to and I'm having difficultly returning a legible document.
Here's my code:
require 'nokogiri'
require 'open-uri'
url = "http://kickass.to/usearch/Mobile%20Suit%20Gundam:%20Char%27s%20Counterattack%201988category:movies/"
doc = Nokogiri::HTML(open(url))
result:
#<Nokogiri::HTML::Document:0x3ffb45c23ab4 name="document" children=[#<Nokogiri::XML::DTD:0x3ffb45c23744 name="html">, #<Nokogiri::XML::Element:0x3ffb45c26fc0 name="html" children=[#<Nokogiri::XML::Element:0x3ffb45c26db8 name="body" children=[#<Nokogiri::XML::Element:0x3ffb45c26bb0 name="p" children=[#<Nokogiri::XML::Text:0x3ffb45c269a8 "\u008B å}ùvÛF²÷ßñSt8Ç\u009142H,Y\u0092©Åñ\u008Cíx,%\u0099\\_],\r\tÐX$Ñ\u0093y¢ï¾ÿî\u0093Ý_u ¸\u0088\"eÑ\u008E3>>\"6º««ªkëBõþ÷Ç?\u009Dÿöæ\u0084õ\u0093áàðÑ>}°\u009Bá \u0088*ý$íÕj×××Õk£F½\u009AÖn·k7Ô¦Â\\?:¨\u0092¨BOqË=|Äðo\u007FÈ\u009D%#\u007FLý«\u0083ÊQ$">, #<Nokogiri::XML::Element:0x3ffb45c268cc name="h">]>]>, #<Nokogiri::XML::Element:0x3ffb45c26480 name="html" children=[#<Nokogiri::XML::Element:0x3ffb45c26278 name="p" children=[#<Nokogiri::XML::Text:0x3ffb45c26070 "T~\u0093Ô¨§§Ìé[QÌ\u0093\u00834ñ\u0094V¥vWGgÉxÀvçÄñôã\u00815ä\u0097ÇNä\u008F?J CάÀenxBËeÃÐö\u009CÅ©\u009F°^¸ÖpOÀ¶ì³\u0088¬$±\u009CKfÙq8H>3/\u008C\u0098q^e§V\u009C}ÅUvìGÜ\u0099ÜaW¾Å~\u007Fì+ËXö\u0080/\u00825\ní0\u0089K`¡¸ü¦Â\">8¨¤1·\"§_¯=\u0083ó0\u008A#\u0094\u00981ýÝw.8Îoí×d§\u0092\u009C?¸\u0094CÇ\u0084ö¸ÏyRa\th\u0099\u0091\u0090pÎú÷*µúI¬ÄwªN8¬Y\u0083\u0081¢µ\u009Aå\u0094.\u008DÑ£ÄIæ\u0083OnéÖZ=×Uñ§\u0092÷ôhfk4«$aêô\u0095»»\u009Cm]=Ñ·ìö{Eyç{l\u0090°'¬ù>cSüÂùcÎ5\u009F7¦q ¨¸\u00959N¾\u007FÇ×÷Þ+Êa6«løuÆn>üØUçÝ\u00924ÿìùJt·óaåJfqäÌñÛ\u0087Xȳ:ô\u0083bâÀ\u009D%ný\u0080Å'»¨î×äUFÈ[1ÞK8Q¼ á.\u008A·\u008BÁ×ßB\u0092\u0096¡£WVÄ.\u0084°\u007F\t\u0086¤{ôp+澻Ƕ²·õdª\u0089ËÈ¢\u008B\u0081ôö\u0098:ý
You get the picture. It's illegible and I can't seem to figure out where particular elements are. Any ideas where to go from here?
Works fine for me on MRI Ruby 2.1.1. You can either try to re-install/update Nokogiri and/or do the same with Ruby.
I think you misunderstand how Nokogiri works. Nokogiri does not return the raw HTML on the requested page, it wraps each DOM element within a Nokogiri object and returns a Nokogiri enumerable object that contains all of these elements.
It is difficult to help you as It's unclear if you want to extract all of the HTML or specific parts of the page. Nokogiri works by using CSS style selectors to 'query' the Nokogiri object and extract the elements you want.
If you refer to the Nokogiri docs this will help, but using there example...
doc.css('h3.r a').each do |link|
puts link.content
end
This assumes you have a variable containing results of a Nokogiri scrape (in your case you've also used 'doc').
This then performs a search for all nodes that are links (a tags) that are contained within an h3 tag with the class of 'r'.
In this case they are looping through the elements that match this criteria (.css function also returns an enumerable as there could be multiple elements matching the criteria) and printing these to console.
Related
I am fetching data from a website. I need to fetch a text inside h1 tag. when I inspect the element , inside that h1 tag there is a text. But when I fetch using Nokogiri, there is a variable name in that h1 tag.
content = open('https://example.com').read
html = Nokogiri::HTML(content)
html.css('h1#egift-refresh-online-number-desktop').text
when I inspect in chrome i found
But when I view the source of that page, I saw
I need to extract the actual value not the variable name. How can I do that with Nokogiri? If there is any method for doing this?
Nokogiri is just a simple XML/HTML parser and is not the right tool for this job.
What you have fetched looks like a Handlebars template (or one of its many offshots) and {{ ecardDetails.cardCardnumber }} is just a placeholder in the HTML file that is replaced with actual data by JavaScript possibly after doing an AJAX request.
Nokogiri does not execute JavaScript as its not a browser.
Capybara is a DSL which is mostly used for acceptance testing which when used with the correct driver (like selenium or webkit) can automate a browser and thus scrape pages that rely on JavaScript.
I'm able to narrow in on the area of an HTML document using nokogiri. I need to be able to extract the href from the nokogiri object but I'm not able to figureout how to do this for the life of me. Calling row.css('td > b').to_html method gives me the pretty html representation in string form. But I need to parse this using nokogiri.
"<b>\ntour companies for botswana</b>"
The nokogiri equivalent that I'm unable to extract the url from is below:
[#<Nokogiri::XML::Element:0x3fe972a9deb8 name="b" children=[#<Nokogiri::XML::Element:0x3fe972ad90a8 name="a" attributes=[#<Nokogiri::XML::Attr:0x3fe972ad8ff4 name="href" value="/ShowTopic-g317055-i11941-k10224606-United_Expeditions_tour_company_Maun-Maun_North_West_District.html">, #<Nokogiri::XML::Attr:0x3fe972ad8fe0 name="onclick" value="setPID(34603)">] children=[#<Nokogiri::XML::Text:0x3fe972ad8900 "\nUnited Expeditions tour company, Maun">]>]>]
The snippet above is a confusing bit of nokogiri xml object I guess. But I just want to get the href. How the heck do I do this?
row.css('td > b a').attr('href')
This should do the work. Read more about How to access attributes using Nokogiri.
I'm not sure how I'd select an title with regex. I've tried
match(/<title>(.*) .*<\/title>/)[1]
but that doesn't match anything.
This is the response body I'm trying to select from.
Trying to select "title I need to select."
The reason it doesn't work is because of the itemprop=\"name\" property. To fix this, you can match it as well:
# copy-paste from the page you provided
html = '<!doctype html>\n<html lang=\"en\" itemscope itemtype=\"https://schema.org/WebPage\">\n<head>\n<meta charset=\"utf-8\"><meta name=\"referrer\" content=\"always\" />\n<title itemprop=\"name\">title I need to select.</title>\n<meta itemprop=\"description\" name=\"description\" content=\\'
html.match(/<title.*?>(.*)<\/title>/)[1] # => "title I need to select."
.*? basically means "match as many characters are needed, but not more"
However, as other have pointed out, regexes are not ideal for html parsing. Instead, you could use a popular ruby gem for that purpose - Nokogiri:
require 'nokogiri'
page = Nokogiri.parse(html)
page.css('title').text # => "title I need to select."
Note that it can handle even malformed html like is the case here.
If you're looking for a much more robust XML/HTML parser, try using Nokogiri which supports XPath.
This post explains why
Use xPath or Regex?
require "nokogiri"
string = "<title itemprop=\"name\">title I need to select.</title>"
html_doc = Nokogiri::HTML(string)
html_doc.xpath("//title").first.text
Here's the regexp that will give you what you want:
<title.*>(.*)<\/title>
As was mentioned, there are better ways to parse HTML. You might want to check out something like Nokogiri.
When I have to get elements from XML I like to convert it to a hash
from_xml(xml, disallowed_types = nil) public
Returns a Hash containing a collection of pairs when the key is the
node name and the value is its content
# http://apidock.com/rails/Hash/from_xml/class
now you can do something like
hash = Hash.from_xml('XML')
hash.title # my favorite book
One solution would be to use the following pattern:
<title.*?>(.*?)<\/title>
https://regex101.com/r/piwm5H/1
Use a HTML/XML parser when dealing with XML or HTML data, except for extremely simple cases. HTML and XML are too complicated for normal regular expressions.
Using Nokogiri I'd do:
require 'nokogiri'
some_html = '
<html>
<head>
<title>the title</title>
</head>
</html>
'
doc = Nokogiri::HTML(some_html)
doc.title # => "the title"
Nokogiri already has a method to return the title so you can take advantage of that. Or, you can do it the normal way:
doc.at('title').text # => "the title"
The problem with a regular expression is that HTML could be written in many ways:
<title>foo</title>
or:
<title>
foo
</title>
or even:
<title>foo
</head>
which, while not correct, will be accepted by browsers and fixed up by Nokogiri which will then still work. Writing a pattern to handle those variants is a pain and error-prone. It only gets worse as the HTML gets more complex, especially when you don't control the generation of the content.
Specifically, I would like to import the first block of text before the table of contents from a Wikipedia page (which is public domain).
Let's say I have a Model "Resource", with an attribute x, and x is a string that is a Wikipedia link (eg. x: "http://en.wikipedia.org/wiki/Lanny_McDonald"). The first block of text on every Wikipedia page is the group of <p>...</p>'s before <div id="toc" class="toc">...</div>.
Can I write code that copies the content of these <p>...</p>'s and writes it onto my website?
This is known as Web Scraping.
Ironically follow this wikipedia link and
consider the legal ramifications etc.
Nokogiri is boss for this..
Install:
sudo gem install nokogiri -- --with-xml2-include=/usr/local/include/libxml2 --with-xml2-lib=/usr/local/lib
Usage:
There are methods to search using xpath or css which makes things simple.
# wiki_scraper.rb
require 'open-uri'
require 'nokogiri'
# Load in the url.
#doc = Nokogiri::HTML(open("http://en.wikipedia.org/wiki/Branch_predictor"))
# Print the first <p> element
puts #doc.xpath("/html/body/p[1]")
You could use a HttpWebRequest, to retrieve the entire page, and then parse the html. There are tools available to convert html to xhtml, at which point you could use xml libraries to parse the xhtml.
I'd like to include some HTML element names, like <label>, in my ruby class documentation generated by Yard. But it is not working. For example, the sentence
# Returns a <label> field...
Becomes, after processing by Yard
Return a field...
The <label> element is actually passing verbatim through Yard making it to the browser as raw HTML.
I tried using <label> instead, and that got escaped so I ended up with <label> in the resulting documentation.
Thanks!
Yard uses normal RDoc markup by default:
By default, YARD is compatible with the same RDoc syntax most Ruby developers are already familiar with.
And RDoc has this to say:
Putting a backslash before inline markup stops it being interpreted, which is how I created the table above:
_italic_:: \_word_ or \<em>text</em>
*bold*:: \*word* or \<b>text</b>
+typewriter+:: \+word+ or \<tt>text</tt>
That suggests that this:
# Returns a \<label> field...
should work but that did nothing useful for me, just more of the same "pass it through to the HTML" nonsense. However, wrapping the <label> in RDoc's typewriter markup did produce something useful so try this:
# Returns a +<label>+ field...