I'm exploring Nokogiri and have come across a perplexing issues I would appreciate someones views on. NB I'm also fairly new to Ruby so am expecting to have done something really daft. Apologies if that is the case.
I have a simple test that is comparing the results of an XPath query and a CSS query on an XML document. The CSS query works but the XPath doesn't and I'm at a loss as to explain why.
should "get same result from Nokogiri using XPath or CSS syntax" do
xml_source = "<?xml version=\"1.0\" encoding=\"utf-8\"?><accounts xmlns=\"http://api.esendex.com/ns/\"><account id=\"2b4a326c-41de-4a57-a577-c7d742dc145c\" uri=\"http://api.esendex.com/v1.0/accounts/2b4a326c-41de-4a57-a577-c7d742dc145c\"><messagesremaining>100</messagesremaining></account></accounts>"
ndoc = Nokogiri::XML(xml_source)
node_value = ndoc.css("accounts account messagesremaining").count
assert_equal 1, node_value
node_value = ndoc.xpath("//accounts//account//messagesremaining").count
assert_equal 1, node_value
end
The second assert fails with node_value equal to zero.
Thanks in advance.
You have two issues.
First the xpath should be "//accounts/account/messagesremaining".
Second you have a default namespace "http://api...". You need to specify the namespace of each element when doing an xpath query (css queries ignore the namespace).
Sorry, i don't know Nokogiri, but I'm it has some documentation on how to used namespaces on xpath queries.
Related
I'm trying to add an rspec to test an xml file i am building. In the actual xml code, i have:
<g:shipping_weight>5 lb</g:shipping_weight>
I just want to make sure that the value is as i am expecting, but rspec just can't match the tag if it has special characters in it. a snippet from my spec:
context 'verify weight' do
subject { response.body }
it { is_expected.to have_css('g:shipping_weight', text: '12.34 lb')}
end
have_selector and have_tag do not match the selector either, so i am relying on using match which works, but i'm sure there's a better way?
I hit this problem a few months ago and spent hours Googling before I found the solution. Assuming you are parsing with Nokogiri here, you should able to access the text of this attribute as follows:
response.body.at_css("g|shipping_weight").text
That's right, you use a pipe, |, in place of the colon. I don't think I ever found documentation of this, and I can't now find the example that led me to try this. But it works!
(Note that you will have had to define the g namespace at the start of the XML document.)
I am trying to get auto-corrected spelling from Google's home page using Nokogiri.
For example, if I am typing "hw did" and the correct spelling is "how did", I have to get the correct spelling.
I tried with the xpath and css methods, but in both cases, I get the same empty array.
I got the XPath and CSS paths using FireBug.
Here is my Nokogiri code:
#requ=params[:search]
#requ_url=#requ.gsub(" ","+") //to encode the url(if user inputs space than it should be convet into + )
#doc=Nokogiri::HTML(open("https://www.google.co.in/search?q=#{#requ_url}"))
binding.pry
Here are my XPath and CSS selectors:
Using XPath:
pry(#<SearchController>)> #doc.xpath("/html/body/div[5]/div[2]/div[6]/div/div[4]/div/div/div[2]/div/p/a").inspect
=> "[]"
Using CSS:
pry(#<SearchController>)> #doc.css('html body#gsr.srp div#main div#cnt.mdm div.mw div#rcnt div.col div#center_col div#taw div div.med p.ssp a.spell').inner_text()
=> ""
First, use the right tools to manipulate URLs; They'll save you headaches.
Here's how I'd find the right spelling:
require 'nokogiri'
require 'uri'
require 'open-uri'
requ = 'hw did'
uri = URI.parse('https://www.google.co.in/search')
uri.query = URI.encode_www_form({'q' => requ})
doc = Nokogiri::HTML(open(uri.to_s))
doc.at('a.spell').text # => "how did"
it works fine with "how did",check it with "bnglore" or any one word string,it gives an error. the same i was facing in my previous code. it is showing undefined method `text'
It's not that hard to figure out. They're changing the HTML so you have to change your selector. "Inspect" the suggested word "bangalore" and see where it exists in relation to the previous path. Once you know that, it's easy to find a way to access the word:
doc.at('span.spell').next_element.text # => "bangalore"
Don't trust Google to do things the easy way, or even the best way, or be consistent. Just because they return HTML one way for words with spaces, doesn't mean they're going to do it the same way for a single word. I would do it consistently, but they might be trying to discourage you from mining their pages so don't be surprised if you see variations.
Now, you need to figure out how to write code that knows when to use one selector/method or the other. That's for you to do.
So say I have an array that looks like this:
links = [['May 1', 'Link A', 'www.linka.com'], ['May 2', 'Link B', 'www.linkb.com']]
What I would like to do with Nokogiri is go to each link and return specific text (per an xpath I have) on each page.
I know I can do something like:
links.each do |x|
doc = Nokogiri::HTML(open(x[2]))
end
Then traverse each doc within that each loop. But, given that my array might have 700 items in it...this seems like it will be very inefficient. With all sorts of nested loops and such.
Is there a more efficient way to attack this problem?
Thanks.
You might want to look at something like Typhoeus or EM-HTTP-Request to parallelize your HTTP requests.
Processing the pages themselves with Nokogiri will be a CPU-bound task, so multithreading that task won't give you much (or any) speedup.
Your biggest slowdown will come from the HTTP requests, so making those execute in parallel will provide the biggest speedup.
Typhoeus:
https://github.com/typhoeus/typhoeus
EM-HTTP-Request:
https://github.com/igrigorik/em-http-request
Using nokogiri I need to search through some HTML for something like:
new GLatLng(-14.468352,132.270434)
and then assign the latitude and longitude values in that code to two variables.
You haven't shown us any example HTML. Nokogiri seems to be the wrong tool for this job if you're just searching for plain text. You could simply do:
require 'open-uri'
html = open('http://stackoverflow.com/questions/6739202/find-google-map-line-w-nokogiri').read
match = /new GLatLng\((?<lat>.+?),(?<long>.+?)\)/.match html
p match[:lat].to_f
#=> -14.468352
Or, if you need an array of all such matches, say the page also has new GLatLng(17.3,42.1) on it:
matches = html.scan /new GLatLng\((.+?),(.+?)\)/
p matches
#=> [["-14.468352", "132.270434"],["17.3", "42.1"]]
The only reason you might want to use Nokogiri would be to limit your searching to a particular HTML element (e.g. some <script> block).
Greetings everyone:
I would love to get some information from a huge collection of Google Search Result pages.
The only thing I need is the URLs inside a bunch of <cite></cite> HTML tags.
I cannot get a solution in any other proper way to handle this problem so now I am moving to ruby.
This is so far what I have written:
require 'net/http'
require 'uri'
url=URI.parse('http://www.google.com.au')
res= Net::HTTP.start(url.host, url.port){|http|
http.get('/#hl=en&q=helloworld')}
puts res.body
Unfortunately I cannot use the recommended hpricot ruby gem (because it misses a make command or something?)
So I would like to stick with this approach.
Now that I can get the response body as a string, the only thing I need is to retrieve whatever is inside the ciite(remove an i to see the true name :)) HTML tags.
How should I do that? using regular expression? Can anyone give me an example?
Here's one way to do it using Nokogiri:
Nokogiri::HTML(res.body).css("cite").map {|cite| cite.content}
I think this will solve it:
res.scan(/<cite>([^<>]*)<\/cite>/imu).flatten
# This one to ignore empty tags:
res.scan(/<cite>([^<>]*)<\/cite>/imu).flatten.select{|x| !x.empty?}
If you're having problems with hpricot, you could also try nokogiri which is very similar, and allows you to do the same things.
Split the string on the tag you want. Assuming only one instance of tag (or specify only one split) you'll have two pieces I'll call head and tail. Take tail and split it on the closing tag (once), so you'll now have two pieces in your new array. The new head is what was between your tags, and the new tail is the remainder of the string, which you may process again if the tag could appear more than once.
An example that may not be exactly correct but you get the idea:
head1, tail1 = str.split('<tag>', 1) # finds the opening tag
head2, tail2 = tail1.split('</tag>', 1) # finds the closing tag