Nokogiri- Parsing HTML <a href> and displaying only part of the URL - ruby-on-rails

So basically I am scraping a website, and I want to display only part of the address. For instance, if it is www.yadaya.com/nyc/sales/manhattan and I want to only put "sales" in a hash or an array.
{
:listing_class => listings.css('a').text
}
That will give me the whole URL. Would I want to gsub to get the partial output?
Thanks!

When you are dealing with URLs, you should start with URI, then, to mess with the path, switch to using File.dirname and/or File.basename:
require 'uri'
uri = URI.parse('http://www.yadaya.com/nyc/sales/manhattan')
dir = File.dirname(uri.path).split('/').last
which sets dir to "sales".
No regex is needed, except what parse and split do internally.
Using that in your code's context:
File.dirname(URI.parse(listings.css('a').text).path).split('/').last
but, personally, I'd break that into two lines for clarity and readability, which translate into easier maintenance.
A warning though:
listings.css('a')
returns a NodeSet, which is akin to an Array. If the DOM you are searching has multiple <a> tags, you will get more than one Node being passed to text, which will then be concatenated into the text you are treating as a URL. That's a bug in waiting:
require 'nokogiri'
html = '<div>foobar</div>'
doc = Nokogiri::HTML(html)
doc.at('div').css('a').text
Which results in:
"foobar"
Instead, your code needs to be:
listings.at('a')
or
listings.at_css('a')
so only one node is returned. In the context of my sample code:
doc.at('div').at('a').text
# => "foo"
Even if the code that sets up listings only results in a single <a> node being visible, use at or at_css for correctness.

Since you have the full URL using listings.css('a').text, you could parse out a section of the path using a combination of the URI class and a regular expression, using something like the following:
require 'uri'
uri = URI.parse(listings.css('a').text)
=> #<URI::HTTP:0x007f91a39255b8 URL:http://www.yadaya.com/nyc/sales/manhattan>
match = %r{^/nyc/([^/]+)/}.match(uri.path)
=> #<MatchData "/nyc/sales/" 1:"sales">
match[1]
=> "sales"
You may need to tweak the regular expression to meet your needs, but that's the gist of it.

Related

Regex markdown string for key-value pairs

My Rails app is retrieving data from a third-party service that doesn't allow me to attach arbitrary data to records, but does have a description area that supports Markdown. I am attempting to pass data for each record over to my Rails app within the description content, via Markdown comments:
[//]: # (POST:28|USERS:102,78,90)
... additional Markdown content.
I found the [//]: # (...) syntax in this answer for embedding comments in Markdown, and my idea was to then pass pipe-separated key-value pairs in as the comment content.
Using my example above, I would like to be able to parse the description content string to interpret the key-value pairs. In this case, POST=28 and USERS=102,78,90. If it helps, this comment will always appear at the first line of the Markdown content.
I imagine Regex is the way to go here? I would really appreciate any help!
You can use \G:
(?:^\[//\]:[^(]+\(\K # match your token [//]: at the beginning
|
\G(?!\A)\| # or right after the previous match
)
(\w+):([\w,]+) # capture word chars (=key)
# followed by :
# followed by word chars and comma (=val)
See a demo on regex101.com.
You'll need 2 steps to properly parse this:
First find the comment: ^\[\/\/\]: # \(([^)]*)\)
This captures the comment's content.
Then parse the content: (\w+):([^|]+) (with the global flag)
This captures the key and value separately.
As I mentioned in my comment above, you could simplify things a lot by using a standard data serialization format like JSON after the #. For example:
require "json"
MATCH_DATA_COMMENT = %r{(?<=^\[//\]: # ).*}
markdown = <<END
[//]: # {"POST":28,"USERS":[102,78,90]}
... additional Markdown content.
END
p JSON.parse(markdown[MATCH_DATA_COMMENT])
# => { "POST" => 28, "USERS" => [102, 78, 90] }
The regular expression %r{(?<=^\[//\]: # ).*} uses negative lookbehind to match anything that follows "[//]: #".

Why it is returning an empty array while it has content?

I am trying to get auto-corrected spelling from Google's home page using Nokogiri.
For example, if I am typing "hw did" and the correct spelling is "how did", I have to get the correct spelling.
I tried with the xpath and css methods, but in both cases, I get the same empty array.
I got the XPath and CSS paths using FireBug.
Here is my Nokogiri code:
#requ=params[:search]
#requ_url=#requ.gsub(" ","+") //to encode the url(if user inputs space than it should be convet into + )
#doc=Nokogiri::HTML(open("https://www.google.co.in/search?q=#{#requ_url}"))
binding.pry
Here are my XPath and CSS selectors:
Using XPath:
pry(#<SearchController>)> #doc.xpath("/html/body/div[5]/div[2]/div[6]/div/div[4]/div/div/div[2]/div/p/a").inspect
=> "[]"
Using CSS:
pry(#<SearchController>)> #doc.css('html body#gsr.srp div#main div#cnt.mdm div.mw div#rcnt div.col div#center_col div#taw div div.med p.ssp a.spell').inner_text()
=> ""
First, use the right tools to manipulate URLs; They'll save you headaches.
Here's how I'd find the right spelling:
require 'nokogiri'
require 'uri'
require 'open-uri'
requ = 'hw did'
uri = URI.parse('https://www.google.co.in/search')
uri.query = URI.encode_www_form({'q' => requ})
doc = Nokogiri::HTML(open(uri.to_s))
doc.at('a.spell').text # => "how did"
it works fine with "how did",check it with "bnglore" or any one word string,it gives an error. the same i was facing in my previous code. it is showing undefined method `text'
It's not that hard to figure out. They're changing the HTML so you have to change your selector. "Inspect" the suggested word "bangalore" and see where it exists in relation to the previous path. Once you know that, it's easy to find a way to access the word:
doc.at('span.spell').next_element.text # => "bangalore"
Don't trust Google to do things the easy way, or even the best way, or be consistent. Just because they return HTML one way for words with spaces, doesn't mean they're going to do it the same way for a single word. I would do it consistently, but they might be trying to discourage you from mining their pages so don't be surprised if you see variations.
Now, you need to figure out how to write code that knows when to use one selector/method or the other. That's for you to do.

Find Google Map Line w/ Nokogiri

Using nokogiri I need to search through some HTML for something like:
new GLatLng(-14.468352,132.270434)
and then assign the latitude and longitude values in that code to two variables.
You haven't shown us any example HTML. Nokogiri seems to be the wrong tool for this job if you're just searching for plain text. You could simply do:
require 'open-uri'
html = open('http://stackoverflow.com/questions/6739202/find-google-map-line-w-nokogiri').read
match = /new GLatLng\((?<lat>.+?),(?<long>.+?)\)/.match html
p match[:lat].to_f
#=> -14.468352
Or, if you need an array of all such matches, say the page also has new GLatLng(17.3,42.1) on it:
matches = html.scan /new GLatLng\((.+?),(.+?)\)/
p matches
#=> [["-14.468352", "132.270434"],["17.3", "42.1"]]
The only reason you might want to use Nokogiri would be to limit your searching to a particular HTML element (e.g. some <script> block).

I am creating a Twitter clone in Ruby On Rails, how do I code it so that the '#...''s in the 'tweets' turn into links?

I am somewhat of a Rails newbie so bear with me, I have most of the application figured out except for this one part.
def linkup_mentions_and_hashtags(text)
text.gsub!(/#([\w]+)(\W)?/, '#\1\2')
text.gsub!(/#([\w]+)(\W)?/, '#\1\2')
text
end
I found this example here: http://github.com/jnunemaker/twitter-app
The link to the helper method: http://github.com/jnunemaker/twitter-app/blob/master/app/helpers/statuses_helper.rb
Perhaps you could use Regular Expressions to look for "#..." and then replace the matches with the corresponding link?
You could use a regular expression to search for #sometext{whitespace_or_endofstring}
You can use regular expressions, i don't know ruby but the code should be almost exactly as my example:
Regex.Replace("this is an example #AlbertEin",
"(?<type>[##])(?<nick>\\w{1,}[^ ])",
"${type}${nick}");
This example would return
this is an example <a href="http://twitter.com/AlbertEin>#AlbertEin</a>
If you run it on .NET
The regex (?<type>[##])(?<nick>\\w{1,}[^ ]) means, capture and name it TYPE the text that starts with # or #, and then capture and name it NAME the text that follows that contains at least one text character until you fin a white space.
Perhaps you can use a regular expression to parse out the words starting with #, then update the string at that location with the proper link.
This regular expression will give you words starting with # symbols, but you might have to tweak it:
\#[\S]+\
You would use a regular expression to search for #username and then turn that to the corresponding link.
I use the following for the # in PHP:
$ret = preg_replace("#(^|[\n ])#([^ \"\t\n\r<]*)#ise",
"'\\1<a href=\"http://www.twitter.com/\\2\" >#\\2</a>'",
$ret);
I've also been working on this, I'm not sure that it's 100% perfect, but it seems to work:
def auto_link_twitter(txt, options = {:target => "_blank"})
txt.scan(/(^|\W|\s+)(#|#)(\w{1,25})/).each do |match|
if match[1] == "#"
txt.gsub!(/##{match.last}/, link_to("##{match.last}", "http://twitter.com/search/?q=##{match.last}", options))
elsif match[1] == "#"
txt.gsub!(/##{match.last}/, link_to("##{match.last}", "http://twitter.com/#{match.last}", options))
end
end
txt
end
I pieced it together with some google searching and some reading up on String.scan in the api docs.

How to use ruby to get string between HTML <cite> tags?

Greetings everyone:
I would love to get some information from a huge collection of Google Search Result pages.
The only thing I need is the URLs inside a bunch of <cite></cite> HTML tags.
I cannot get a solution in any other proper way to handle this problem so now I am moving to ruby.
This is so far what I have written:
require 'net/http'
require 'uri'
url=URI.parse('http://www.google.com.au')
res= Net::HTTP.start(url.host, url.port){|http|
http.get('/#hl=en&q=helloworld')}
puts res.body
Unfortunately I cannot use the recommended hpricot ruby gem (because it misses a make command or something?)
So I would like to stick with this approach.
Now that I can get the response body as a string, the only thing I need is to retrieve whatever is inside the ciite(remove an i to see the true name :)) HTML tags.
How should I do that? using regular expression? Can anyone give me an example?
Here's one way to do it using Nokogiri:
Nokogiri::HTML(res.body).css("cite").map {|cite| cite.content}
I think this will solve it:
res.scan(/<cite>([^<>]*)<\/cite>/imu).flatten
# This one to ignore empty tags:
res.scan(/<cite>([^<>]*)<\/cite>/imu).flatten.select{|x| !x.empty?}
If you're having problems with hpricot, you could also try nokogiri which is very similar, and allows you to do the same things.
Split the string on the tag you want. Assuming only one instance of tag (or specify only one split) you'll have two pieces I'll call head and tail. Take tail and split it on the closing tag (once), so you'll now have two pieces in your new array. The new head is what was between your tags, and the new tail is the remainder of the string, which you may process again if the tag could appear more than once.
An example that may not be exactly correct but you get the idea:
head1, tail1 = str.split('<tag>', 1) # finds the opening tag
head2, tail2 = tail1.split('</tag>', 1) # finds the closing tag

Resources