How to scrape a span name in Nokogiri in Ruby? - ruby-on-rails
I want to scrape data off a website. The data is in the text of a span.
The HTML looks like this:
<p class="text-muted text-small">
<span class="text-muted">Votes:</span>
<span name="nv" data-value="1564808">1,564,808</span>
<span class="ghost">|</span>
<span class="text-muted">Gross:</span>
<span name="nv" data-value="107,928,762">$107.93M</span>
</p>
I want to search the whole page and get the value of the data-value which is 1,564,808 not the 107.93M value.
I tried various ways to get the data, Like for instance:
#votes = []
html_content =
open("https://www.imdb.com/list/ls057823854/sort=list_order,asc&st_
dt=&mod e=detail&page=1").read
doc = Nokogiri::HTML(html_content)
doc.css(".text-muted['span name=nv']").each do |i|
#votes << i.text.strip
Try this code:
doc.css('div.lister-item-content > p.text-muted > span[name = nv]:nth-child(2)').map(&:text)
Which results in:
["1,564,941", "373,745", "2,004,624", "1,077,404", "887,189", "305,554", "207,904", "1,074,609", "748,393", "789,255", "1,224,753", "754,008", "634,752", "1,056,328", "1,604,158", "1,438,194", "629,504", "1,158,452", "517,609", "539,263", "1,443,979", "1,290,159", "161,981", "830,992", "1,427,193", "299,532", "289,184", "705,138", "615,264", "1,147,650", "1,030,826", "1,018,932", "921,730", "524,568", "557,482", "1,973,773", "813,743", "367,587", "342,800", "188,210", "649,467", "1,068,455", "547,990", "527,123", "805,964", "420,447", "441,780", "318,295", "1,004,742", "446,096", "203,977", "581,108", "1,754,019", "616,804", "484,534", "265,048", "958,244", "289,190", "651,605", "503,185", "320,564", "660,685", "476,016", "432,155", "588,572", "374,705", "378,561", "337,801", "463,467", "508,822", "187,810", "1,128,184", "221,361", "261,529", "322,314", "324,435", "116,258", "318,628", "1,334,595", "222,651", "1,155,754", "228,713", "205,956", "271,162", "293,774", "33,136", "80,385", "703,048", "195,712", "274,244", "233,133", "121,874", "208,462", "513,797", "485,112", "120,750", "135,232", "57,411", "125,431", "297,193"]
Related
How to replace last free space to nbsp in ruby
How I can replace last free space to in ruby? I have in database this: <h1>Hello dear friend!</h1> <p>How are you?</p> <figure><img src="..." alt="..." /></figure> <p>Bye!</p> And I need to have this "output": <h1>Hello dear friend!</h1> <p>How are you?</p> <figure><img src="..." alt="..." /></figure> <p>Bye!</p> I tried to play with nokogiri: text = Nokogiri::HTML::DocumentFragment.parse(...) text.css('h1, h2, h3, h4, h5, h6, p, li').each do |tag| tag_arr = tag.content.split(' ') tag_last_words = tag_arr[tag_arr.length-2..tag_arr.length] tag_return = tag_arr[0..-2].push(tag_last_words.join(' ')) tag_return = tag_return.join(' ') tag.content = tag_return end but the I can't beat some "bugs": all attributes and inner tags (html) are deleted instead of I have Why? To avoid single word wrapping to new line on mobile device. (JS is not an option in my case)
ng-options for ActiveSupport::TimeZone returns Unexpected end of expression: [
I'm trying to list timezone options in an erb file. I have the following code to do so: <select class="pull-left" ng-model="schedule.deliver_timezone" ng-options="zone for zone in <%= ActiveSupport::TimeZone.zones_map.map { |zone_name, zone_desc| zone_name.to_s } %>" name="deliver_at_tz"/> If I just run ActiveSupport::TimeZone.zones_map.map { |zone_name, zone_desc| zone_name.to_s }, I get an array of timezone names as expected. However, when actually hitting this template, I get: Unexpected end of expression: [ This dumps out the following mess into the console: <select class="pull-left ng-pristine ng-valid" ng-model="schedule.deliver_timezone" ng-options="zone for zone in [" utc",="" "eastern="" time="" (us="" &="" canada)",="" "international="" date="" line="" west",="" "midway="" island",="" "american="" samoa",="" "hawaii",="" "alaska",="" "pacific="" "tijuana",="" "mountain="" "arizona",="" "chihuahua",="" "mazatlan",="" "central="" "saskatchewan",="" "guadalajara",="" "mexico="" city",="" "monterrey",="" america",="" "indiana="" (east)",="" "bogota",="" "lima",="" "quito",="" "atlantic="" (canada)",="" "caracas",="" "la="" paz",="" "santiago",="" "newfoundland",="" "brasilia",="" "buenos="" aires",="" "georgetown",="" "greenland",="" "mid-atlantic",="" "azores",="" "cape="" verde="" is.",="" "dublin",="" "edinburgh",="" "lisbon",="" "london",="" "casablanca",="" "monrovia",="" "belgrade",="" "bratislava",="" "budapest",="" "ljubljana",="" "prague",="" "sarajevo",="" "skopje",="" "warsaw",="" "zagreb",="" "brussels",="" "copenhagen",="" "madrid",="" "paris",="" "amsterdam",="" "berlin",="" "bern",="" "rome",="" "stockholm",="" "vienna",="" "west="" central="" africa",="" "bucharest",="" "cairo",="" "helsinki",="" "kyiv",="" "riga",="" "sofia",="" "tallinn",="" "vilnius",="" "athens",="" "istanbul",="" "minsk",="" "jerusalem",="" "harare",="" "pretoria",="" "moscow",="" "st.="" petersburg",="" "volgograd",="" "kuwait",="" "riyadh",="" "nairobi",="" "baghdad",="" "tehran",="" "abu="" dhabi",="" "muscat",="" "baku",="" "tbilisi",="" "yerevan",="" "kabul",="" "ekaterinburg",="" "islamabad",="" "karachi",="" "tashkent",="" "chennai",="" "kolkata",="" "mumbai",="" "new="" delhi",="" "kathmandu",="" "astana",="" "dhaka",="" "sri="" jayawardenepura",="" "almaty",="" "novosibirsk",="" "rangoon",="" "bangkok",="" "hanoi",="" "jakarta",="" "krasnoyarsk",="" "beijing",="" "chongqing",="" "hong="" kong",="" "urumqi",="" "kuala="" lumpur",="" "singapore",="" "taipei",="" "perth",="" "irkutsk",="" "ulaan="" bataar",="" "seoul",="" "osaka",="" "sapporo",="" "tokyo",="" "yakutsk",="" "darwin",="" "adelaide",="" "canberra",="" "melbourne",="" "sydney",="" "brisbane",="" "hobart",="" "vladivostok",="" "guam",="" "port="" moresby",="" "magadan",="" "solomon="" caledonia",="" "fiji",="" "kamchatka",="" "marshall="" "auckland",="" "wellington",="" "nuku'alofa",="" "tokelau="" "samoa"]"="" name="deliver_at_tz"> I'm not really sure exactly how this format is supposed to look, but it's clearly wrong. What exactly is wrong with this logic? Full trace (santizied for sensitive info): Error: [$parse:ueoe] Unexpected end of expression: [ http://errors.angularjs.org/1.3.0-beta.10/$parse/ueoe?p0=%5B at http://127.0.0.1:3000/application.js <select class="pull-left ng-pristine ng-valid" ng-model="schedule.deliver_timezone" ng-options="zone for zone in [" utc",="" "eastern="" time="" (us="" &="" canada)",="" "international="" date="" line="" west",="" "midway="" island",="" "american="" samoa",="" "hawaii",="" "alaska",="" "pacific="" "tijuana",="" "mountain="" "arizona",="" "chihuahua",="" "mazatlan",="" "central="" "saskatchewan",="" "guadalajara",="" "mexico="" city",="" "monterrey",="" america",="" "indiana="" (east)",="" "bogota",="" "lima",="" "quito",="" "atlantic="" (canada)",="" "caracas",="" "la="" paz",="" "santiago",="" "newfoundland",="" "brasilia",="" "buenos="" aires",="" "georgetown",="" "greenland",="" "mid-atlantic",="" "azores",="" "cape="" verde="" is.",="" "dublin",="" "edinburgh",="" "lisbon",="" "london",="" "casablanca",="" "monrovia",="" "belgrade",="" "bratislava",="" "budapest",="" "ljubljana",="" "prague",="" "sarajevo",="" "skopje",="" "warsaw",="" "zagreb",="" "brussels",="" "copenhagen",="" "madrid",="" "paris",="" "amsterdam",="" "berlin",="" "bern",="" "rome",="" "stockholm",="" "vienna",="" "west="" central="" africa",="" "bucharest",="" "cairo",="" "helsinki",="" "kyiv",="" "riga",="" "sofia",="" "tallinn",="" "vilnius",="" "athens",="" "istanbul",="" "minsk",="" "jerusalem",="" "harare",="" "pretoria",="" "moscow",="" "st.="" petersburg",="" "volgograd",="" "kuwait",="" "riyadh",="" "nairobi",="" "baghdad",="" "tehran",="" "abu="" dhabi",="" "muscat",="" "baku",="" "tbilisi",="" "yerevan",="" "kabul",="" "ekaterinburg",="" "islamabad",="" "karachi",="" "tashkent",="" "chennai",="" "kolkata",="" "mumbai",="" "new="" delhi",="" "kathmandu",="" "astana",="" "dhaka",="" "sri="" jayawardenepura",="" "almaty",="" "novosibirsk",="" "rangoon",="" "bangkok",="" "hanoi",="" "jakarta",="" "krasnoyarsk",="" "beijing",="" "chongqing",="" "hong="" kong",="" "urumqi",="" "kuala="" lumpur",="" "singapore",="" "taipei",="" "perth",="" "irkutsk",="" "ulaan="" bataar",="" "seoul",="" "osaka",="" "sapporo",="" "tokyo",="" "yakutsk",="" "darwin",="" "adelaide",="" "canberra",="" "melbourne",="" "sydney",="" "brisbane",="" "hobart",="" "vladivostok",="" "guam",="" "port="" moresby",="" "magadan",="" "solomon="" caledonia",="" "fiji",="" "kamchatka",="" "marshall="" "auckland",="" "wellington",="" "nuku'alofa",="" "tokelau="" "samoa"]"="" name="deliver_at_tz">
Try doing this, should work <select class="pull-left" ng-model="schedule.deliver_timezone" ng-options='zone for zone in <%= "[\"#{ActiveSupport::TimeZone.zones_map.values.collect{|z| z.name }.join('","')}\"]" %>' name="deliver_at_tz" />
if else if else in angular js
I have to incorporate angular in my slim file, I'm not sure how to transform if else block in angular. I know angular don't have ng if else statements but is there a way to change below code to angular - if #cart['empty'] cart is empty - elsif #cart['invalid'] can't proceed - else -#cart['Items'].each do |item| #{item['description']} I want to achieve some thing like this ng-if="cart.empty" cart is empty ng-else-if cart.invalid can't proceed ng-else ng-repeat="cart.Items as item" item.description
What you're looking for is the ng-switch directive. First get some variable that will contain the switch condition: $scope.getStatus = function(cart) { if (cart.empty) return 'empty'; if (cart.invalid) return 'invalid'; } And then use the directive: <div ng-switch = "getStatus(cart)"> <div ng-switch-when = "empty">Cart is empty</div> <div ng-switch-when = "invalid">Can't proceed</div> <div ng-switch-default>.... ok ....</div> </div>
Nokogiri parsing missing element create issue
I am having Plain html doc NO CSS . In which some of the content i need to pass to excel sheet. I tried with Nokogiri it works on Css basis. Do anybody tried this thing. <html> <head></head> <body> ***NOTE*** <br> Items <br> <br> Invoice Number : [78945824] PO Number : [4587958] <br> Track It : 12345 <br> <br> Items <br> <br> Invoice Number : [79546828] PO Number : [4567892] <br> <br> <br> Items <br> <br> Invoice Number : [78976824] PO Number : [897569] <br> Track It : 12345 <br> </body> </html> I am able to retrieve the PO Number & Tracking no require 'rubygems' require 'nokogiri' require 'open-uri' PAGE_URL = "a.html" page = Nokogiri::HTML(open(PAGE_URL)) data = page.css("body").text po_numbers = data.scan(/Invoice Number : \[\d+\] PO Number : \[(\d+)\]/).flatten tracking_numbers = page.css("a").text.split [["PO Number", "Tracking Number"]].concat(po_numbers.zip(tracking_numbers)) puts po_numbers puts tracking_numbers => po_numbers = ["4587958", "4567892", "4587958"] => tracking_numbers = ["12543", "12356"] When we zip those together, we get: => po_numbers.zip(tracking_numbers) => [["4587958", "12543"], ["4567892", "12356"], ["4587958", "nil"]] What we want is: => [["4587958", "12543"], ["4567892", "nil"], ["4587958", "12356"] ]
Try this data = page.css("body").text data = data.gsub(" ","").split(/\n/) po=[] track=[] data.each do |i| if i.include? "PONumber" po << i.split("PONumber:").last.scan(/\d+/)[0] end if i.include? "TrackIt" track << i.split("TrackIt:").last end end po.zip(track)
If you can use regex to scan for all invoice number (po_numbers), you can do the same with tracking number (tracking_numbers): tracking_numbers = data.scan(/Tracking no : (\d*)/).flatten The returned array includes nil, therefore, you can walk through both array for po number and tracking number po_numbers.each_with_index do |elm, index| p "PO Number: #{elm}, Tracking Number: #{tracking_numbers[index]}" end Update This regex match the updated HTML /Track It :\s*(?:<a href=".*">\s*(\d+)\s*<\/a>|$)/ It matches both empty track number and one with a link.
How to split value from a string in ruby
My example string is listed here. i want to split every value result in array or hash to process value of each element. <div id="test"> accno: 123232323 <br> id: 5443534534534 <br> name: test_name <br> url: www.google.com <br> </div> How can i fetch each values in a hash or array.
With regex it's easy: s = '<div id="test"> accno: 123232323 <br> id: 5443534534534 <br> name: test_name <br> url: www.google.com <br> </div>' p s.scan(/\s+(.*?)\:\s+(.*?)<br>/).map.with_object({}) { |i, h| h[i[0].to_sym] = i[1].strip } Or you can precise you keys (accno, id, name, url) like ([a-z]+) if they contains only lower case letters: p s.scan(/\s+([a-z]+)\:\s+(.*?)<br>/).map.with_object({}) { |i, h| h[i[0].to_sym] = i[1].strip } Result: {:accno=>"123232323", :id=>"5443534534534", :name=>"test_name", :url=>"www.google.com"} Update in case of: <div id="test"> accno: 123232323 id: 5443534534534 name: test_name url: www.google.com </div> regex will be: /([a-z]+)\:\s*(.*?)\s+/ ([a-z]+) - this is hash key, and it could contains - or _, then just add it like: ([a-z]+\-_). This scheme presume that after key follows : (perhaps with space) and then some text until the space. Or (\s+|<) at the end if line ends without space: url: www.google.com</div>
If you are processing html, use a html/xml parser like nokogiri to pull out the text content of the required <div> tag using a CSS selector. Then parse the text into fields. To install nokogiri: gem install nokogiri Then process the page and text: require "nokogiri" require "open-uri" # re matches: spaces (word) colon spaces (anything) space re_fields = /\s+(?<field>\w+):\s+(?<data>.*?)\s/ # Somewhere to store the results record = {} page = Nokogiri::HTML( open("http://example.com/divtest.html") ) # Select the text from <div id=test> and scan into fields with the regex page.css( "div#test" ).text.scan( re_fields ){ |field, data| record[ field ] = data } p record Results in: {"accno"=>"123232323", "id"=>"5443534534534", "name"=>"test_name", "url"=>"www.google.com"} The page.css( "blah" ) selector can also be accessed as an array if you are processing multiple elements, which can be looped through with .each # Somewhere to store the results records = [] # Select the text from <div id=test> and scan into fields with the regex page.css( "div#test" ).each{ |div| record = {} div.text.scan( re_fields ){ |field, data| record[field] = data } records.push record } p records