I want to scrape data off a website. The data is in the text of a span.
The HTML looks like this:
<p class="text-muted text-small">
<span class="text-muted">Votes:</span>
<span name="nv" data-value="1564808">1,564,808</span>
<span class="ghost">|</span>
<span class="text-muted">Gross:</span>
<span name="nv" data-value="107,928,762">$107.93M</span>
</p>
I want to search the whole page and get the value of the data-value which is 1,564,808 not the 107.93M value.
I tried various ways to get the data, Like for instance:
#votes = []
html_content =
open("https://www.imdb.com/list/ls057823854/sort=list_order,asc&st_
dt=&mod e=detail&page=1").read
doc = Nokogiri::HTML(html_content)
doc.css(".text-muted['span name=nv']").each do |i|
#votes << i.text.strip
Try this code:
doc.css('div.lister-item-content > p.text-muted > span[name = nv]:nth-child(2)').map(&:text)
Which results in:
["1,564,941", "373,745", "2,004,624", "1,077,404", "887,189", "305,554", "207,904", "1,074,609", "748,393", "789,255", "1,224,753", "754,008", "634,752", "1,056,328", "1,604,158", "1,438,194", "629,504", "1,158,452", "517,609", "539,263", "1,443,979", "1,290,159", "161,981", "830,992", "1,427,193", "299,532", "289,184", "705,138", "615,264", "1,147,650", "1,030,826", "1,018,932", "921,730", "524,568", "557,482", "1,973,773", "813,743", "367,587", "342,800", "188,210", "649,467", "1,068,455", "547,990", "527,123", "805,964", "420,447", "441,780", "318,295", "1,004,742", "446,096", "203,977", "581,108", "1,754,019", "616,804", "484,534", "265,048", "958,244", "289,190", "651,605", "503,185", "320,564", "660,685", "476,016", "432,155", "588,572", "374,705", "378,561", "337,801", "463,467", "508,822", "187,810", "1,128,184", "221,361", "261,529", "322,314", "324,435", "116,258", "318,628", "1,334,595", "222,651", "1,155,754", "228,713", "205,956", "271,162", "293,774", "33,136", "80,385", "703,048", "195,712", "274,244", "233,133", "121,874", "208,462", "513,797", "485,112", "120,750", "135,232", "57,411", "125,431", "297,193"]
I am having Plain html doc NO CSS . In which some of the content i need to pass to excel sheet. I tried with Nokogiri it works on Css basis.
Do anybody tried this thing.
<html>
<head></head>
<body>
***NOTE***
<br>
Items
<br>
<br>
Invoice Number : [78945824] PO Number : [4587958]
<br>
Track It : 12345
<br>
<br>
Items
<br>
<br>
Invoice Number : [79546828] PO Number : [4567892]
<br>
<br>
<br>
Items
<br>
<br>
Invoice Number : [78976824] PO Number : [897569]
<br>
Track It : 12345
<br>
</body>
</html>
I am able to retrieve the PO Number & Tracking no
require 'rubygems'
require 'nokogiri'
require 'open-uri'
PAGE_URL = "a.html"
page = Nokogiri::HTML(open(PAGE_URL))
data = page.css("body").text
po_numbers = data.scan(/Invoice Number : \[\d+\] PO Number : \[(\d+)\]/).flatten
tracking_numbers = page.css("a").text.split
[["PO Number", "Tracking Number"]].concat(po_numbers.zip(tracking_numbers))
puts po_numbers
puts tracking_numbers
=> po_numbers = ["4587958", "4567892", "4587958"]
=> tracking_numbers = ["12543", "12356"]
When we zip those together, we get:
=> po_numbers.zip(tracking_numbers)
=> [["4587958", "12543"], ["4567892", "12356"], ["4587958", "nil"]]
What we want is:
=> [["4587958", "12543"], ["4567892", "nil"], ["4587958", "12356"] ]
Try this
data = page.css("body").text
data = data.gsub(" ","").split(/\n/)
po=[]
track=[]
data.each do |i|
if i.include? "PONumber"
po << i.split("PONumber:").last.scan(/\d+/)[0]
end
if i.include? "TrackIt"
track << i.split("TrackIt:").last
end
end
po.zip(track)
If you can use regex to scan for all invoice number (po_numbers), you can do the same with tracking number (tracking_numbers):
tracking_numbers = data.scan(/Tracking no : (\d*)/).flatten
The returned array includes nil, therefore, you can walk through both array for po number and tracking number
po_numbers.each_with_index do |elm, index|
p "PO Number: #{elm}, Tracking Number: #{tracking_numbers[index]}"
end
Update
This regex match the updated HTML
/Track It :\s*(?:<a href=".*">\s*(\d+)\s*<\/a>|$)/
It matches both empty track number and one with a link.
My example string is listed here. i want to split every value result in array or hash to process value of each element.
<div id="test">
accno: 123232323 <br>
id: 5443534534534 <br>
name: test_name <br>
url: www.google.com <br>
</div>
How can i fetch each values in a hash or array.
With regex it's easy:
s = '<div id="test">
accno: 123232323 <br>
id: 5443534534534 <br>
name: test_name <br>
url: www.google.com <br>
</div>'
p s.scan(/\s+(.*?)\:\s+(.*?)<br>/).map.with_object({}) { |i, h| h[i[0].to_sym] = i[1].strip }
Or you can precise you keys (accno, id, name, url) like ([a-z]+) if they contains only lower case letters:
p s.scan(/\s+([a-z]+)\:\s+(.*?)<br>/).map.with_object({}) { |i, h| h[i[0].to_sym] = i[1].strip }
Result:
{:accno=>"123232323", :id=>"5443534534534", :name=>"test_name", :url=>"www.google.com"}
Update
in case of:
<div id="test"> accno: 123232323 id: 5443534534534 name: test_name url: www.google.com </div>
regex will be:
/([a-z]+)\:\s*(.*?)\s+/
([a-z]+) - this is hash key, and it could contains - or _, then just add it like: ([a-z]+\-_). This scheme presume that after key follows : (perhaps with space) and then some text until the space. Or (\s+|<) at the end if line ends without space: url: www.google.com</div>
If you are processing html, use a html/xml parser like nokogiri to pull out the text content of the required <div> tag using a CSS selector. Then parse the text into fields.
To install nokogiri:
gem install nokogiri
Then process the page and text:
require "nokogiri"
require "open-uri"
# re matches: spaces (word) colon spaces (anything) space
re_fields = /\s+(?<field>\w+):\s+(?<data>.*?)\s/
# Somewhere to store the results
record = {}
page = Nokogiri::HTML( open("http://example.com/divtest.html") )
# Select the text from <div id=test> and scan into fields with the regex
page.css( "div#test" ).text.scan( re_fields ){ |field, data|
record[ field ] = data
}
p record
Results in:
{"accno"=>"123232323", "id"=>"5443534534534", "name"=>"test_name", "url"=>"www.google.com"}
The page.css( "blah" ) selector can also be accessed as an array if you are processing multiple elements, which can be looped through with .each
# Somewhere to store the results
records = []
# Select the text from <div id=test> and scan into fields with the regex
page.css( "div#test" ).each{ |div|
record = {}
div.text.scan( re_fields ){ |field, data|
record[field] = data
}
records.push record
}
p records
I have a page that is formatted like so:
<h1>Header</h1>
<h2>Subheader</h2>
<h3>Subsubheader</h3>
<h1>Another header</h1>
Is it possible to server-side generate a table of contents / outline at the start of the page, like Wikipedia does in its articles? I use Ruby on Rails.
EDIT: WITHOUT JavaScript!
I created a class for this purpose today. It depends on http://www.nokogiri.org/, but that gem comes with Rails already.
Put this in app/models/toc.rb:
class Toc
attr_accessor :html
TOC_CLASS = "toc".freeze
TOC_ELEMENT = "p".freeze
TOC_ITEMS = "h1 | h2 | h3 | h4 | h5".freeze
UNIQUEABLE_ELEMENTS = "h1 | h2 | h3 | h4 | h5 | p".freeze
def initialize(content)
#html = Nokogiri::HTML.fragment content
end
def generate
clear
set_uniq_ids
toc = create_container
html.xpath(TOC_ITEMS).each { |node| toc << toc_item_tag(node) }
html.prepend_child toc
return html.to_s
end
private
def clear
html.search(".#{TOC_CLASS}").remove
end
def set_uniq_ids
html.xpath(UNIQUEABLE_ELEMENTS).
each { |node| node["id"] = rand_id }
end
def rand_id
(0...8).map { ('a'..'z').to_a[rand(26)] }.join
end
def create_container
toc = Nokogiri::XML::Node.new TOC_ELEMENT, html
toc["class"] = TOC_CLASS
return toc
end
def toc_item_tag(node)
"<a data-turbolinks='false' class=\"toc-link toc-link-#{node.name}\" href=\"##{node["id"]}\">#{node.text}</a>"
end
end
Use it like
toc = Toc.new article.body
body_with_toc = toc.generate
article.update body: body_with_toc
You need to generate data source from your hierarchy to be something like this
#toc = [ ['header', 0], ['subheader', 1], ['subsubheader', 2],
['header2', 0], ['header3', 0], ['subheader2', 1]
]
Than it is easy to render it in template, for example:
<%- #toc.each do |item, distance| %>
<%= (' ' * distance * 5).html_safe %>
<%= item %>
<br/>
<%- end %>
Would give you:
header
subheader
subsubheader
header2
header3
subheader2
Of course you can use 'distance' for determining style size instead of 'depth', but I hope you get the main idea.
yes, it is possible. you don't really need rails for this; you can also use javascript to generate a table of contents.
Here is an exmaple library that you can use.
http://www.kryogenix.org/code/browser/generated-toc/
You could alternatively create your anchor links as you loop through elements in your rails erb/haml views.