I want to scrape data off a website. The data is in the text of a span.
The HTML looks like this:
<p class="text-muted text-small">
<span class="text-muted">Votes:</span>
<span name="nv" data-value="1564808">1,564,808</span>
<span class="ghost">|</span>
<span class="text-muted">Gross:</span>
<span name="nv" data-value="107,928,762">$107.93M</span>
</p>
I want to search the whole page and get the value of the data-value which is 1,564,808 not the 107.93M value.
I tried various ways to get the data, Like for instance:
#votes = []
html_content =
open("https://www.imdb.com/list/ls057823854/sort=list_order,asc&st_
dt=&mod e=detail&page=1").read
doc = Nokogiri::HTML(html_content)
doc.css(".text-muted['span name=nv']").each do |i|
#votes << i.text.strip
Try this code:
doc.css('div.lister-item-content > p.text-muted > span[name = nv]:nth-child(2)').map(&:text)
Which results in:
["1,564,941", "373,745", "2,004,624", "1,077,404", "887,189", "305,554", "207,904", "1,074,609", "748,393", "789,255", "1,224,753", "754,008", "634,752", "1,056,328", "1,604,158", "1,438,194", "629,504", "1,158,452", "517,609", "539,263", "1,443,979", "1,290,159", "161,981", "830,992", "1,427,193", "299,532", "289,184", "705,138", "615,264", "1,147,650", "1,030,826", "1,018,932", "921,730", "524,568", "557,482", "1,973,773", "813,743", "367,587", "342,800", "188,210", "649,467", "1,068,455", "547,990", "527,123", "805,964", "420,447", "441,780", "318,295", "1,004,742", "446,096", "203,977", "581,108", "1,754,019", "616,804", "484,534", "265,048", "958,244", "289,190", "651,605", "503,185", "320,564", "660,685", "476,016", "432,155", "588,572", "374,705", "378,561", "337,801", "463,467", "508,822", "187,810", "1,128,184", "221,361", "261,529", "322,314", "324,435", "116,258", "318,628", "1,334,595", "222,651", "1,155,754", "228,713", "205,956", "271,162", "293,774", "33,136", "80,385", "703,048", "195,712", "274,244", "233,133", "121,874", "208,462", "513,797", "485,112", "120,750", "135,232", "57,411", "125,431", "297,193"]
I'm creating some kind of custom tags that I'll use later to filter some datas. However, when I add the tags inside an array, I get the following:
"[\"witcher 3\", \"badass\", \"epic\"]"
#tags = []
params[:tags].split(', ').map do |tag|
#tags.push(tag.strip)
end
# About 5 lines under
FileDetail.create!(path: path, creation_date: date, tags: #tags)
Why do these \ show up, and why don't .strip work?
Thank you in advance
You are setting an array of strings in #tag, and \" represents an escaped character, in this case " which is used by ruby to represent String objects.
Consider the following code (an try it on IRB):
foo = ["bar", "baz"]
#=> ["bar", "baz"]
foo.inspect
#=> "[\"bar\", \"baz\"]"
foo.each { |f| puts "tag: #{f}" }
# tag: bar
# tag: baz
As you can see, there is really no \ character to strip from the string, its just how ruby outputs a String representation. So your code doesn't need .strip method:
#tags = []
params[:tags].split(', ').map do |tag|
#tags.push(tag)
end
Not related to your question, but still relevant: split method will return an array, so there is no need to create one before and then push items to it; just assign the returned array to #tags.
For example:
params[:tags] = "witcher 3, badass, epic"
#=> "witcher 3, badass, epic"
#tags = params[:tags].split(', ')
#=> ["witcher 3", "badass", "epic"]
If you want, you can still use map and strip to remove leading and trailing spaces:
params[:tags] = "witcher 3, badass , epic "
#=> "witcher 3, badass , epic "
params[:tags].split(",").map(&:strip)
#=> ["witcher 3", "badass", "epic"]
I am having Plain html doc NO CSS . In which some of the content i need to pass to excel sheet. I tried with Nokogiri it works on Css basis.
Do anybody tried this thing.
<html>
<head></head>
<body>
***NOTE***
<br>
Items
<br>
<br>
Invoice Number : [78945824] PO Number : [4587958]
<br>
Track It : 12345
<br>
<br>
Items
<br>
<br>
Invoice Number : [79546828] PO Number : [4567892]
<br>
<br>
<br>
Items
<br>
<br>
Invoice Number : [78976824] PO Number : [897569]
<br>
Track It : 12345
<br>
</body>
</html>
I am able to retrieve the PO Number & Tracking no
require 'rubygems'
require 'nokogiri'
require 'open-uri'
PAGE_URL = "a.html"
page = Nokogiri::HTML(open(PAGE_URL))
data = page.css("body").text
po_numbers = data.scan(/Invoice Number : \[\d+\] PO Number : \[(\d+)\]/).flatten
tracking_numbers = page.css("a").text.split
[["PO Number", "Tracking Number"]].concat(po_numbers.zip(tracking_numbers))
puts po_numbers
puts tracking_numbers
=> po_numbers = ["4587958", "4567892", "4587958"]
=> tracking_numbers = ["12543", "12356"]
When we zip those together, we get:
=> po_numbers.zip(tracking_numbers)
=> [["4587958", "12543"], ["4567892", "12356"], ["4587958", "nil"]]
What we want is:
=> [["4587958", "12543"], ["4567892", "nil"], ["4587958", "12356"] ]
Try this
data = page.css("body").text
data = data.gsub(" ","").split(/\n/)
po=[]
track=[]
data.each do |i|
if i.include? "PONumber"
po << i.split("PONumber:").last.scan(/\d+/)[0]
end
if i.include? "TrackIt"
track << i.split("TrackIt:").last
end
end
po.zip(track)
If you can use regex to scan for all invoice number (po_numbers), you can do the same with tracking number (tracking_numbers):
tracking_numbers = data.scan(/Tracking no : (\d*)/).flatten
The returned array includes nil, therefore, you can walk through both array for po number and tracking number
po_numbers.each_with_index do |elm, index|
p "PO Number: #{elm}, Tracking Number: #{tracking_numbers[index]}"
end
Update
This regex match the updated HTML
/Track It :\s*(?:<a href=".*">\s*(\d+)\s*<\/a>|$)/
It matches both empty track number and one with a link.
So I found myself needing to remove <br /> tags from the beginning and end of strings in a project I'm working on. I made a quick little method that does what I need it to do but I'm not convinced it's the best way to go about doing this sort of thing. I suspect there's probably a handy regular expression I can use to do it in only a couple of lines. Here's what I got:
def remove_breaks(text)
if text != nil and text != ""
text.strip!
index = text.rindex("<br />")
while index != nil and index == text.length - 6
text = text[0, text.length - 6]
text.strip!
index = text.rindex("<br />")
end
text.strip!
index = text.index("<br />")
while index != nil and index == 0
text = test[6, text.length]
text.strip!
index = text.index("<br />")
end
end
return text
end
Now the "<br />" could really be anything, and it'd probably be more useful to make a general use function that takes as an argument the string that needs to be stripped from the beginning and end.
I'm open to any suggestions on how to make this cleaner because this just seems like it can be improved.
gsub can take a regular expression:
text.gsub!(/(<br \/>\s*)*$/, '')
text.gsub!(/^(\s*<br \/>)*/, '')
text.strip!
class String
def strip_this!(t)
# Removes leading and trailing occurrences of t
# from the string, plus surrounding whitespace.
t = Regexp.escape(t)
sub!(/^(\s* #{t} \s*)+ /x, '')
sub!(/ (\s* #{t} \s*)+ $/x, '')
end
end
# For example.
str = ' <br /> <br /><br /> foo bar <br /> <br /> '
str.strip_this!('<br />')
p str # => 'foo bar'
You can use chomp! and slice! methods. See:
http://ruby-doc.org/core-1.8.7/String.html
def remove_breaks(text)
text.gsub((%r{^\s*<br />|<br />\s*$}, '')
end
%r{...} is another way to specify a regular expression. The advantage of %r is that you can pick your own delimeter. Using {} for the delimiters means not having to escape the /'s.
use replace method instead
str.replace("<br/>", "")