I want to scrape data off a website. The data is in the text of a span.
The HTML looks like this:
<p class="text-muted text-small">
<span class="text-muted">Votes:</span>
<span name="nv" data-value="1564808">1,564,808</span>
<span class="ghost">|</span>
<span class="text-muted">Gross:</span>
<span name="nv" data-value="107,928,762">$107.93M</span>
</p>
I want to search the whole page and get the value of the data-value which is 1,564,808 not the 107.93M value.
I tried various ways to get the data, Like for instance:
#votes = []
html_content =
open("https://www.imdb.com/list/ls057823854/sort=list_order,asc&st_
dt=&mod e=detail&page=1").read
doc = Nokogiri::HTML(html_content)
doc.css(".text-muted['span name=nv']").each do |i|
#votes << i.text.strip
Try this code:
doc.css('div.lister-item-content > p.text-muted > span[name = nv]:nth-child(2)').map(&:text)
Which results in:
["1,564,941", "373,745", "2,004,624", "1,077,404", "887,189", "305,554", "207,904", "1,074,609", "748,393", "789,255", "1,224,753", "754,008", "634,752", "1,056,328", "1,604,158", "1,438,194", "629,504", "1,158,452", "517,609", "539,263", "1,443,979", "1,290,159", "161,981", "830,992", "1,427,193", "299,532", "289,184", "705,138", "615,264", "1,147,650", "1,030,826", "1,018,932", "921,730", "524,568", "557,482", "1,973,773", "813,743", "367,587", "342,800", "188,210", "649,467", "1,068,455", "547,990", "527,123", "805,964", "420,447", "441,780", "318,295", "1,004,742", "446,096", "203,977", "581,108", "1,754,019", "616,804", "484,534", "265,048", "958,244", "289,190", "651,605", "503,185", "320,564", "660,685", "476,016", "432,155", "588,572", "374,705", "378,561", "337,801", "463,467", "508,822", "187,810", "1,128,184", "221,361", "261,529", "322,314", "324,435", "116,258", "318,628", "1,334,595", "222,651", "1,155,754", "228,713", "205,956", "271,162", "293,774", "33,136", "80,385", "703,048", "195,712", "274,244", "233,133", "121,874", "208,462", "513,797", "485,112", "120,750", "135,232", "57,411", "125,431", "297,193"]
I'm trying to parse a text, and based on tags to do actions.
The text is:
<window>
<caption>My window
</window>
<panel>
<label>
<caption>
<position>50,50
<color>255,255,255
</label>
</panel>
Code:
function parse_tag(chunck)
for start_tag,tag_name in string.gfind(chunck,"(<(.-)>)") do
if (child_obj[tag_name]) then
print(start_tag)
for data,end_tag in string.gfind(chunck,"<" .. tag_name ..">(.-)(</" .. tag_name ..">)") do
for object_prop,value in string.gfind(data,"<(.-)>(.-)") do
print("setting property = \"" .. object_prop .. "\", value of" .. value);
end
end
print("</" .. tag_name ..">");
elseif(findInArray(main_obj,tag_name)) then
print("Invalid data");
stop();
end
end
end
for key,tag in ipairs(main_obj) do
for start_tag,tag_name,chunck,end_tag in string.gfind(data,"(<(" .. tag.name .. ")>)(.-)(</" .. tag.name .. ">)") do --> searching for window/panel start and end tags
if (findInArray(main_obj,tag_name)) then
print(start_tag)
parse_tag(chunck); --> parses the tag with child tag
print(end_tag)
end
end
end
It seems to fail getting the value, as I get the following output:
<window>
</window>
<panel>
<label>
setting property = "caption", value of
setting property = "position", value of
setting property = "color", value of
</label>
</panel>
How can I use match the string after the first <%tag%> until the next <%tag%> or end of the chunk.
string.gfind(data,"<(.-)>(.-)")
Here, you try to match the value with .-. However, - is lazy, i.e, .- will try to match as little as possible, in this case, an empty string.
Try telling it to match until the next <:
string.gfind(data,"<(.-)>(._)<")
Tried different type of captures.
This
string.gfind(data,"<(.-)>([^%<+.-%>+]+)")
Seems to work