How to split value from a string in ruby - ruby-on-rails

My example string is listed here. i want to split every value result in array or hash to process value of each element.
<div id="test">
accno: 123232323 <br>
id: 5443534534534 <br>
name: test_name <br>
url: www.google.com <br>
</div>
How can i fetch each values in a hash or array.

With regex it's easy:
s = '<div id="test">
accno: 123232323 <br>
id: 5443534534534 <br>
name: test_name <br>
url: www.google.com <br>
</div>'
p s.scan(/\s+(.*?)\:\s+(.*?)<br>/).map.with_object({}) { |i, h| h[i[0].to_sym] = i[1].strip }
Or you can precise you keys (accno, id, name, url) like ([a-z]+) if they contains only lower case letters:
p s.scan(/\s+([a-z]+)\:\s+(.*?)<br>/).map.with_object({}) { |i, h| h[i[0].to_sym] = i[1].strip }
Result:
{:accno=>"123232323", :id=>"5443534534534", :name=>"test_name", :url=>"www.google.com"}
Update
in case of:
<div id="test"> accno: 123232323 id: 5443534534534 name: test_name url: www.google.com </div>
regex will be:
/([a-z]+)\:\s*(.*?)\s+/
([a-z]+) - this is hash key, and it could contains - or _, then just add it like: ([a-z]+\-_). This scheme presume that after key follows : (perhaps with space) and then some text until the space. Or (\s+|<) at the end if line ends without space: url: www.google.com</div>

If you are processing html, use a html/xml parser like nokogiri to pull out the text content of the required <div> tag using a CSS selector. Then parse the text into fields.
To install nokogiri:
gem install nokogiri
Then process the page and text:
require "nokogiri"
require "open-uri"
# re matches: spaces (word) colon spaces (anything) space
re_fields = /\s+(?<field>\w+):\s+(?<data>.*?)\s/
# Somewhere to store the results
record = {}
page = Nokogiri::HTML( open("http://example.com/divtest.html") )
# Select the text from <div id=test> and scan into fields with the regex
page.css( "div#test" ).text.scan( re_fields ){ |field, data|
record[ field ] = data
}
p record
Results in:
{"accno"=>"123232323", "id"=>"5443534534534", "name"=>"test_name", "url"=>"www.google.com"}
The page.css( "blah" ) selector can also be accessed as an array if you are processing multiple elements, which can be looped through with .each
# Somewhere to store the results
records = []
# Select the text from <div id=test> and scan into fields with the regex
page.css( "div#test" ).each{ |div|
record = {}
div.text.scan( re_fields ){ |field, data|
record[field] = data
}
records.push record
}
p records

Related

How to scrape a span name in Nokogiri in Ruby?

I want to scrape data off a website. The data is in the text of a span.
The HTML looks like this:
<p class="text-muted text-small">
<span class="text-muted">Votes:</span>
<span name="nv" data-value="1564808">1,564,808</span>
<span class="ghost">|</span>
<span class="text-muted">Gross:</span>
<span name="nv" data-value="107,928,762">$107.93M</span>
</p>
I want to search the whole page and get the value of the data-value which is 1,564,808 not the 107.93M value.
I tried various ways to get the data, Like for instance:
#votes = []
html_content =
open("https://www.imdb.com/list/ls057823854/sort=list_order,asc&st_
dt=&mod e=detail&page=1").read
doc = Nokogiri::HTML(html_content)
doc.css(".text-muted['span name=nv']").each do |i|
#votes << i.text.strip
Try this code:
doc.css('div.lister-item-content > p.text-muted > span[name = nv]:nth-child(2)').map(&:text)
Which results in:
["1,564,941", "373,745", "2,004,624", "1,077,404", "887,189", "305,554", "207,904", "1,074,609", "748,393", "789,255", "1,224,753", "754,008", "634,752", "1,056,328", "1,604,158", "1,438,194", "629,504", "1,158,452", "517,609", "539,263", "1,443,979", "1,290,159", "161,981", "830,992", "1,427,193", "299,532", "289,184", "705,138", "615,264", "1,147,650", "1,030,826", "1,018,932", "921,730", "524,568", "557,482", "1,973,773", "813,743", "367,587", "342,800", "188,210", "649,467", "1,068,455", "547,990", "527,123", "805,964", "420,447", "441,780", "318,295", "1,004,742", "446,096", "203,977", "581,108", "1,754,019", "616,804", "484,534", "265,048", "958,244", "289,190", "651,605", "503,185", "320,564", "660,685", "476,016", "432,155", "588,572", "374,705", "378,561", "337,801", "463,467", "508,822", "187,810", "1,128,184", "221,361", "261,529", "322,314", "324,435", "116,258", "318,628", "1,334,595", "222,651", "1,155,754", "228,713", "205,956", "271,162", "293,774", "33,136", "80,385", "703,048", "195,712", "274,244", "233,133", "121,874", "208,462", "513,797", "485,112", "120,750", "135,232", "57,411", "125,431", "297,193"]

Can't get rid of some characters when pushing string to array

I'm creating some kind of custom tags that I'll use later to filter some datas. However, when I add the tags inside an array, I get the following:
"[\"witcher 3\", \"badass\", \"epic\"]"
#tags = []
params[:tags].split(', ').map do |tag|
#tags.push(tag.strip)
end
# About 5 lines under
FileDetail.create!(path: path, creation_date: date, tags: #tags)
Why do these \ show up, and why don't .strip work?
Thank you in advance
You are setting an array of strings in #tag, and \" represents an escaped character, in this case " which is used by ruby to represent String objects.
Consider the following code (an try it on IRB):
foo = ["bar", "baz"]
#=> ["bar", "baz"]
foo.inspect
#=> "[\"bar\", \"baz\"]"
foo.each { |f| puts "tag: #{f}" }
# tag: bar
# tag: baz
As you can see, there is really no \ character to strip from the string, its just how ruby outputs a String representation. So your code doesn't need .strip method:
#tags = []
params[:tags].split(', ').map do |tag|
#tags.push(tag)
end
Not related to your question, but still relevant: split method will return an array, so there is no need to create one before and then push items to it; just assign the returned array to #tags.
For example:
params[:tags] = "witcher 3, badass, epic"
#=> "witcher 3, badass, epic"
#tags = params[:tags].split(', ')
#=> ["witcher 3", "badass", "epic"]
If you want, you can still use map and strip to remove leading and trailing spaces:
params[:tags] = "witcher 3, badass , epic "
#=> "witcher 3, badass , epic "
params[:tags].split(",").map(&:strip)
#=> ["witcher 3", "badass", "epic"]

How can i Read a file in Ruby on Rails

I´m new to rails an i try to read a txt.file that looks like this:
ThomasLinde ; PeterParker ; Monday
JulkoAndrovic ; KeludowigFrau ; Tuesday
JohannesWoellenstein ; SiegmundoKrugmando ; Wednesday
Now i want to read each "column" of the .txt file to display it on a page of my application.
My idea for the code looks like this:
if (File.exist?("Zuordnung_x.txt"))
fi=File.open("Zuordnung_x.txt", "r")
fi.each { |line|
sa=line.split(";")
#nanny_name=sa[0]
#customer_name=sa[1]
#period_name=sa[2]
}
fi.close
else
#nanny_name=nil
#customer_name=nil
#period_name=nil
flash.now[:not_available] = "Nothing happened!"
end
This is my Idea but he gives me only one line. Any ideas? or i am just able to read one line if i use #nanny_name?
You can only need a variable with an array value, and push every line to it.
#result = []
if (File.exist?("Zuordnung_x.txt"))
fi=File.open("Zuordnung_x.txt", "r")
fi.each do |line|
sa=line.split(";")
#result << {nanny_name: sa[0], customer_name: sa[1], period_name: [2]}
end
fi.close
else
flash.now[:not_available] = "Nothing happened!"
end
and on view template, you need to each #result, example
<% #result.each do |row| %>
<p><%= "#{row[:nanny_name]} serve the customer #{row[:customer_name]} on #{row[:period_name]}" %><p>
<% end %>
optional :
If just using split, probably you will get some string with whitespace at the beginning of string or at the end of string
"ThomasLinde ; PeterParker ; Monday".split(';')
=> ["ThomasLinde ", " PeterParker ", " Monday"]
to handle it, you need strip every value of an array like this :
"ThomasLinde ; PeterParker ; Monday".split(';').map(&:strip)
=> ["ThomasLinde", "PeterParker", "Monday"]

Nokogiri parsing missing element create issue

I am having Plain html doc NO CSS . In which some of the content i need to pass to excel sheet. I tried with Nokogiri it works on Css basis.
Do anybody tried this thing.
<html>
<head></head>
<body>
***NOTE***
<br>
Items
<br>
<br>
Invoice Number : [78945824] PO Number : [4587958]
<br>
Track It : 12345
<br>
<br>
Items
<br>
<br>
Invoice Number : [79546828] PO Number : [4567892]
<br>
<br>
<br>
Items
<br>
<br>
Invoice Number : [78976824] PO Number : [897569]
<br>
Track It : 12345
<br>
</body>
</html>
I am able to retrieve the PO Number & Tracking no
require 'rubygems'
require 'nokogiri'
require 'open-uri'
PAGE_URL = "a.html"
page = Nokogiri::HTML(open(PAGE_URL))
data = page.css("body").text
po_numbers = data.scan(/Invoice Number : \[\d+\] PO Number : \[(\d+)\]/).flatten
tracking_numbers = page.css("a").text.split
[["PO Number", "Tracking Number"]].concat(po_numbers.zip(tracking_numbers))
puts po_numbers
puts tracking_numbers
=> po_numbers = ["4587958", "4567892", "4587958"]
=> tracking_numbers = ["12543", "12356"]
When we zip those together, we get:
=> po_numbers.zip(tracking_numbers)
=> [["4587958", "12543"], ["4567892", "12356"], ["4587958", "nil"]]
What we want is:
=> [["4587958", "12543"], ["4567892", "nil"], ["4587958", "12356"] ]
Try this
data = page.css("body").text
data = data.gsub(" ","").split(/\n/)
po=[]
track=[]
data.each do |i|
if i.include? "PONumber"
po << i.split("PONumber:").last.scan(/\d+/)[0]
end
if i.include? "TrackIt"
track << i.split("TrackIt:").last
end
end
po.zip(track)
If you can use regex to scan for all invoice number (po_numbers), you can do the same with tracking number (tracking_numbers):
tracking_numbers = data.scan(/Tracking no : (\d*)/).flatten
The returned array includes nil, therefore, you can walk through both array for po number and tracking number
po_numbers.each_with_index do |elm, index|
p "PO Number: #{elm}, Tracking Number: #{tracking_numbers[index]}"
end
Update
This regex match the updated HTML
/Track It :\s*(?:<a href=".*">\s*(\d+)\s*<\/a>|$)/
It matches both empty track number and one with a link.

Removing a pattern from the beginning and end of a string in ruby

So I found myself needing to remove <br /> tags from the beginning and end of strings in a project I'm working on. I made a quick little method that does what I need it to do but I'm not convinced it's the best way to go about doing this sort of thing. I suspect there's probably a handy regular expression I can use to do it in only a couple of lines. Here's what I got:
def remove_breaks(text)
if text != nil and text != ""
text.strip!
index = text.rindex("<br />")
while index != nil and index == text.length - 6
text = text[0, text.length - 6]
text.strip!
index = text.rindex("<br />")
end
text.strip!
index = text.index("<br />")
while index != nil and index == 0
text = test[6, text.length]
text.strip!
index = text.index("<br />")
end
end
return text
end
Now the "<br />" could really be anything, and it'd probably be more useful to make a general use function that takes as an argument the string that needs to be stripped from the beginning and end.
I'm open to any suggestions on how to make this cleaner because this just seems like it can be improved.
gsub can take a regular expression:
text.gsub!(/(<br \/>\s*)*$/, '')
text.gsub!(/^(\s*<br \/>)*/, '')
text.strip!
class String
def strip_this!(t)
# Removes leading and trailing occurrences of t
# from the string, plus surrounding whitespace.
t = Regexp.escape(t)
sub!(/^(\s* #{t} \s*)+ /x, '')
sub!(/ (\s* #{t} \s*)+ $/x, '')
end
end
# For example.
str = ' <br /> <br /><br /> foo bar <br /> <br /> '
str.strip_this!('<br />')
p str # => 'foo bar'
You can use chomp! and slice! methods. See:
http://ruby-doc.org/core-1.8.7/String.html
def remove_breaks(text)
text.gsub((%r{^\s*<br />|<br />\s*$}, '')
end
%r{...} is another way to specify a regular expression. The advantage of %r is that you can pick your own delimeter. Using {} for the delimiters means not having to escape the /'s.
use replace method instead
str.replace("<br/>", "")

Resources