How to fix slow Nokogiri parsing - ruby-on-rails

I have a Rake task in my Rails app which looks into a folder for an XML file, parses it, and saves it to a database. The code works OK, but I have about 2100 files totaling 1.5GB, and processing is very slow, about 400 files in 7 hours. There are approximately 600-650 contracts in each XML file, and each contract can have 0 to n attachments. I did not paste all values, but each contract has 25 values.
To speed-up the process I use Activerecord's Import gem, so I am building an array per file and when the whole file is parsed. I do a mass import to Postgres. Only if a record is found is it directly updated and/or a new attachment inserted, but this is like 1 out of 100000 records. This helps a little, instead of doing new record per contract, but now I see that the slow part is XML parsing. Can you please look if I am doing something wrong in my parsing?
When I tried to print the arrays I am building, the slow part was until it loaded/parsed whole file and starts printing array by array. Thats why I assume the probem with speed is in parsing as Nokogiri loads the whole XML before it starts.
require 'nokogiri'
require 'pp'
require "activerecord-import/base"
ActiveRecord::Import.require_adapter('postgresql')
namespace :loadcrz2 do
desc "this task load contracts from crz xml files to DB"
task contracts: :environment do
actual_dir = File.dirname(__FILE__).to_s
Dir.foreach(actual_dir+'/../../crzfiles') do |xmlfile|
next if xmlfile == '.' or xmlfile == '..' or xmlfile == 'archive'
page = Nokogiri::XML(open(actual_dir+"/../../crzfiles/"+xmlfile))
puts xmlfile
cons = page.xpath('//contracts/*')
contractsarr = []
#c =[]
cons.each do |contract|
name = contract.xpath("name").text
crzid = contract.xpath("ID").text
procname = contract.xpath("procname").text
conname = contract.xpath("contractorname").text
subject = contract.xpath("subject").text
dateeff = contract.xpath("dateefficient").text
valuecontract = contract.xpath("value").text
attachments = contract.xpath('attachments/*')
attacharray = []
attachments.each do |attachment|
attachid = attachment.xpath("ID").text
attachname = attachment.xpath("name").text
doc = attachment.xpath("document").text
size = attachment.xpath("size").text
arr = [attachid,attachname,doc,size]
attacharray.push arr
end
#con = Crzcontract.find_by_crzid(crzid)
if #con.nil?
#c=Crzcontract.new(:crzname => name,:crzid => crzid,:crzprocname=>procname,:crzconname=>conname,:crzsubject=>subject,:dateeff=>dateeff,:valuecontract=>valuecontract)
else
#con.crzname = name
#con.crzid = crzid
#con.crzprocname=procname
#con.crzconname=conname
#con.crzsubject=subject
#con.dateeff=dateeff
#con.valuecontract=valuecontract
#con.save!
end
attacharray.each do |attar|
attachid=attar[0]
attachname=attar[1]
doc=attar[2]
size=attar[3]
#at = Crzattachment.find_by_attachid(attachid)
if #at.nil?
if #con.nil?
#c.crzattachments.build(:attachid=>attachid,:attachname=>attachname,:doc=>doc,:size=>size)
else
#a=Crzattachment.new
#a.attachid = attachid
#a.attachname = attachname
#a.doc = doc
#a.size = size
#a.crzcontract_id=#con.id
#a.save!
end
end
end
if #c.present?
contractsarr.push #c
end
#p #c
end
#p contractsarr
puts "done"
if contractsarr.present?
Crzcontract.import contractsarr, recursive: true
end
FileUtils.mv(actual_dir+"/../../crzfiles/"+xmlfile, actual_dir+"/../../crzfiles/archive/"+xmlfile)
end
end
end

There are a number of problems with the code. Here are some ways to improve it:
actual_dir = File.dirname(__FILE__).to_s
Don't use to_s. dirname is already returning a string.
actual_dir+'/../../crzfiles', with and without a trailing path delimiter is used repeatedly. Don't make Ruby rebuild the concatenated string over and over. Instead define it once, but take advantage of Ruby's ability to build the full path:
File.absolute_path('../../bar', '/path/to/foo') # => "/path/bar"
So use:
actual_dir = File.absolute_path('../../crzfiles', __FILE__)
and then refer to actual_dir only:
Dir.foreach(actual_dir)
This is unwieldy:
next if xmlfile == '.' or xmlfile == '..' or xmlfile == 'archive'
I'd do:
next if (xmlfile[0] == '.' || xmlfile == 'archive')
or even:
next if xmlfile[/^(?:\.|archive)/]
Compare these:
'.hidden'[/^(?:\.|archive)/] # => "."
'.'[/^(?:\.|archive)/] # => "."
'..'[/^(?:\.|archive)/] # => "."
'archive'[/^(?:\.|archive)/] # => "archive"
'notarchive'[/^(?:\.|archive)/] # => nil
'foo.xml'[/^(?:\.|archive)/] # => nil
The pattern will return a truthy value if it starts with '.' or is equal to 'archive'. It's not as readable but it's compact. I'd recommend the compound conditional test though.
In some places, you're concatenating xmlfile, so again let Ruby do it once:
xml_filepath = File.join(actual_dir, xmlfile)
which will honor the file path delimiter for whatever OS you're running on. Then use xml_filepath instead of concatenating the name:
xml_filepath = File.join(actual_dir, xmlfile)))
page = Nokogiri::XML(open(xml_filepath))
[...]
FileUtils.mv(xml_filepath, File.join(actual_dir, "archive", xmlfile)
join is a good tool so take advantage of it. It's not just another name for concatenating strings, because it's also aware of the correct delimiter to use for the OS the code is running on.
You use a lot of instances of:
xpath("some_selector").text
Don't do that. xpath, along with css and search return a NodeSet, and text when used on a NodeSet can be evil in a way that'll hurtle you down a very steep and slippery slope. Consider this:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<root>
<node>
<data>foo</data>
</node>
<node>
<data>bar</data>
</node>
</root>
EOT
doc.search('//node/data').class # => Nokogiri::XML::NodeSet
doc.search('//node/data').text # => "foobar"
The concatenation of the text into 'foobar' can't be split easily and it's a problem we see here in questions too often.
Do this if you expect getting a NodeSet back because of using search, xpath or css:
doc.search('//node/data').map(&:text) # => ["foo", "bar"]
It's better to use at, at_xpath or at_css if you're after a specific node because then text will work as you'd expect.
See "How to avoid joining all text from Nodes when scraping" also.
There's a lot of replication that could be DRY'd. Instead of this:
name = contract.xpath("name").text
crzid = contract.xpath("ID").text
procname = contract.xpath("procname").text
You could do something like:
name, crzid, procname = [
'name', 'ID', 'procname'
].map { |s| contract.at(s).text }

Related

CSV::MalformedCSVError: Unquoted fields do not allow \r or \n in Ruby

I have CSV's which I am trying to import into my oracle database but unfortunately I keep on getting the same error:
> CSV::MalformedCSVError: Unquoted fields do not allow \r or \n (line 1).
I know there are tons of similar questions which have been asked but none relate specifically to my issue other than this one, but unfortunately it didn't help.
To explain my scenario:
I have CSV's in which the rows don't always end with a value, but
rather, just a comma because it's a null value hence it stays blank.
I would like to import the CSV's regardless of whether the ending is with a comma or without a comma.
Here are the first 5 lines of my CSV with changed values due to privacy reasons,
id,customer_id,provider_id,name,username,password,salt,email,description,blocked,created_at,updated_at,deleted_at
1,,1,Default Administrator,admin,1,1," ",Initial default user.,f,2019-10-04 14:28:38.492000,2019-10-04 14:29:34.224000,
2,,2,Default Administrator,admin,2,1,,Initial default user.,,2019-10-04 14:28:38.633000,2019-10-04 14:28:38.633000,
3,1,,Default Administrator,admin,3,1," ",Initial default user.,f,2019-10-04 14:41:38.030000,2019-11-27 10:23:03.329000,
4,1,,admin,admin,4,1," ",,,2019-10-28 12:21:23.338000,2019-10-28 12:21:23.338000,
5,2,,Default Administrator,admin,5,1," ",Initial default user.,f,2019-11-12 09:00:49.430000,2020-02-04 08:20:06.601000,2020-02-04 08:20:06.601000
As you can see the ending is sometimes with or without a comma and this structure repeats quite often.
This is my code with which I have been playing around with:
def csv_replace_empty_string
Dir.foreach(Rails.root.join('db', 'csv_export')) do |filename|
next if filename == '.' or filename == '..' or filename == 'extract_db_into_csv.sh' or filename =='import_csv.rb'
read_file = File.read(Rails.root.join('db', 'csv_export', filename))
replace_empty_string = read_file.gsub(/(?<![^,])""(?![^,])/, '" "')
format_csv = replace_empty_string.gsub(/\r\r?\n?/, "\n")
# format_csv = remove_empty_lines.sub!(/(?:\r?\n)+\z/, "")
File.open(Rails.root.join('db', 'csv_export', filename), "w") {|file| file.puts format_csv }
end
end
I have tried using many different kinds of gsubs found online in similar forums, but it didn't help.
Here is my function for importing the CSV in the db:
def import_csv_into_db
Dir.foreach(Rails.root.join('db', 'csv_export')) do |filename|
next if filename == '.' or filename == '..' or filename == 'extract_db_into_csv.sh' or filename =='import_csv.rb'
filename_renamed = File.basename(filename, File.extname(filename)).classify
CSV.foreach(Rails.root.join('db', 'csv_export',filename), :headers => true, :skip_blanks => true) do |row|
class_name = filename_renamed.constantize
class_name.create!(row.to_hash)
puts "Insert on table #{filename_renamed} complete"
end
end
end
I have also tried the options provided by CSV such as :row_sep => :"\n" or :row_sep => "\r" but keep on getting the same error.
I am pretty sure I have some sort of thinking error, but I can't seem to figure it out.
I fixed the issue by using the following:
format_csv = replace_empty_string.gsub(/\r\r?\n?/, "\n")
This was originally #mgrims answer, but I had to adjust my code by further removing the :skip_blanks :row_sep options.
It is importing successfully now!

Move data into 3 Separate Hashes inside loop in ruby

It's only my second post and I'm still learning ruby.
I'm trying to figure this out based on my Java knowledge but I can't seem to get it right.
What I need to do is:
I have a function that reads a file line by line and extract different car features from each line, for example:
def convertListings2Catalogue (fileName)
f = File.open(fileName, "r")
f.each_line do |line|
km=line[/[0-9]+km/]
t = line[(Regexp.union(/sedan/i, /coupe/i, /hatchback/i, /station/i, /suv/i))]
trans = ....
end end
Now for each line I need to store the extracted features into separate
hashes that I can access later in my program.
The issues I'm facing:
1) I'm overwriting the features in the same hash
2) Can't access the hash outside my function
That what's in my file:
65101km,Sedan,Manual,2010,18131A,FWD,Used,5.5L/100km,Toyota,camry,SE,{AC,
Heated Seats, Heated Mirrors, Keyless Entry}
coupe,1100km,auto,RWD, Mercedec,CLK,LX ,18FO724A,2017,{AC, Heated
Seats, Heated Mirrors, Keyless Entry, Power seats},6L/100km,Used
AWD,SUV,0km,auto,new,Honda,CRV,8L/100km,{Heated Seats, Heated Mirrors,
Keyless Entry},19BF723A,2018,LE
Now my function extracts the features of each car model, but I need to store these features in 3 different hashes with the same keys but different values.
listing = Hash.new(0)
listing = { kilometers: km, type: t, transmission: trans, drivetrain: dt, status: status, car_maker: car_maker }
I tried moving the data from one hash to another, I even tried storing the data in an array first and then moving it to the hash but I still can't figure out how to create separate hashes inside a loop.
Thanks
I don't fully understand the question but I thought it was important to suggest how you might deal with a more fundamental issue: extracting the desired information from each line of the file in an effective and Ruby-like manner. Once you have that information, in the form of an array of hashes, one hash per line, you can do with it what you want. Alternatively, you could loop through the lines in the file, constructing a hash for each line and performing any desired operations before going on to the next line.
Being new to Ruby you will undoubtedly find some of the code below difficult to understand. If you persevere, however, I think you will be able to understand all of it, and in the process learn a lot about Ruby. I've made some suggestions in the last section of my answer to help you decipher the code.
Code
words_by_key = {
type: %w| sedan coupe hatchback station suv |,
transmission: %w| auto manual steptronic |,
drivetrain: %w| fwd rwd awd |,
status: %w| used new |,
car_maker: %w| honda toyota mercedes bmw lexus |,
model: %w| camry clk crv |
}
#=> {:type=>["sedan", "coupe", "hatchback", "station", "suv"],
# :transmission=>["auto", "manual", "steptronic"],
# :drivetrain=>["fwd", "rwd", "awd"],
# :status=>["used", "new"],
# :car_maker=>["honda", "toyota", "mercedes", "bmw", "lexus"],
# :model=>["camry", "clk", "crv"]}
WORDS_TO_KEYS = words_by_key.each_with_object({}) { |(k,v),h| v.each { |s| h[s] = k } }
#=> {"sedan"=>:type, "coupe"=>:type, "hatchback"=>:type, "station"=>:type, "suv"=>:type,
# "auto"=>:transmission, "manual"=>:transmission, "steptronic"=>:transmission,
# "fwd"=>:drivetrain, "rwd"=>:drivetrain, "awd"=>:drivetrain,
# "used"=>:status, "new"=>:status,
# "honda"=>:car_maker, "toyota"=>:car_maker, "mercedes"=>:car_maker,
# "bmw"=>:car_maker, "lexus"=>:car_maker,
# "camry"=>:model, "clk"=>:model, "crv"=>:model}
module ExtractionMethods
def km(str)
str[/\A\d+(?=km\z)/]
end
def year(str)
str[/\A\d+{4}\z/]
end
def stock(str)
return nil if str.end_with?('km')
str[/\A\d+\p{Alpha}\p{Alnum}*\z/]
end
def trim(str)
str[/\A\p{Alpha}{2}\z/]
end
def fuel_consumption(str)
str.to_f if str[/\A\d+(?:\.\d+)?(?=l\/100km\z)/]
end
end
class K
include ExtractionMethods
def extract_hashes(fname)
File.foreach(fname).with_object([]) do |line, arr|
line = line.downcase
idx_left = line.index('{')
idx_right = line.index('}')
if idx_left && idx_right
g = { set_of_features: line[idx_left..idx_right] }
line[idx_left..idx_right] = ''
line.squeeze!(',')
else
g = {}
end
arr << line.split(',').each_with_object(g) do |word, h|
word.strip!
if WORDS_TO_KEYS.key?(word)
h[WORDS_TO_KEYS[word]] = word
else
ExtractionMethods.instance_methods.find do |m|
v = public_send(m, word)
(h[m] = v) unless v.nil?
v
end
end
end
end
end
end
Example
data =<<BITTER_END
65101km,Sedan,Manual,2010,18131A,FWD,Used,5.5L/100km,Toyota,camry,SE,{AC, Heated Seats, Heated Mirrors, Keyless Entry}
coupe,1100km,auto,RWD, Mercedec,CLK,LX ,18FO724A,2017,{AC, Heated Seats, Heated Mirrors, Keyless Entry, Power seats},6L/100km,Used
AWD,SUV,0km,auto,new,Honda,CRV,8L/100km,{Heated Seats, Heated Mirrors, Keyless Entry},19BF723A,2018,LE
BITTER_END
FILE_NAME = 'temp'
File.write(FILE_NAME, data)
#=> 353 (characters written to file)
k = K.new
#=> #<K:0x00000001c257d348>
k.extract_hashes(FILE_NAME)
#=> [{:set_of_features=>"{ac, heated seats, heated mirrors, keyless entry}",
# :km=>"65101", :type=>"sedan", :transmission=>"manual", :year=>"2010",
# :stock=>"18131a", :drivetrain=>"fwd", :status=>"used", :fuel_consumption=>5.5,
# :car_maker=>"toyota", :model=>"camry", :trim=>"se"},
# {:set_of_features=>"{ac, heated seats, heated mirrors, keyless entry, power seats}",
# :type=>"coupe", :km=>"1100", :transmission=>"auto", :drivetrain=>"rwd",
# :model=>"clk", :trim=>"lx", :stock=>"18fo724a", :year=>"2017",
# :fuel_consumption=>6.0, :status=>"used"},
# {:set_of_features=>"{heated seats, heated mirrors, keyless entry}",
# :drivetrain=>"awd", :type=>"suv", :km=>"0", :transmission=>"auto",
# :status=>"new", :car_maker=>"honda", :model=>"crv", :fuel_consumption=>8.0,
# :stock=>"19bf723a", :year=>"2018", :trim=>"le"}]
Explanation
Firstly, note that the HEREDOC needs to be un-indented before being executed.
You will see that the instance method K#extract_hashes uses IO#foreach to read the file line-by-line.1
The first step in processing each line of the file is to downcase it. You will then want to split the string on commas to form an array of words. There is a problem, however, in that you don't want to split on commas that are between a left and right brace ({ and }), which corresponds to the key :set_of_features. I decided to deal with that by determining the indices of the two braces, creating a hash with the single key :set_of_features, delete that substring from the line and lastly replace a resulting pair of adjacent commas with a single comma:
idx_left = line.index('{')
idx_right = line.index('}')
if idx_left && idx_right
g = { set_of_features: line[idx_left..idx_right] }
line[idx_left..idx_right] = ''
line.squeeze!(',')
else
g = {}
end
See String for the documentation of the String methods used here and elsewhere.
We can now convert the resulting line to an array of words by splitting on the commas. If any capitalization is desired in the output this should be done after the hashes have been constructed.
We will build on the hash { set_of_features: line[idx_left..idx_right] } just created. When complete, it will be appended to the array being returned.
Each element (word) in the array, is then processed. If it is a key of the hash WORDS_TO_KEYS we set
h[WORDS_TO_KEYS[word]] = word
and are finished with that word. If not, we execute each of the instance methods m in the module ExtractionMethods until one is found for which m[word] is not nil. When that is found another key-value pair is added to the hash h:
h[m] = word
Notice that the name of each instance method in ExtractionMethods, which is a symbol (e.g., :km), is a key in the hash h. Having separate methods facilitates debugging and testing.
I could have written:
if (s = km(word))
s
elsif (s = year(word))
s
elsif (s = stock(str))
s
elsif (s = trim(str))
s
elsif (s = fuel_consumption(str))
s
end
but since all these methods take the same argument, word, we can instead use Object#public_send:
a = [:km, :year, :stock, :trim, :fuel_consumption]
a.find do |m|
v = public_send(m, word)
(h[m] = v) unless v.nil?
v
end
A final tweak is to put all the methods in the array a in a module ExtractionMethods and include that module in the class K. We can then replace a in the find expression above with ExtractionMethods.instance_methods. (See Module#instance_methods.)
Suppose now that the data are changed so that additional fields are added (e.g., for "colour" or "price"). Then the only modifications to the code required are changes to words_by_key and/or the addition of methods to ExtractionMethods.
Understanding the code
It may be helpful to run the code with puts statements inserted. For example,
idx_left = line.index('{')
idx_right = line.index('}')
puts "idx_left=#{idx_left}, idx_left=#{idx_left}"
Where code is chained it may be helpful to break it up with temporary variables and insert puts statements. For example, change
arr << line.split(',').each_with_object(g) do |word, h|
...
to
a = line.split(',')
puts "line.split(',')=#{a}"
enum = a.each_with_object(g)
puts "enum.to_a=#{enum.to_a}"
arr << enum do |word, h|
...
The second puts here is merely to see what elements the enumerator enum will generate and pass to the block.
Another way of doing that is to use the handy method Object#tap, which is inserted between two methods:
arr << line.split(',').tap { |a| puts "line.split(',')=#{a}"}.
each_with_object(g) do |word, h|
...
tap (great name, eh?), as used here, simply returns its receiver after displaying its value.
Lastly, I've used the method Enumerable#each_with_object in a couple of places. It may seem complex but it's actually quite simple. For example,
arr << line.split(',').each_with_object(g) do |word, h|
...
end
is effectively equivalent to:
h = g
arr << line.split(',').each do |word|
...
end
h
1 Many IO methods are typically invoked on File. This is acceptable because File.superclass #=> IO.
You could leverage the fact that your file instance is an enumerable. This allows you to leverage the inject method, and you can seed that with an empty hash. collector in this case is the hash that gets passed along as the iteration continues. Be sure to (implicitly, by having collector be the last line of the block) return the value of collector as the inject method will use this to feed into the next iteration. It's some pretty powerful stuff!
I think this is roughly what you're going for. I used model as the key in the hash, and set_of_features as your data.
def convertListings2Catalogue (fileName)
f = File.open(fileName, "r")
my_hash = f.inject({}) do |collector, line|
km=line[/[0-9]+km/]
t = line[(Regexp.union(/sedan/i, /coupe/i, /hatchback/i, /station/i, /suv/i))]
trans = line[(Regexp.union(/auto/i, /manual/i, /steptronic/i))]
dt = line[(Regexp.union(/fwd/i, /rwd/i, /awd/i))]
status = line[(Regexp.union(/used/i, /new/i))]
car_maker = line[(Regexp.union(/honda/i, /toyota/i, /mercedes/i, /bmw/i, /lexus/i))]
stock = line.scan(/(\d+[a-z0-9]+[a-z](?<!km\b))(?:,|$)/i).first
year = line.scan(/(\d{4}(?<!km\b))(?:,|$)/).first
trim = line.scan(/\b[a-zA-Z]{2}\b/).first
fuel = line.scan(/[\d.]+L\/\d*km/).first
set_of_features = line.scan(/\{(.*?)\}/).first
model = line[(Regexp.union(/camry/i, /clk/i, /crv/i))]
collector[model] = set_of_features
collector
end
end

How to get the following result by truncating data?

I have Coffeescript file in the following format. I have to generate all the possible combinations using the : :. I have already written the code for combinations. It is working fine. But, somehow the configuration file is changed & I have to modify that code. So, could anyone please help me to solve this problem?
abTests:
productRanking:
version: 4
groups: [
ratio:
default: 1
us: 0.90
me: 0.0
value: "LessPopularityEPC"
,
ratio:
default: 0
us: 0.1
value: "CtrEpcJob"
,
ratio:
default: 0
me: 1.0
value: "RandomPerVisitor"
]
I would like to have the data formatted in the following format:
productRanking:
"LessPopularityEPC"
"CtrEpcJob"
"RandomPerVisitor"
]
I am using the following code here :
START_REGEXP = /# AB Tests/
END_REGEXP = /# Routes/
COMMENT_EXP = /#/
COMMA_REGEXP = /,/
START_BLOCK = /\[/
END_BLOCK = /]/
def Automate_AB_tests.abTestParser(input_file,output_file)
raise "Source File doesn't exist at provided path" unless File.exists?(input_file)
flag = false #setting default value of flag=FALSE to parse the data between two REGEX
File.open(output_file, "w") do |ofile| #opening destination file in WRITE mode
File.foreach(input_file) do |iline| #Reading each lines of source file
flag = true if iline =~ START_REGEXP
ofile.puts(iline.sub(" ", '').sub("value:",'')) if flag && (iline =~ /value/ || iline=~ /,/ || iline =~ /]/) unless (iline =~ COMMENT_EXP or iline =~ COMMA_REGEXP)
flag = false if iline =~ END_REGEXP
end
end
end
Assuming that what you want is to take each key under abTests, e.g. ProductRanking, and return a hash with those keys having the value of the value key in their first group, then like so:
data['abTests'].each_with_object({}) do |(key, testData), resultingHash|
resultingHash[key] = testData['groups'].first['value']
end
However, if that's not what you want, then you need to be a bit more clear. Try working through the operations you want to achieve on paper, and break down your thought process step-by-step. The two operations that tend to be most useful when processing list or hash data are map and reduce (each_with_object being a form of reduction). Look through the documentation of the Enumerable module of Ruby for more info.
Assume that you have a ruby hash, which can be rewrite as
data = {:abTests=>{:productRanking=>{:version=>4, :groups=>[{:ratio=>{:default=>1, :us=>0.9, :me=>0.0}, :value=>"LessPopularityEPC"}]}}}
you can loop over this data hash to get your desired result
for eg:
result_hash = {}
data[:abTests].each do |key, value|
result_hash = {k.to_sym => value[:groups][0][:value]}
puts result_hash
end
puts result_hash[:productRanking] //outputs "LessPopularityEPC"

Unicodes in file get escaped

I have a model Person when i save a new Person through the console with unicode for eg:
p = Person.new
p.name = "Hans"
p.street = "Jo\u00DFstreet"
p.save
It returns on p.street Joßstreet what is correct. But when i try in my seeds to build from a file:
file.txt
Hans;Jo\u00DFstreet
Jospeh;Baiuvarenstreet
and run this in my seed:
File.readlines('file.txt').each do |line|
f = line.split(';')
p = Person.new
p.name = p[0]
p.street = p[1]
p.save
end
Now when i call for eg: p = Person.last i get p.street => "Jo\\u00DFstreet"
I dont understand why the \u00DF gets escaped! What can i do to fix this problem? Thanks
It's because escape sequences such as \u00DF are handled only in source code string literals, as a part of Ruby syntax.
When you read some file (or receive data from somewhere else) Ruby don't try to handle escape sequences and you should do it by your own code.
To unescape string you may use approaches described here or there, but maybe you'd be better to store initially unescaped text in your file.

ROR/Hpricot: parsing a site and searching/comparing strings with regex

I just started with Ruby On Rails, and want to create a simple web site crawler which:
Goes through all the Sherdog fighters' profiles.
Gets the Referees' names.
Compares names with the old ones (both during the site parsing and from the file).
Prints and saves all the unique names to the file.
An example URL is: http://www.sherdog.com/fighter/Fedor-Emelianenko-1500
I am searching for the tag entries like <span class="sub_line">Dan Miragliotta</span>, unfortunately, additionally to the proper Referee names I need, the same kind of class is used with:
The date.
"N/A" when the referee name is not known.
I need to discard all the results with a "N/A" string as well as any string which contains numbers. I managed to do the first part but couldn't figure out how to do the second. I tried searching, thinking and experimenting, but, after experimenting and rewriting, managed to break the whole program and don't know how to (properly) fix it:
require 'rubygems'
require 'hpricot'
require 'simplecrawler'
# Set up a new crawler
sc = SimpleCrawler::Crawler.new("http://www.sherdog.com/fighter/Fedor-Emelianenko-1500")
sc.maxcount = 1
sc.include_patterns = [".*/fighter/.*$", ".*/events/.*$", ".*/organizations/.*$", ".*/stats/fightfinder\?association/.*$"]
# The crawler yields a Document object for each visited page.
sc.crawl { |document|
# Parse page title with Hpricot and print it
hdoc = Hpricot(document.data)
(hdoc/"td/span[#class='sub_line']").each do |span|
if span.inner_html == 'N/A' || Regexp.new(".*/\d\.*$").match(span.inner_html)
# puts "Test"
else
puts span.inner_html
#File.open("File_name.txt", 'a') {|f| f.puts(hdoc.span.inner_html) }
end
end
}
I would also appreciate help with ideas on the rest of the program: How do I properly read the current names from the file, if the program is run multiple times, and how do I make the comparisons for the unique names?
Edit:
After some proposed improvements, here is what I got:
require 'rubygems'
require 'simplecrawler'
require 'nokogiri'
#require 'open-uri'
sc = SimpleCrawler::Crawler.new("http://www.sherdog.com/fighter/Fedor-Emelianenko-1500")
sc.maxcount = 1
sc.crawl { |document|
doc = Nokogiri::HTML(document.data)
names = doc.css('td:nth-child(4) .sub-line').map(&:content).uniq.reject { |c| c == 'N/A' }
puts names
}
Unfortunately, the code still doesn't work - it returns a blank.
If instead of doc = Nokogiri::HTML(document.data), I write doc = Nokogiri::HTML(open(document.data)), then it gives me the whole page, but, parsing still doesn't work.
hpricot isn't maintained anymore. How about using nokogiri instead?
names = document.css('td:nth-child(4) .sub-line').map(&:content).uniq.reject { |c| c == 'N/A' }
=> ["Yuji Shimada", "Herb Dean", "Dan Miragliotta", "John McCarthy"]
A breakdown of the different parts:
document.css('td:nth-child(4) .sub-line')
This returns an array of html elements with the class name sub-line that are in the forth table column.
.map(&:content)
For each element in the previous array, return element.content (the inner html). This is equivalent to map({ |element| element.content }).
.uniq
Remove duplicate values from the array.
.reject { |c| c == 'N/A' }
Remove elements whose value is "N/A"
You would use array math (-) to compare them:
get referees from the current page
current_referees = doc.search('td[4] .sub_line').map(&:inner_text).uniq - ['N/A']
read old referees from the file
old_referees = File.read('old_referees.txt').split("\n")
use Array#- to compare them
new_referees = current_referees - old_referees
write the new file
File.open('new_referees.txt','w'){|f| f << new_referees * "\n"}
This will return all the names, ignoring dates and "N/A":
puts doc.css('td span.sub_line').map(&:content).reject{ |s| s['/'] }.uniq
It results in:
Yuji Shimada
Herb Dean
Dan Miragliotta
John McCarthy
Adding these to a file and removing duplicates is left as an exercise for you, but I'd use some magical combination of File.readlines, sort and uniq followed by a bit of File.open to write the results.
Here is the final answer
require 'rubygems'
require 'simplecrawler'
require 'nokogiri'
require 'open-uri'
# Mute log messages
module SimpleCrawler
class Crawler
def log(message)
end
end
end
n = 0 # Counters how many pages/profiles processed
sc = SimpleCrawler::Crawler.new("http://www.sherdog.com/fighter/Fedor-Emelianenko-1500")
sc.maxcount = 150000
sc.include_patterns = [".*/fighter/.*$", ".*/events/.*$", ".*/organizations/.*$", ".*/stats/fightfinder\?association/.*$"]
old_referees = File.read('referees.txt').split("\n")
sc.crawl { |document|
doc = Nokogiri::HTML(document.data)
current_referees = doc.search('td[4] .sub_line').map(&:text).uniq - ['N/A']
new_referees = current_referees - old_referees
n +=1
# If new referees found, print statistics
if !new_referees.empty? then
puts n.to_s + ". " + new_referees.length.to_s + " new : " + new_referees.to_s + "\n"
end
new_referees = new_referees + old_referees
old_referees = new_referees.uniq
old_referees.reject!(&:empty?)
# Performance optimization. Saves only every 10th profile.
if n%10 == 0 then
File.open('referees.txt','w'){|f| f << old_referees * "\n" }
end
}
File.open('referees.txt','w'){|f| f << old_referees * "\n" }

Resources