I have CSV's which I am trying to import into my oracle database but unfortunately I keep on getting the same error:
> CSV::MalformedCSVError: Unquoted fields do not allow \r or \n (line 1).
I know there are tons of similar questions which have been asked but none relate specifically to my issue other than this one, but unfortunately it didn't help.
To explain my scenario:
I have CSV's in which the rows don't always end with a value, but
rather, just a comma because it's a null value hence it stays blank.
I would like to import the CSV's regardless of whether the ending is with a comma or without a comma.
Here are the first 5 lines of my CSV with changed values due to privacy reasons,
id,customer_id,provider_id,name,username,password,salt,email,description,blocked,created_at,updated_at,deleted_at
1,,1,Default Administrator,admin,1,1," ",Initial default user.,f,2019-10-04 14:28:38.492000,2019-10-04 14:29:34.224000,
2,,2,Default Administrator,admin,2,1,,Initial default user.,,2019-10-04 14:28:38.633000,2019-10-04 14:28:38.633000,
3,1,,Default Administrator,admin,3,1," ",Initial default user.,f,2019-10-04 14:41:38.030000,2019-11-27 10:23:03.329000,
4,1,,admin,admin,4,1," ",,,2019-10-28 12:21:23.338000,2019-10-28 12:21:23.338000,
5,2,,Default Administrator,admin,5,1," ",Initial default user.,f,2019-11-12 09:00:49.430000,2020-02-04 08:20:06.601000,2020-02-04 08:20:06.601000
As you can see the ending is sometimes with or without a comma and this structure repeats quite often.
This is my code with which I have been playing around with:
def csv_replace_empty_string
Dir.foreach(Rails.root.join('db', 'csv_export')) do |filename|
next if filename == '.' or filename == '..' or filename == 'extract_db_into_csv.sh' or filename =='import_csv.rb'
read_file = File.read(Rails.root.join('db', 'csv_export', filename))
replace_empty_string = read_file.gsub(/(?<![^,])""(?![^,])/, '" "')
format_csv = replace_empty_string.gsub(/\r\r?\n?/, "\n")
# format_csv = remove_empty_lines.sub!(/(?:\r?\n)+\z/, "")
File.open(Rails.root.join('db', 'csv_export', filename), "w") {|file| file.puts format_csv }
end
end
I have tried using many different kinds of gsubs found online in similar forums, but it didn't help.
Here is my function for importing the CSV in the db:
def import_csv_into_db
Dir.foreach(Rails.root.join('db', 'csv_export')) do |filename|
next if filename == '.' or filename == '..' or filename == 'extract_db_into_csv.sh' or filename =='import_csv.rb'
filename_renamed = File.basename(filename, File.extname(filename)).classify
CSV.foreach(Rails.root.join('db', 'csv_export',filename), :headers => true, :skip_blanks => true) do |row|
class_name = filename_renamed.constantize
class_name.create!(row.to_hash)
puts "Insert on table #{filename_renamed} complete"
end
end
end
I have also tried the options provided by CSV such as :row_sep => :"\n" or :row_sep => "\r" but keep on getting the same error.
I am pretty sure I have some sort of thinking error, but I can't seem to figure it out.
I fixed the issue by using the following:
format_csv = replace_empty_string.gsub(/\r\r?\n?/, "\n")
This was originally #mgrims answer, but I had to adjust my code by further removing the :skip_blanks :row_sep options.
It is importing successfully now!
It's only my second post and I'm still learning ruby.
I'm trying to figure this out based on my Java knowledge but I can't seem to get it right.
What I need to do is:
I have a function that reads a file line by line and extract different car features from each line, for example:
def convertListings2Catalogue (fileName)
f = File.open(fileName, "r")
f.each_line do |line|
km=line[/[0-9]+km/]
t = line[(Regexp.union(/sedan/i, /coupe/i, /hatchback/i, /station/i, /suv/i))]
trans = ....
end end
Now for each line I need to store the extracted features into separate
hashes that I can access later in my program.
The issues I'm facing:
1) I'm overwriting the features in the same hash
2) Can't access the hash outside my function
That what's in my file:
65101km,Sedan,Manual,2010,18131A,FWD,Used,5.5L/100km,Toyota,camry,SE,{AC,
Heated Seats, Heated Mirrors, Keyless Entry}
coupe,1100km,auto,RWD, Mercedec,CLK,LX ,18FO724A,2017,{AC, Heated
Seats, Heated Mirrors, Keyless Entry, Power seats},6L/100km,Used
AWD,SUV,0km,auto,new,Honda,CRV,8L/100km,{Heated Seats, Heated Mirrors,
Keyless Entry},19BF723A,2018,LE
Now my function extracts the features of each car model, but I need to store these features in 3 different hashes with the same keys but different values.
listing = Hash.new(0)
listing = { kilometers: km, type: t, transmission: trans, drivetrain: dt, status: status, car_maker: car_maker }
I tried moving the data from one hash to another, I even tried storing the data in an array first and then moving it to the hash but I still can't figure out how to create separate hashes inside a loop.
Thanks
I don't fully understand the question but I thought it was important to suggest how you might deal with a more fundamental issue: extracting the desired information from each line of the file in an effective and Ruby-like manner. Once you have that information, in the form of an array of hashes, one hash per line, you can do with it what you want. Alternatively, you could loop through the lines in the file, constructing a hash for each line and performing any desired operations before going on to the next line.
Being new to Ruby you will undoubtedly find some of the code below difficult to understand. If you persevere, however, I think you will be able to understand all of it, and in the process learn a lot about Ruby. I've made some suggestions in the last section of my answer to help you decipher the code.
Code
words_by_key = {
type: %w| sedan coupe hatchback station suv |,
transmission: %w| auto manual steptronic |,
drivetrain: %w| fwd rwd awd |,
status: %w| used new |,
car_maker: %w| honda toyota mercedes bmw lexus |,
model: %w| camry clk crv |
}
#=> {:type=>["sedan", "coupe", "hatchback", "station", "suv"],
# :transmission=>["auto", "manual", "steptronic"],
# :drivetrain=>["fwd", "rwd", "awd"],
# :status=>["used", "new"],
# :car_maker=>["honda", "toyota", "mercedes", "bmw", "lexus"],
# :model=>["camry", "clk", "crv"]}
WORDS_TO_KEYS = words_by_key.each_with_object({}) { |(k,v),h| v.each { |s| h[s] = k } }
#=> {"sedan"=>:type, "coupe"=>:type, "hatchback"=>:type, "station"=>:type, "suv"=>:type,
# "auto"=>:transmission, "manual"=>:transmission, "steptronic"=>:transmission,
# "fwd"=>:drivetrain, "rwd"=>:drivetrain, "awd"=>:drivetrain,
# "used"=>:status, "new"=>:status,
# "honda"=>:car_maker, "toyota"=>:car_maker, "mercedes"=>:car_maker,
# "bmw"=>:car_maker, "lexus"=>:car_maker,
# "camry"=>:model, "clk"=>:model, "crv"=>:model}
module ExtractionMethods
def km(str)
str[/\A\d+(?=km\z)/]
end
def year(str)
str[/\A\d+{4}\z/]
end
def stock(str)
return nil if str.end_with?('km')
str[/\A\d+\p{Alpha}\p{Alnum}*\z/]
end
def trim(str)
str[/\A\p{Alpha}{2}\z/]
end
def fuel_consumption(str)
str.to_f if str[/\A\d+(?:\.\d+)?(?=l\/100km\z)/]
end
end
class K
include ExtractionMethods
def extract_hashes(fname)
File.foreach(fname).with_object([]) do |line, arr|
line = line.downcase
idx_left = line.index('{')
idx_right = line.index('}')
if idx_left && idx_right
g = { set_of_features: line[idx_left..idx_right] }
line[idx_left..idx_right] = ''
line.squeeze!(',')
else
g = {}
end
arr << line.split(',').each_with_object(g) do |word, h|
word.strip!
if WORDS_TO_KEYS.key?(word)
h[WORDS_TO_KEYS[word]] = word
else
ExtractionMethods.instance_methods.find do |m|
v = public_send(m, word)
(h[m] = v) unless v.nil?
v
end
end
end
end
end
end
Example
data =<<BITTER_END
65101km,Sedan,Manual,2010,18131A,FWD,Used,5.5L/100km,Toyota,camry,SE,{AC, Heated Seats, Heated Mirrors, Keyless Entry}
coupe,1100km,auto,RWD, Mercedec,CLK,LX ,18FO724A,2017,{AC, Heated Seats, Heated Mirrors, Keyless Entry, Power seats},6L/100km,Used
AWD,SUV,0km,auto,new,Honda,CRV,8L/100km,{Heated Seats, Heated Mirrors, Keyless Entry},19BF723A,2018,LE
BITTER_END
FILE_NAME = 'temp'
File.write(FILE_NAME, data)
#=> 353 (characters written to file)
k = K.new
#=> #<K:0x00000001c257d348>
k.extract_hashes(FILE_NAME)
#=> [{:set_of_features=>"{ac, heated seats, heated mirrors, keyless entry}",
# :km=>"65101", :type=>"sedan", :transmission=>"manual", :year=>"2010",
# :stock=>"18131a", :drivetrain=>"fwd", :status=>"used", :fuel_consumption=>5.5,
# :car_maker=>"toyota", :model=>"camry", :trim=>"se"},
# {:set_of_features=>"{ac, heated seats, heated mirrors, keyless entry, power seats}",
# :type=>"coupe", :km=>"1100", :transmission=>"auto", :drivetrain=>"rwd",
# :model=>"clk", :trim=>"lx", :stock=>"18fo724a", :year=>"2017",
# :fuel_consumption=>6.0, :status=>"used"},
# {:set_of_features=>"{heated seats, heated mirrors, keyless entry}",
# :drivetrain=>"awd", :type=>"suv", :km=>"0", :transmission=>"auto",
# :status=>"new", :car_maker=>"honda", :model=>"crv", :fuel_consumption=>8.0,
# :stock=>"19bf723a", :year=>"2018", :trim=>"le"}]
Explanation
Firstly, note that the HEREDOC needs to be un-indented before being executed.
You will see that the instance method K#extract_hashes uses IO#foreach to read the file line-by-line.1
The first step in processing each line of the file is to downcase it. You will then want to split the string on commas to form an array of words. There is a problem, however, in that you don't want to split on commas that are between a left and right brace ({ and }), which corresponds to the key :set_of_features. I decided to deal with that by determining the indices of the two braces, creating a hash with the single key :set_of_features, delete that substring from the line and lastly replace a resulting pair of adjacent commas with a single comma:
idx_left = line.index('{')
idx_right = line.index('}')
if idx_left && idx_right
g = { set_of_features: line[idx_left..idx_right] }
line[idx_left..idx_right] = ''
line.squeeze!(',')
else
g = {}
end
See String for the documentation of the String methods used here and elsewhere.
We can now convert the resulting line to an array of words by splitting on the commas. If any capitalization is desired in the output this should be done after the hashes have been constructed.
We will build on the hash { set_of_features: line[idx_left..idx_right] } just created. When complete, it will be appended to the array being returned.
Each element (word) in the array, is then processed. If it is a key of the hash WORDS_TO_KEYS we set
h[WORDS_TO_KEYS[word]] = word
and are finished with that word. If not, we execute each of the instance methods m in the module ExtractionMethods until one is found for which m[word] is not nil. When that is found another key-value pair is added to the hash h:
h[m] = word
Notice that the name of each instance method in ExtractionMethods, which is a symbol (e.g., :km), is a key in the hash h. Having separate methods facilitates debugging and testing.
I could have written:
if (s = km(word))
s
elsif (s = year(word))
s
elsif (s = stock(str))
s
elsif (s = trim(str))
s
elsif (s = fuel_consumption(str))
s
end
but since all these methods take the same argument, word, we can instead use Object#public_send:
a = [:km, :year, :stock, :trim, :fuel_consumption]
a.find do |m|
v = public_send(m, word)
(h[m] = v) unless v.nil?
v
end
A final tweak is to put all the methods in the array a in a module ExtractionMethods and include that module in the class K. We can then replace a in the find expression above with ExtractionMethods.instance_methods. (See Module#instance_methods.)
Suppose now that the data are changed so that additional fields are added (e.g., for "colour" or "price"). Then the only modifications to the code required are changes to words_by_key and/or the addition of methods to ExtractionMethods.
Understanding the code
It may be helpful to run the code with puts statements inserted. For example,
idx_left = line.index('{')
idx_right = line.index('}')
puts "idx_left=#{idx_left}, idx_left=#{idx_left}"
Where code is chained it may be helpful to break it up with temporary variables and insert puts statements. For example, change
arr << line.split(',').each_with_object(g) do |word, h|
...
to
a = line.split(',')
puts "line.split(',')=#{a}"
enum = a.each_with_object(g)
puts "enum.to_a=#{enum.to_a}"
arr << enum do |word, h|
...
The second puts here is merely to see what elements the enumerator enum will generate and pass to the block.
Another way of doing that is to use the handy method Object#tap, which is inserted between two methods:
arr << line.split(',').tap { |a| puts "line.split(',')=#{a}"}.
each_with_object(g) do |word, h|
...
tap (great name, eh?), as used here, simply returns its receiver after displaying its value.
Lastly, I've used the method Enumerable#each_with_object in a couple of places. It may seem complex but it's actually quite simple. For example,
arr << line.split(',').each_with_object(g) do |word, h|
...
end
is effectively equivalent to:
h = g
arr << line.split(',').each do |word|
...
end
h
1 Many IO methods are typically invoked on File. This is acceptable because File.superclass #=> IO.
You could leverage the fact that your file instance is an enumerable. This allows you to leverage the inject method, and you can seed that with an empty hash. collector in this case is the hash that gets passed along as the iteration continues. Be sure to (implicitly, by having collector be the last line of the block) return the value of collector as the inject method will use this to feed into the next iteration. It's some pretty powerful stuff!
I think this is roughly what you're going for. I used model as the key in the hash, and set_of_features as your data.
def convertListings2Catalogue (fileName)
f = File.open(fileName, "r")
my_hash = f.inject({}) do |collector, line|
km=line[/[0-9]+km/]
t = line[(Regexp.union(/sedan/i, /coupe/i, /hatchback/i, /station/i, /suv/i))]
trans = line[(Regexp.union(/auto/i, /manual/i, /steptronic/i))]
dt = line[(Regexp.union(/fwd/i, /rwd/i, /awd/i))]
status = line[(Regexp.union(/used/i, /new/i))]
car_maker = line[(Regexp.union(/honda/i, /toyota/i, /mercedes/i, /bmw/i, /lexus/i))]
stock = line.scan(/(\d+[a-z0-9]+[a-z](?<!km\b))(?:,|$)/i).first
year = line.scan(/(\d{4}(?<!km\b))(?:,|$)/).first
trim = line.scan(/\b[a-zA-Z]{2}\b/).first
fuel = line.scan(/[\d.]+L\/\d*km/).first
set_of_features = line.scan(/\{(.*?)\}/).first
model = line[(Regexp.union(/camry/i, /clk/i, /crv/i))]
collector[model] = set_of_features
collector
end
end
I have Coffeescript file in the following format. I have to generate all the possible combinations using the : :. I have already written the code for combinations. It is working fine. But, somehow the configuration file is changed & I have to modify that code. So, could anyone please help me to solve this problem?
abTests:
productRanking:
version: 4
groups: [
ratio:
default: 1
us: 0.90
me: 0.0
value: "LessPopularityEPC"
,
ratio:
default: 0
us: 0.1
value: "CtrEpcJob"
,
ratio:
default: 0
me: 1.0
value: "RandomPerVisitor"
]
I would like to have the data formatted in the following format:
productRanking:
"LessPopularityEPC"
"CtrEpcJob"
"RandomPerVisitor"
]
I am using the following code here :
START_REGEXP = /# AB Tests/
END_REGEXP = /# Routes/
COMMENT_EXP = /#/
COMMA_REGEXP = /,/
START_BLOCK = /\[/
END_BLOCK = /]/
def Automate_AB_tests.abTestParser(input_file,output_file)
raise "Source File doesn't exist at provided path" unless File.exists?(input_file)
flag = false #setting default value of flag=FALSE to parse the data between two REGEX
File.open(output_file, "w") do |ofile| #opening destination file in WRITE mode
File.foreach(input_file) do |iline| #Reading each lines of source file
flag = true if iline =~ START_REGEXP
ofile.puts(iline.sub(" ", '').sub("value:",'')) if flag && (iline =~ /value/ || iline=~ /,/ || iline =~ /]/) unless (iline =~ COMMENT_EXP or iline =~ COMMA_REGEXP)
flag = false if iline =~ END_REGEXP
end
end
end
Assuming that what you want is to take each key under abTests, e.g. ProductRanking, and return a hash with those keys having the value of the value key in their first group, then like so:
data['abTests'].each_with_object({}) do |(key, testData), resultingHash|
resultingHash[key] = testData['groups'].first['value']
end
However, if that's not what you want, then you need to be a bit more clear. Try working through the operations you want to achieve on paper, and break down your thought process step-by-step. The two operations that tend to be most useful when processing list or hash data are map and reduce (each_with_object being a form of reduction). Look through the documentation of the Enumerable module of Ruby for more info.
Assume that you have a ruby hash, which can be rewrite as
data = {:abTests=>{:productRanking=>{:version=>4, :groups=>[{:ratio=>{:default=>1, :us=>0.9, :me=>0.0}, :value=>"LessPopularityEPC"}]}}}
you can loop over this data hash to get your desired result
for eg:
result_hash = {}
data[:abTests].each do |key, value|
result_hash = {k.to_sym => value[:groups][0][:value]}
puts result_hash
end
puts result_hash[:productRanking] //outputs "LessPopularityEPC"
I just started with Ruby On Rails, and want to create a simple web site crawler which:
Goes through all the Sherdog fighters' profiles.
Gets the Referees' names.
Compares names with the old ones (both during the site parsing and from the file).
Prints and saves all the unique names to the file.
An example URL is: http://www.sherdog.com/fighter/Fedor-Emelianenko-1500
I am searching for the tag entries like <span class="sub_line">Dan Miragliotta</span>, unfortunately, additionally to the proper Referee names I need, the same kind of class is used with:
The date.
"N/A" when the referee name is not known.
I need to discard all the results with a "N/A" string as well as any string which contains numbers. I managed to do the first part but couldn't figure out how to do the second. I tried searching, thinking and experimenting, but, after experimenting and rewriting, managed to break the whole program and don't know how to (properly) fix it:
require 'rubygems'
require 'hpricot'
require 'simplecrawler'
# Set up a new crawler
sc = SimpleCrawler::Crawler.new("http://www.sherdog.com/fighter/Fedor-Emelianenko-1500")
sc.maxcount = 1
sc.include_patterns = [".*/fighter/.*$", ".*/events/.*$", ".*/organizations/.*$", ".*/stats/fightfinder\?association/.*$"]
# The crawler yields a Document object for each visited page.
sc.crawl { |document|
# Parse page title with Hpricot and print it
hdoc = Hpricot(document.data)
(hdoc/"td/span[#class='sub_line']").each do |span|
if span.inner_html == 'N/A' || Regexp.new(".*/\d\.*$").match(span.inner_html)
# puts "Test"
else
puts span.inner_html
#File.open("File_name.txt", 'a') {|f| f.puts(hdoc.span.inner_html) }
end
end
}
I would also appreciate help with ideas on the rest of the program: How do I properly read the current names from the file, if the program is run multiple times, and how do I make the comparisons for the unique names?
Edit:
After some proposed improvements, here is what I got:
require 'rubygems'
require 'simplecrawler'
require 'nokogiri'
#require 'open-uri'
sc = SimpleCrawler::Crawler.new("http://www.sherdog.com/fighter/Fedor-Emelianenko-1500")
sc.maxcount = 1
sc.crawl { |document|
doc = Nokogiri::HTML(document.data)
names = doc.css('td:nth-child(4) .sub-line').map(&:content).uniq.reject { |c| c == 'N/A' }
puts names
}
Unfortunately, the code still doesn't work - it returns a blank.
If instead of doc = Nokogiri::HTML(document.data), I write doc = Nokogiri::HTML(open(document.data)), then it gives me the whole page, but, parsing still doesn't work.
hpricot isn't maintained anymore. How about using nokogiri instead?
names = document.css('td:nth-child(4) .sub-line').map(&:content).uniq.reject { |c| c == 'N/A' }
=> ["Yuji Shimada", "Herb Dean", "Dan Miragliotta", "John McCarthy"]
A breakdown of the different parts:
document.css('td:nth-child(4) .sub-line')
This returns an array of html elements with the class name sub-line that are in the forth table column.
.map(&:content)
For each element in the previous array, return element.content (the inner html). This is equivalent to map({ |element| element.content }).
.uniq
Remove duplicate values from the array.
.reject { |c| c == 'N/A' }
Remove elements whose value is "N/A"
You would use array math (-) to compare them:
get referees from the current page
current_referees = doc.search('td[4] .sub_line').map(&:inner_text).uniq - ['N/A']
read old referees from the file
old_referees = File.read('old_referees.txt').split("\n")
use Array#- to compare them
new_referees = current_referees - old_referees
write the new file
File.open('new_referees.txt','w'){|f| f << new_referees * "\n"}
This will return all the names, ignoring dates and "N/A":
puts doc.css('td span.sub_line').map(&:content).reject{ |s| s['/'] }.uniq
It results in:
Yuji Shimada
Herb Dean
Dan Miragliotta
John McCarthy
Adding these to a file and removing duplicates is left as an exercise for you, but I'd use some magical combination of File.readlines, sort and uniq followed by a bit of File.open to write the results.
Here is the final answer
require 'rubygems'
require 'simplecrawler'
require 'nokogiri'
require 'open-uri'
# Mute log messages
module SimpleCrawler
class Crawler
def log(message)
end
end
end
n = 0 # Counters how many pages/profiles processed
sc = SimpleCrawler::Crawler.new("http://www.sherdog.com/fighter/Fedor-Emelianenko-1500")
sc.maxcount = 150000
sc.include_patterns = [".*/fighter/.*$", ".*/events/.*$", ".*/organizations/.*$", ".*/stats/fightfinder\?association/.*$"]
old_referees = File.read('referees.txt').split("\n")
sc.crawl { |document|
doc = Nokogiri::HTML(document.data)
current_referees = doc.search('td[4] .sub_line').map(&:text).uniq - ['N/A']
new_referees = current_referees - old_referees
n +=1
# If new referees found, print statistics
if !new_referees.empty? then
puts n.to_s + ". " + new_referees.length.to_s + " new : " + new_referees.to_s + "\n"
end
new_referees = new_referees + old_referees
old_referees = new_referees.uniq
old_referees.reject!(&:empty?)
# Performance optimization. Saves only every 10th profile.
if n%10 == 0 then
File.open('referees.txt','w'){|f| f << old_referees * "\n" }
end
}
File.open('referees.txt','w'){|f| f << old_referees * "\n" }