How to Parse with Commas in CSV file in Ruby - ruby-on-rails

I am parsing the CSV file with Ruby and am having trouble in that the delimiter is a comma my data contains commas.
In portions of the data that contain commas the data is surrounded by "" but I am not sure how to make CSV ignore commas that are contained within Quotations.
Example CSV Data (File.csv)
NCB 14591 BLK 13 LOT W IRR," 84.07 FT OF 25, ALL OF 26,",TWENTY-THREE SAC HOLDING COR
Example Code:
require 'csv'
CSV.foreach("File.csv", encoding:'iso-8859-1:utf-8', :quote_char => "\x00").each do |x|
puts x[1]
end
Current Output: " 84.07 FT OF 25
Expected Output: 84.07 FT OF 25, ALL OF 26,
Link to the gist to view the example file and code.
https://gist.github.com/markscoin/0d6c2d346d70fd627203317c5fe3097c

Try with force_quotes option:
require 'csv'
CSV.foreach("data.csv", encoding:'iso-8859-1:utf-8', quote_char: '"', force_quotes: true).each do |x|
puts x[1]
end
Result:
84.07 FT OF 25, ALL OF 26,

The illegal quoting error is when a line has quotes, but they don't wrap the entire column, so for instance if you had a CSV that looks like:
NCB 14591 BLK 13 LOT W IRR," 84.07 FT OF 25, ALL OF 26,",TWENTY-THREE SAC HOLDING COR
NCB 14592 BLK 14 LOT W IRR,84.07 FT OF "25",TWENTY-FOUR SAC HOLDING COR
You could parse each line individually and change the quote character only for the lines that use bad quoting:
require 'csv'
def parse_file(file_name)
File.foreach(file_name) do |line|
parse_line(line) do |x|
puts x.inspect
end
end
end
def parse_line(line)
options = { encoding:'iso-8859-1:utf-8' }
begin
yield CSV.parse_line(line, options)
rescue CSV::MalformedCSVError
# this line is misusing quotes, change the quote character and try again
options.merge! quote_char: "\x00"
retry
end
end
parse_file('./File.csv')
and running this gives you:
["NCB 14591 BLK 13 LOT W IRR", " 84.07 FT OF 25, ALL OF 26,", "TWENTY-THREE SAC HOLDING COR"]
["NCB 14592 BLK 14 LOT W IRR", "84.07 FT OF \"25\"", "TWENTY-FOUR SAC HOLDING COR"]
but then if you have a mix of bad quoting and good quoting in a single row this falls apart again. Ideally you just want to clean up the CSV to be valid.

Related

Rails parse csv separtor ¦

i have csv file with strange format
2783¦Larson and Sons
967¦Becker Group
333¦Rolfson LLC
I have tried to do this
CSV.foreach("#{Rails.root}/csv_files/suppliers.csv") do |supplier|
p supplier[0]
end
but have got a string "2783¦Larson and Sons"
How to separate values?
For example will return
supplier[0] #=> "2783"
supplier[1] #=> "Larson and Sons"
Why would you expect CSV to know how to handle this weird input? You should explicitly specify the encoding and the column separator.
CSV.read("#{Rails.root}/csv_files/suppliers.csv",
encoding: Encoding::ISO_8859_1,
col_sep: "\xC2\xA6".force_encoding(Encoding::ISO_8859_1)) do |supplier|
puts supplier.inspect
end
#⇒ [["2783", "Larson and Sons"],
# ["967", "Becker Group"],
# ["333", "Rolfson LLC"]]

Nokogiri: Clean up data output

I am trying to scrape player information from MLS sites to create a map of where the players come from, as well as other information. I am as new to this as it gets.
So far I have used this code:
require 'HTTParty'
require 'Nokogiri'
require 'JSON'
require 'Pry'
require 'csv'
page = HTTParty.get('https://www.atlutd.com/players')
parse_page = Nokogiri::HTML(page)
players_array = []
parse_page.css('.player_list.list-reset').css('.row').css('.player_info').map do |a|
player_info = a.text
players_array.push(player_info)
end
#CSV.open('atlantaplayers.csv', 'w') do |csv|
# csv << players_array
#end
pry.start(binding)
The output of the pry function is:
"Miguel Almirón10\nMidfielder\n-\nAsunción, ParaguayAge:\n23\nHT:\n5' 9\"\nWT:\n140\n"
Which when put into the csv creates this in a single cell:
"Miguel Almirón10
Midfielder
-
Asunción, ParaguayAge:
23
HT:
5' 9""
WT:
140
"
I've looked into things and have determined that it is possible nodes (\n)? that is throwing off the formatting.
My desired outcome here is to figure out how to get the pry output into the array as follows:
Miguel, Almiron, 10, Midfielder, Asuncion, Paraguay, 23, 5'9", 140
Bonus points if you can help with the accent marks on names. Also if there is going to be an issue with height, is there a way to convert it to metric?
Thank you in advance!
I've looked into things and have determined that it is possible nodes (\n)? that is throwing off the formatting.
Yes that's why it's showing in this odd format, you can strip the rendered text to remove extra spaces/lines then your text will show without the \ns
player_info = a.text.strip
[1] pry(main)> "Miguel Almirón10\n".strip
=> "Miguel Almirón10"
This will only remove the \n if you wish to store them in a CSV in this order
Miguel, Almiron, 10, Midfielder, Asuncion, Paraguay, 23, 5'9", 140
then you might want to split by spaces and then create an array for each row so when pushing the line to the CSV file it will look like this:
csv << ["Miguel", "Almiron", 10, "Midfielder", "Asuncion", "Paraguay", 23, "5'9\"", 140]
with the accent marks on names
you can use the transliterate method which will remove accents
[8] pry(main)> ActiveSupport::Inflector.transliterate("Miguel Almirón10")
=> "Miguel Almiron10"
See http://api.rubyonrails.org/classes/ActiveSupport/Inflector.html#method-i-transliterate and you might want to require 'rails' for this
Here's what I would use, i18n and people gems:
require 'people'
require "i18n"
I18n.available_locales = [:en]
#np = People::NameParser.new
players_array = []
parse_page.css('.player_info').each do |div|
name = #np.parse I18n.transliterate(div.at('.name a').text)
players_array << [
name[:first],
name[:last],
div.at('.jersey').text,
div.at('.position').text,
]
end
# => [["Miguel", "Almiron", "10", "Midfielder"],
# ["Mikey", "Ambrose", "22", "Defender"],
# ["Yamil", "Asad", "11", "Forward"],
# ...
That should get you started.

Ruby csv import trouble

I tried to import data from csv in my rails app, but something went wrong:
CSV::MalformedCSVError in ArticlesController#index
Unclosed quoted field on line 1.
My csv looks like this:
"Код";"№ по каталогу (артикул)";"Наименование товара";"Ед. изм.";"Цена опт.";"Доп.";"Остатки";"Класс";"Группа";"Бренд";"Блок."
2223;15-562-44;15-562-44 (27-B07-F) VW Polo 95-R ;шт ;37,430;;;Амортизаторы ;Амортизаторы BOGE ;;
10327;24-052-1;24-052-1(46-A27-0) LAND ROVER 84- F ;шт ;68,750;;;Амортизаторы ;Амортизаторы BOGE ;;
10328;24-053-1;24-053-1(46-A28-0) LAND ROVER 84- R ;шт ;68,750;;;Амортизаторы ;Амортизаторы BOGE ;;
Maybe this is because of the first line (which has no ;;)
My code look like this:
def csv_import
require 'csv'
file = File.open("/#{Rails.public_path}/uploads/smallcsv.csv")
#csv = CSV.parse(file)
csv = CSV.open(file, "r:ISO-8859-15:UTF-8", {:col_sep => ";", :row_sep => ";;", :headers => :first_row})
file_path = "/#{Rails.public_path}/uploads/smallcsv.csv"
##parsed_file=CSV::Reader.parse(file_path)
csv.each do |row|
ename = row[2]
eprice = row[5]
eqnt = row[7]
esupp = row[10]
logger.warn(ename)
end
end
I'm running ruby 1.9+ with fastercsv gem
I figured this out myself using "CSV - Unquoted fields do not allow \r or \n (line 2)".
The problem was with the first line, so :auto helped me.

Parse a malformed CSV line

I am parsing the following CSV lines. I need to rescue malformed lines that look like "Malformed" below. What is a regular expression I can use to do this? What considerations do I need to make?
body = %(
"Sensitive",2416,159,"Test "Malformed" Failure",2789,111,7-24-11,1800,0600,"R2","12323","",""
"Sensitive",2742,107,"Test",2791,112,7-24-11,1800,0600,"R1","","",""
"Sensitive",2700,135,"Test",2792,113,7-24-11,1800,0600,"R1","12110","","")
rows = []
body.each_line do |line|
begin
rows << FasterCSV.parse_line(line)
rescue FasterCSV::MalformedCSVError => e
rows << line if rescue_from_malformed_line(line)
rescue => e
Rails.logger.error(e.to_s)
Rails.logger.info(line)
end
end
I am not sure how malformed your data is malformed, but here is one approach for that line.
> puts line
"Sensitive",2416,159,"Test "Malformed" Failure",2789,111,7-24-11,1800,0600,"R2","12323","",""
>
> puts line.scan /[\d.-]+|(?:"[^"]*"[^",]*)+/
"Sensitive"
2416
159
"Test "Malformed" Failure"
2789
111
7-24-11
1800
0600
"R2"
"12323"
""
""
Note: Tested on ruby 1.9.2p290
You could use a regex to replace the nested double quotes with single quotes before it's passed to the parser.
Something like
.gsub(/(?<!^|,)"(?!,|$)/,"'")

Illegal quoting on line using FasterCSV in ruby 1.8.7

I am facing "Illegal quoting" error when parse the content from SQL dump and the dump file is in the format of TXT with tab (\t) separator.
require 'rubygems'
require 'faster_csv'
begin
FasterCSV.foreach(excel_file, :quote_char => '"',:col_sep =>'\t', :row_sep =>:auto, :headers => :first_row) do |row|
col= row.to_s.split(/\t/)
if col[3]!="" or !col[3].empty?
color_value=col[3].to_s.capitalize
#Inser Color
color=Color.find_or_create_by_name(:name=>color_value)
elsif col[3].empty?
color_id= nil
end
end
rescue Exception => e
puts e
end
The program executed and run successfully but there is an invalid data present like
below (#font-face ...) mean execution terminated with error of "Illegal quoting on line 3.
ID Name code comments
1 white 234 good
2 Black 222
3 red 343 #font-face { font-family: "Verdana"; .....}
Can any one suggest me how to skip when invalid data occurs in column ?
Thanks in advance.
I'm not sure if this will solve the error you are seeing, but you need to use double quotes around escaped characters, e.g.:
:col_sep => "\t"
FasterCSV isn't very kind to badly formatted data.
I don't know that there is a solution for this.
However - if your example file doesn't actually contain any quoting using "
then perhaps just use a different quot_char (eg ')
You can use the ASCII code for the NULL character -- \0x00 -- as such:
FasterCSV.foreach(excel_file, :quote_char => '\0x00',:col_sep =>'\t', :row_sep =>:auto, :headers => :first_row) do |row|
...
end
You can find a chart of some ASCII chars here: http://www.bluesock.org/~willg/dev/ascii.html

Resources