CSV - Unquoted fields do not allow \r or \n (line 2) - ruby-on-rails

Trying to parse a CSV file, but still getting the error message Unquoted fields do not allow \r or \n (line 2)..
I found here at SO similar topic, where was a hint to do following:
CSV.open('file.csv', :row_sep => "\r\n") do |csv|
but his unfortunately doesn't works me... I can't change the CSV file, so I would need to fix it in the code.
EDIT sample of CSV file:
A;B;C
1234;...
Is there any way to do it?
Many thanks!

First of all, you should set you column delimiters to ';', since that is not the normal way CSV files are parsed. This worked for me:
CSV.open('file.csv', :row_sep => :auto, :col_sep => ";") do |csv|
csv.each { |a,b,c| puts "#{a},#{b},#{c}" }
end
From the 1.9.2 CSV documentation:
Auto-discovery reads ahead in the data looking for the next \r\n,
\n, or \r sequence. A sequence will be selected even if it occurs
in a quoted field, assuming that you would have the same line endings
there.

Simpler solution if the CSV was touched or saved by any program that may have used weird formatting (such as Excel or Spreadsheet):
Open the file with any plain text editor (I used Sublime Text 3)
Press the enter key to add a new line anywhere
Save the file
Remove the line you just added
Save the file again
Try the import again, error should be gone

For me I was importing LinkedIn CSV and got the error.
I removed the blank lines like this:
def import
csv_text = File.read('filepath', :encoding => 'ISO-8859-1')
#remove blank lines from LinkedIn
csv_text = csv_text.gsub /^$\n/, ''
#csv = CSV.parse(csv_text, :headers => true, skip_blanks: true)
end

In my case I had to provide encoding, and a quote char that was guaranteed to not occur in data
CSV.read("file.txt", 'rb:bom|UTF-16LE', {:row_sep => "\r\n", :col_sep => "\t", :quote_char => "\x00"})

I realize this is an old post but I recently ran into a similar issue with a badly formatted CSV file that failed to parse with the standard Ruby CSV library.
I tried the SmarterCSV gem which parsed the file in no time. It's an external library so it might not be the best solution for everyone but it beats parsing the file myself.
opts = { col_sep: ';', file_encoding: 'iso-8859-1', skip_lines: 5 }
SmarterCSV.process(file, opts).each do |row|
p row[:someheader]
end

Please see this thread Unquoted fields do not allow \r or \n
Solution:
file = open(file.csv).read.gsub!("\r", '')
CSV.open(file, :row_sep => "\r\n") do |csv|

In my case, the first row of the spreadsheet/CSV was a double-quoted bit of introduction text. The error I got was:
/Users/.../.rvm/rubies/ruby-2.3.0/lib/ruby/2.3.0/csv.rb:1880:in `block (2 levels) in shift': Unquoted fields do not allow \r or \n (line 1). (CSV::MalformedCSVError)
I deleted the comment with " characters so the .csv ONLY had the .csv data, saved it, and my program worked with no errors.

If you have to deal with files coming from Excel with newlines in cells there is also a solution.
The big disadvantage of this way is, that no semicolons or no double quotes in strings are allowed.
I choose to go with no semicolons
if file.respond_to?(:read)
csv_contents = file.read
elsif file_data.respond_to?(:path)
csv_contents = File.read(file.path)
else
logger.error "Bad file_data: #{file_data.class.name}: #{file_data.inspect}"
return false
end
result = "string"
csv_contents = csv_contents.force_encoding("iso-8859-1").encode('utf-8') # In my case the files are latin 1...
# Here is the important part (Remove all newlines between quotes):
while !result.nil?
result = csv_contents.sub!(/(\"[^\;]*)[\n\r]([^\;]*\")/){$1 + ", " + $2}
end
CSV.parse(csv_contents, headers: false, :row_sep => :auto, col_sep: ";") do |row|
# do whatever
end
For me the solution works fine, if you deal with large files you could run into problems with it.
If you want to go with no quotes just replace the semicolons in the regex with quotes.

Another simple solution to fix the weird formatting caused by Excel is to copy and paste the data into Google spreadsheet and then download it as a CSV.

Related

Rails Import/Parse from CSV UTF-8 Missing Column

So I'm working on allowing users to import data from a CSV file. Right now all the fields will import correctly, except whatever is the first field.
What I've discovered is the file type is affecting the import.
My code looks like:
class Import < Operation
require 'csv'
def call(file, training_event_id)
csv_data = CSV.parse(file.read, headers: true)
list_occo = []
csv_data.each do |row|
occupant = Occupant.new
occupant.account_number = row['Account Number']
occupant.check_in = row['Check In']
binding.pry
occupant.training_event_id = training_event_id
list_occo << occupant
end
binding.pry
occo_errors = check_file(list_occo)
list_occo.each(&:save) if occo_errors.empty?
return occo_errors
end
When I do the binding.pry and check on occupant I'm getting nil on the Account Number when doing CSV UTF-8. If I switch to straight up CSV not an issue. Is there a way to convert/switch a CSV UTF-8 to CSV? I thought/tried using some sort of encoding on the parse like: encoding: 'iso-8859-1' but that didn't work.
Is there a way to convert the CSV UTF-8 or is there a way to do a straight up file format check to ensure it's CSV and not CSV UTF-8?
Just in case someone comes across this issue in the future. I looked at the file in the rails console using CSV.read(file.path) and noticed U+FEFF preceding the first column header. There's a rabbit hole of information about BOM and UTF-8 issues. Without wanting to do a CSV/File.open I attempted things like doing a split, gsub, file checks on utf-8, etc. Then I simply changed the csv_data line to be:
csv_data = CSV.parse(File.read(file, encoding: 'bom|utf-8'), headers: true)
Then in my controller I updated it from (params[:file]) to (params[:file].path) as I was getting an error of
no implicit conversion of ActionDispatch::Http::UploadedFile into
String
Hopefully this helps someone else.

Is there a way to clean a file of "invalid byte sequence in UTF-8" errors in Ruby?

I have a service that uploads data to our database via XML feeds provided by customers. Often these XML files are claimed to be UTF-8 encoded, but they clearly have quite a few invalid byte sequences. I can clean up these files and import them perfectly into our database by simply running the following Linux command before importing:
tr -cd '^[:print:]' < original.xml > clean.xml
Simply running this one Linux command allows me to import all of the data into my database using Nokogiri in Ruby on Rails.
The problem is that we're deploying on Heroku, and I can't preprocess the file with a Linux command. I've spent the last week searching the Internet for native Ruby on Rails based solutions to this problem, but none of them work. Before I run through all the suggestions I've tried, here is my original code:
data_source = ARGV[0]
data_file = open data_source
data_string = data_file.read
doc = Nokogiri::XML.parse(data_string)
doc.xpath(".//job").each do |node|
hash = node.element_children.each_with_object(Hash.new) do |e, h|
h[e.name.gsub(/ /,"_").strip.downcase.to_sym] = e.content
data.push(newrow)
end
end
Running this on the raw file produces an error:
"Invalid byte sequence in UTF-8"
Here are all the helpful suggestions I've tried but all have failed.
Use Coder
Coder.clean!(data_string, "UTF-8")
Force Encoding
data_string.force_encoding('BINARY').encode('UTF-8', :undef => :replace, :replace => '')
Convert to UTF-16 and back to UTF-8
data_string.encode!('UTF-16', 'UTF-8', :invalid => :replace, :replace => '')
data_string.encode!('UTF-8', 'UTF-16')
Use valid_encoding?
data_string.chars.select{|i| i.valid_encoding?}.join
No characters are removed; generates "invalid byte sequence" errors.
Specify encoding on opening the file
I actually wrote a function that tries every encoding possible until it can open the file without errors and convert to UTF-8 (#file_encodings is an array of every possible file encoding):
#file_encodings.each do |enc|
print "#{enc}..."
conv_str = "r:#{enc}:utf-8"
begin
data_file = File.open(fname, conv_str)
data_string = data_file.read
rescue
data_file = nil
data_string = ""
end
data_string = data_string.encode(enc, :invalid => :replace, :undef => :replace, :replace => "")
unless data_string.blank? print "\n#{enc} detected!\n"
return data_string
end
Use Regexp to remove non-printables:
data_string.gsub!(/[^[:print:]]/,"")
data_string.gsub!(/[[:cntrl:]&&[^\n\r]]/,"")
(I also tried variants including /[^a-zA-Z0-9~`!##$%^&*()-_=+[{]}\|;:'",<.>/\?]/)
For all of the above, the results are the same... either "invalid byte sequence" errors occur or the file is cut off halfway through after reading only 4400 rows.
So, why does the Linux "tr" command work perfectly and yet NONE of these suggestions can do the job in Ruby on Rails.
What I ended up doing is extremely inelegant, but it gets the job done. I inspected each row that stopped Nokogiri (row.last) and looked for strange characters. Each one I found I added to a character class and then gsub!ed it out, like this (the control characters won't print here, but you get the idea):
data_string.gsub!(/[Crazy Control Characters]/,"")
But the purist in me insists there should be a more elegant, general solution.
Ruby 2.1 has a new method called String.scrub which is exactly what you need.
If the string is invalid byte sequence then replace invalid bytes with
given replacement character, else returns self. If block is given,
replace invalid bytes with returned value of the block.
Check the documentation for more information.
I found this on Stack Overflow for some other question and this too worked fine for me. Assuming data_string is your XML:
data_string.encode!('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')
Thanks for the responses. I did find something that works by testing all sorts of combinations of different tools. I hope this is helpful to other people who have shared the same frustration.
data_string.encode!("UTF-8", "UTF-8", invalid: :replace, undef: :replace, replace: "" )
data_string.gsub!(/[[:cntrl:]&&[^\n\r]]/,"")
As you can see, it's a combination of the "encode" method and a regexp to remove control characters (except for newlines).
My testing revealed that the file I was importing had TWO problems: (1) invalid UTF-8 byte sequences; and (2) unprintable control characters that forced Nokogiri to stop parsing before the end of the file. I had to fix both problems, in that order, otherwise gsub! throws the "invalid byte sequence" error.
Note that the first line in the code above could be substituted with EITHER of the following with the same successful result:
Coder.clean!(data_string,'UTF-8')
or
data_string.scrub!("")
This worked perfectly for me.
Try using a combination of force_encoding("ISO-8859-1") and encode("utf-8"):
data_string.force_encoding("ISO-8859-1").encode("utf-8", replace: nil)
This helped me once.

How to convert ascii-8bit file data to readable string

I'm trying to parse the contents of a CSV file (saved on Windows Excel then uploaded to dropbox) from my dropbox via the Dropbox Core api.
I create a rake task (part of a Rails app) with the following code and it creates a magnum-opus.csv file on my local hard drive that has the original text in the Excel file. The encoding of contents is ASCII-8BIT by calling contents.encoding
contents, metadata = client.get_file_and_metadata('/magnum-opus.csv')
open('magnum-opus.csv', 'w') {|f| f.puts contents }
Instead of creating a local file, I'd like to convert the binary data in "contents" to readable text on the fly and parse through it. I don't want to save it anywhere and then have to open it.
How do I go about doing that?
If do
p contents
I end up getting some type of unreadable data format ... \x00e\x00d\x00u
1) How do I convert this into a string I can parse through with Ruby?
2) The other thing I'm wondering - if do:
puts contents
The original text in the CSV file that is human readable is outputed to STDOUT. What is puts doing?
I tried:
calling CSV.parse on contents.encode( "UTF-8", "binary", :invalid => :replace, :undef => :replace, :replace => '') but end up getting an error such as
CSV::MalformedCSVError: Unquoted fields do not allow \r or \n

Rails 3, check CSV file encoding before import

In my app (Rails 3.0.5, Ruby 1.8.7), I created an import tool to import CSV data from file.
Problem: I asked my users to export the CSV file from Excel in UTF-8 encoding but they don't do it most of time.
How can I just verify if the file is UTF-8 before importing ? Else the import will run but give strange results. I use FasterCSV to import.
Exemple of bad CSV file:
;VallÈe du RhÙne;CÙte Rotie;
Thanks.
You can use Charlock Holmes, a character encoding detecting library for Ruby.
https://github.com/brianmario/charlock_holmes
To use it, you just read the file, and use the detect method.
contents = File.read('test.xml')
detection = CharlockHolmes::EncodingDetector.detect(contents)
# => {:encoding => 'UTF-8', :confidence => 100, :type => :text}
You can also convert the encoding to UTF-8 if it is not in the correct format:
utf8_encoded_content = CharlockHolmes::Converter.convert contents, detection[:encoding], 'UTF-8'
This saves users from having to do it themselves before uploading it again.
For 1.9 it's obvious, you just tell it to expect utf8 and it will raise an error if it isn't:
begin
lines = CSV.read('bad.csv', :encoding => 'utf-8')
rescue ArgumentError
puts "My users don't listen to me!"
end

Headers on the second row in FasterCSV?

G'day guys, I'm currently using fasterCSV to parse a CSV file in ruby, and wondering how to get rid of the initial row of data on a CSV (The initial row contains the time/date information generated by another software package)
I tried using fasterCSV.table and then deleting row(0) then converting it to a CSV document then parsing it
but the row was still present in the document.
Any other ideas?
fTable = FasterCSV.table("sto.csv", :headers => true)
fTable.delete(0)
Three suggestions
Can you get FasterCSV to ignore the line?
You could use the :return_headers => true option to skip over the bad line. That'll work great if the second line isn't the real header. See here for more
:return_headers:
When false, header rows are silently
swallowed. If set to true, header rows
are returned in a FasterCSV::Row
object with identical headers and
fields (save that the fields do not go
through the converters).
Chop the line off with another tool
You don't need to use Ruby for this - how about chopping the file using one of the solutions suggested here you can call the one-liners from Ruby using the system method.
Max Flexibility - parse the file line by line with FasterCSV
Have you considered reading the file directly, skipping the first line and then accepting or rejecting lines? Deep in the heart of my code is this parse method which treats the file as a series of lines, accepting or rejecting each. You could do something similar but skip over the first row.
The neat thing is that you get to determine which rows are acceptable by defining your own acceptable? method - only valid CSV data is passed to acceptable? the rest are thrown away in response to the exception.
def parse(file)
#
# Parse data
#
row = []
file.each_line do |line|
the_line = line.chomp
begin
row = FasterCSV.parse_line(the_line)
ok, message = acceptable?(row)
if not ok
reject(file.lineno, the_line, message)
else
accept(row, the_line)
end
rescue FasterCSV::MalformedCSVError => e
reject(file.lineno, the_line, e.to_s)
end
end
hi doing just that with some data for Australian Electoral Commission. The file in question has a date string on the first line and headers on the second
require 'csv'
require 'open-uri'
filename = "http://results.aec.gov.au/15508/Website/Downloads/SenateGroupVotingTicketsDownload-15508.csv"
file = File.open(open(filename))
first_line = file.readline
CSV.parse(file, headers: true).each do |row|
puts row["State"]
end
I presume the file I quote still exists but that can be replaced by the file in question. if you need to skip more rows you have to call file.readline that number of times.
According to the docs, fTable = FasterCSV.table("sto.csv", :return_headers => false) should do what you want. .table implies :headers => true The docs have this info.

Resources