CSV file encoding in Rails with S3 and Heroku - ruby-on-rails

My rails app uploads CSV files to S3, then subsequently pulls them down into a tempfile to send each row's data to a Sidekiq worker. I'm using Carrierwave and fog to handle the uploading.
This all worked beautifully until recently switching to Heroku, and now, when trying to create my tempfile I get the following error:
Error type Encoding::UndefinedConversionError
Error message "\xA2" from ASCII-8BIT to UTF-8
I've tried setting the encoding when creating the tempfile as well as working with the CSV file and continue to get the same error. I cannot reproduce this error on my local machine, which has made this entire process that much more fun :)
Currently, my Sidekiq worker calls the following method:
def upload_csv(filename, file_path)
file = Tempfile.new(filename, Rails.root.join('tmp'), encoding: "ISO8859-1:utf-8").tap do |f|
open(file_path).rewind
f.write(open(file_path).read)
f.close
end
CSV.foreach(file, headers: true, encoding: "ISO8859-1:utf-8")do |row|
#do stuff to rows
end
end
I understand the very basics of encoding, but I'm super stuck on this. Any insight would be appreciated.
Thanks!

Not sure if this will help anyone else, but I found a solution that works for me:
def upload_csv(filename, file_path)
file = Tempfile.new(filename, Rails.root.join('tmp')).tap do |f|
open(file_path).rewind
f.write(open(file_path).read.force_encoding('utf-8'))
f.close
end
CSV.foreach(file, headers: true)do |row|
#do stuff to rows
end
end
Even though I could confirm that the file was UTF-8 encoded before it was uploaded, open(#file_path).read.encoding returning an ASCII-8BIT encoding. It was getting confused on how to write the file and convert it from ASCII-8BIT to UTF-8.

Related

How to write a Tempfile as binary

When trying to write a string / unzipped file to a Tempfile by doing:
temp_file = Tempfile.new([name, extension])
temp_file.write(unzipped_io.read)
Which throws the following error when I do this with an image:
Encoding::UndefinedConversionError - "\xFF" from ASCII-8BIT to UTF-8
When researching it I found out that this is caused because Ruby tries to write files with an encoding by default (UTF-8). But the file should be written as binary, so it ignores any file specific behavior.
Writing regular File you would be able to do this as following:
File.open('/tmp/test.jpg', 'rb') do |file|
file.write(unzipped_io.read)
end
How to do this in Tempfile
Tempfile.new passes options to File.open which accepts the options from IO.new, in particular:
:binmode
If the value is truth value, same as “b” in argument mode.
So to open a tempfile in binary mode, you'd use:
temp_file = Tempfile.new([name, extension], binmode: true)
temp_file.binmode? #=> true
temp_file.external_encoding #=> #<Encoding:ASCII-8BIT>
In addition, you might want to use Tempfile.create which takes a block and automatically closes and removes the file afterwards:
Tempfile.create([name, extension], binmode: true) do |temp_file|
temp_file.write(unzipped_io.read)
# ...
end
I have encountered the solution in an old Ruby forum post, so I thought I would share it here, making it easier for people to find:
https://www.ruby-forum.com/t/ruby-binary-temp-file/116791
Apparently Tempfile has an undocumented method binmode, which changes the writing mode to binary and thus ignoring any encoding issues:
temp_file = Tempfile.new([name, extension])
temp_file.binmode
temp_file.write(unzipped_io.read)
Thanks unknown person who mentioned it on ruby-forums.com in 2007!
Another alternative is IO.binwrite(path, file_content)

Rails Import/Parse from CSV UTF-8 Missing Column

So I'm working on allowing users to import data from a CSV file. Right now all the fields will import correctly, except whatever is the first field.
What I've discovered is the file type is affecting the import.
My code looks like:
class Import < Operation
require 'csv'
def call(file, training_event_id)
csv_data = CSV.parse(file.read, headers: true)
list_occo = []
csv_data.each do |row|
occupant = Occupant.new
occupant.account_number = row['Account Number']
occupant.check_in = row['Check In']
binding.pry
occupant.training_event_id = training_event_id
list_occo << occupant
end
binding.pry
occo_errors = check_file(list_occo)
list_occo.each(&:save) if occo_errors.empty?
return occo_errors
end
When I do the binding.pry and check on occupant I'm getting nil on the Account Number when doing CSV UTF-8. If I switch to straight up CSV not an issue. Is there a way to convert/switch a CSV UTF-8 to CSV? I thought/tried using some sort of encoding on the parse like: encoding: 'iso-8859-1' but that didn't work.
Is there a way to convert the CSV UTF-8 or is there a way to do a straight up file format check to ensure it's CSV and not CSV UTF-8?
Just in case someone comes across this issue in the future. I looked at the file in the rails console using CSV.read(file.path) and noticed U+FEFF preceding the first column header. There's a rabbit hole of information about BOM and UTF-8 issues. Without wanting to do a CSV/File.open I attempted things like doing a split, gsub, file checks on utf-8, etc. Then I simply changed the csv_data line to be:
csv_data = CSV.parse(File.read(file, encoding: 'bom|utf-8'), headers: true)
Then in my controller I updated it from (params[:file]) to (params[:file].path) as I was getting an error of
no implicit conversion of ActionDispatch::Http::UploadedFile into
String
Hopefully this helps someone else.

Parse remote csv on Rails 4

I keep getting the error file name is too long.
I am running rails on Heroku so I am trying to have an uploaded file saved on cloud, and then imported so it is not lost on their dyno.
I want to create a new object for each row in the csv. Parsing the CSV has worked perfectly before in development when using a temp file. But I have to change this for Heroku.
What is wrong about my code for the remote csv being parsed correctly?
def self.import_open_order(file_url)
open(file_url) do |file|
CSV.parse(self.parse_headers(file.read), headers: true) do |row|
...
This fixed it
def self.import_open_order(file)
imported_file = open(file)
CSV.parse(self.parse_headers(imported_file), headers: true) do |row|
Since open(file).class = Tempfile... I was able to just create the Tempfile and pass it through CSV.parse
I swear I had already tried this but now it works!

MalformedCSVError with rails CSV (FasterCSV)

I'm having serious issues trying to parse some CSV in rails right now.
Basically my app gets a user to upload a CSV file. The app then converts the file to ensure it is in UTF-8 format, then attempts to parse it and process it. Whenever the app attempts to parse it however, I get the MalformedCSVError stating "Illegal quoting on line 1"
Now what I don't get, is if I copy the original file into a new document and save it, then I can parse it on a rails console without a problem.
If I attempt to parse the original file, it complains about an invalid character for UTF-8 encoding (the file isn't in UTF-8 hence the app converts it)
If I attempt to parse the file which the app has converted to UTF-8 and changed the line endings to LF, it fails to parse.
If I do a file diff between the version the app has produced, and the copy/paste version that I have made (which works) there are 0 differences so I really can't figure out why one is parsable, and one is not.
Any suggestions? My app is processing the file as follows :
def create
#survey = Survey.new(params[:survey])
# Now we need to try and convert this to UTF-8 if it isn't already
encoded = File.read(#survey.survey_data.current_path)
encoding = CharlockHolmes::EncodingDetector.detect(encoded)
# We've got a guess at the encoding,
# so we can try and convert it but it
# may still fail so we need to handle
# that
begin
re_encoded = CharlockHolmes::Converter.convert(encoded, encoding[:encoding], 'UTF-8')
re_encoded = re_encoded.gsub(/\r\n?/, "\n")
# Now replace the uploaded file
File.open(#survey.survey_data.current_path, 'w') { |f|
f.write(re_encoded)
}
rescue ArgumentError
puts "UH OH!!!!!"
end
puts "#{#survey.survey_data.current_path}"
#parsed = CSV.read(#survey.survey_data.current_path)
end
The file uploading gem is CarrierWave if that makes any difference.
Please can someone help me as this is driving me insane!
Edit
The error says it's on line 1. Line 1 (assuming it doesn't index from 0) is
"Survey","RD","GarrysMDs","NigelsMDs","PaulsMDs","StephensMDs","BrinleyJ","CarolineP","DaveL","GrantR","GregS","Kent","NeilC","NicolaP","AndyC","DarrenS","DeanB","KarenF","PaulR","RichardF","SteveG","BrianG","GordonA","NickD","NickR","NickT","RayL","SimonH","EdmondH","JasonF","MikeS","SamanthaN","TimB","TravisF","AlanS","Q1","Q2","Q3","Q4","Q5","Q6","Q7","Q8PM","Q8N","Q9","Q10","Q11","Q12","Q13","Q14","Q15","Q16PM","Q16N","Q17PM","Q17N","Q18PM","Q18N","Q19","Q20","Q21","Q22","comment","Q23.1","Q23.2","Q23.3","TQ23.1","TQ23.2","VPM","VN","VQ1","VQ2","VQ3","VQ4","VQ5","VQ6","VQ7","VQ8N","VQ8PM","VQ9","VQ10","VQ11","VQ12","VQ13","VQ14","VQ15","VQ16","VQ16N","VQ16PM","VQ17","VQ17N","VQ17PM","VQ18","VQ18N","VQ18PM","VQ19","VQ20","VQ21","VQ22","VQ23.1","VQ23.2","VQ23.3","VRD","XQ16","XQ17","XQ18"
Well that was irritating!
Turns out the file had a BOM which was causing the CSV parser to break. Loading the file with
CSV.open("path/to/file.csv", "rb:bom|encoding")
allowed it to parse it perfectly! So annoyed how long it took to track down but it's now working and with no need to convert to UTF-8 now either!

Spreadsheet - encoding problem with reading cyrillic characters

I'm working on a rails app for a small shop. It needs to load an .xls file, parse it and maybe load to the database.
I use Spreadsheet gem to work with the file.
The problem is that the file contains russian characters which are displayed as "└ÛÛ.ExT H-1727F (ÓÝÓÙ¯Ò GP T304)"
The reference says, I need to specify the encoding, but I don't know which one is used in this file. I tried "win-1251" but it gave me an error about being unable to find a "utf-8 to win-1251 converter"
I've setting encoding to "WINDOWS-1251" but it gave me this error:
U+00BE to WINDOWS-1251 in conversion from CP850 to UTF-8 to WINDOWS-1251
So then I've tried CP850, which didn't throw an error, but the characters were still not readable.
There's not much code really.
# -*- encoding : utf-8 -*-
...
def show
require 'spreadsheet'
Spreadsheet.client_encoding = 'UTF-8'
book = Spreadsheet.open 'c:\rails\renergy23\public\price-16-04-11.xls'
#sheet = book.worksheet 0
end
For simpicity I don't load it to the database right now. Instead I output it in my view:
- 30.times do |i|
= #sheet.row i+10
%br
http://dl.dropbox.com/u/4976861/price-16-04-11.xls
I kinda solved this after 1.5 months by first saving the document in .xlsx and then saving it in .xls (97-2003). I couldn't use the .xlsx because of some weird OLE signature incorrect error.

Resources