Can't parse files uploaded in UTF-16 on Heroku - ruby-on-rails

I'm writing a Rails app that allows the user to upload TSV (Tab-separated values) files to be parsed on the server. Those files are encoded in UTF-16. All goes fine in local, but when I try to open the file with such encoding on Heroku, I'm getting a warning that says warning: Unsupported encoding utf-16 ignored. When later I try to read such file, it obviously fails stating invalid byte sequence in UTF-8. See an excerpt of the code below:
File.open(params[:batch_import][:file].path, 'r:utf-16') do |f|
#recipients = Recipient.from_tsv(f.read)
end
Is there any walkaround that I can do?

UTF-16 files must be opened with the binary mode. Try this:
File.open(params[:batch_import][:file].path, 'rb:utf-16') do |f|
#recipients = Recipient.from_tsv(f.read)
end

Related

Ruby: Is there a way to specify your encoding in File.write?

TL;DR
How would I specify the mode of encoding on File.write, or how would one save image binary to a file in a similar fashion?
More Details
I'm trying to download an image from a Trello card and then upload that image to S3 so it has an accessible URL. I have been able to download the image from Trello as binary (I believe it is some form of binary), but I have been having issues saving this as a .jpeg using File.write. Every time I attempt that, it gives me this error in my Rails console:
Encoding::UndefinedConversionError: "\xFF" from ASCII-8BIT to UTF-8
from /app/app/services/customer_order_status_notifier/card.rb:181:in `write'
And here is the code that triggers that:
def trello_pics
#trello_pics ||=
card.attachments.last(config_pics_number)&.map(&:url).map do |url|
binary = Faraday.get(url, nil, {'Authorization' => "OAuth oauth_consumer_key=\"#{ENV['TRELLO_PUBLIC_KEY']}\", oauth_token=\"#{ENV['TRELLO_TOKEN']}\""}).body
File.write(FILE_LOCATION, binary) # doesn't work
run_me
end
end
So I figure this must be an issue with the way that File.write converts the input into a file. Is there a way to specify encoding?
AFIK you can't do it at the time of performing the write, but you can do it at the time of creating the File object; here an example of UTF8 encoding:
File.open(FILE_LOCATION, "w:UTF-8") do
|f|
f.write(....)
end
Another possibility would be to use the external_encoding option:
File.open(FILE_LOCATION, "w", external_encoding: Encoding::UTF_8)
Of course this assumes that the data which is written, is a String. If you have (packed) binary data, you would use "wb" for openeing the file, and syswrite instead of write to write the data to the file.
UPDATE As engineersmnky points out in a comment, the arguments for the encoding can also be passed as parameter to the write method itself, for instance
IO::write(FILE_LOCATION, data_to_write, external_encoding: Encoding::UTF_8)

How to write a Tempfile as binary

When trying to write a string / unzipped file to a Tempfile by doing:
temp_file = Tempfile.new([name, extension])
temp_file.write(unzipped_io.read)
Which throws the following error when I do this with an image:
Encoding::UndefinedConversionError - "\xFF" from ASCII-8BIT to UTF-8
When researching it I found out that this is caused because Ruby tries to write files with an encoding by default (UTF-8). But the file should be written as binary, so it ignores any file specific behavior.
Writing regular File you would be able to do this as following:
File.open('/tmp/test.jpg', 'rb') do |file|
file.write(unzipped_io.read)
end
How to do this in Tempfile
Tempfile.new passes options to File.open which accepts the options from IO.new, in particular:
:binmode
If the value is truth value, same as “b” in argument mode.
So to open a tempfile in binary mode, you'd use:
temp_file = Tempfile.new([name, extension], binmode: true)
temp_file.binmode? #=> true
temp_file.external_encoding #=> #<Encoding:ASCII-8BIT>
In addition, you might want to use Tempfile.create which takes a block and automatically closes and removes the file afterwards:
Tempfile.create([name, extension], binmode: true) do |temp_file|
temp_file.write(unzipped_io.read)
# ...
end
I have encountered the solution in an old Ruby forum post, so I thought I would share it here, making it easier for people to find:
https://www.ruby-forum.com/t/ruby-binary-temp-file/116791
Apparently Tempfile has an undocumented method binmode, which changes the writing mode to binary and thus ignoring any encoding issues:
temp_file = Tempfile.new([name, extension])
temp_file.binmode
temp_file.write(unzipped_io.read)
Thanks unknown person who mentioned it on ruby-forums.com in 2007!
Another alternative is IO.binwrite(path, file_content)

RoR - Download ZIP file using gmail gem

I am trying to set up some rake tasks. It requires me to connect to gmail and download a Zip file which is sent as an attachment.
I have written the following code(which works fine for downloading csv) -
gmail = Gmail.connect(ENV["USERNAME"], ENV["PASSWORD"])
msg = gmail.inbox.find(from: ENV["REC_USER"],
subject: args[:subject])
dir_path = "lib/mfu_payment_data/"
Dir.mkdir dir_path unless File.exists?(dir_path)
if msg.first
msg.first.attachments.each do |attachment|
File.write(File.join(dir_path,attachment.filename),attachment.body.decoded)
end
end
It throws the following error -
rake aborted!
Encoding::UndefinedConversionError: "\xED" from ASCII-8BIT to UTF-8
I assume that this has got something to do with the attachment.body.decoded, but I do not know how else to do this.
You can try writing the file in binary mode:
File.open('/path/to/file;, 'wb') { |file| file.write(attachment.body.decoded) }
"b" Binary file mode
Suppresses EOL <-> CRLF conversion on Windows. And
sets external encoding to ASCII-8BIT unless explicitly
specified.
The modes are described in the IO class which File inherits from.
I think you have many option to generate zip file
Download and unzip

MalformedCSVError with rails CSV (FasterCSV)

I'm having serious issues trying to parse some CSV in rails right now.
Basically my app gets a user to upload a CSV file. The app then converts the file to ensure it is in UTF-8 format, then attempts to parse it and process it. Whenever the app attempts to parse it however, I get the MalformedCSVError stating "Illegal quoting on line 1"
Now what I don't get, is if I copy the original file into a new document and save it, then I can parse it on a rails console without a problem.
If I attempt to parse the original file, it complains about an invalid character for UTF-8 encoding (the file isn't in UTF-8 hence the app converts it)
If I attempt to parse the file which the app has converted to UTF-8 and changed the line endings to LF, it fails to parse.
If I do a file diff between the version the app has produced, and the copy/paste version that I have made (which works) there are 0 differences so I really can't figure out why one is parsable, and one is not.
Any suggestions? My app is processing the file as follows :
def create
#survey = Survey.new(params[:survey])
# Now we need to try and convert this to UTF-8 if it isn't already
encoded = File.read(#survey.survey_data.current_path)
encoding = CharlockHolmes::EncodingDetector.detect(encoded)
# We've got a guess at the encoding,
# so we can try and convert it but it
# may still fail so we need to handle
# that
begin
re_encoded = CharlockHolmes::Converter.convert(encoded, encoding[:encoding], 'UTF-8')
re_encoded = re_encoded.gsub(/\r\n?/, "\n")
# Now replace the uploaded file
File.open(#survey.survey_data.current_path, 'w') { |f|
f.write(re_encoded)
}
rescue ArgumentError
puts "UH OH!!!!!"
end
puts "#{#survey.survey_data.current_path}"
#parsed = CSV.read(#survey.survey_data.current_path)
end
The file uploading gem is CarrierWave if that makes any difference.
Please can someone help me as this is driving me insane!
Edit
The error says it's on line 1. Line 1 (assuming it doesn't index from 0) is
"Survey","RD","GarrysMDs","NigelsMDs","PaulsMDs","StephensMDs","BrinleyJ","CarolineP","DaveL","GrantR","GregS","Kent","NeilC","NicolaP","AndyC","DarrenS","DeanB","KarenF","PaulR","RichardF","SteveG","BrianG","GordonA","NickD","NickR","NickT","RayL","SimonH","EdmondH","JasonF","MikeS","SamanthaN","TimB","TravisF","AlanS","Q1","Q2","Q3","Q4","Q5","Q6","Q7","Q8PM","Q8N","Q9","Q10","Q11","Q12","Q13","Q14","Q15","Q16PM","Q16N","Q17PM","Q17N","Q18PM","Q18N","Q19","Q20","Q21","Q22","comment","Q23.1","Q23.2","Q23.3","TQ23.1","TQ23.2","VPM","VN","VQ1","VQ2","VQ3","VQ4","VQ5","VQ6","VQ7","VQ8N","VQ8PM","VQ9","VQ10","VQ11","VQ12","VQ13","VQ14","VQ15","VQ16","VQ16N","VQ16PM","VQ17","VQ17N","VQ17PM","VQ18","VQ18N","VQ18PM","VQ19","VQ20","VQ21","VQ22","VQ23.1","VQ23.2","VQ23.3","VRD","XQ16","XQ17","XQ18"
Well that was irritating!
Turns out the file had a BOM which was causing the CSV parser to break. Loading the file with
CSV.open("path/to/file.csv", "rb:bom|encoding")
allowed it to parse it perfectly! So annoyed how long it took to track down but it's now working and with no need to convert to UTF-8 now either!

Character Encoding issue in Rails v3/Ruby 1.9.2

I get this error sometimes "invalid byte sequence in UTF-8" when I read contents from a file. Note - this only happens when there are some special characters in the string. I have tried opening the file without "r:UTF-8", but still get the same error.
open(file, "r:UTF-8").each_line { |line| puts line.strip(",") } # line.strip generates the error
Contents of the file:
# encoding: UTF-8
290919,"SE","26","Sk‰l","",59.4500,17.9500,, # this errors out
290956,"CZ","45","HornÌ Bradlo","",49.8000,15.7500,, # this errors out
290958,"NO","02","Svaland","",58.4000,8.0500,, # this works
This is the CSV file I got from outside and I am trying to import it into my DB, it did not come with "# encoding: UTF-8" at the top, but I added this since I read somewhere it will fix this problem, but it did not. :(
Environment:
Rails v3.0.3
ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-darwin10.5.0]
Ruby has a notion of an external encoding and internal encoding for each file. This allows you to work with a file in UTF-8 in your source, even when the file is stored in a more esoteric format. If your default external encoding is UTF-8 (which it is if you're on Mac OS X), all of your file I/O is going to be in UTF-8 as well. You can check this using File.open('file').external_encoding. What you're doing when you opening your file and passing "r:UTF-8" is forcing the same external encoding that Ruby is using by default.
Chances are, your source document isn't in UTF-8 and those non-ascii characters aren't mapping cleanly to UTF-8 (if they were, you would either get the correct characters and no error, and if they mapped by incorrectly, you would get incorrect characters and no error). What you should do is try to determine the encoding of the source document, then have Ruby transcode the document on read, like so:
File.open(file, "r:windows-1251:utf-8").each_line { |line| puts line.strip(",") }
If you need help determining the encoding of the source, give this Python library a whirl. It's based on the automatic charset detection fallback that was in Seamonkey/Mozilla (and is possibly still in Firefox).
If you want to change your file encoding, you can use gem 'charlock holmes'
https://github.com/brianmario/charlock_holmes
$require 'charlock_holmes/string'
content = File.read('test2.txt')
if !content.is_utf8?
detection = CharlockHolmes::EncodingDetector.detect(content)
utf8_encoded_content = CharlockHolmes::Converter.convert content, detection[:encoding], 'UTF-8'
end
Then you can save your new content in a temp file and overwrite your original file.
Hope this help.

Resources