Rails CSV file upload character encoding issues - ruby-on-rails

I know there are a lot of threads about this already, but none of the solutions suggested do not seem to work for me for some reason...
I am using:
Ruby 1.9.2
Rails 2.3.8
My users author CSV files in MS Excel and then need to upload these files to the web application. My web application and the database backend uses UTF-8 and all special characters, such as the £ sign, get corrupted on upload.
I am reading in the file like this:
#file = params[:import_file][:uploaded_data]
Then get encoding of the file using:
source_encoding = "UTF-8"
if #file.external_encoding
source_encoding = #file.external_encoding.name
end
For my test file the source encoding value is ASCII-8BIT.
Then I try to do:
#file.each {|line|
print "#{line.force_encoding(source_encoding).encode!("UTF-8") }\n"
}
in order to see if all texts are displaying ok. However this gives me error like this:
"\xA3" from ASCII-8BIT to UTF-8
If I am trying to read the CSV with:
dataArray = CSV.read(#file, encoding: source_encoding)
No errors this time, but all special characters go as ? characters.
Any pointers where I might be going wrong or is importing CSV file authored with MS Excel just a mission impossible?
Regards,
Olli

Related

How to correctly handle character encoding when using Postgresql's copy_data function?

In my Rails app, I managed to stream large CSV files directly from Postgres based on solutions mentioned in this SO post. My working code looks somewhat like so:
query = <A Long SQL Query String>
response.headers["Cache-Control"] = "no-cache"
response.headers["Content-Type"] = "text/csv; charset=utf-8"
response.headers["Content-Disposition"] =
%(attachment; filename="#{csv_filename}")
response.headers["Last-Modified"] = Time.now.ctime.to_s
conn = ActiveRecord::Base.connection.raw_connection
conn.copy_data("COPY (#{query}) TO STDOUT WITH (FORMAT CSV, HEADER TRUE, FORCE_QUOTE *, ESCAPE E'\\\\');") do
while row = conn.get_copy_data
response.stream.write row
end
end
response.stream.close
end
Some of the columns (VARCHAR) being queried have values as either English or Chinese strings. The CSV file resulting from the above code doesn’t show the Chinese characters as is. Instead, I get something like this:
大大 文文
Am I supposed to change the way I’m using the copy_data function, or is there something I could do to the CSV file to solve this? I’ve tried saving the file as UTF-8 .txt file, as well as trying the convert_to function mentioned in the copy_data documentation, but to no avail.
This depends of the original encoding included in the CSV file.
Do this on Linux :
file -i you_file
Are you sure it's not UTF-16 or GB 18030 ?
And also in what kind of encoding is setup your database ?
do a \l in psql to see this.
So it boiled down to my MS Excel not being able to render the Chinese chars correctly. On MacOS, opening the same .csv file using the Numbers app (or even Atom, for that matter) resolved this issue for me.

MalformedCSVError with rails CSV (FasterCSV)

I'm having serious issues trying to parse some CSV in rails right now.
Basically my app gets a user to upload a CSV file. The app then converts the file to ensure it is in UTF-8 format, then attempts to parse it and process it. Whenever the app attempts to parse it however, I get the MalformedCSVError stating "Illegal quoting on line 1"
Now what I don't get, is if I copy the original file into a new document and save it, then I can parse it on a rails console without a problem.
If I attempt to parse the original file, it complains about an invalid character for UTF-8 encoding (the file isn't in UTF-8 hence the app converts it)
If I attempt to parse the file which the app has converted to UTF-8 and changed the line endings to LF, it fails to parse.
If I do a file diff between the version the app has produced, and the copy/paste version that I have made (which works) there are 0 differences so I really can't figure out why one is parsable, and one is not.
Any suggestions? My app is processing the file as follows :
def create
#survey = Survey.new(params[:survey])
# Now we need to try and convert this to UTF-8 if it isn't already
encoded = File.read(#survey.survey_data.current_path)
encoding = CharlockHolmes::EncodingDetector.detect(encoded)
# We've got a guess at the encoding,
# so we can try and convert it but it
# may still fail so we need to handle
# that
begin
re_encoded = CharlockHolmes::Converter.convert(encoded, encoding[:encoding], 'UTF-8')
re_encoded = re_encoded.gsub(/\r\n?/, "\n")
# Now replace the uploaded file
File.open(#survey.survey_data.current_path, 'w') { |f|
f.write(re_encoded)
}
rescue ArgumentError
puts "UH OH!!!!!"
end
puts "#{#survey.survey_data.current_path}"
#parsed = CSV.read(#survey.survey_data.current_path)
end
The file uploading gem is CarrierWave if that makes any difference.
Please can someone help me as this is driving me insane!
Edit
The error says it's on line 1. Line 1 (assuming it doesn't index from 0) is
"Survey","RD","GarrysMDs","NigelsMDs","PaulsMDs","StephensMDs","BrinleyJ","CarolineP","DaveL","GrantR","GregS","Kent","NeilC","NicolaP","AndyC","DarrenS","DeanB","KarenF","PaulR","RichardF","SteveG","BrianG","GordonA","NickD","NickR","NickT","RayL","SimonH","EdmondH","JasonF","MikeS","SamanthaN","TimB","TravisF","AlanS","Q1","Q2","Q3","Q4","Q5","Q6","Q7","Q8PM","Q8N","Q9","Q10","Q11","Q12","Q13","Q14","Q15","Q16PM","Q16N","Q17PM","Q17N","Q18PM","Q18N","Q19","Q20","Q21","Q22","comment","Q23.1","Q23.2","Q23.3","TQ23.1","TQ23.2","VPM","VN","VQ1","VQ2","VQ3","VQ4","VQ5","VQ6","VQ7","VQ8N","VQ8PM","VQ9","VQ10","VQ11","VQ12","VQ13","VQ14","VQ15","VQ16","VQ16N","VQ16PM","VQ17","VQ17N","VQ17PM","VQ18","VQ18N","VQ18PM","VQ19","VQ20","VQ21","VQ22","VQ23.1","VQ23.2","VQ23.3","VRD","XQ16","XQ17","XQ18"
Well that was irritating!
Turns out the file had a BOM which was causing the CSV parser to break. Loading the file with
CSV.open("path/to/file.csv", "rb:bom|encoding")
allowed it to parse it perfectly! So annoyed how long it took to track down but it's now working and with no need to convert to UTF-8 now either!

Spreadsheet - encoding problem with reading cyrillic characters

I'm working on a rails app for a small shop. It needs to load an .xls file, parse it and maybe load to the database.
I use Spreadsheet gem to work with the file.
The problem is that the file contains russian characters which are displayed as "└ÛÛ.ExT H-1727F (ÓÝÓÙ¯Ò GP T304)"
The reference says, I need to specify the encoding, but I don't know which one is used in this file. I tried "win-1251" but it gave me an error about being unable to find a "utf-8 to win-1251 converter"
I've setting encoding to "WINDOWS-1251" but it gave me this error:
U+00BE to WINDOWS-1251 in conversion from CP850 to UTF-8 to WINDOWS-1251
So then I've tried CP850, which didn't throw an error, but the characters were still not readable.
There's not much code really.
# -*- encoding : utf-8 -*-
...
def show
require 'spreadsheet'
Spreadsheet.client_encoding = 'UTF-8'
book = Spreadsheet.open 'c:\rails\renergy23\public\price-16-04-11.xls'
#sheet = book.worksheet 0
end
For simpicity I don't load it to the database right now. Instead I output it in my view:
- 30.times do |i|
= #sheet.row i+10
%br
http://dl.dropbox.com/u/4976861/price-16-04-11.xls
I kinda solved this after 1.5 months by first saving the document in .xlsx and then saving it in .xls (97-2003). I couldn't use the .xlsx because of some weird OLE signature incorrect error.

Character Encoding issue in Rails v3/Ruby 1.9.2

I get this error sometimes "invalid byte sequence in UTF-8" when I read contents from a file. Note - this only happens when there are some special characters in the string. I have tried opening the file without "r:UTF-8", but still get the same error.
open(file, "r:UTF-8").each_line { |line| puts line.strip(",") } # line.strip generates the error
Contents of the file:
# encoding: UTF-8
290919,"SE","26","Sk‰l","",59.4500,17.9500,, # this errors out
290956,"CZ","45","HornÌ Bradlo","",49.8000,15.7500,, # this errors out
290958,"NO","02","Svaland","",58.4000,8.0500,, # this works
This is the CSV file I got from outside and I am trying to import it into my DB, it did not come with "# encoding: UTF-8" at the top, but I added this since I read somewhere it will fix this problem, but it did not. :(
Environment:
Rails v3.0.3
ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-darwin10.5.0]
Ruby has a notion of an external encoding and internal encoding for each file. This allows you to work with a file in UTF-8 in your source, even when the file is stored in a more esoteric format. If your default external encoding is UTF-8 (which it is if you're on Mac OS X), all of your file I/O is going to be in UTF-8 as well. You can check this using File.open('file').external_encoding. What you're doing when you opening your file and passing "r:UTF-8" is forcing the same external encoding that Ruby is using by default.
Chances are, your source document isn't in UTF-8 and those non-ascii characters aren't mapping cleanly to UTF-8 (if they were, you would either get the correct characters and no error, and if they mapped by incorrectly, you would get incorrect characters and no error). What you should do is try to determine the encoding of the source document, then have Ruby transcode the document on read, like so:
File.open(file, "r:windows-1251:utf-8").each_line { |line| puts line.strip(",") }
If you need help determining the encoding of the source, give this Python library a whirl. It's based on the automatic charset detection fallback that was in Seamonkey/Mozilla (and is possibly still in Firefox).
If you want to change your file encoding, you can use gem 'charlock holmes'
https://github.com/brianmario/charlock_holmes
$require 'charlock_holmes/string'
content = File.read('test2.txt')
if !content.is_utf8?
detection = CharlockHolmes::EncodingDetector.detect(content)
utf8_encoded_content = CharlockHolmes::Converter.convert content, detection[:encoding], 'UTF-8'
end
Then you can save your new content in a temp file and overwrite your original file.
Hope this help.

convert jruby 1.8 string to windows encoding?

I want to export some data from my jruby on rails webapp to excel, so I create a csv string and send it as a download to the client using
send_data(text, :filename => "file.csv", :type => "text/csv; charset=CP1252", :encoding => "CP1252")
The file seems to be in UTF-8 which Excel cannot read correctly. I googled the problem and found that iconv can convert encodings. I try to do that with:
ic = Iconv.new('CP1252', 'UTF-8')
text = ic.iconv(text)
but when I send the converted text it does not make any difference. It is still UTF-8 and Excel cannot read the special characters. there are several solutions using iconv, so this seems to work for others. When I convert the file on the linux shell manually with iconv it works.
What am I doing wrong? Is there a better way?
Im using:
- jruby 1.3.1 (ruby 1.8.6p287) (2009-06-15 2fd6c3d) (Java HotSpot(TM) Client VM 1.6.0_19) [i386-java]
- Debian Lenny
- Glassfish app server
- Iceweasel 3.0.6
Edit:
Do I have to include some gem to use iconv?
Solution:
S.Mark pointed out this solution:
You have to use UTF-16LE encoding to make excel understand it, like this:
text= Iconv.iconv('UTF-16LE', 'UTF-8', text)
Thanks, S.Mark for that answer.
According to my experience, Excel cannot handle UTF-8 CSV files properly. Try UTF-16 instead.
Note: Excel's Import Text Wizard appears to work with UTF-8 too
Edit: A Search on Stack Overflow give me this page, please take a look that.
According to that, adding a BOM (Byte Order Mark) signature in CSV will popup Excel Text Import Wizard, so you could use it as work around.
Do you get the same result with the following?
cp1252= Iconv.conv("CP1252", "UTF8", text)

Resources