Spreadsheets ruby gem encoding not working - ruby-on-rails

I'm getting a weird problem when I try to write strings (that are UTF-8) in a xls with the Spreadsheets gem. It doesn't give errors, but I get an invalid spreadsheet, with random characters (opened on Excel and Calc, same thing).
So I assume it is an encoding error, but I thought the lib would automatically convert my strings to the encoding used by Excel... I tried converting them to ISO by hand (.encode('ISO-8859-1')), force_encoding to UTF-8, and many other combinations of these two methods. Some give execution errors, and the others just don't work. Is there anything special I should do?
Spreadsheets: http://spreadsheet.rubyforge.org/
Code:
book = Spreadsheet::Workbook.new
sheet = book.create_worksheet
lines.each do |line|
sheet.row(row).concat(line) #line is in utf-8
end
book.write #file

You should try adding the following magic comment on top of your ruby script and then try.
# encoding: UTF-8
Before processing your source code interpreter reads this line and sets proper encoding. So, I assume this should solve your problem.

Related

Ruby removing diacritics from filenames - how to preserve them?

I have a directory full of files which have Unicode characters with diacritics in their file names, e.g. ăn.mp3, bất.mp3. (They're Vietnamese words.)
I'm iterating over these files using Dir.glob("path/to/folder/*").each, but the diacritics don't work properly. For example:
Dir.glob("path/to/folder/*").each do |file|
# e.g. file = "path/to/folder/bất.mp3"
word = file.split("/").last.split(".").first # bất
puts word[1] # outputs "a", but should be "ấ"
end
Bizarrely, if I run puts word then the diacritics appear correctly, but if I puts individual letters, they're not there. The file names eventually get saved as an attribute in a table in my Rails app, and all kinds of problems are occurring from the diacritics being inconsistent and disappearing.
Clearly something's wrong with my encoding, but I have no idea how to go about fixing this. This is a problem not just with Rails but with Ruby itself, because the above output is from irb, independent of any Rails app.
(I'm running Ruby 2.0.0p247.)
What the hell is going on?
There are two ways to produce a diatric. One is to use the letter with the diatric on it. Another is to use the normal letter, and to immediately follow it with a special diatric letter. Are you sure you're not in the latter scenario? (If so, puts 'a' + word[2] should produce the letter wiht a diatric.)
Also, are you sure your strings are properly encoded using utf8 (or utf16), rather than sequences of bytes?

Rails oracle raw16

I'm using Rails 3.2.1 and I have stuck on some problem for quite long.
I'm using oracle enhanced adapter and I have raw(16) (uuid) column and when I'm trying to display the data there is 2 situations:
1) I see the weird symbols
2) I'm getting incompatible character encoding: Ascii-8bit and utf-8.
In my application.rb file I added the
config.encoding = 'utf-8'
and in my view file I added
'#encoding=utf-8'
But so far nothing worked
I also tried to add html_safe but it failed .
How can I safely diaply my uuid data?
Thank you very much
Answer:
I used the unpack method to convert the
binary with those parameters
H8H4H4H4H12 and in the end joined the
array :-)
The RAW datatype is a string of bytes that can take any value. This includes binary data that doesn't translate to anything meaningful in ASCII or UTF-8 or in any character set.
You should really read Joel Spolsky's note about character sets and unicode before continuing.
Now, since the data can't be translated reliably to a string, how can we display it? Usually we convert or encode it, for instance:
we could use the hexadecimal representation where each byte is converted to two [0-9A-F] characters (in Oracle using the RAWTOHEX function). This is fine for display of small binary field such as RAW(16).
you can also use other encodings such as base 64, in Oracle with the UTL_ENCODE package.

identifying problematic row of data giving mass import error

I am using activerecord-import to bulk insert a bunch of data in a .csv file into my rails app. Unfortunately, I am getting an error when I call import on my model.
ArgumentError (invalid byte sequence in UTF-8)
I know the problem is that I have a string with weird characters somewhere in the 1000+ rows of data that I am importing, but I can't figure out which row is the problem.
Does activerecord-import have any error handling built in that I could use to figure out which row/row(s) were problematic (e.g. some option I could set when calling import function on my model)? As far as I can tell the answer is no.
Alternatively, can I write some code that would check the array that I am passing into activerecord-import to determine which rows have strings that are invalid in UTF-8?
Without being able to see the data, it is only possible to guess. Most likely, you have a character combination that is not UTF-8 valid.
You should be able to check your file with
iconv -f utf8 <filename>

Eliminating non-convertable characters on encoding change from UTF-8 to Shift_JIS with ruby 1.9

I need to write a CVS export program which internally use UTF-8 encoding which originated from user input via web(so you can expect any characters). It's Japanese system so I need to encode to Shift_JIS.
Now, when I change UTF-8 into Shift_JIS, I get errors like:
Encoding::UndefinedConversionError (U+7E6B from UTF-8 to Shift_JIS):
I want to either a) eliminate the character, or b) map the character to some other character
(or simply, to string '(U+7E6B)')
It seems catch the exception and eliminate it as byte string but there must be easier way to do this.
What is the best way to do this conversion?
[Converting my follow-up comments to question to an answer]
I found encode has option and I can give encode with
:undef=>true, # for UndefinedConversionError :replace=>"?"
to have desired effect. can specify following also:
:invalid=>true, # for InvalidByteSequenceError

Why does Rails 3 think xE2x80x89 means â x80 x89

I have a field scraped from a utf-8 page:
"O’Reilly"
And saved in a yml file:
:name: "O\xE2\x80\x99Reilly"
(xE2x80x99 is the correct UTF-8 representation of this apostrophe)
However when I load the value into a hash and yield it to a page tagged as utf-8, I get:
OâReilly
I looked up the character â, which is encoded in UTF-16 as x00E2, and the characters x80 and x89 were invisible but present after the â when I pasted the string. I assume this means my app is outputting three UTF-16 characters instead of one UTF-8.
How do I make rails interpret a 3-byte UTF-8 code as a single character?
Ruby strings are sequences of bytes instead of characters:
$ irb
>> "O\xE2\x80\x99Reilly"
=> "O\342\200\231Reilly"
Your string is a sequence of 10 bytes but 8 characters (as you know). The safest way to see that you output the correct string in HTML (I assume you want HTML since you mentioned Rails) is to convert non-printable characters to HTML entities; in your case to
O’Reilly
This takes some work but it should help in cases where send your HTML in UTF-8 but your end-user has set his or her browser to override and show Latin-1 or some other silly restricted charset.
Ultimately this was caused by loading a syck file (generated by an external script) with psych (in rails). Loading with syck solved the issue:
#in ruby environment
puts YAML::ENGINE.yamler => syck
#in rails
puts YAML::ENGINE.yamler => psych
#in webapp
YAML::ENGINE.yamler = 'syck'
a = YAML::load(file_saved_with_syck)
a[index][:name] => "O’Reilly"
YAML::ENGINE.yamler = 'psych'
I assume this means my app is outputting three UTF-16 characters instead of one UTF-8.
It's not really UTF-16, which is rarely used on the web (and largely breaks there). Your app is outputting three Unicode characters (including the two invisible control codes), but that's not the same thing as the UTF-16 encoding.
The problem would seem to be that the YAML file is being read in as if it were ISO-8859-1-encoded, so that the \xE2 byte maps to character U+00E2 and so on. I am guessing you are using Ruby 1.9 and the YAML is being parsed into byte strings with associated ASCII-8BIT encoding instead of UTF-8, causing the strings to undergo a round of trancoding (mangling) later.
If this is the case you might have to force_encoding the read strings back to what they should have been, or set default_internal to cause the strings to be read back into UTF-8. Bit of a mess this.

Resources