Ruby removing diacritics from filenames - how to preserve them? - ruby-on-rails

I have a directory full of files which have Unicode characters with diacritics in their file names, e.g. ăn.mp3, bất.mp3. (They're Vietnamese words.)
I'm iterating over these files using Dir.glob("path/to/folder/*").each, but the diacritics don't work properly. For example:
Dir.glob("path/to/folder/*").each do |file|
# e.g. file = "path/to/folder/bất.mp3"
word = file.split("/").last.split(".").first # bất
puts word[1] # outputs "a", but should be "ấ"
end
Bizarrely, if I run puts word then the diacritics appear correctly, but if I puts individual letters, they're not there. The file names eventually get saved as an attribute in a table in my Rails app, and all kinds of problems are occurring from the diacritics being inconsistent and disappearing.
Clearly something's wrong with my encoding, but I have no idea how to go about fixing this. This is a problem not just with Rails but with Ruby itself, because the above output is from irb, independent of any Rails app.
(I'm running Ruby 2.0.0p247.)
What the hell is going on?

There are two ways to produce a diatric. One is to use the letter with the diatric on it. Another is to use the normal letter, and to immediately follow it with a special diatric letter. Are you sure you're not in the latter scenario? (If so, puts 'a' + word[2] should produce the letter wiht a diatric.)
Also, are you sure your strings are properly encoded using utf8 (or utf16), rather than sequences of bytes?

Related

Handling UTF-8 Character with Latin1 db encoding

I keep getting an exception that ActiveRecord::StatementInvalid: PG::UntranslatableCharacter: ERROR: character with byte sequence 0xe2 0x80 0x99 in encoding "UTF8" has no equivalent in encoding "LATIN1". I did some checking and it looks like it is the backtick or apostrophe. What is the best way to handle this? Just strip out the character or convert the whole db to UTF-8? If it is converting to UTF-8 how can I do that permanently as it always seems to revert if you do it in the shell?
I don't understand what you mean by "revert, if done in the shell", but: You seem to have an application where some parts (at least the database) using encoding LATIN1, and one part (your Rails App) is using UTF-8. IMO, it is best if you have every in Unicode, but to what extend a conversion makes sense, can not be said in general. For example, if your database is also being processed by other tools, and those expect Latin1, a conversion is not sensible.
In any case, you need to define a clear borderline between where you use which encoding, and handle conversion at this border. This applies not only to the database, but also - for example - to the HTML pages you are generating (hopefully UTF-8), to files uploaded by the users and processes by your application, and so on.
If you convert to an encoding, where certain characters can not be represented - as this is in your case -, you have only three choices:
Reject the data (they must have been generated somewhere, perhaps as user input in a web form),
Simply remove the offending characters
Replace the offending characters by a placeholder (for instance, a question mark)
None of these options is very pleasant, but if converting your database to UTF-8 is no option, you should deal with this problem at the point where the problem string is generated, and not when it is written into the database.

Best practices for creating a CSV file?

I am working in Swift although perhaps the language is not as relevant, and I am creating a relatively simple CSV file.
I wanted to ask for some recommendations in creating the files, in particular:
Should I wrap each column/value in single or double quotes? Or nothing? I understand if I use quotes I'll need to escape them appropriately in case the text in my file legitimately has those values. Same for \r\n
Is it ok to end each line with \r\n ? Anything specific to Mac vs. Windows I need to think about?
What encoding should I use? I'd like to make sure my csv file can be read by most readers (so on mobile devices, mac, windows, etc.)
Any other recommendations / tips to make sure the quality of my CSV is ideal for most readers?
I have a couple of apps that create CSV files.
Any column value that contains a newline or the field separator must be enclosed in quotes (double quotes is common, single quotes less so).
I end lines with just \n.
You may wish to give the user some options when creating the CSV file. Let them choose the field separator. While the comma is common, a tab is also common. You can also use a semi-colon, space, or other characters. Just be sure to properly quote values that contain the chosen field separator.
Using UTF-8 encoding is arguably the best choice for encoding the file. It lets you support all Unicode characters and just about any tool that supports CSV can handled UTF-8. It avoid any issues with platform specific encodings. But again, depending on the needs of your users, you may wish to give them the choice of encoding.

CGI::unescape can't handle unescaping "wymiana+teflon%F3w"?

I am working on data imported from legacy database into sqlite for development, legacy database has a lot of url encoded strings with Polish characters. I can get most of these strings readable by using
CGI::unescape_html( CGI::unescape "string" )
except for one case (that I noticed yet, there may be more as I didn't do any testing yet), the letter "ó". For instance, using unescapeHTML on string "wymiana+teflon%F3w" throws an invalid byte sequence exception.
Question now is either my string is properly escaped, as other Polish characters are using sequences of "&#nnn;" like "b%26%23322%3Bad+zapisu+%2D+powinno+by%26%23263%3B+brak", which seems to follow standard for numeric character referencing. BTW, this string is properly unescaped into
"bład zapisu - powinno być brak"
But, on the other hand, there are also strings with similar character encoding, e.g. "odpowietrzanie+weza%5C" which is properly handled by CGI::unescapeHTML. However, %5C represents a backslash not a letter with code point lower than U+0256. Can it be the reason? I tried to research on this but haven't found any explanation. I also updated my Ruby to 2.1.0 as CGI::Util has changed in new version, but still no luck.
ó is 0xF3 in ISO-8859-2 (and ISO-8859-1) but '\xF3' is not a valid UTF-8 string, that ó should be %C3%B3 in the URL if you're expecting UTF-8. Someone somewhere probably used the deprecated escape JavaScript function to encode the string instead of modern encodeURIComponent; you can see the difference with a simple test in your browser's JavaScript console:
> escape('ó')
"%F3"
> encodeURIComponent('ó')
"%C3%B3"
There's the %F3 you're seeing and the %C3%B3 that you want to see. One thing that should work is to fix the encoding by hand:
irb> CGI::unescape('wymiana+teflon%F3w').force_encoding('ISO-8859-2').encode('UTF-8')
=> "wymiana teflonów"
This assumes that you know what should be ISO-8859-1 and what should be UTF-8. You might have a mix of both ISO-8859-2 (or -1, -3, ..., Windows CP-1258, ...) in your data; unfortunately, there's no reliable way to tell the difference as the encodings overlap and there's no way to be sure what result makes sense without eye-balling it and knowing the various languages involved.
Probably the best you can do is:
Send everything through through your CGI::unescape_html(CGI::unescape(...)) converter.
Wrap that in an exception handler to trap the inevitable problems.
Stash the problem strings off to the side somewhere.
Try the ISO-8859-2 to UTF-8 conversion on the strings from (3) and eye-ball them to see if they makes sense.
Repeat with other common encodings until there's nothing left that you care about.
Note that I'm using ISO-8859-2 instead of the more common ISO-8859-1 as Latin-2 is for Eastern European languages (such as Polish) whereas Latin-1 is for Western European languages. They overlap on ó but there is no ł in Latin-1. With tasks like this you usually try the encodings that are probably there first, then fall back on other common encodings, then fall back to whatever other encodings you can think of, and then fall back on hard liquor.
Good luck, modernizing legacy data is not the funnest job in the world.
I've chosen another way to solve my problem, simply substituting all occurrences of '%F3' with '%26%23xF3%3B' before unescaping. BTW, capital letter Ó also needs similar substitution. The actual code I used:
def unescape_ó(s)
s = s.gsub(/%D3|%F3/, {'%D3' =>'%26%23xD3%3B', '%F3' => '%26%23xF3%3B'})
end
With this approach I don't have to handle invalid byte sequence exception as properly escaped string is used in CGI::unescapeHTML

character encoding output to a file in linux

The working environment is jboss+mssql
I am doing a query and output the formatted result to a text file. The query result has some French accent characters.
On my local machine, everything works fine, but on the UAT server (linux box, UTF-8), the french accent characters become question marks.
Does anyone know how to solve it?
It depends on how you create your file - a code example would be helpful.
If you do specify an encoding explicitly, e.g. when creating a Writer, then if it doesn't match the locale of the machine on which you view the file, you may see question marks, placeholder boxes etc. instead of accented letters. You can use the locale command to check your locale and this will make it possible to learn the associated character encoding. This is just a matter of viewing the file. You say that the box is UTF-8, but do ensure that the app is also running under a UTF-8 locale - your user console and the server app may be using different locales.
If you do not specify the character encoding when writing, most often you will end up using the system's locale. In that case it may happen that this locale doesn't support the characters you need, so they are replaced with placeholders. A solution would be to change the locale with which your app is running e.g. by exporting the corresponding LC_* environmental variables.
So, the short checklist goes like this:
How do you write your file? Is the encoding specified explicitly?
What is the locale with which the app is running (output of locale command)?
Check the actual bytes written to your file using od -t x1 command or using a hex viewer like the one included in mc. Are the question marks actual question marks (hex code 3F), or rather some other character? If they take one byte, they're probably in one of the Latin-N (ISO 8859-N) encodings. If they take more than one byte, it's probably UTF-8 (I understand the letters a-z look normal, so it's not UTF-16).

Is there a solution to the character encoding problem ("�") for Rails 2 / Ruby 1.8.7?

From the Rails 3 announcement listing the major new features:
Say goodbye to encoding issues
If you browse the Internet with any frequency, you will likely encounter the � character. This problem is extremely pervasive, and is caused by mixing and matching content with different encodings.
In a system like Rails, content comes from the database, your templates, your source files, and from the user. Ruby 1.9 gives us the raw tools to eliminate these problems, and in combination with Rails 3, � should be a thing of the past in Rails applications. Never struggle with corrupted data pasted by a user from Microsoft Word again!
I have an app where users often paste in text from MS Word and we encounter exactly this issue.
However we're running Rails 2 and Ruby 1.8.7. There is no immediate prospect of changing this.
I think the encoding problem usually manifests with typographer's quotes ("curly quotes"). Probably also things like em dashes and the elipses character.
I'm wondering if there's routine I can run on the incoming data to overcome this problem.
It's OK if the quotes get turned into straight quotes, elipses get turned into three periods, etc.
It could even be a utility that runs on the system level that I could call from my app with
processed_data = `system_command #{params[:incoming_data]}`
You can use the rchardet gem to detect the encoding of incoming strings, and the built-in Iconv libs to convert strings that need conversion:
require ‘rchardet’
[...]
cd = CharDet.detect(params[:my_upload_form][:uploaded_file])
encoding = cd['encoding']
converted_string = Iconv.conv(‘UTF-8′, encoding, params[:my_upload_form][:uploaded_file])
The example is working on an uploaded file, but of course you can apply it to data coming in from textareas or wherever else you think users may be pasting data in encodings other than the one you want.
Borrowed shamelessly from the kind gentleman at http://www.meeho.net/blog/2010/03/ruby-how-to-detect-the-encoding-of-a-string/.

Resources