How to properly handle invalid bytes in UTF-8 strings? - ruby-on-rails

I have a string with encoding ASCII-8BIT:
str = 'quindi \xE8 al \r\ngoverno'
I want to transcode it to UTF-8, for not having problems with char visualization.
Naturally, \xE8 is not a valid sequence in UTF-8, so I get the error when I try to:
str.encode 'utf-8'
Which returns:
UndefinedConversionError "\xE8" from ASCII-8BIT to UTF-8
Reading the docs about encode method, and I came up with this solution:
encode('UTF-8', invalid: :replace, undef: :replace)
This way all the invalid sequences are replaced with the ?. But if I want to display the proper char instead of the ?. I have different escape sequences in this text, \xE8, \xE0 ...
Is there a way to automatically replace them with the right escaped char?

Your string seems to be ISO-8859-1 encoded. This should work:
str = "quindi \xE8 al \r\ngoverno"
str.force_encoding('ISO-8859-1').encode('UTF-8')
#=> "quindi è al \r\ngoverno"
Note that you have to use double quotes.

Related

Postgres invalid byte sequence for encoding "UTF8": 0xc3 0x2f

I work with a payment API and it returns some XML. For logging I want to save the API response in my database.
One word in the API is "manhã" but the API returns "manh�". Other chars like á ou ç are being returned correctly, this is some bug in the API I guess.
But when trying to save this in my DB I get:
Postgres invalid byte sequence for encoding "UTF8": 0xc3 0x2f
How can I solve this?
I tried things like
response.encode("UTF-8") and also force_encode but all I get is:
Encoding::UndefinedConversionError ("\xC3" from ASCII-8BIT to UTF-8)
I need to either remove this wrong character or convert it somehow.
You’re on the right track - you should be able to solve the problem with the encode method - when the source encoding is known you should be able to simply use:
response.encode(‘UTF-8’, ‘ISO-8859-1’)
There may be times where there are invalid characters in the source encoding, and to get around exceptions, you can instruct ruby how to handle them:
# This will transcode the string to UTF-8 and replace any invalid/undefined characters with ‘’ (empty string)
response.encode(‘UTF-8’, 'ISO-8859-1', invalid: :replace, undef: :replace, replace: ‘’)
This is all laid out in the Ruby docs for String - check them out!
—--
Note, many people incorrectly assume that force_encode will somehow fix encoding problems. force_encode simply tags the string as the specified encoding - it does not transcode and replace/remove the invalid characters. When you're converting between encodings, you must transcode so that characters in one character set are correctly represented in the other character set.
As pointed out in the comment section, you can use force_encoding to transcode your string if you used: response.force_encoding('ISO-8859-1').encode('UTF-8') (which is equivalent to the first example using encode above).

How to convert arabic digits to numeric in ruby

Params include arabic digits also which I want to convert it into digits:-
"lexus/yr_٢٠٠١_٢٠٠٦"
I tried this one
params[:query].tr!('٠١٢٣٤٥٦٧٨٩','0123456789').delete!(" ")
but it gives an error
Encoding::CompatibilityError Exception: incompatible character encodings: UTF-8 and ASCII-8BIT
for that I do
params[:query].force_encoding('utf-8').encode.tr!('٠١٢٣٤٥٦٧٨٩','0123456789').delete!(" ")
how I can convert this?
If you have enforced UTF-8 encoding then this should work fine.
str = "lexus/yr_٢٠٠١_٢٠٠٦"
str.tr('٠١٢٣٤٥٦٧٨٩','0123456789')
returns "lexus/yr_2001_2006"
ASCII 8-bit is not really an encoding. It is binary data, not something text based. Transcoding ASCII 8-bit to UTF-8 is not a meaningful operation. I would recommend ensuring that the request that passes the query parameter through your textfield is using valid string encoding, if you can control this. You can use String#valid_encoding? method in ruby to check you are receiving a correctly encoded string.

Percent encoding a non extended ascii char like extended chars

If we percent encode the char "€", we will have %E2%82%AC as result. Ok!
My problem:
a = %61
I already know it.
Is it possible to encode "a" to something like %XX%XX or %XX%XX%XX?
If yes, will browsers and servers understand the result as the char "a"?
If we percent encode the char "€", we will have %E2%82%AC as result.
€ is Unicode codepoint U+20AC EURO SIGN. The byte sequence 0xE2 0x82 0xAC is how U+20AC is encoded in UTF-8. %E2%82%AC is the URL encoding of those bytes.
a = %61
I already know it.
For ASCII character a, aka Unicode codepoint U+0061 LATIN SMALL LETTER A, that is correct. It is encoded as byte 0x61 in UTF-8 (and most other charsets), and thus can be encoded as %61 in URLs.
Is it possible to encode "a" to something like %XX%XX or %XX%XX%XX?
Yes. Any character can be encoded using percent encoding in a URL. Simply encode the character in the appropriate charset, and then percent-encode the resulting bytes. However, most ASCII non-reserved characters do not require such encoding, just use them as-is.
If yes, will browsers and servers understand the result as the char "a"?
In URLs and URL-like content encodings (like application/x-www-webform-urlencoded), yes.

What extended ASCII encoding is this and how can I get ruby to understand it?

The characters 0x91, 0x92, 0x93, and 0x94 are supposed to represent what in Unicode are U+2018, U+2019, U+201c, and U+201d, or the "opening single quote", "closing single quote", "opening double quote", and "closing double quote". I thought that it was ISO-8859-1 but when I try to process a file using IO.read('file', :encoding=>'ISO-8859-1') it still does not recognize these characters.
If it isn't ISO-8859-1 then what is it? And if it is, why doesn't ruby recognize these characters?
UPDATE: Apparently this encoding is supposed to be Windows-1252. But ruby still does not recognize these characters when I do IO.read('file', :encoding=>'Windows-1252').
UPDATE 2: Nevermind, Windows-1252 works.
0x91 is the Windows-1251 representation of Unicode's \u2018 (AKA ‘):
>> "\x91".force_encoding('windows-1251').encode('utf-8')
=> "‘"
Windows-1251 and Latin-1 (AKA ISO 8859-1) are not the same, try using windows-1251 as the encoding:
IO.read('file', :encoding => 'windows-1251')
That will give you a string that knows it is Windows-1251. If you want UTF-8, then perhaps you want to specifying the :internal_encoding and :external_encoding:
IO.read('file', :external_encoding => 'windows-1251', :internal_encoding => 'utf-8')

Inconsistent IO character reading when converting encoding

In Ruby 1.9.3-429, I am trying to parse plain text files with various encodings that will ultimately be converted to UTF-8 strings. Non-ascii characters work fine with a file encoded as UTF-8, but problems come up with non-UTF-8 files.
Simplified example:
File.open(file) do |io|
io.set_encoding("#{charset.upcase}:#{Encoding::UTF_8}")
line, char = "", nil
until io.eof? || char == ?\n || char == ?\r
char = io.readchar
puts "Character #{char} has #{char.each_codepoint.count} codepoints"
puts "SLICE FAIL" unless char == char.slice(0,1)
line << char
end
line
end
Both files are just a single string áÁð encoded appropriately. I have checked that the files have been encoded correctly via $ file -i <file_name>
With a UTF-8 file, I get back:
Character á has 1 codepoints
Character Á has 1 codepoints
Character ð has 1 codepoints
With an ISO-8859-1 file:
Character á has 2 codepoints
SLICE FAIL
Character Á has 2 codepoints
SLICE FAIL
Character ð has 2 codepoints
SLICE FAIL
The way I am interpreting this is readchar is returning an incorrectly converted encoding which is causing slice to return incorrectly.
Is this behavior correct? Or am I specifying the file external encoding incorrectly? I would rather not rewrite this process so I am hoping I am making a mistake somewhere. There are reasons why I am parsing files this way, but I don't think those are relevant to my question. Specifying the internal and external encoding as an option in File.open yielded the same results.
This behavior is a bug. See http://bugs.ruby-lang.org/issues/8516 for details.

Resources