What extended ASCII encoding is this and how can I get ruby to understand it? - ruby-on-rails

The characters 0x91, 0x92, 0x93, and 0x94 are supposed to represent what in Unicode are U+2018, U+2019, U+201c, and U+201d, or the "opening single quote", "closing single quote", "opening double quote", and "closing double quote". I thought that it was ISO-8859-1 but when I try to process a file using IO.read('file', :encoding=>'ISO-8859-1') it still does not recognize these characters.
If it isn't ISO-8859-1 then what is it? And if it is, why doesn't ruby recognize these characters?
UPDATE: Apparently this encoding is supposed to be Windows-1252. But ruby still does not recognize these characters when I do IO.read('file', :encoding=>'Windows-1252').
UPDATE 2: Nevermind, Windows-1252 works.

0x91 is the Windows-1251 representation of Unicode's \u2018 (AKA ‘):
>> "\x91".force_encoding('windows-1251').encode('utf-8')
=> "‘"
Windows-1251 and Latin-1 (AKA ISO 8859-1) are not the same, try using windows-1251 as the encoding:
IO.read('file', :encoding => 'windows-1251')
That will give you a string that knows it is Windows-1251. If you want UTF-8, then perhaps you want to specifying the :internal_encoding and :external_encoding:
IO.read('file', :external_encoding => 'windows-1251', :internal_encoding => 'utf-8')

Related

How to convert arabic digits to numeric in ruby

Params include arabic digits also which I want to convert it into digits:-
"lexus/yr_٢٠٠١_٢٠٠٦"
I tried this one
params[:query].tr!('٠١٢٣٤٥٦٧٨٩','0123456789').delete!(" ")
but it gives an error
Encoding::CompatibilityError Exception: incompatible character encodings: UTF-8 and ASCII-8BIT
for that I do
params[:query].force_encoding('utf-8').encode.tr!('٠١٢٣٤٥٦٧٨٩','0123456789').delete!(" ")
how I can convert this?
If you have enforced UTF-8 encoding then this should work fine.
str = "lexus/yr_٢٠٠١_٢٠٠٦"
str.tr('٠١٢٣٤٥٦٧٨٩','0123456789')
returns "lexus/yr_2001_2006"
ASCII 8-bit is not really an encoding. It is binary data, not something text based. Transcoding ASCII 8-bit to UTF-8 is not a meaningful operation. I would recommend ensuring that the request that passes the query parameter through your textfield is using valid string encoding, if you can control this. You can use String#valid_encoding? method in ruby to check you are receiving a correctly encoded string.

Lua hex string to ASCII?

I'm wanting to convert a hex string to ASCII character, (for the game ROBLOX).
Here's the page for the ASCII icon:
http://www.fileformat.info/info/unicode/char/25ba/index.htm
Although I'm not even sure that Lua supports that icon.
EDIT:
Turns out ROBLOX doesn't support UTF-8 symbols at all due to their 'chat filtering'.
Strings in Lua are encoding-agnostic and you can just use the character in the string:
print"►"
Alternatively:
Output the Unicode code directly with print"\u{25BA}".
Output the UTF-8 encoding directly with print"\xE2\x96\xBA".
Output the UTF-8 encoding directly with print"\226\150\186".

How to properly handle invalid bytes in UTF-8 strings?

I have a string with encoding ASCII-8BIT:
str = 'quindi \xE8 al \r\ngoverno'
I want to transcode it to UTF-8, for not having problems with char visualization.
Naturally, \xE8 is not a valid sequence in UTF-8, so I get the error when I try to:
str.encode 'utf-8'
Which returns:
UndefinedConversionError "\xE8" from ASCII-8BIT to UTF-8
Reading the docs about encode method, and I came up with this solution:
encode('UTF-8', invalid: :replace, undef: :replace)
This way all the invalid sequences are replaced with the ?. But if I want to display the proper char instead of the ?. I have different escape sequences in this text, \xE8, \xE0 ...
Is there a way to automatically replace them with the right escaped char?
Your string seems to be ISO-8859-1 encoded. This should work:
str = "quindi \xE8 al \r\ngoverno"
str.force_encoding('ISO-8859-1').encode('UTF-8')
#=> "quindi è al \r\ngoverno"
Note that you have to use double quotes.

Inconsistent IO character reading when converting encoding

In Ruby 1.9.3-429, I am trying to parse plain text files with various encodings that will ultimately be converted to UTF-8 strings. Non-ascii characters work fine with a file encoded as UTF-8, but problems come up with non-UTF-8 files.
Simplified example:
File.open(file) do |io|
io.set_encoding("#{charset.upcase}:#{Encoding::UTF_8}")
line, char = "", nil
until io.eof? || char == ?\n || char == ?\r
char = io.readchar
puts "Character #{char} has #{char.each_codepoint.count} codepoints"
puts "SLICE FAIL" unless char == char.slice(0,1)
line << char
end
line
end
Both files are just a single string áÁð encoded appropriately. I have checked that the files have been encoded correctly via $ file -i <file_name>
With a UTF-8 file, I get back:
Character á has 1 codepoints
Character Á has 1 codepoints
Character ð has 1 codepoints
With an ISO-8859-1 file:
Character á has 2 codepoints
SLICE FAIL
Character Á has 2 codepoints
SLICE FAIL
Character ð has 2 codepoints
SLICE FAIL
The way I am interpreting this is readchar is returning an incorrectly converted encoding which is causing slice to return incorrectly.
Is this behavior correct? Or am I specifying the file external encoding incorrectly? I would rather not rewrite this process so I am hoping I am making a mistake somewhere. There are reasons why I am parsing files this way, but I don't think those are relevant to my question. Specifying the internal and external encoding as an option in File.open yielded the same results.
This behavior is a bug. See http://bugs.ruby-lang.org/issues/8516 for details.

tackle different types of utf hyphens in ruby 1.8.7

We have different types of hyphens/dashes (in some text) populated in db. Before comparing them with some user input text, i have to normalize any type of dashes/hyphens to simple hyphen/minus (ascii 45).
The possible dashes we have to convert are:
Minus(−) U+2212 − or − or −
Hyphen-minus(-) U+002D -
Hyphen(-) U+2010
Soft Hyphen U+00AD ­
Non-breaking hyphen U+2011 &#8209
Figure dash(‒) U+2012 (8210) ‒ or ‒
En dash(–) U+2013 (8211) –, – or –
Em dash(—) U+2014 (8212) —, — or —
Horizontal bar(―) U+2015 (8213) ― or ―
These all have to be converted to Hyphen-minus(-) using gsub.
I've used CharDet gem to detect the character encoding type of the fetched string. It's showing windows-1252. I've tried Iconv to convert the encoding to ascii. But it's throwing an exception Iconv::IllegalSequence.
ruby -v => ruby 1.8.7 (2009-06-12 patchlevel 174) [i686-darwin9.8.0]
rails -v => Rails 2.3.5
mysql encoding => 'latin1'
Any idea how to accomplish this?
Caveat: I know nothing about Ruby, but you have problems that are nothing to do with the programming language that you are using.
You don't need to convert Hyphen-minus(-) U+002D - to simple hyphen/minus (ascii 45); they're the same thing.
You believe that the database encoding is latin1. The statement "My data is encoded in ISO-8859-1 aka latin1" is up there with "The check is in the mail" and "Of course I'll still love you in the morning". All it tells you is that it is a single-byte-per-character encoding.
Presuming that "fetched string" means "byte string extracted from the database", chardet is very likely quite right in reporting windows-1252 aka cp1252 -- however this may be by accident as chardet sometimes seems to report that as a default when it has exhausted other possibilities.
(a) These Unicode characters cannot be decoded into latin1 or cp1252 or ascii:
Minus(−) U+2212 − or − or −
Hyphen(-) U+2010
Non-breaking hyphen U+2011 &#8209
Figure dash(‒) U+2012 (8210) ‒ or ‒
Horizontal bar(―) U+2015 (8213) ― or ―
What gives you the impression that they may possibly appear in the input or in the database?
(b) These Unicode characters can be decoded into cp1252 but not latin1 or ascii:
En dash(–) U+2013 (8211) –, – or –
Em dash(—) U+2014 (8212) —, — or —
These (most likely the EN DASH) are what you really need to convert to an ascii hyphen/dash. What was in the string that chardet reported as windows-1252?
(c) This can be decoded into cp1252 and latin1 but not ascii:
Soft Hyphen U+00AD ­
If a string contains non-ASCII characters, any attempt (using iconv or any other method) to convert it to ascii will fail, unless you use some kind of "ignore" or "replace with ?" option. Why are you trying to do that?

Resources