String length difference between ruby 1.8 and 1.9 - ruby-on-rails

I have a website thats running on ruby 1.8.7 . I have a validation on an incoming post that checks to make sure that we allow upto max of 12000 characters. The spaces are counted as characters and tab and carriage returns are stripped off before the post is subjected to the validation.
Here is the post that is subjected to validation http://pastie.org/5047582
In ruby 1.9 the string length shows up as 11909 which is correct. But when I check the length on ruby 1.8.7 is turns out to be 12044.
I used codepad.org to run this ruby code which gives me http://codepad.org/OxgSuKGZ ( which outputs the length as 12044 which is wrong) but when i run this same code in the console at codeacademy.org the string length is 11909.
Can anybody explain me why this is happening ???
Thanks

This is a Unicode issue. The string you are using contains characters outside the ASCII range, and the UTF-8 encoding that is frequently used encodes those as 2 (or more) bytes.
Ruby 1.8 did not handle Unicode properly, and length simply gives the number of bytes in the string, which results in fun stuff like:
"ą".length
=> 2
Ruby 1.9 has better Unicode handling. This includes length returning the actual number of characters in the string, as long as Ruby knows the encoding:
"ä".length
=> 1
One possible workaround in Ruby 1.8 is using regular expressions, which can be made Unicode aware:
"ą".scan(/./mu).size
=> 1

Related

cyrillic strings Я̆ Я̄ Я̈ return length 2 instead of 1 in ruby and other programming languages

In Ruby, Javascript and Java (others I didn't try), have cyrillic chars Я̆ Я̄ Я̈ length 2. When I try to check length of string with these chars indside, I get bad output value.
"Я̈".mb_chars.length
#=> 2 #should be 1 (ruby on rails)
"Я̆".length
#=> 2 #should be 1 (ruby, javascript)
"Ӭ".length
#=> 1 #correct (ruby, javascript)
Please note, that strings are encoded in UTF-8 and each char behave as single character.
My question is why is there such behaviour and how can I get length of string correctly with these chars inside?
The underlying problem is that Я̈ is actually two code points: the Я and the umlaut are separate:
'Я̈'.chars
#=> ["Я", "̈"]
Normally you'd solve this sort of problem through unicode normalization but that alone won't help you here as there is no single code point for Я̈ or Я̆ (but there is for Ӭ).
You could strip off the diacritics before checking the length:
'Я̆'.gsub(/\p{Diacritic}/, '')
#=> "Я"
'Я̆'.gsub(/\p{Diacritic}/, '').length
#=> 1
You'll get the desired length but the strings won't be quite the same. This also works on things like Ӭ which can be represented by a single code point:
'Ӭ'.length
#=> 1
'Ӭ'.gsub(/\p{Diacritic}/, '')
#=> "Ӭ"
'Ӭ'.gsub(/\p{Diacritic}/, '').length
#=> 1
Unicode is wonderful and awesome and solves many problems that used to plague us. Unfortunately, Unicode is also horrible and complicated because human languages and glyphs weren't exactly designed.
Ruby 2.5 adds String#each_grapheme_cluster:
'Я̆Я̄Я̈'.each_grapheme_cluster.to_a #=> ["Я̆", "Я̄", "Я̈"]
'Я̆Я̄Я̈'.each_grapheme_cluster.count #=> 3
Note that you can't use each_grapheme_cluster.size which is equivalent to each_char.size, so both would return 6 in the above example. (That looks like a bug, I've just filed a bug report)
Try unicode-display_width which is built to give an exact answer to this question:
require "unicode/display_width"
Unicode::DisplayWidth.of "Я̈" #=> 1

Character Encoding not resolved

I have a text file with unknown character formatting, below is a snapshot
\216\175\217\133\217\136\216\185 \216\167\217\132\217\133\216\177\216\163\216\169 \216\163\217\130\217\136\217\137 \217\134\217\129\217\136\216\176\216\167\217\139 \217\133\217\134 \216\167\217\132\217\130\217\136\216\167\217\134\217\138\217\134
Anyone has an idea how can I convert it to normal text?
This is apparently how Lua stores strings. Each \nnn represents a single byte where nnn is the byte's value in decimal. (A similar notation is commonly used for octal, which threw me off for longer than I would like to admit. I should have noticed that there were digits 8 and 9 in the data!) This particular string is just plain old UTF-8.
$ perl -ple 's/\\(\d{3})/chr($1)/ge' <<<'\216\175\217\133\217\136\216\185 \216\167\217\132\217\133\216\177\216\163\216\169 \216\163\217\130\217\136\217\137 \217\134\217\129\217\136\216\176\216\167\217\139 \217\133\217\134 \216\167\217\132\217\130\217\136\216\167\217\134\217\138\217\134'
دموع المرأة أقوى نفوذاً من القوانين
You would obviously get a similar result simply by printing the string from Lua, though I'm not familiar enough with the language to tell you how exactly to do that.
Post scriptum: I had to look this up for other reasons, so here's how to execute Lua from the command line.
lua -e 'print("\216\175\217\133\217\136\216\185 \216\167\217\132\217\133\216\177\216\163\216\169 \216\163\217\130\217\136\217\137 \217\134\217\129\217\136\216\176\216\167\217\139 \217\133\217\134 \216\167\217\132\217\130\217\136\216\167\217\134\217\138\217\134")'

Ruby Gem randomly returns Encoding Error

So I forked this gem on GitHub, thinking that I may be able to fix and update some of the issues with it for use in a Rails project. I basically get this output:
irb(main):020:0> query = Query::simpleQuery('xx.xxx.xxx.xx', 25565)
=> [false, #<Encoding::CompatibilityError: incompatible character encodings: UTF-8 and ASCII-8BIT>]
irb(main):021:0> query = Query::simpleQuery('xx.xxx.xxx.xx', 25565)
=> {:motd=>"Craftnet", :gametype=>"SMP", :map=>"world", :numplayers=>"0", :maxplayers=>"48"}
The first response is the example of the Encoding error, and the second is the wanted output (IP's taken out). Basically this is querying a Minecraft server for information on it.
I tried using
Encoding.default_external = Encoding::UTF_8
Encoding.default_internal = Encoding::UTF_8
But that just gave the same response, randomly spitting encoding errors and not.
Here is the relevant GitHub repo with all the code: RubyMinecraft
Any help would be greatly appreciated.
In the Query class there is this line:
#key = Array(key).pack('N')
This creates a String with an associated encoding of ASCII-8BIT (i.e. it’s a binary string).
Later #key gets used in this line:
query = #sock.send("\xFE\xFD\x00\x01\x02\x03\x04" + #key, 0)
In Ruby 2.0 the default encoding of String literals is UTF-8, so this is combining a UTF-8 string with a binary one.
When Ruby tries to do this it first checks to see if the binary string only contains 7-bit values (i.e. all bytes are less than or equal to 127, with the top byte being 0), and if it does it considers it compatible with UTF-8 and so combines them without further issue. If it doesn’t, (i.e. if it contains bytes greater than 127) then the two strings are not compatible and an Encoding::CompatibilityError is raised.
Whether an error is raised depends on the contents of #key, which is initialized from a response from the server. Sometimes this value happens to contain only 7-bit values, so no error is raised, at other times there is a byte with the high bit set, so it generates an error. This is why the errors appear to be “random”.
To fix it you can specify that the string literal in the line where the two strings are combined should be treated as binary. The simplest way would be to use force_encoding like this:
query = #sock.send("\xFE\xFD\x00\x01\x02\x03\x04".force_encoding(Encoding::ASCII_8BIT) + #key, 0)

tackle different types of utf hyphens in ruby 1.8.7

We have different types of hyphens/dashes (in some text) populated in db. Before comparing them with some user input text, i have to normalize any type of dashes/hyphens to simple hyphen/minus (ascii 45).
The possible dashes we have to convert are:
Minus(−) U+2212 − or − or −
Hyphen-minus(-) U+002D -
Hyphen(-) U+2010
Soft Hyphen U+00AD ­
Non-breaking hyphen U+2011 &#8209
Figure dash(‒) U+2012 (8210) ‒ or ‒
En dash(–) U+2013 (8211) –, – or –
Em dash(—) U+2014 (8212) —, — or —
Horizontal bar(―) U+2015 (8213) ― or ―
These all have to be converted to Hyphen-minus(-) using gsub.
I've used CharDet gem to detect the character encoding type of the fetched string. It's showing windows-1252. I've tried Iconv to convert the encoding to ascii. But it's throwing an exception Iconv::IllegalSequence.
ruby -v => ruby 1.8.7 (2009-06-12 patchlevel 174) [i686-darwin9.8.0]
rails -v => Rails 2.3.5
mysql encoding => 'latin1'
Any idea how to accomplish this?
Caveat: I know nothing about Ruby, but you have problems that are nothing to do with the programming language that you are using.
You don't need to convert Hyphen-minus(-) U+002D - to simple hyphen/minus (ascii 45); they're the same thing.
You believe that the database encoding is latin1. The statement "My data is encoded in ISO-8859-1 aka latin1" is up there with "The check is in the mail" and "Of course I'll still love you in the morning". All it tells you is that it is a single-byte-per-character encoding.
Presuming that "fetched string" means "byte string extracted from the database", chardet is very likely quite right in reporting windows-1252 aka cp1252 -- however this may be by accident as chardet sometimes seems to report that as a default when it has exhausted other possibilities.
(a) These Unicode characters cannot be decoded into latin1 or cp1252 or ascii:
Minus(−) U+2212 − or − or −
Hyphen(-) U+2010
Non-breaking hyphen U+2011 &#8209
Figure dash(‒) U+2012 (8210) ‒ or ‒
Horizontal bar(―) U+2015 (8213) ― or ―
What gives you the impression that they may possibly appear in the input or in the database?
(b) These Unicode characters can be decoded into cp1252 but not latin1 or ascii:
En dash(–) U+2013 (8211) –, – or –
Em dash(—) U+2014 (8212) —, — or —
These (most likely the EN DASH) are what you really need to convert to an ascii hyphen/dash. What was in the string that chardet reported as windows-1252?
(c) This can be decoded into cp1252 and latin1 but not ascii:
Soft Hyphen U+00AD ­
If a string contains non-ASCII characters, any attempt (using iconv or any other method) to convert it to ascii will fail, unless you use some kind of "ignore" or "replace with ?" option. Why are you trying to do that?

Length of a unicode string

In my Rails (2.3, Ruby 1.8.7) application, I need to truncate a string to a certain length. the string is unicode, and when running tests in console, such as 'א'.length, I realized that a double length is returned. I would like an encoding-agnostic length, so that the same truncation would be done for a unicode string or a latin1 encoded string.
I've gone over most of the unicode material for Ruby, but am still a little in the dark. How should this problem be tackled?
Rails has an mb_chars method which returns multibyte characters. Try unicode_string.mb_chars.slice(0,50)
"ア".size # 3 in 1.8, 1 in 1.9
puts "ア".scan(/./mu).size # 1 in both 1.8 and 1.9
chars and mb_chars don't give you text elements, which is what you seem to be looking for.
For text elements you'll want the unicode gem.
mb_chars:
>> 'กุ'.mb_chars.size
=> 2
>> 'กุ'.mb_chars.first.to_s
=> "ก"
text_elements:
>> Unicode.text_elements('กุ').size
=> 1
>> Unicode.text_elements('กุ').first
=> "กุ"
You can use something like str.chars.slice(0, 50).join to get the first 50 characters of a string, no matter how many bytes it uses per character.

Resources