Length of a unicode string - ruby-on-rails

In my Rails (2.3, Ruby 1.8.7) application, I need to truncate a string to a certain length. the string is unicode, and when running tests in console, such as 'א'.length, I realized that a double length is returned. I would like an encoding-agnostic length, so that the same truncation would be done for a unicode string or a latin1 encoded string.
I've gone over most of the unicode material for Ruby, but am still a little in the dark. How should this problem be tackled?

Rails has an mb_chars method which returns multibyte characters. Try unicode_string.mb_chars.slice(0,50)

"ア".size # 3 in 1.8, 1 in 1.9
puts "ア".scan(/./mu).size # 1 in both 1.8 and 1.9

chars and mb_chars don't give you text elements, which is what you seem to be looking for.
For text elements you'll want the unicode gem.
mb_chars:
>> 'กุ'.mb_chars.size
=> 2
>> 'กุ'.mb_chars.first.to_s
=> "ก"
text_elements:
>> Unicode.text_elements('กุ').size
=> 1
>> Unicode.text_elements('กุ').first
=> "กุ"

You can use something like str.chars.slice(0, 50).join to get the first 50 characters of a string, no matter how many bytes it uses per character.

Related

cyrillic strings Я̆ Я̄ Я̈ return length 2 instead of 1 in ruby and other programming languages

In Ruby, Javascript and Java (others I didn't try), have cyrillic chars Я̆ Я̄ Я̈ length 2. When I try to check length of string with these chars indside, I get bad output value.
"Я̈".mb_chars.length
#=> 2 #should be 1 (ruby on rails)
"Я̆".length
#=> 2 #should be 1 (ruby, javascript)
"Ӭ".length
#=> 1 #correct (ruby, javascript)
Please note, that strings are encoded in UTF-8 and each char behave as single character.
My question is why is there such behaviour and how can I get length of string correctly with these chars inside?
The underlying problem is that Я̈ is actually two code points: the Я and the umlaut are separate:
'Я̈'.chars
#=> ["Я", "̈"]
Normally you'd solve this sort of problem through unicode normalization but that alone won't help you here as there is no single code point for Я̈ or Я̆ (but there is for Ӭ).
You could strip off the diacritics before checking the length:
'Я̆'.gsub(/\p{Diacritic}/, '')
#=> "Я"
'Я̆'.gsub(/\p{Diacritic}/, '').length
#=> 1
You'll get the desired length but the strings won't be quite the same. This also works on things like Ӭ which can be represented by a single code point:
'Ӭ'.length
#=> 1
'Ӭ'.gsub(/\p{Diacritic}/, '')
#=> "Ӭ"
'Ӭ'.gsub(/\p{Diacritic}/, '').length
#=> 1
Unicode is wonderful and awesome and solves many problems that used to plague us. Unfortunately, Unicode is also horrible and complicated because human languages and glyphs weren't exactly designed.
Ruby 2.5 adds String#each_grapheme_cluster:
'Я̆Я̄Я̈'.each_grapheme_cluster.to_a #=> ["Я̆", "Я̄", "Я̈"]
'Я̆Я̄Я̈'.each_grapheme_cluster.count #=> 3
Note that you can't use each_grapheme_cluster.size which is equivalent to each_char.size, so both would return 6 in the above example. (That looks like a bug, I've just filed a bug report)
Try unicode-display_width which is built to give an exact answer to this question:
require "unicode/display_width"
Unicode::DisplayWidth.of "Я̈" #=> 1

Convert string to two byte hex unicode in Ruby on Rails

I'm translating messages using Bing Translator and sending the results via SMS text message. The resulting strings often contain non-English characters, e.g. Korean, Japanese, Greek.
I am using the Clickatell SMS Gateway and according to the Clickatell spec here: http://www.clickatell.com/downloads/http/Clickatell_HTTP.pdf
...I think I should encode my strings in two-byte Unicode, hex-encoded.
For example, the Greek characters:
ΩΨΘ
After conversion should become:
03A903A80398
Which is then added to the querystring in my HTTP get request.
My issue however is finding the syntax to to this succinctly in my Ruby on Rails app.
I like fun ones like these. :) Try this:
input.codepoints.map { |c| "%04X" % c }.join
What I see:
[1] pry(main)> x = "\u03A9\u03A8\u0398"
=> "ΩΨΘ"
[2] pry(main)> x.codepoints.map { |c| "%04X" % c }.join
=> "03A903A80398"
Don't need to split -- use each_char instead to iterate.
Call map directly on each_char (or collect) and use capital "X" instead of lower "x" instead of upcase
input.each_char.map{|c| "%04X" % c.ord}.join

String length difference between ruby 1.8 and 1.9

I have a website thats running on ruby 1.8.7 . I have a validation on an incoming post that checks to make sure that we allow upto max of 12000 characters. The spaces are counted as characters and tab and carriage returns are stripped off before the post is subjected to the validation.
Here is the post that is subjected to validation http://pastie.org/5047582
In ruby 1.9 the string length shows up as 11909 which is correct. But when I check the length on ruby 1.8.7 is turns out to be 12044.
I used codepad.org to run this ruby code which gives me http://codepad.org/OxgSuKGZ ( which outputs the length as 12044 which is wrong) but when i run this same code in the console at codeacademy.org the string length is 11909.
Can anybody explain me why this is happening ???
Thanks
This is a Unicode issue. The string you are using contains characters outside the ASCII range, and the UTF-8 encoding that is frequently used encodes those as 2 (or more) bytes.
Ruby 1.8 did not handle Unicode properly, and length simply gives the number of bytes in the string, which results in fun stuff like:
"ą".length
=> 2
Ruby 1.9 has better Unicode handling. This includes length returning the actual number of characters in the string, as long as Ruby knows the encoding:
"ä".length
=> 1
One possible workaround in Ruby 1.8 is using regular expressions, which can be made Unicode aware:
"ą".scan(/./mu).size
=> 1

gsub string for all spaces in Ruby 1.8

I have a string with spaces (one simple space and one ideographic space):
"qwe rty uiop".gsub(/[\s]+/,'') #=> "qwe rtyuiop"
How can I add all space-codes (for example 3000, 2060, 205f) in my pattern?
In Ruby 1.9 I just added \u3000 and other codes, but how do it in 1.8?
I think i found answer. In ActiveSupport::Multibyte::Chars is a UNOCODE_WHITESPACE constant. Solution:
pattern = ActiveSupport::Multibyte::Chars::UNICODE_WHITESPACE.collect do |c|
c.pack "U*"
end.join '|'
puts "qwe rty uiop".mb_chars.gsub(/#{pattern}/,'')
#=> qwertyuiop

Regex "\w" doesn't process utf-8 characters in Ruby 1.9.2

Regex \w doesn't match utf-8 characters in Ruby 1.9.2. Anybody faced same problem?
Example:
/[\w\s]+/u
In my rails application.rb I've added config.encoding = "utf-8"
Define "doesn't match utf-8 characters"? If you expect \w to match anything other than exactly the uppercase and lowercase ASCII letters, the ASCII digits, and underscore, it won't -- Ruby has defined \w to be equivalent to [A-Za-z0-9_] regardless of Unicode. Maybe you want \p{Word} or something similar instead.
Ref: Ruby 1.9 Regexp documentation (see section "Character Classes").
You could always use something like
[a-zA-Z0-9_ñáéíóú]
instead of \w

Resources