regex error(Java::JavaLang::ArrayIndexOutOfBoundsException: 4) on 3 character non-english strings - ruby-on-rails

The following steps reproduce the error. Any workarounds/fixes? Thanks.
Also this happens with
jruby 1.5.2/rails 2.3.9 and jruby 1.6/rails 3.0.5
regex = /(aaa|bbb):/
str = "\343\202\242:"
str =~ regex
Steps in action.
d:\myapp>jruby script/rails console
Loading development environment (Rails 3.0.5)
irb(main):001:0> regex = /(aaa|bbb):/
=> /(aaa|bbb):/
irb(main):002:0> str = "\343\202\242:"
=> "péó:"
irb(main):003:0> str =~ regex
Java::JavaLang::ArrayIndexOutOfBoundsException: 4
from org.jcodings.MultiByteEncoding.safeLengthForUptoFour(MultiByteEncoding.java:5
from org.jcodings.specific.NonStrictUTF8Encoding.length(NonStrictUTF8Encoding.java
from org.joni.Matcher.forwardSearchRange(Matcher.java:124)
from org.joni.Matcher.search(Matcher.java:432)
from org.jruby.RubyRegexp.search(RubyRegexp.java:1474)
from org.jruby.RubyRegexp.op_match(RubyRegexp.java:1391)
from org.jruby.RubyString.op_match(RubyString.java:1557)
from org.jruby.RubyString$i$1$0$op_match.call(RubyString$i$1$0$op_match.gen:65535)
from org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:
from org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:139)
from org.jruby.ast.CallOneArgNode.interpret(CallOneArgNode.java:57)
from org.jruby.ast.NewlineNode.interpret(NewlineNode.java:103)
from org.jruby.ast.RootNode.interpret(RootNode.java:129)
from org.jruby.evaluator.ASTInterpreter.INTERPRET_EVAL(ASTInterpreter.java:95)
from org.jruby.evaluator.ASTInterpreter.evalWithBinding(ASTInterpreter.java:160)
from org.jruby.RubyKernel.evalCommon(RubyKernel.java:1134)
... 158 levels...
from org.jruby.RubyKernel$s$1$0$require.call(RubyKernel$s$1$0$require.gen:65535)
from org.jruby.internal.runtime.methods.JavaMethod$JavaMethodOneOrNBlock.call(Java
from org.jruby.internal.runtime.methods.AliasMethod.call(AliasMethod.java:61)
from org.jruby.internal.runtime.methods.AliasMethod.call(AliasMethod.java:61)
from org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:
from org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:139)
from script.rails.__file__(script/rails:6)
from script.rails.load(script/rails)
from org.jruby.Ruby.runScript(Ruby.java:670)
from org.jruby.Ruby.runNormally(Ruby.java:574)
from org.jruby.Ruby.runFromMain(Ruby.java:423)
from org.jruby.Main.doRunFromMain(Main.java:278)
from org.jruby.Main.internalRun(Main.java:198)
from org.jruby.Main.run(Main.java:164)
from org.jruby.Main.run(Main.java:148)
from org.jruby.Main.main(Main.java:128)irb(main):004:0>

Perhaps the regex is not delimited.
How about
str =~ /(aaa|bbb):/
or
regex='(aaa|bbb):'
str =~ /regex/

I think because you're passing a string that contains multibyte characters you need to pass the /u regex parameter to parse it as a UTF-8 string.
Just checked this in different versions of Ruby and it only appears to happen in JRuby so I think you've found a bug ;)
If you use something like "string".to_java_string first it would appear to work but it's actually converting it to ISO-8859-1 first which you don't want. To maintain your encoding just use .to_java and pass it the regex.
Here's a work around that should work I think:
regex = /(aaa|bbb):/u
str = "\343\202\242:"
str.to_java =~ regex

Related

Incompatible Character Encoding in rails - how to just fail/skip sensibly?

I'm having an issue when importing Email subjects via IMAP.
I'm getting a problem, I think related to the £ sign in email subjects.
Having spent a couple of hours touring around various answers I can't seem to find anything that works...
If I try the following...
Using ruby 2.1.2
views/emails/index
=email.subject
incompatible character encodings: ASCII-8BIT and UTF-8
=email.subject.scrub
incompatible character encodings: ASCII-8BIT and UTF-8
= email.subject.encode!('UTF-8', 'UTF-8', :invalid => :replace)
invalid byte sequence in UTF-8
= email.subject.force_encoding('UTF-8')
invalid byte sequence in UTF-8
= email.subject.encode("UTF-8", invalid: :replace)
"\xA3" from ASCII-8BIT to UTF-8
/xA3 is the '£' sign which shouldn't be that unusual.
I'm currently working with the following...
-if email.subject.force_encoding('UTF-8').valid_encoding?
=email.subject
-else
"Can't display"
What I would ideally do is just have something which checked if the encoding was working, and then did something like #scrub is supposed to do... I'd even take it with '/xA3' perfectly happily so long as it wasn't throwing an error and I could basically see the text.
Any ideas on either how to do it properly or a fudge to solve the issue?
After much pain this is how I solved it.
You need to add default encoding to your environment.rb file, like so:
# Load the rails application
require File.expand_path('../application', __FILE__)
Encoding.default_external = Encoding::UTF_8
Encoding.default_internal = Encoding::UTF_8
# Initialize the rails application
Stma::Application.initialize!
Apparently this is something to do with Ruby's roots in japan. When dealing with Japanese (or russian) characters this wouldn't be helpful so this sort of thing isn't there as standard.
I've then done the following:
mail_object = Mail.new(mail[0].attr["RFC822"])
subject = mail_object.subject.encode!('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '') if mail_object.subject
body_part = (mail_object.text_part || mail_object.html_part || mail_object).body.decoded
body = body_part.encode!('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '') if body_part
from = mail_object.from.join(",") if mail_object.from #deals with multiple addresses
to = mail_object.to.join(",") if mail_object.to #deals with multiple addresses
That should get all the main pieces into strings / text you can easily work with that won't fail nastily if somethings missing/unusual...etc. Hope that helps somebody...

Soft signs in rails console

I want to create multiple categories via console and I want to be able add soft signs. At this moment I can't do that.
It's very important to project that I can save category names with soft signs.
Can somebody tip me where to search? I searched such tag - soft signs rails.
There wasn't any usefull resource.
Thanks
EDIT
Soft signs in my native language is like this.
Ā,Š,Ē,Ž with that symbol called soft sign abowe the character.
At this moment when I try to save new category record it shows me this kind off error
thodError: undefined methodcache_ancestry!' for #
But I am sure that I didn't change anything in models or controllers :(
What version of Ruby is this? What you're seeing there are either US-ASCII strings with UTF-8 data in them (Ruby 1.9) or byte arrays (Ruby 1.8).
If you're using Ruby 1.8, you may need to use Iconv to convert your encoding from US-ASCII to UTF-8. If you're using Ruby 1.9, then make sure you're creating UTF-8 strings and it should work just fine.
Note that those escape sequences are correct - that is the literal byte array of those characters, assuming the proper encoding is applied, so you may not need to actually change anything. If the bytes are right, everything's fine - you're just seeing ruby interpret the string as ASCII rather than UTF-8 or whatnot.
In Ruby 1.8, when you #inspect a string, you get the escaped version, but putsing it will show you the actual string:
1.8.7 :021 > s = "Komunālās mašīnas"
=> "Komun\304\201l\304\201s ma\305\241\304\253nas"
1.8.7 :022 > puts s
Komunālās mašīnas
In 1.9, you get the correct display all around, so long as your encoding is right:
1.9.3p327 :001 > s = "Komunālās mašīnas"
=> "Komunālās mašīnas"
1.9.3p327 :004 > s.force_encoding "US-ASCII"
=> "Komun\xC4\x81l\xC4\x81s ma\xC5\xA1\xC4\xABnas"
1.9.3p327 :005 > puts s
Komunālās mašīnas
Check this out Edgars:
#encoding: UTF-8
t = 'ŠšÐŽžÀÁÂÃÄAÆAÇÈÉÊËÌÎÑNÒOÓOÔOÕOÖOØOUÚUUÜUÝYÞBßSàaáaâäaaæaçcèéêëìîðñòóôõöùûýýþÿƒ'
fallback = {
'Š'=>'S', 'š'=>'s', 'Ð'=>'Dj','Ž'=>'Z', 'ž'=>'z', 'À'=>'A', 'Á'=>'A', 'Â'=>'A', 'Ã'=>'A', 'Ä'=>'A',
'Å'=>'A', 'Æ'=>'A', 'Ç'=>'C', 'È'=>'E', 'É'=>'E', 'Ê'=>'E', 'Ë'=>'E', 'Ì'=>'I', 'Í'=>'I', 'Î'=>'I',
'Ï'=>'I', 'Ñ'=>'N', 'Ò'=>'O', 'Ó'=>'O', 'Ô'=>'O', 'Õ'=>'O', 'Ö'=>'O', 'Ø'=>'O', 'Ù'=>'U', 'Ú'=>'U',
'Û'=>'U', 'Ü'=>'U', 'Ý'=>'Y', 'Þ'=>'B', 'ß'=>'Ss','à'=>'a', 'á'=>'a', 'â'=>'a', 'ã'=>'a', 'ä'=>'a',
'å'=>'a', 'æ'=>'a', 'ç'=>'c', 'è'=>'e', 'é'=>'e', 'ê'=>'e', 'ë'=>'e', 'ì'=>'i', 'í'=>'i', 'î'=>'i',
'ï'=>'i', 'ð'=>'o', 'ñ'=>'n', 'ò'=>'o', 'ó'=>'o', 'ô'=>'o', 'õ'=>'o', 'ö'=>'o', 'ø'=>'o', 'ù'=>'u',
'ú'=>'u', 'û'=>'u', 'ý'=>'y', 'ý'=>'y', 'þ'=>'b', 'ÿ'=>'y', 'ƒ'=>'f'
}
p t.encode('us-ascii', :fallback => fallback)
See Ruby 1.9.x replace sets of characters with specific cleaned up characters in a string
EDIT:
To get all the characters for your language you will need to add them as desired to the fallback hash. When I run "Komunālās mašīnas" as the variable 't' I get this:
t = "Komunālās mašīnas"
t.encode('us-ascii', :fallback => fallback)
Encoding::UndefinedConversionError: U+0101 from UTF-8 to US-ASCII
You can tell from this where the problem lies by googling U+0101 which shows
http://www.charbase.com/0101-unicode-latin-small-letter-a-with-macron
So now you know which letter is not working and you can add it to the fallback hash like so:
fallback = { OTHER DEFINITIONS , 'ā'=>'a'}
Here's a place to start:
http://www.ascii-codes.com/cp775.html

Force strings to UTF-8 from any encoding

In my rails app I'm working with RSS feeds from all around the world, and some feeds have links that are not in UTF-8. The original feed links are out of my control, and in order to use them in other parts of the app, they need to be in UTF-8.
How can I detect encoding and convert to UTF-8?
Ruby 1.9
"Forcing" an encoding is easy, however it won't convert the characters just change the encoding:
str = str.force_encoding('UTF-8')
str.encoding.name # => 'UTF-8'
If you want to perform a conversion, use encode:
begin
str.encode("UTF-8")
rescue Encoding::UndefinedConversionError
# ...
end
I would definitely read the following post for more information:
http://graysoftinc.com/character-encodings/ruby-19s-string
This will ensure you have the correct encoding and won't error out because it replaces any invalid or undefined character with a blank string.
This will ensure no matter what, that you have a valid UTF-8 string
str.encode(Encoding.find('UTF-8'), {invalid: :replace, undef: :replace, replace: ''})
Iconv
require 'iconv'
i = Iconv.new('UTF-8','LATIN1')
a_with_hat = i.iconv("\xc2")
Summary: the iconv gem does all the work of converting encodings. Make sure it's installed with:
gem install iconv
Now, you need to know what encoding your string is currently in as Ruby 1.8 treats Strings as an array of bytes (with no intrinsic encoding.) For example, say your string was in latin1 and you wanted to convert it to utf-8
require 'iconv'
string_in_utf8_encoding = Iconv.conv("UTF8", "LATIN1", string_in_latin1_encoding)
Only this solution worked for me:
string.encode('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')
Note the binary argument.

How to URL encode a string in Ruby

How do I URI::encode a string like:
\x12\x34\x56\x78\x9a\xbc\xde\xf1\x23\x45\x67\x89\xab\xcd\xef\x12\x34\x56\x78\x9a
to get it in a format like:
%124Vx%9A%BC%DE%F1%23Eg%89%AB%CD%EF%124Vx%9A
as per RFC 1738?
Here's what I tried:
irb(main):123:0> URI::encode "\x12\x34\x56\x78\x9a\xbc\xde\xf1\x23\x45\x67\x89\xab\xcd\xef\x12\x34\x56\x78\x9a"
ArgumentError: invalid byte sequence in UTF-8
from /usr/local/lib/ruby/1.9.1/uri/common.rb:219:in `gsub'
from /usr/local/lib/ruby/1.9.1/uri/common.rb:219:in `escape'
from /usr/local/lib/ruby/1.9.1/uri/common.rb:505:in `escape'
from (irb):123
from /usr/local/bin/irb:12:in `<main>'
Also:
irb(main):126:0> CGI::escape "\x12\x34\x56\x78\x9a\xbc\xde\xf1\x23\x45\x67\x89\xab\xcd\xef\x12\x34\x56\x78\x9a"
ArgumentError: invalid byte sequence in UTF-8
from /usr/local/lib/ruby/1.9.1/cgi/util.rb:7:in `gsub'
from /usr/local/lib/ruby/1.9.1/cgi/util.rb:7:in `escape'
from (irb):126
from /usr/local/bin/irb:12:in `<main>'
I looked all about the internet and haven't found a way to do this, although I am almost positive that the other day I did this without any trouble at all.
str = "\x12\x34\x56\x78\x9a\xbc\xde\xf1\x23\x45\x67\x89\xab\xcd\xef\x12\x34\x56\x78\x9a".force_encoding('ASCII-8BIT')
puts CGI.escape str
=> "%124Vx%9A%BC%DE%F1%23Eg%89%AB%CD%EF%124Vx%9A"
Nowadays, you should use ERB::Util.url_encode or CGI.escape. The primary difference between them is their handling of spaces:
>> ERB::Util.url_encode("foo/bar? baz&")
=> "foo%2Fbar%3F%20baz%26"
>> CGI.escape("foo/bar? baz&")
=> "foo%2Fbar%3F+baz%26"
CGI.escape follows the CGI/HTML forms spec and gives you an application/x-www-form-urlencoded string, which requires spaces be escaped to +, whereas ERB::Util.url_encode follows RFC 3986, which requires them to be encoded as %20.
See "What's the difference between URI.escape and CGI.escape?" for more discussion.
str = "\x12\x34\x56\x78\x9a\xbc\xde\xf1\x23\x45\x67\x89\xab\xcd\xef\x12\x34\x56\x78\x9a"
require 'cgi'
CGI.escape(str)
# => "%124Vx%9A%BC%DE%F1%23Eg%89%AB%CD%EF%124Vx%9A"
Taken from #J-Rou's comment
I was originally trying to escape special characters in a file name only, not on the path, from a full URL string.
ERB::Util.url_encode didn't work for my use:
helper.send(:url_encode, "http://example.com/?a=\11\15")
# => "http%3A%2F%2Fexample.com%2F%3Fa%3D%09%0D"
Based on two answers in "Why is URI.escape() marked as obsolete and where is this REGEXP::UNSAFE constant?", it looks like URI::RFC2396_Parser#escape is better than using URI::Escape#escape. However, they both are behaving the same to me:
URI.escape("http://example.com/?a=\11\15")
# => "http://example.com/?a=%09%0D"
URI::Parser.new.escape("http://example.com/?a=\11\15")
# => "http://example.com/?a=%09%0D"
You can use Addressable::URI gem for that:
require 'addressable/uri'
string = '\x12\x34\x56\x78\x9a\xbc\xde\xf1\x23\x45\x67\x89\xab\xcd\xef\x12\x34\x56\x78\x9a'
Addressable::URI.encode_component(string, Addressable::URI::CharacterClasses::QUERY)
# "%5Cx12%5Cx34%5Cx56%5Cx78%5Cx9a%5Cxbc%5Cxde%5Cxf1%5Cx23%5Cx45%5Cx67%5Cx89%5Cxab%5Cxcd%5Cxef%5Cx12%5Cx34%5Cx56%5Cx78%5Cx9a"
It uses more modern format, than CGI.escape, for example, it properly encodes space as %20 and not as + sign, you can read more in "The application/x-www-form-urlencoded type" on Wikipedia.
2.1.2 :008 > CGI.escape('Hello, this is me')
=> "Hello%2C+this+is+me"
2.1.2 :009 > Addressable::URI.encode_component('Hello, this is me', Addressable::URI::CharacterClasses::QUERY)
=> "Hello,%20this%20is%20me"
Code:
str = "http://localhost/with spaces and spaces"
encoded = URI::encode(str)
puts encoded
Result:
http://localhost/with%20spaces%20and%20spaces
I created a gem to make URI encoding stuff cleaner to use in your code. It takes care of binary encoding for you.
Run gem install uri-handler, then use:
require 'uri-handler'
str = "\x12\x34\x56\x78\x9a\xbc\xde\xf1\x23\x45\x67\x89\xab\xcd\xef\x12\x34\x56\x78\x9a".to_uri
# => "%124Vx%9A%BC%DE%F1%23Eg%89%AB%CD%EF%124Vx%9A"
It adds the URI conversion functionality into the String class. You can also pass it an argument with the optional encoding string you would like to use. By default it sets to encoding 'binary' if the straight UTF-8 encoding fails.
If you want to "encode" a full URL without having to think about manually splitting it into its different parts, I found the following worked in the same way that I used to use URI.encode:
URI.parse(my_url).to_s

Ruby On Rails and UTF-8

I have an Rails application with SayController, hello action and view template say/hello.html.erb. When I add some cyrillic character like "ю", I get an error:
ArgumentError in SayController#hello
invalid byte sequence in UTF-8
Headers:
{"Cache-Control"=>"no-cache",
"X-Runtime"=>"11",
"Content-Type"=>"text/html; charset=utf-8"}
If I try to write this letter with embedded Ruby,
<%= "ю" %>
I don't get any error, but it displays a question mark in black square (�) instead of this letter.
I use Windows 7 x64, Ruby 1.9.1p378, Rails 2.3.5, WEBrick server.
A likely cause of this error is that the file which contains the cyrillic letters is not encoded in UTF8, but perhaps in some russian encoding like KOI8. This will cause the characters to be impossible to interpret in UTF8 (and rightly so!).
So double check that your file is properly encoded in UTF8.
Create a initializer file (e.g encoding_fix.rb) under your_app/config/initializers with the following content:
Encoding.default_internal = Encoding::UTF_8 if RUBY_VERSION > "1.9"
Encoding.default_external = Encoding::UTF_8 if RUBY_VERSION > "1.9"
This sets the encoding to utf8.

Resources