JSON encoding/decoding with unicode in rails - ruby-on-rails

I upgraded downgraded to rails 2.3.17 due to the security bugs, but now I can't decode json strings that I have saved down to a DB if they have unicode in them :(. Is there a way to process the string such that it decodes properly?
e = ActiveSupport::JSON.encode({'a' => "Hello Unicode \u2019"})
ActiveSupport::JSON.decode(e)
gives me
RangeError: 8217 out of char range
from /app/vendor/bundle/ruby/1.9.1/gems/activesupport-2.3.17/lib/active_support/json/backends/okjson.rb:314:in `unquote'
from /app/vendor/bundle/ruby/1.9.1/gems/activesupport-2.3.17/lib/active_support/json/backends/okjson.rb:251:in `strtok'
from /app/vendor/bundle/ruby/1.9.1/gems/activesupport-2.3.17/lib/active_support/json/backends/okjson.rb:215:in `tok'
from /app/vendor/bundle/ruby/1.9.1/gems/activesupport-2.3.17/lib/active_support/json/backends/okjson.rb:178:in `lex'
from /app/vendor/bundle/ruby/1.9.1/gems/activesupport-2.3.17/lib/active_support/json/backends/okjson.rb:46:in `decode'
from /app/vendor/bundle/ruby/1.9.1/gems/activesupport-2.3.17/lib/active_support/json/backends/okjson.rb:612:in `decode'
from /app/vendor/bundle/ruby/1.9.1/gems/activesupport-2.3.17/lib/active_support/json/decoding.rb:14:in `decode'
from (irb):30
from /usr/local/bin/irb:12:in `<main>'
I can't change the first line since it's coming from the DB like that.
This used to work.

You can change the backend JSON provider in ActiveSupport.
Add ActiveSupport::JSON.backend = "JSONGem" into an application initialiser (I added it to application.rb). This fixed the unicode parsing issues I had after I upgraded activesupport to 3.0.20.
See the vulnerability notice which caused this update - It mentions that this workaround should apply to 2.3.16 as well.
From rails console:
> ActiveSupport::VERSION::STRING
=> "3.0.20"
> ActiveSupport::JSON.decode('{"test":"string\u2019"}')
RangeError: 8217 out of char range
> ActiveSupport::JSON.backend = "JSONGem"
> ActiveSupport::JSON.decode('{"test":"string\u2019"}')
 => {"test"=>"string’"}

The JSON gem will handle this correctly.
As a note, the gem is much more strict than the other JSON parsers out there. For example:
{ 'test' : 'value' }
This is not valid JSON even though it looks okay.
For whatever reason the non-UTF-8 savvy JSON parser shipped as part of the 2.3.16 patch which is really sloppy on the part of the maintainer.

Switch to 2.3.15 which should be fine because that's when the fixes landed.
Curse the developer who started this project in rails
Begin work on porting to python post haste

Related

XML parsing in Ruby

I am using a REXML Ruby parser to parse an XML file. But on a 64 bit AIX box with 64 bit Ruby, I am getting the following error:
REXML::ParseException: #<REXML::ParseException: #<RegexpError: Stack overflow in
regexp matcher:
/^<((?>(?:[\w:][\-\w\d.]*:)?[\w:][\-\w\d.]*))\s*((?>\s+(?:[\w:][\-\w\d.]*:)?[\w:][\-\w\d.]*\s*=\s*(["']).*?\3)*)\s*(\/)?>/mu>
The call for the same is something like this:
REXML::Document.new(File.open(actual_file_name, "r"))
Does anyone have an idea regarding how to solve this issue?
I've had several issues for REXML, it doesn't seem to be the most mature library. Usually I use Nokogiri for Ruby XML parsing stuff, it should be faster and more stable than REXML. After installing it with sudo gem install nokogiri, you can use something like this to get a DOM instance:
doc = Nokogiri.XML(File.open(actual_file_name, 'rb'))
# => #<Nokogiri::XML::Document:0xf1de34 name="document" [...] >
The documentation on the official webpage is also much better than that of REXML, IMHO.
I almost immediately found the answer.
The first thing I did was to search in the ruby source code for the error being thrown.
I found that regex.h was responsible for this.
In regex.h, the code flow is something like this:
/* Maximum number of duplicates an interval can allow. */
#ifndef RE_DUP_MAX
#define RE_DUP_MAX ((1 << 15) - 1)
#endif
Now the problem here is RE_DUP_MAX. On AIX box, the same constant has been defined somewhere in /usr/include.
I searched for it and found in
/usr/include/NLregexp.h
/usr/include/sys/limits.h
/usr/include/unistd.h
I am not sure which of the three is being used(most probably NLregexp.h).
In these headers, the value of RE_DUP_MAX has been set to 255! So there is a cap placed on the number of repetitions of a regex!
In short, the reason is the compilation taking the system defined value than that we define in regex.h!
This also answers my question which i had asked recently:
Regex limit in ruby 64 bit aix compilation
I was not able to answer it immediately as i need to have min of 100 reputation :D :D
Cheers!

iconv deprecation warning with ruby 1.9.3

I'm getting this warning when I run rspec:
/gems/activesupport-3.1.0/lib/active_support/dependencies.rb:240:in `block in require': iconv will be deprecated in the future, use String#encode instead.
I get the same warning with rails 3.1.0, 3.1.1, 3.1.2.rc2 versions. Seems it's related to sqlite3 gem, but I'm not sure. There are no warnings with ruby 1.9.2
Any suggestions how to deal with it?
You are getting this deprecation notice cause a library somewhere is requiring iconv.
iconv is a gem created by Matz that can be used to convert strings from one format to another.
For example this is often used:
Iconv.iconv('UTF-8//IGNORE', 'UTF-8', content) this little bit of magic takes a UTF-8 string that may have invalid chars and converts it to a proper UTF-8 string.
It has been decided that in Ruby 1.9.3 we should not be using iconv any more and instead use the built-in String#encode. encode is more powerful and allows you more flexibility.
The theory is that the above example could be replaced with:
string.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => "?")
In practice it seems this is imperfect.
This also leads to a less than easy story for gem creators who wish to support 1.8:
content = RUBY_VERSION.to_f < 1.9 ?
Iconv.iconv('UTF-8//IGNORE', 'UTF-8', "content") :
"#{content}".encode(Encoding::UTF_8, :invalid => :replace, :undef => :replace, :replace => '')
So, you have a gem somewhere that is requiring iconv, to find it:
Assuming your error message is: /gems/activesupport-3.1.0/lib/active_support/dependencies.rb:240
Open up /gems/activesupport-3.1.0/lib/active_support/dependencies.rb on line 240:
Add the line:
p caller if file =~ /iconv/
(just after: load_dependency(file) { result = super })
You will get a big fat stack trace:
rake --tasks
/home/sam/.rvm/gems/ruby-1.9.3-p125/gems/activesupport-3.2.6/lib/active_support/dependencies.rb:251:in `block in require': iconv will be deprecated in the future, use String#encode instead.
["/home/sam/.rvm/gems/ruby-1.9.3-p125/gems/calais-0.0.13/lib/calais.rb:5:in `'",
.. more omitted ..
This tells me it is the calais gem. Looking through pull requests, I am not the first. The pull has not been yanked in.
Depending on the gem, there may be an upgraded version that does not have this error, so I would recommend you upgrade your gems first. If you are unlucky you may be stuck with the unfortunate task of forking a gem to get rid of this (if for example your pull request to fix it languishes)
If you're seeing this, it's very probably not Rails. If you look at the method surrounding the line being referred to in the error you posted, you'll see the following:
def require(file, *)
result = false
load_dependency(file) { result = super }
result
end
I'm not saying it's your code, necessarily, but I'm certain that it's not actually the line in question where iconv is being called. In my case, I found that my project's code actually contained a reference to iconv.
If you want to check your code for such a reference, try grep -ir iconv ./ in your project directory.
When iconv is actually in a library it can be harder to find. By temporarily changing the above method to:
def require(file, *)
result = false
puts
puts caller.reverse
load_dependency(file) { result = super }
result
end
You can then easily run your code and grep out the relevant lines of the backtrace to find the root cause of the warning.
ruby your/code.rb 2>&1 | grep -B 5 iconv
Add this to the start of your program:
oldverb = $VERBOSE; $VERBOSE = nil
require 'iconv'
$VERBOSE = oldverb
and curse the people who think this is a professional way to handle deprecation.
You can pin down the exact location of the warning by generating exceptions for ActiveSupport::Deprecation, instead of just printing to the log. At the top of application.rb:
ActiveSupport::Deprecation.behavior = Proc.new do |message, backtrace|
raise message
end
Once you've figured out where the warning is coming from (by inspecting the full backtrace), remove this again.
To remove this warning...
go to your .rvm directory and find iconv.c (mine was at ~/.rvm/src/ruby-1.9.3-p125/ext/iconv/iconv.c)
edit that file are remove or comment out the call to warn_deprecated() (should be near the bottom)
from that file's directory, run ruby extconf.rb
then make
then make install
Should do the trick

Rails 3, Heroku: Taps Server Error: PGError: ERROR: invalid byte sequence for encoding "UTF8": 0xba

I have a Rails 3.0.9 application running both locally in my dev env and remotely on a heroku app. I have a method that imports a CSV file into a model, and this file can contain non-english characters, like °,á,é,í, etc (it's in spanish).
I am currently able to import the complete file (75k records) without any problems in my local dev (SQLite) database; but, when uploading the db to heroku with heroku db:push, it fails with the error I'm posting in the title:
!!! Caught Server Exception
HTTP CODE: 500
Taps Server Error: PGError: ERROR: invalid byte sequence for encoding "UTF8": 0xba
HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding".
Apparently, Heroku has issues inserting the '°' character. (At the moment the file doesn't have any á,é,í, etc characters, but I suspect these might fail too.)
I have set in my application.rb file the default encoding, as follows:
#.../application.rb
config.encoding = "utf-8"
What else can I do to set the 'client encoding' and solve this problem?
The numero sign, º, is 0xBA in ISO-8869-1 not UTF-8. So your CSV file is encoded with Latin-1 but you're trying to store it in your database as UTF-8 without fixing the encoding.
You can try telling your CSV library that it is dealing with Latin-1 encoded text and maybe it will take care of converting to UTF-8. If that doesn't work, then you can do it yourself with Iconv:
ruby-1.9.2 > Iconv.iconv('UTF-8', 'ISO-8859-1', "\xba")
=> ["º"]
ruby-1.9.2 > Iconv.iconv('UTF-8', 'ISO-8859-1', "\xb0")
=> ["°"]
You're not having trouble with SQLite because SQLite tends be very forgiving and it has a very loose type system. PostgreSQL, OTOH, tends to be rather strict and properly complains if you try to feed it invalid data. I'd recommend that you stop developing on top of SQLite if you're going to be deploying to Heroku and PostgreSQL, there are other differences that will cause problems (the behavior of GROUP BY and LIKE for example).

ActiveSupport::JSON.decode does not properly handle literal line breaks

Is this the expected behavior? Note how the line break character gets lost.
ruby-1.9.2-p136 :001 > ActiveSupport::JSON.decode("{\"content\": \"active\n\nsupport\"}")
=> {"content"=>"active\nsupport"}
The same happens with unicode-escaped line breaks:
ruby-1.9.2-p136 :002 > ActiveSupport::JSON.decode("{\"content\": \"active\u000a\u000asupport\"}")
=> {"content"=>"active\nsupport"}
I'm using rails 3.0.3.
I eventually came across this ticket: https://rails.lighthouseapp.com/projects/8994/tickets/3479-activesupportjson-fails-to-decode-unicode-escaped-newline-and-literal-newlines
It seems this is a bug in ActiveSupport that will be fixed in Rails 3.0.5. For now I have patched activesupport and things are working as expected.
ruby-1.9.2-p136 :001 > ActiveSupport::JSON.decode("{\"content\": \"active\n\nsupport\"}")
=> {"content"=>"active\n\nsupport"}
To represent a newline in JSON data using double quotes, you must escape the newline:
ActiveSupport::JSON.decode("{\"content\": \"active\\n\\nsupport\"}")
Otherwise, you are inserting a newline into the JSON source and not the JSON data. Note that this would also work:
ActiveSupport::JSON.decode('{"content": "active\n\nsupport"}')
By using single quotes, you are no longer inserting a literal newline into the source of the JSON.
It is interesting to note the way ActiveSupport handles this by default (the default JSON backend is ActiveSupport::JSON::Backends::Yaml). By installing the json gem and changing the JSON backend to it (ActiveSupport::JSON.backend = 'JSONGem') and attempting to decode the same text (ActiveSupport::JSON.decode("{\"content\": \"active\\n\\nsupport\"}")) you get the following:
JSON::ParserError: 737: unexpected token at '{"content": "active
support"}'

invalid multibyte escape after upgrade to rails 3 and ruby 1.9.2 -- dtext = '[^\\x80]'

I am upgrading my app from rails 2 to 3 and when i 'require' this file that has an email address validator i get an 'invalid multibyte escape' error with:
dtext = '[^\\\\x80]'
pattern = /\A#{dtext}\z/
Any thoughts?
Try using:
pattern = /\A#{dtext}\z/, nil, 'n'
Check out details on encodings and regexp for more.
And I use and recommend this awesome article on encodings in Ruby.
Modify the rfc822.rb file and change the addr_spec line to the following:
addr_spec = Regexp.new("#{local_part}\\x40#{domain}", nil, 'n')
That should resolve the issue. I got the solution from another gem, see https://github.com/saepia/rfc822/blob/master/lib/rfc822.rb

Resources