Testing of graceful handling of invalid UTF-8 in Rails - ruby-on-rails

I'm currently trying to ensure that my rails application handles POST methods with invalid UTF-8 data gracefully (e.g. \xFF) (using rack middleware)
Unfortunately writing tests for this is proving extremely difficult.
One thing I've tried is using Capybara to fill in one of my form fields with invalid UTF-8 and submit, however this causes the following output in the terminal amongst my test output - and it's not being printed by Rails!
error : string is not in utf-8
Is there another way that a POST containing invalid UTF-8 data can be emulated in order to validate that a 400 error (or similar) is displayed?
NB: I'm trying to avoid having to run against a separate running instance of the application (e.g. using 'curl' against it), but just run directly with Capybara (or similar)

You could try posting a file attachment which contains invalid utf-8 data, although this would also depend on your form itself. However, as you're testing a rather obscure edge-case, you could always create a form that is only accessible in dev/test environments, with the route also only available for testing.
This would at least allow you to target the code that handles the processing of the invalid utf-8 data in a safe, test-only way.

What do you mean by filling the form with invalid UTF-8? The characters you fill in the form do not have any encoding, they are encoded when the form is sent. This sentence would make sense for some encodings that cannot encode all characters out there but UTF-8 can.
If you want to send the byte \xFF to the server from a browser, it's as easy as pulling out the developer tools of that browser, editing the form's attributes to accept-charset="ISO-8859-1" writing ÿ somewhere on the form and pressing send. The ÿ will get encoded as a %FF which cannot be decoded as UTF-8:

As detailed in this ThoughtBot blog post, this worked for me in a unit test:
"hello joel\255".force_encoding('UTF-8')
Not sure how you'd convince capybara to do this though.

You should be able to do this by constructing a post with rack-test as opposed to using capybara.
For example the following request spec (spec/requests/utf8_spec.rb):
require 'spec_helper'
describe "Invalid UTF-8" do
it "handles invalid UTF-8 in post data gracefully" do
post "/users", :user => {:name => "Test \xFF boom"}
end
end
Produces the following when run:
1) Invalid UTF-8 handles invalid UTF-8 in post data gracefully
Failure/Error: post "/users", :user => {:name => "Test \xFF boom"}
ArgumentError:
invalid byte sequence in UTF-8
# ./app/controllers/users_controller.rb:17:in `create'
# ./spec/requests/utf8_spec.rb:5:in `block (2 levels) in <top (required)>'

Related

Rails uses different encoding for 'puts' vs 'return'

I have a Rails method that ends like this:
puts encrypted
return encrypted
Console outputs:
#?$???z???e7Bw?1I?F???????s?w
=> "#\x9A$\xB1\xBA\xF4z\x8F\x97\xECe\a7Bw\xE01I\xEDF\xA6\xBE\xEA\xE0\xFC\xF6\xB9\x1Cs\x00\xC0w\x14"
Why do these results look different when output in the same place and without any encode/decode instructions?
How can I get Rails to output the longer version when I call puts encrypted?
IRB calls inspect on its values before dumping them in order to expose codes that might get output as just ? or worse, in this kind of situations.
Try p encrypted, then try puts encrypted.inspect.

Ruby 2.0.0 String#Match ArgumentError: invalid byte sequence in UTF-8

I see this a lot and haven't figured out a graceful solution. If user input contains invalid byte sequences, I need to be able to have it not raise an exception. For example:
# #raw_response comes from user and contains invalid UTF-8
# for example: #raw_response = "\xBF"
regex.match(#raw_response)
ArgumentError: invalid byte sequence in UTF-8
Numerous similar questions have been asked and the result appears to be encoding or force encoding the string. Neither of these work for me however:
regex.match(#raw_response.force_encoding("UTF-8"))
ArgumentError: invalid byte sequence in UTF-8
or
regex.match(#raw_response.encode("UTF-8", :invalid=>:replace, :replace=>"?"))
ArgumentError: invalid byte sequence in UTF-8
Is this a bug with Ruby 2.0.0 or am I missing something?
What is strange is it appear to be encoding correctly, but match continues to raise an exception:
#raw_response.encode("UTF-8", :invalid=>:replace, :replace=>"?").encoding
=> #<Encoding:UTF-8>
In Ruby 2.0 the encode method is a no-op when encoding a string to its current encoding:
Please note that conversion from an encoding enc to the same encoding enc is a no-op, i.e. the receiver is returned without any changes, and no exceptions are raised, even if there are invalid bytes.
This changed in 2.1, which also added the scrub method as an easier way to do this.
If you are unable to upgrade to 2.1, you’ll have to encode into a different encoding and back in order to remove invalid bytes, something like:
if ! s.valid_encoding?
s = s.encode("UTF-16be", :invalid=>:replace, :replace=>"?").encode('UTF-8')
end
Since you're using Rails and not just Ruby you can also use tidy_bytes. This works with Ruby 2.0 and also will probably give you back sensible data instead of just replacement characters.

utf-8 and ActionMailer

I'm facing some problems with UTF-8 and ActionMailer. My application has a form (contact) that when it is submitted, it sends an email to me. The problem is that when somebody enters some chars like öäüß, I receive the message encoded like for example
=?UTF-8?Q?funktioniert_oder_nicht.=0D=0A=0D=0Ameine_Stra=C3=9Fe_ist_die?=
=?UTF-8?Q?_Bratwurststra=C3=9Fe=0D=0A=0D=0A=C3=B6=C3=A4?=
As I understand, ActionMailer per default is utf-8 ready. Analyzing the log from my server, when the form is submitted, the params are already well encoded (it means I can read the äüö in my log)
Any idea about what should I change? should I change my application to support ISO-8859-1?
environment: ruby 1.9 and rails 3.1
You are getting the UTF-8 bytes escaped with quoted-printable.
ß is "\xC3\x9F" -> "=C3=9F"
String#unpack('M') will decode it:
$ ruby -e 'puts "Bratwurststra=C3=9Fe=0D=0A=0D=0A=C3=B6=C3=A4".unpack "M"'
Bratwurststraße
öä
`

Decode JSON in rails simple string raises error

I'm trying to roundtrip encode/decode plain strings in json, but I'm getting an error.
In rails 2.3. w/ ruby 1.8.6, it used to work.
>> puts ActiveSupport::JSON.decode("abc".to_json)
abc
=> nil
In rails 3.1beta1 w/ ruby 1.9.2, it raises an error.
ruby-1.9.2-p180 :001 > puts ActiveSupport::JSON.decode("abc".to_json)
MultiJson::DecodeError: 706: unexpected token at '"abc"'
from /home/stevenh/.rvm/rubies/ruby-1.9.2-p180/lib/ruby/1.9.1/json/common.rb:147:in `parse'
from /home/stevenh/.rvm/rubies/ruby-1.9.2-p180/lib/ruby/1.9.1/json/common.rb:147:in `parse'
from /home/stevenh/.rvm/gems/ruby-1.9.2-p180/gems/multi_json-1.0.1/lib/multi_json/engines/json_gem.rb:13:in `decode'
[...]
This is pretty much the same question discussed at nil.to_json cannot be parsed back to nil?
But nil used to work in 2.3/1.8.7 as well.
puts ActiveSupport::JSON.decode(nil.to_json)
nil
Is this the new normal?
This change occurred with the switch from ActiveSupport's JSON backend to MultiJson that was included in Rails 3.1.0.rc1. Per the MultiJson project team, the current behavior is correct and the previous implementation was faulty due to RFC4627's specification of the JSON grammar:
2. JSON Grammar
A JSON text is a sequence of tokens. The set of tokens includes six
structural characters, strings, numbers, and three literal names.
A JSON text is a serialized object or array.
JSON-text = object / array
As neither "abc" nor "/"abc/"" are serialized objects or arrays, an error when attempting to decode them is appropriate.
The diagrams from the JSON website reinforce this specification.
That being said, this would seem to imply a bug in the to_json implementation that results in:
ruby-1.9.2-p180 :001 > "abc".to_json
=> "\"abc\""
Yes, what is happening in Rails3 is the new normal. The changes you illustrate seem like a reflection of a maturing framework.
Methods named "encode" & "decode" should be expected to be perfectly compliant with the JSON spec, and inverses of one another.
String#to_json, on the other hand is a behavior-ish type of method that functions as a convenience for building more complex JSON objects presumably used internally (within ActiveSupport) when Array#to_json or Hash#to_json encounter a String value as one of their consituent elements.
If you need to restore that behavior follow these steps, i.e.
# in your Gemfile
gem 'yajl-ruby'
# in your application.rb
require 'yajl/json_gem'
After those steps:
Loading development environment (Rails 3.2.8)
[1] pry(main)> puts ActiveSupport::JSON.decode("abc".to_json)
abc
=> nil
[2] pry(main)> puts ActiveSupport::JSON.decode(nil.to_json)
=> nil

Rails 3, heroku - PGError: ERROR: invalid byte sequence for encoding "UTF8":

I just randomly got this strange error via Rails 3, on heroku (postgres)
PGError: ERROR: invalid byte sequence for encoding "UTF8": 0x85 HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding". : INSERT INTO "comments" ("content") VALUES ('BTW∑I re-listened to the video' ......
The hint while nice isn't making anything click for me. Can I set encoding somewhere? Should I even mess with that? Anyone seen this and/or have any ideas on how to deal with this type of issue?
Thank you
From what I can gather, this is a problem where the string you're trying to insert into your PostgrSQL server isn't encoded with UTF-8. This is somewhat odd, because your Rails app should be configured to use UTF-8 by default.
There are a couple of ways you can try fix this (in order of what I recommend):
Firstly, make sure that config.encoding is set to "utf-8" in config/application.rb.
If you're using Ruby 1.9, you can try to force the character encoding prior to insertion with toutf8.
You can figure out what your string is encoded with, and manually set SET CLIENT_ENCODING TO 'ISO-8859-1'; (or whatever the encoding is) on your PostgeSQL connection before inserting the string. Don't forget to do RESET CLIENT_ENCODING; after the statement to reset the encoding.
If you're using Ruby 1.8 (which is more likely), you can use the iconv library to convert the string to UTF-8. See documentation here.
A more hackish solution is to override your getters and setters in the model (i.e. content and content=) encode and decode your string with Base64. It'd look something like this:
require 'base64'
class Comment
def content
Base64::decode64(self[:content])
end
def content=(value)
self[:content] = Base64::encode64(value)
end
end
text.force_encoding(charset).encode("UTF-8")
http://blog.zenlike.me/2013/04/06/sendgrid-parse-incoming-email-encoding-errors-for-rails-apps-using-postgresql/

Resources