Isn't user data that comes in from a form in Rails going to be UTF-8 encoded? - ruby-on-rails

A Rails 3.2 app I'm contributing to has a method that coerces user input to UTF-8.
require "iconv"
def normalize(user_input_text)
Iconv.new('UTF-8//IGNORE', 'UTF-8').iconv(user_input_text.dup)
end
It basically encodes the string in UTF-8 and ignores characters that can't be transcoded.
But isn't all user data that's entering Rails through a form going to be UTF-8 encoded?
In other words, isn't this code specious and unnecessary?

These resources suggest that indeed you are right.
Now that the vast majority of web input is UTF-8, we set
the inbound parameters to UTF-8. This will eliminate many
cases of incompatible encodings between ASCII-8BIT and
UTF-8.
https://github.com/rails/rails/commit/25215d7285db10e2c04d903f251b791342e4dd6a
Rails 3 solves this very nicely by doing a number of things including interpreting params as UTF-8 and adding workarounds for Internet Explorer
http://jasoncodes.com/posts/ruby19-rails2-encodings

Related

Ignore � (non-UTF-8 characters) in email attachment or strip them from the attachment?

Users of our application are able to upload plain text files. These files might then be added as attachments to outgoing ActionMailer emails. Recently an attempt to send said email resulted in an invalid byte sequence in UTF-8 error. The email was not sent. This symbol, �, appears throughout the offending attachment.
We're using ActionMailer so although it ought to go without saying, here's representative code for the attachment action within the mailer class's method:
attachments['file-name.jpg'] = File.read('file-name.jpg')
From a business standpoint we don't care about the content of these text files. Ideally I'd like for our application to ignore the content and simply attach them to emails.
Is it possible to somehow tell Rails / ActionMailer to ignore the formatting? Or should I parse the incoming text file, stripping out non-UTF-8 characters?
I did search through like questions here on Stack Overflow but nothing addressed the problem I'm currently facing.
Edit: I did call #readlines on the file in a Rails console and found that the black diamond is a representation of \xA0. This is likely a non-breaking space in Latin1 (ISO 8859-1).
If Ruby is having problems reading the file and corrupting the characters during the read then try using File.binread. File.binread is inherited from IO
...
attachments['attachment.txt'] = File.binread('/path/to/file')
...
If your file already has corrupted characters then you can either find some process to 'uncorrupt' them, which is not fun, or strip them using by re-encoding from ASCII-8bit to UTF-8 stripping out the invalid characters.
...
attachments['attachment.txt'] = File.binread('/path/to/file')
.encode('utf-8', 'binary', invalid: :replace, undef: :replace)
...
(String#scrub does this but since you can't read it in as UTF-8 then you cant use it.)
With your edit, this seems pretty clear to me:
The file on your filesystem is encoded in latin1.
File.read uses the standard ruby encoding by default. If LANG contains something like "en_GB.utf8", File.read will associate the string with utf-8 encoding. You can verify this by logging the value of str.encoding (where str is the value of File.read).
File.read does not actually verify the encoding, it only slurps in the bytes and slaps on the encoding (like force_encoding).
Later, in ActionMailer, something wants to transcode the string, for whatever reason, and that fails as expected (and with the result you are noticing).
If your text files are encoded in latin1, then use File.read(path, encoding: Encoding::ISO_8859_1). This way, it may work. Let us know if it doesn't...
When reading the file at time of attachment, I can use the following syntax.
mail.attachments[file.file_name.to_s] = File.read(path_to_file).force_encoding("BINARY").gsub(0xA0.chr,"")
The important addition is the following, which goes after the call to File.read(...):
.force_encoding("BINARY").gsub(0xA0.chr,"")
The stripping and encoding ought to be done at time of file upload to our system, so this answer isn't the resolution. It's a short-term band-aid.

Rails 3 dealing with special characters

I want to provide user with ability to fill-in input field with special characters (i.e. ¥ and others).
User input could be saved in xml file and later fetched and rendered back to form input.
What is the best practice of saving special symbols to xml (maybe using html entities or hexadecimal form)?
Thanks for advance.
I'd say if you save the file in utf-8 you will have no problems.
If some controller/view has problems with encoding you have to place this in the first line:
# encoding: utf-8
There's nothing special about them and you can don't need to encode them. Let your XML library deal with that, XML supports unicode ever since, and what you call "special symbols" are just unicode characters.

What is the _snowman param in Ruby on Rails 3 forms for?

In Ruby on Rails 3 (currently using Beta 4), I see that when using the form_tag or form_for helpers there is a hidden field named _snowman with the value of ☃ (Unicode \x9731) showing up.
So, what is this for?
This parameter was added to forms in order to force Internet Explorer (5, 6, 7 and 8) to encode its parameters as unicode.
Specifically, this bug can be triggered if the user switches the browser's encoding to Latin-1. To understand why a user would decide to do something seemingly so crazy, check out this google search. Once the user has put the web-site into Latin-1 mode, if they use characters that can be understood as both Latin-1 and Unicode (for instance, é or ç, common in names), Internet Explorer will encode them in Latin-1.
This means that if a user searches for "Ché Guevara", it will come through incorrectly on the server-side. In Ruby 1.9, this will result in an encoding error when the text inevitably makes its way into the regular expression engine. In Ruby 1.8, it will result in broken results for the user.
By creating a parameter that can only be understood by IE as a unicode character, we are forcing IE to look at the accept-charset attribute, which then tells it to encode all of the characters as UTF-8, even ones that can be encoded in Latin-1.
Keep in mind that in Ruby 1.8, it is extremely trivial to get Latin-1 data into your UTF-8 database (since nothing in the entire stack checks that the bytes that the user sent at any point are valid UTF-8 characters). As a result, it's extremely common for Ruby applications (and PHP applications, etc. etc.) to exhibit this user-facing bug, and therefore extremely common for users to try to change the encoding as a palliative measure.
All that said, when I wrote this patch, I didn't realize that the name of the parameter would ever appear in a user-facing place (it does with forms that use the GET action, such as search forms). Since it does, we will rename this parameter to _e, and use a more innocuous-looking unicode character.
This is here to support Internet Explorer 5 and encourage it to use UTF-8 for its forms.
The commit message seen here details it as follows:
Fix several known web encoding issues:
Specify accept-charset on all forms. All recent browsers, as well as
IE5+, will use the encoding specified
for form parameters
Unfortunately, IE5+ will not look at accept-charset unless at least one
character in the form's values is not
in the page's charset. Since the
user can override the default
charset (which Rails sets to UTF-8),
we provide a hidden input containing
a unicode character, forcing IE to
look at the accept-charset.
Now that the vast majority of web input is UTF-8, we set the inbound
parameters to UTF-8. This will
eliminate many cases of incompatible
encodings between ASCII-8BIT and
UTF-8.
You can safely ignore params[:_snowman]
In short, you can safely ignore this parameter.
Still, I am not sure why we're supporting old technologies like Internet Explorer 5. It seems like a very non-Ruby on Rails decision if you ask me.

Weird charactors on HTML page

i am using Last.fm API to fetch some info of artists .I save info in DB and then display on my webpage.
But characters like “ (double quote) are shown as “ .
Example Artist info http://www.last.fm/music/David+Penn
and i got the first line as " Producer, arranger, dj and musician from Madrid-Spain. He has his own record company “Zen Recordsâ€, and ".
Mine Db is UTF-8 but i dunno why this error is still coming .
This seems to be a character encoding error. Confirm that you are reading the webpage as the correct encoding and are showing the results in the correct encoding.
You should be using UTF-8 all the way through. Check that:
your connection to the database is UTF-8 (using mysql_set_charset);
the pages you're outputting are marked as UTF-8 (<meta http-equiv="Content-Type" content="text/html;charset=utf-8">);
when you output strings from the database, you HTML-encode them using htmlspecialchars() and not htmlentities().
htmlentities HTML-encodes all non-ASCII characters, and by default assumes you are passing it bytes in ISO-8859-1. So if you pass it “ encoded as UTF-8 (bytes 0xE2, 0x80, 0x9C), you'd get “, instead of the expected “ or “. This can be fixed by passing in utf-8 as the optional $charset argument.
However it's usually easier to just use htmlspecialchars() instead, as this leaves non-ASCII characters alone, as raw bytes instead of HTML entity references. This results in a smaller page output, so is preferable as long as you're sure the HTML you're producing will keep its charset information (which you can usually rely on, except in context like sending snippets of HTML in a mail or something).
htmlspecialchars() does have an optional $charset argument too, but setting it to utf-8 is not critical since that results in no change of behaviour over the default ISO-8859-1 charset. If you are producing output in old-school multibyte encodings like Shift-JIS you do have to worry about setting this argument correctly, but today that's quite rare as most sane people use UTF-8 in preference.

How to use send an email with accents using actionmailer

My environment.rb is like this:
ActionMailer::Base.default_charset = "iso-8859-1"
which should be enough for accents, but here is how the message's subject is being sent:
Convite para participação de projeto
Does anyone know what I have to do to fix it?
Is your data in iso-8859-1? From the look of the error example, it seems to be two bytes per character (note the repetition of Ã). Since 8859-1 uses 1 byte per character, my guess is that your data is in utf-8 format.
Also check that your db is not doing any conversions on data going in or out.
I urge you to use unicode / utf-8 everywhere--database, html, emails, etc. It's what all of the kool-kids are using these days. 8859-1 is so last century!
Regarding emails
config.action_mailer.default_charset = "utf-8"
is what I use.
Just remove your setting and let Rails do the work using the default charset, UTF-8.
I work with emails and special characters all the time, there's no need to perform any kind of conversion or setting, at least not on Rails 3. As long as your strings contains the right characters, you'll be fine.
Just make sure the encoding error is not ocurring when reading the data from your database.

Resources