Character encoding, how do I tell the difference? - ruby-on-rails

Characters coming out of my database are encoded differently than the same characters written directly in the source. For exmaple, the word Permissões shows a different result when the string is written directly in the HTML, than when the string is output from a db record.
# From the source
Addressable::URI.encode("Permissões.pdf") #=> "Permiss%C3%B5es.pdf"
# From the db
Addressable::URI.encode("Permissões.pdf") #=> "Permisso%CC%83es.pdf"
The encodings are different. But my database is set to UTF-8, and I am using HTML5. What could be causing this?
I am unable to download files I upload to S3 because of this issue. I tried to force the encoding attachment.path.encode("UTF-8") but that makes no diffrence.

To solve this, since I am using Rails, I used ActiveSupport::Multibyte::Unicode to normalize any unicode characters before they get inserted into the database.
before_save do
self.path = ActiveSupport::Multibyte::Unicode.normalize(path)
end

Related

ActiveStorage CSV file force encoding?

I have a CSV file that I'm uploading which runs into an issue when importing rows into the database:
Encoding::UndefinedConversionError ("\xCC" from ASCII-8BIT to UTF-8)
What would be the most efficient way to ensure each column is properly encoded for being placed in the database or ignored?
The most basic approach is to go through each row and each field and force encoding on the string but that seems incredibly inefficient. What would be a better way to handle this?
Currently it's just uploaded as a parameter (:csv_file). I then access it as follows:
CSV.parse(csv_file.download) within the model.
I'm assuming there's a way to force the encoding when CSV.parse is called on the activestorage file but not sure how. Any ideas?
Thanks!
The latest version ActiveStorage (6.0.0.rc1) adds an API to be able to download the file to a temp file, which you can then read from. I'm assuming that Ruby will read from the file using the correct encoding.
https://edgeguides.rubyonrails.org/active_storage_overview.html#downloading-files
If you don't want to upgrade to the RC of Rails 6 (like I don't) you can use this method to convert the string to UTF-8 while getting rid of the byte order mark that may be present in your file:
wrongly_encoded_string = active_record_model.attachment.download
correctly_encoded_string = wrongly_encoded_string.bytes.pack("c*").force_encoding("UTF-8")

Ignore � (non-UTF-8 characters) in email attachment or strip them from the attachment?

Users of our application are able to upload plain text files. These files might then be added as attachments to outgoing ActionMailer emails. Recently an attempt to send said email resulted in an invalid byte sequence in UTF-8 error. The email was not sent. This symbol, �, appears throughout the offending attachment.
We're using ActionMailer so although it ought to go without saying, here's representative code for the attachment action within the mailer class's method:
attachments['file-name.jpg'] = File.read('file-name.jpg')
From a business standpoint we don't care about the content of these text files. Ideally I'd like for our application to ignore the content and simply attach them to emails.
Is it possible to somehow tell Rails / ActionMailer to ignore the formatting? Or should I parse the incoming text file, stripping out non-UTF-8 characters?
I did search through like questions here on Stack Overflow but nothing addressed the problem I'm currently facing.
Edit: I did call #readlines on the file in a Rails console and found that the black diamond is a representation of \xA0. This is likely a non-breaking space in Latin1 (ISO 8859-1).
If Ruby is having problems reading the file and corrupting the characters during the read then try using File.binread. File.binread is inherited from IO
...
attachments['attachment.txt'] = File.binread('/path/to/file')
...
If your file already has corrupted characters then you can either find some process to 'uncorrupt' them, which is not fun, or strip them using by re-encoding from ASCII-8bit to UTF-8 stripping out the invalid characters.
...
attachments['attachment.txt'] = File.binread('/path/to/file')
.encode('utf-8', 'binary', invalid: :replace, undef: :replace)
...
(String#scrub does this but since you can't read it in as UTF-8 then you cant use it.)
With your edit, this seems pretty clear to me:
The file on your filesystem is encoded in latin1.
File.read uses the standard ruby encoding by default. If LANG contains something like "en_GB.utf8", File.read will associate the string with utf-8 encoding. You can verify this by logging the value of str.encoding (where str is the value of File.read).
File.read does not actually verify the encoding, it only slurps in the bytes and slaps on the encoding (like force_encoding).
Later, in ActionMailer, something wants to transcode the string, for whatever reason, and that fails as expected (and with the result you are noticing).
If your text files are encoded in latin1, then use File.read(path, encoding: Encoding::ISO_8859_1). This way, it may work. Let us know if it doesn't...
When reading the file at time of attachment, I can use the following syntax.
mail.attachments[file.file_name.to_s] = File.read(path_to_file).force_encoding("BINARY").gsub(0xA0.chr,"")
The important addition is the following, which goes after the call to File.read(...):
.force_encoding("BINARY").gsub(0xA0.chr,"")
The stripping and encoding ought to be done at time of file upload to our system, so this answer isn't the resolution. It's a short-term band-aid.

Rails 3 dealing with special characters

I want to provide user with ability to fill-in input field with special characters (i.e. ¥ and others).
User input could be saved in xml file and later fetched and rendered back to form input.
What is the best practice of saving special symbols to xml (maybe using html entities or hexadecimal form)?
Thanks for advance.
I'd say if you save the file in utf-8 you will have no problems.
If some controller/view has problems with encoding you have to place this in the first line:
# encoding: utf-8
There's nothing special about them and you can don't need to encode them. Let your XML library deal with that, XML supports unicode ever since, and what you call "special symbols" are just unicode characters.

Problem with cyrillic characters in Ruby on Rails

In my rails app I work a lot with cyrillic characters. Thats no problem, I store them in the db, I can display it in html.
But I have a problem exporting them in a plain txt file. A string like "элиас" gets "—ç–ª–∏–∞—Å" if I let rails put in in a txt file and download it. Whats wrong here? What has to be done?
Regards,
Elias
Obviously, there's a problem with your encoding. Make sure you text is in Unicode before writing it to the text file. You may use something like this:
ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
your_unicode_text = ic.iconv(your_text + ' ')[0..-2]
Also, double check that your database encoding is UTF-8. Cyrillic characters can display fine in DB and in html with non-unicode encoding, e.g. KOI8-RU, but you're guaranteed to have problems with them elsewhere.

How to use send an email with accents using actionmailer

My environment.rb is like this:
ActionMailer::Base.default_charset = "iso-8859-1"
which should be enough for accents, but here is how the message's subject is being sent:
Convite para participação de projeto
Does anyone know what I have to do to fix it?
Is your data in iso-8859-1? From the look of the error example, it seems to be two bytes per character (note the repetition of Ã). Since 8859-1 uses 1 byte per character, my guess is that your data is in utf-8 format.
Also check that your db is not doing any conversions on data going in or out.
I urge you to use unicode / utf-8 everywhere--database, html, emails, etc. It's what all of the kool-kids are using these days. 8859-1 is so last century!
Regarding emails
config.action_mailer.default_charset = "utf-8"
is what I use.
Just remove your setting and let Rails do the work using the default charset, UTF-8.
I work with emails and special characters all the time, there's no need to perform any kind of conversion or setting, at least not on Rails 3. As long as your strings contains the right characters, you'll be fine.
Just make sure the encoding error is not ocurring when reading the data from your database.

Resources