Rails - Saving Mail Attachment in a Postgres DB, results in PGError: ERROR: invalid byte sequence for encoding "UTF8": 0xa0 - ruby-on-rails

Has anyone seen this error before?
PGError: ERROR: invalid byte sequence for encoding "UTF8": 0xa0
I'm trying to save an incoming mail attachment(s), of any file type to the database for processing.
Any ideas?

What type of column are you saving your data to? If the attachment could be of any type, you need a bytea column to ensure that the data is simply passed through as a blob (binary "large" object). As mentioned in other answers, that error indicates that some data sent to PostgreSQL that was tagged as being text in UTF-8 encoding was invalid.
I'd recommend you store email attachments as binary along with their MIME content-type header. The Content-Type header should include the character encoding needed to convert the binary content to text for attachments where that makes sense: e.g. "text/plain; charset=iso-8859-1".
If you want the decoded text available in the database, you can have the application decode it and store the textual content, maybe having an extra column for the decoded version. That's useful if you want to use PostgreSQL's full-text indexing on email attachments, for example. However, if you just want to store them in the database for later retrieval as-is, just store them as binary and leave worrying about text encoding to the application.

The 0xa0 is a non-breaking space, possibly latin1 encoding. In Python I'd use str.decode() and str.encode() to change it from its current encoding to the target encoding, here 'utf8'. But I don't know how you'd go about it in Rails.

I do not know about Rails, but when PG gives this error message it means that :
the connection between postgres and your Rails client is correctly configured to use utf-8 encoding, meaning that all text data going between the client and postgres must be encoed in utf-8
and your Rails client erroneously sent some data encoded in another encoding (most probably latin-1 or ISO-8859) : therefore postgres rejects it
You must look into your client code where the data is inserted into the database, probably you try to insert a non-unicode string or there is some improper transcoding taking place.

Related

Ignore � (non-UTF-8 characters) in email attachment or strip them from the attachment?

Users of our application are able to upload plain text files. These files might then be added as attachments to outgoing ActionMailer emails. Recently an attempt to send said email resulted in an invalid byte sequence in UTF-8 error. The email was not sent. This symbol, �, appears throughout the offending attachment.
We're using ActionMailer so although it ought to go without saying, here's representative code for the attachment action within the mailer class's method:
attachments['file-name.jpg'] = File.read('file-name.jpg')
From a business standpoint we don't care about the content of these text files. Ideally I'd like for our application to ignore the content and simply attach them to emails.
Is it possible to somehow tell Rails / ActionMailer to ignore the formatting? Or should I parse the incoming text file, stripping out non-UTF-8 characters?
I did search through like questions here on Stack Overflow but nothing addressed the problem I'm currently facing.
Edit: I did call #readlines on the file in a Rails console and found that the black diamond is a representation of \xA0. This is likely a non-breaking space in Latin1 (ISO 8859-1).
If Ruby is having problems reading the file and corrupting the characters during the read then try using File.binread. File.binread is inherited from IO
...
attachments['attachment.txt'] = File.binread('/path/to/file')
...
If your file already has corrupted characters then you can either find some process to 'uncorrupt' them, which is not fun, or strip them using by re-encoding from ASCII-8bit to UTF-8 stripping out the invalid characters.
...
attachments['attachment.txt'] = File.binread('/path/to/file')
.encode('utf-8', 'binary', invalid: :replace, undef: :replace)
...
(String#scrub does this but since you can't read it in as UTF-8 then you cant use it.)
With your edit, this seems pretty clear to me:
The file on your filesystem is encoded in latin1.
File.read uses the standard ruby encoding by default. If LANG contains something like "en_GB.utf8", File.read will associate the string with utf-8 encoding. You can verify this by logging the value of str.encoding (where str is the value of File.read).
File.read does not actually verify the encoding, it only slurps in the bytes and slaps on the encoding (like force_encoding).
Later, in ActionMailer, something wants to transcode the string, for whatever reason, and that fails as expected (and with the result you are noticing).
If your text files are encoded in latin1, then use File.read(path, encoding: Encoding::ISO_8859_1). This way, it may work. Let us know if it doesn't...
When reading the file at time of attachment, I can use the following syntax.
mail.attachments[file.file_name.to_s] = File.read(path_to_file).force_encoding("BINARY").gsub(0xA0.chr,"")
The important addition is the following, which goes after the call to File.read(...):
.force_encoding("BINARY").gsub(0xA0.chr,"")
The stripping and encoding ought to be done at time of file upload to our system, so this answer isn't the resolution. It's a short-term band-aid.

Character encoding, how do I tell the difference?

Characters coming out of my database are encoded differently than the same characters written directly in the source. For exmaple, the word Permissões shows a different result when the string is written directly in the HTML, than when the string is output from a db record.
# From the source
Addressable::URI.encode("Permissões.pdf") #=> "Permiss%C3%B5es.pdf"
# From the db
Addressable::URI.encode("Permissões.pdf") #=> "Permisso%CC%83es.pdf"
The encodings are different. But my database is set to UTF-8, and I am using HTML5. What could be causing this?
I am unable to download files I upload to S3 because of this issue. I tried to force the encoding attachment.path.encode("UTF-8") but that makes no diffrence.
To solve this, since I am using Rails, I used ActiveSupport::Multibyte::Unicode to normalize any unicode characters before they get inserted into the database.
before_save do
self.path = ActiveSupport::Multibyte::Unicode.normalize(path)
end

Convert Uniocode to UTF-8 before sending json

My rails app gets certain data in database from another application. That data is stored as text and it may have some unicode chars in it. Now my rails app does have UTF-8 set as default in the config. But when that data is sent as json to backbone front-end then those unicode chars and not converted properly and the front-end displays ? or smart-quotes instead of displaying the proper char. How do I force the rails backend to do the encoding on the backend to convert unicode chars to UTF-8 in the json?
.encode('UTF-8') on each field.
Which is not that good, or you can write your own json serializer, where you can encode to any encoding you want
http://matthewrobertson.org/blog/2013/08/06/active-record-serializers-from-scratch/
or patch the system one
http://api.rubyonrails.org/classes/ActiveModel/Serializers/JSON.html

UTF Coding for CSV import

I am attempting to import a csv file into a rails application. I followed the directions given in a RailsCast > http://railscasts.com/episodes/396-importing-csv-and-excel
No matter what I do, however I still get the following error:
ArgumentError in PropertiesController#import
invalid byte sequence in UTF-8 Products.
I'm hoping someone can help me find a solution.
Have you read the CSV documentation? The open method, along with new support multibyte character conversions on the fly:
You must provide a mode with an embedded Encoding designator unless your data is in Encoding::default_external(). CSV will check the Encoding of the underlying IO object (set by the mode you pass) to determine how to parse the data. You may provide a second Encoding to have the data transcoded as it is read just as you can with a normal call to IO::open(). For example, "rb:UTF-32BE:UTF-8" would read UTF-32BE data from the file but transcode it to UTF-8 before CSV parses it.

Weird charactors on HTML page

i am using Last.fm API to fetch some info of artists .I save info in DB and then display on my webpage.
But characters like “ (double quote) are shown as “ .
Example Artist info http://www.last.fm/music/David+Penn
and i got the first line as " Producer, arranger, dj and musician from Madrid-Spain. He has his own record company “Zen Recordsâ€, and ".
Mine Db is UTF-8 but i dunno why this error is still coming .
This seems to be a character encoding error. Confirm that you are reading the webpage as the correct encoding and are showing the results in the correct encoding.
You should be using UTF-8 all the way through. Check that:
your connection to the database is UTF-8 (using mysql_set_charset);
the pages you're outputting are marked as UTF-8 (<meta http-equiv="Content-Type" content="text/html;charset=utf-8">);
when you output strings from the database, you HTML-encode them using htmlspecialchars() and not htmlentities().
htmlentities HTML-encodes all non-ASCII characters, and by default assumes you are passing it bytes in ISO-8859-1. So if you pass it “ encoded as UTF-8 (bytes 0xE2, 0x80, 0x9C), you'd get “, instead of the expected “ or “. This can be fixed by passing in utf-8 as the optional $charset argument.
However it's usually easier to just use htmlspecialchars() instead, as this leaves non-ASCII characters alone, as raw bytes instead of HTML entity references. This results in a smaller page output, so is preferable as long as you're sure the HTML you're producing will keep its charset information (which you can usually rely on, except in context like sending snippets of HTML in a mail or something).
htmlspecialchars() does have an optional $charset argument too, but setting it to utf-8 is not critical since that results in no change of behaviour over the default ISO-8859-1 charset. If you are producing output in old-school multibyte encodings like Shift-JIS you do have to worry about setting this argument correctly, but today that's quite rare as most sane people use UTF-8 in preference.

Resources