Stubborn character encoding errors when reading strings from text file (Ruby/Rails) - ruby-on-rails

I've been trying to import a long text file generated from a PDF reader application (SODA-PDF). Source document is a script in PDF format.
The convertged text files look ok in note pad, but I get a variety of errors when trying to read the file into a string and manipulate it.
None of the following methods which I've seen in various threads seem to work:
clean1=Iconv.conv('ASCII//IGNORE', 'UTF8', s)
or
clean1=s.encode('UTF-8', invalid: :replace, undef: :replace, replace: '', UNIVERSAL_NEWLINE_DECORATOR: true)
or
clean1=s.gsub(/[\u0080-\u00ff]/,"")
The first method, using Iconv gives
Iconv::InvalidEncoding: invalid encoding ("ASCII", "UTF8")
when invoked.
The second method appears to work, but fails on various string manipulations like
lines= s.split("\n") unless s.blank?
with
ArgumentError: invalid byte sequence in UTF-8
(Either split or blank? will throw the exception.)
The 3rd method also fails with the 'invalid byte sequence in UTF-8' error.
I am quite hazy on the whole character encoding thing, so excuse any obvious stupidity here.
I'm going to try a character by character filtering, but that's kind of pain since the docs I am working with can be 100+ pages, and I'm hoping there's an easier solve.
Env: Win7 64/ ruby 1.9.3p484 (2013-11-22) [i386-mingw32] / Rails 4.0.3

I discovered that my source file was encoded in ISO-8859-1. Was able to convert to UTF-8 and it all works fine now.

Related

Ignore � (non-UTF-8 characters) in email attachment or strip them from the attachment?

Users of our application are able to upload plain text files. These files might then be added as attachments to outgoing ActionMailer emails. Recently an attempt to send said email resulted in an invalid byte sequence in UTF-8 error. The email was not sent. This symbol, �, appears throughout the offending attachment.
We're using ActionMailer so although it ought to go without saying, here's representative code for the attachment action within the mailer class's method:
attachments['file-name.jpg'] = File.read('file-name.jpg')
From a business standpoint we don't care about the content of these text files. Ideally I'd like for our application to ignore the content and simply attach them to emails.
Is it possible to somehow tell Rails / ActionMailer to ignore the formatting? Or should I parse the incoming text file, stripping out non-UTF-8 characters?
I did search through like questions here on Stack Overflow but nothing addressed the problem I'm currently facing.
Edit: I did call #readlines on the file in a Rails console and found that the black diamond is a representation of \xA0. This is likely a non-breaking space in Latin1 (ISO 8859-1).
If Ruby is having problems reading the file and corrupting the characters during the read then try using File.binread. File.binread is inherited from IO
...
attachments['attachment.txt'] = File.binread('/path/to/file')
...
If your file already has corrupted characters then you can either find some process to 'uncorrupt' them, which is not fun, or strip them using by re-encoding from ASCII-8bit to UTF-8 stripping out the invalid characters.
...
attachments['attachment.txt'] = File.binread('/path/to/file')
.encode('utf-8', 'binary', invalid: :replace, undef: :replace)
...
(String#scrub does this but since you can't read it in as UTF-8 then you cant use it.)
With your edit, this seems pretty clear to me:
The file on your filesystem is encoded in latin1.
File.read uses the standard ruby encoding by default. If LANG contains something like "en_GB.utf8", File.read will associate the string with utf-8 encoding. You can verify this by logging the value of str.encoding (where str is the value of File.read).
File.read does not actually verify the encoding, it only slurps in the bytes and slaps on the encoding (like force_encoding).
Later, in ActionMailer, something wants to transcode the string, for whatever reason, and that fails as expected (and with the result you are noticing).
If your text files are encoded in latin1, then use File.read(path, encoding: Encoding::ISO_8859_1). This way, it may work. Let us know if it doesn't...
When reading the file at time of attachment, I can use the following syntax.
mail.attachments[file.file_name.to_s] = File.read(path_to_file).force_encoding("BINARY").gsub(0xA0.chr,"")
The important addition is the following, which goes after the call to File.read(...):
.force_encoding("BINARY").gsub(0xA0.chr,"")
The stripping and encoding ought to be done at time of file upload to our system, so this answer isn't the resolution. It's a short-term band-aid.

Rails View Encoding Issues

I'm using Ruby 2.0 and Rails 3.2.14. My view is littered several UTF-8 characters, mainly currency symbols like บาท and د.إ etc. I noticed some
(ActionView::Template::Error) "incompatible character encodings: ASCII-8BIT and UTF-8
in our production code and promptly tried visiting the page url on my browser without any issues. On digging in, I realised the error was actually caused by BingBot and few spiders. So when I tried to curl the same url, I was able to reproduce the issue. So, if I try
curl http://localhost:3000/?x=✓
I get the error where UTF-8 symbols are used in the view code. I also realised that if use HTML encoded strings in place of the symbols, this does not occur. However, I prefer using the actual symbols.
And I have already tried setting Encoding.default_external = Encoding::UTF_8 in environment.rb adding #encoding: utf-8 magic comment to top of file and it does not help.
So, why does this error occur? What is the difference between hitting this url on browser and on CURL besides cookies? And how do I go about fixing this issue and allow BingBot to index our site? Thanks.
The culprit that was leaking non UTF-8 characters in my template was an innocuous meta tag for Facebook Open Graph
%meta{property: "og:url", content: request.url}
And when the request is non-standard, this causes the encoding issue. Changing it to
%meta{property: "og:url", content: request.url.force_encoding('UTF-8')}
made the trick.
That error message usually occurs when you try to concatenate strings with different character encodings.
Is your database set to use UTF-8 as well?
If not, you could have a problem when you try to insert the non-UTF8 values into your UTF-8 template.

UTF Coding for CSV import

I am attempting to import a csv file into a rails application. I followed the directions given in a RailsCast > http://railscasts.com/episodes/396-importing-csv-and-excel
No matter what I do, however I still get the following error:
ArgumentError in PropertiesController#import
invalid byte sequence in UTF-8 Products.
I'm hoping someone can help me find a solution.
Have you read the CSV documentation? The open method, along with new support multibyte character conversions on the fly:
You must provide a mode with an embedded Encoding designator unless your data is in Encoding::default_external(). CSV will check the Encoding of the underlying IO object (set by the mode you pass) to determine how to parse the data. You may provide a second Encoding to have the data transcoded as it is read just as you can with a normal call to IO::open(). For example, "rb:UTF-32BE:UTF-8" would read UTF-32BE data from the file but transcode it to UTF-8 before CSV parses it.

Character conversion in ruby 1.8.7 from pdftk unicode conversion results

I am parsing titles from pdf files using pdftk has various language specific characters in it.
This ruby on rails application I need to do this in is using ruby 1.8.7 and rails 2.3.14 so any encoding solutions built into ruby 1.9 aren't an option for me right now.
Example of what I need to do:
If the title includes a ü, when I read the pdf content using pdftk (either command line or using ruby pdf-toolkit gem) the "ü" gets converted to ü
In my application, I really want this in the ü as this seems to work fine for my needs in a web page and in XML file.
I can convert the character explicitly in ruby using
>> string = "ü"
=> "ü"
>> string.gsub("ü","ü")
=> "ü"
but obviously I don't want to do this one by one.
I've tried using Iconv to do this but I feel I don't know what to specify to get this converted to the rendered character. I thought maybe this was just a utf-8 but it doesn't seem to convert to rendered character
>> Iconv.iconv("latin1", "utf-8","ü").join
=> "ü"
I am little confused about what format to/from to use here to get the end result of the rendered character.
So how do use Iconv or other tools to make this conversion for all characters converted to this HTML code from pdftk?
Or how to tell pdftk to do this when I read the pdf file in the first place!
Ok - I think the issue here is the codes that pdftk are returning are HTML so unescaping the HTML first is the path that works
>> Iconv.iconv("utf8", "latin1", CGI.unescapeHTML(string) ).join
=> "ü"
Update:
Using the following
pdf = PDF::Toolkit.open(file)
pdf.title = Iconv.iconv("utf8", "latin1", CGI.unescapeHTML(pdf.title)).join
This seems to work for most languages but when I apply this to japanese and chinese, it mangles things and doesn't result in the original as it appears in the PDF.
Update:
Getting closer - it appears that the html codes pdftk puts in the title for japanese and chinese already render correctly if I just unescape them and don't attempt any Iconv conversion.
CGI.unescapeHTML(pdf.title)
This renders correctly.
So... how do I test the pdf.title to see ahead of time if this is chinese or japanese (double byte ?) before I try to apply the conversion needed for other languages?
Maybe something like:
string.gsub(/&#\d+;/){|x| x[/\d+/].to_i.chr}

Weird charactors on HTML page

i am using Last.fm API to fetch some info of artists .I save info in DB and then display on my webpage.
But characters like “ (double quote) are shown as “ .
Example Artist info http://www.last.fm/music/David+Penn
and i got the first line as " Producer, arranger, dj and musician from Madrid-Spain. He has his own record company “Zen Recordsâ€, and ".
Mine Db is UTF-8 but i dunno why this error is still coming .
This seems to be a character encoding error. Confirm that you are reading the webpage as the correct encoding and are showing the results in the correct encoding.
You should be using UTF-8 all the way through. Check that:
your connection to the database is UTF-8 (using mysql_set_charset);
the pages you're outputting are marked as UTF-8 (<meta http-equiv="Content-Type" content="text/html;charset=utf-8">);
when you output strings from the database, you HTML-encode them using htmlspecialchars() and not htmlentities().
htmlentities HTML-encodes all non-ASCII characters, and by default assumes you are passing it bytes in ISO-8859-1. So if you pass it “ encoded as UTF-8 (bytes 0xE2, 0x80, 0x9C), you'd get “, instead of the expected “ or “. This can be fixed by passing in utf-8 as the optional $charset argument.
However it's usually easier to just use htmlspecialchars() instead, as this leaves non-ASCII characters alone, as raw bytes instead of HTML entity references. This results in a smaller page output, so is preferable as long as you're sure the HTML you're producing will keep its charset information (which you can usually rely on, except in context like sending snippets of HTML in a mail or something).
htmlspecialchars() does have an optional $charset argument too, but setting it to utf-8 is not critical since that results in no change of behaviour over the default ISO-8859-1 charset. If you are producing output in old-school multibyte encodings like Shift-JIS you do have to worry about setting this argument correctly, but today that's quite rare as most sane people use UTF-8 in preference.

Resources