UTF Coding for CSV import - ruby-on-rails

I am attempting to import a csv file into a rails application. I followed the directions given in a RailsCast > http://railscasts.com/episodes/396-importing-csv-and-excel
No matter what I do, however I still get the following error:
ArgumentError in PropertiesController#import
invalid byte sequence in UTF-8 Products.
I'm hoping someone can help me find a solution.

Have you read the CSV documentation? The open method, along with new support multibyte character conversions on the fly:
You must provide a mode with an embedded Encoding designator unless your data is in Encoding::default_external(). CSV will check the Encoding of the underlying IO object (set by the mode you pass) to determine how to parse the data. You may provide a second Encoding to have the data transcoded as it is read just as you can with a normal call to IO::open(). For example, "rb:UTF-32BE:UTF-8" would read UTF-32BE data from the file but transcode it to UTF-8 before CSV parses it.

Related

ActiveStorage CSV file force encoding?

I have a CSV file that I'm uploading which runs into an issue when importing rows into the database:
Encoding::UndefinedConversionError ("\xCC" from ASCII-8BIT to UTF-8)
What would be the most efficient way to ensure each column is properly encoded for being placed in the database or ignored?
The most basic approach is to go through each row and each field and force encoding on the string but that seems incredibly inefficient. What would be a better way to handle this?
Currently it's just uploaded as a parameter (:csv_file). I then access it as follows:
CSV.parse(csv_file.download) within the model.
I'm assuming there's a way to force the encoding when CSV.parse is called on the activestorage file but not sure how. Any ideas?
Thanks!
The latest version ActiveStorage (6.0.0.rc1) adds an API to be able to download the file to a temp file, which you can then read from. I'm assuming that Ruby will read from the file using the correct encoding.
https://edgeguides.rubyonrails.org/active_storage_overview.html#downloading-files
If you don't want to upgrade to the RC of Rails 6 (like I don't) you can use this method to convert the string to UTF-8 while getting rid of the byte order mark that may be present in your file:
wrongly_encoded_string = active_record_model.attachment.download
correctly_encoded_string = wrongly_encoded_string.bytes.pack("c*").force_encoding("UTF-8")

Ignore � (non-UTF-8 characters) in email attachment or strip them from the attachment?

Users of our application are able to upload plain text files. These files might then be added as attachments to outgoing ActionMailer emails. Recently an attempt to send said email resulted in an invalid byte sequence in UTF-8 error. The email was not sent. This symbol, �, appears throughout the offending attachment.
We're using ActionMailer so although it ought to go without saying, here's representative code for the attachment action within the mailer class's method:
attachments['file-name.jpg'] = File.read('file-name.jpg')
From a business standpoint we don't care about the content of these text files. Ideally I'd like for our application to ignore the content and simply attach them to emails.
Is it possible to somehow tell Rails / ActionMailer to ignore the formatting? Or should I parse the incoming text file, stripping out non-UTF-8 characters?
I did search through like questions here on Stack Overflow but nothing addressed the problem I'm currently facing.
Edit: I did call #readlines on the file in a Rails console and found that the black diamond is a representation of \xA0. This is likely a non-breaking space in Latin1 (ISO 8859-1).
If Ruby is having problems reading the file and corrupting the characters during the read then try using File.binread. File.binread is inherited from IO
...
attachments['attachment.txt'] = File.binread('/path/to/file')
...
If your file already has corrupted characters then you can either find some process to 'uncorrupt' them, which is not fun, or strip them using by re-encoding from ASCII-8bit to UTF-8 stripping out the invalid characters.
...
attachments['attachment.txt'] = File.binread('/path/to/file')
.encode('utf-8', 'binary', invalid: :replace, undef: :replace)
...
(String#scrub does this but since you can't read it in as UTF-8 then you cant use it.)
With your edit, this seems pretty clear to me:
The file on your filesystem is encoded in latin1.
File.read uses the standard ruby encoding by default. If LANG contains something like "en_GB.utf8", File.read will associate the string with utf-8 encoding. You can verify this by logging the value of str.encoding (where str is the value of File.read).
File.read does not actually verify the encoding, it only slurps in the bytes and slaps on the encoding (like force_encoding).
Later, in ActionMailer, something wants to transcode the string, for whatever reason, and that fails as expected (and with the result you are noticing).
If your text files are encoded in latin1, then use File.read(path, encoding: Encoding::ISO_8859_1). This way, it may work. Let us know if it doesn't...
When reading the file at time of attachment, I can use the following syntax.
mail.attachments[file.file_name.to_s] = File.read(path_to_file).force_encoding("BINARY").gsub(0xA0.chr,"")
The important addition is the following, which goes after the call to File.read(...):
.force_encoding("BINARY").gsub(0xA0.chr,"")
The stripping and encoding ought to be done at time of file upload to our system, so this answer isn't the resolution. It's a short-term band-aid.

How to identify character encoding from website?

What I'm trying to do:
I'm getting from a database a list of uris and download them,
removing the stopwords and counting the frequency that the words appears in the webpage,
then trying to save in the mongodb.
The Problem:
When I try to save the result in the database I get the error
bson.errors.invalidDocument: the document must be a valid utf-8
it appears to be related to the codes '\xc3someotherstrangewords', '\xe2something'
when I'm processing the webpages I try remove the punctuation, but I can't remove accents because I'll get a wrong word.
What I already tried
I've tried identify the char encode through the header from the webpage
I've tried utilize the chardet
utilize the re.compile(r"[^a-zA-Z]") and/or unicode(variable,'ascii', 'ignore');
that isn't good for non-English languages because they remove the accents.
What I want know is:
anyone know how identify the chars and translate to the right word/encode?
e.g. get this from webpage '\xe2' and translate to 'â'
(English isn't my first language so forgive me)
EDIT: if anyone want see the source code
It is not easy to find out the correct character encoding of a website because the information in the header might be wrong. BeautifulSoup does a pretty good job at guessing the character encoding and automatically decodes it to Unicode.
from bs4 import BeautifulSoup
import urllib
url = 'http://www.google.de'
fh = urllib.urlopen(url)
html = fh.read()
soup = BeautifulSoup(html)
# text is a Unicode string
text = soup.body.get_text()
# encoded_text is a utf-8 string that you can store in mongo
encoded_text = text.encode('utf-8')
See also the answers to this question.

lua reading chinese character

I have the following xml that I would like to read:
chinese xml - https://news.google.com/news/popular?ned=cn&topic=po&output=rss
korean xml - http://www.voanews.com/templates/Articles.rss?sectionPath=/korean/news
Currently, I try to use a luaxml to parse in the xml which contain the chinese character. However, when I print out using the console, the result is that the chinese character cannot be printed correctly and show as a garbage character.
I would like to ask if there is anyway to parse a chinese or korean character into lua table?
I don't think Lua is the issue here. The raw data the remote site sends is encoded using UTF-8, and Lua does no special interpretation of that—which means it should be preserved perfectly if you just (1) read from the remote site, and (2) save the read data to a file. The data in the file will contain CJK characters encoded in UTF-8, just like the remote site sent back.
If you're getting funny results like you mention, the fault probably lies either with the library you're using to read from the remote site, or perhaps simply with the way your console displays the results when you output to it.
I managed to convert the "中美" into chinese character.
I would need to do one additional step which has to convert all the the series of string by using this method from this link, http://forum.luahub.com/index.php?topic=3617.msg8595#msg8595 before saving into xml format.
string.gsub(l,"&#([0-9]+);", function(c) return string.char(tonumber(c)) end)
I would like to ask for LuaXML, I have come across this method xml.registerCode(decoded,encoded)
Under that method, it says that
registers a custom code for the conversion between non-standard characters and XML character entities
What do they mean by non-standard characters and how do I use it?

Rails - Saving Mail Attachment in a Postgres DB, results in PGError: ERROR: invalid byte sequence for encoding "UTF8": 0xa0

Has anyone seen this error before?
PGError: ERROR: invalid byte sequence for encoding "UTF8": 0xa0
I'm trying to save an incoming mail attachment(s), of any file type to the database for processing.
Any ideas?
What type of column are you saving your data to? If the attachment could be of any type, you need a bytea column to ensure that the data is simply passed through as a blob (binary "large" object). As mentioned in other answers, that error indicates that some data sent to PostgreSQL that was tagged as being text in UTF-8 encoding was invalid.
I'd recommend you store email attachments as binary along with their MIME content-type header. The Content-Type header should include the character encoding needed to convert the binary content to text for attachments where that makes sense: e.g. "text/plain; charset=iso-8859-1".
If you want the decoded text available in the database, you can have the application decode it and store the textual content, maybe having an extra column for the decoded version. That's useful if you want to use PostgreSQL's full-text indexing on email attachments, for example. However, if you just want to store them in the database for later retrieval as-is, just store them as binary and leave worrying about text encoding to the application.
The 0xa0 is a non-breaking space, possibly latin1 encoding. In Python I'd use str.decode() and str.encode() to change it from its current encoding to the target encoding, here 'utf8'. But I don't know how you'd go about it in Rails.
I do not know about Rails, but when PG gives this error message it means that :
the connection between postgres and your Rails client is correctly configured to use utf-8 encoding, meaning that all text data going between the client and postgres must be encoed in utf-8
and your Rails client erroneously sent some data encoded in another encoding (most probably latin-1 or ISO-8859) : therefore postgres rejects it
You must look into your client code where the data is inserted into the database, probably you try to insert a non-unicode string or there is some improper transcoding taking place.

Resources