ActiveStorage CSV file force encoding? - ruby-on-rails

I have a CSV file that I'm uploading which runs into an issue when importing rows into the database:
Encoding::UndefinedConversionError ("\xCC" from ASCII-8BIT to UTF-8)
What would be the most efficient way to ensure each column is properly encoded for being placed in the database or ignored?
The most basic approach is to go through each row and each field and force encoding on the string but that seems incredibly inefficient. What would be a better way to handle this?
Currently it's just uploaded as a parameter (:csv_file). I then access it as follows:
CSV.parse(csv_file.download) within the model.
I'm assuming there's a way to force the encoding when CSV.parse is called on the activestorage file but not sure how. Any ideas?
Thanks!

The latest version ActiveStorage (6.0.0.rc1) adds an API to be able to download the file to a temp file, which you can then read from. I'm assuming that Ruby will read from the file using the correct encoding.
https://edgeguides.rubyonrails.org/active_storage_overview.html#downloading-files
If you don't want to upgrade to the RC of Rails 6 (like I don't) you can use this method to convert the string to UTF-8 while getting rid of the byte order mark that may be present in your file:
wrongly_encoded_string = active_record_model.attachment.download
correctly_encoded_string = wrongly_encoded_string.bytes.pack("c*").force_encoding("UTF-8")

Related

Ignore � (non-UTF-8 characters) in email attachment or strip them from the attachment?

Users of our application are able to upload plain text files. These files might then be added as attachments to outgoing ActionMailer emails. Recently an attempt to send said email resulted in an invalid byte sequence in UTF-8 error. The email was not sent. This symbol, �, appears throughout the offending attachment.
We're using ActionMailer so although it ought to go without saying, here's representative code for the attachment action within the mailer class's method:
attachments['file-name.jpg'] = File.read('file-name.jpg')
From a business standpoint we don't care about the content of these text files. Ideally I'd like for our application to ignore the content and simply attach them to emails.
Is it possible to somehow tell Rails / ActionMailer to ignore the formatting? Or should I parse the incoming text file, stripping out non-UTF-8 characters?
I did search through like questions here on Stack Overflow but nothing addressed the problem I'm currently facing.
Edit: I did call #readlines on the file in a Rails console and found that the black diamond is a representation of \xA0. This is likely a non-breaking space in Latin1 (ISO 8859-1).
If Ruby is having problems reading the file and corrupting the characters during the read then try using File.binread. File.binread is inherited from IO
...
attachments['attachment.txt'] = File.binread('/path/to/file')
...
If your file already has corrupted characters then you can either find some process to 'uncorrupt' them, which is not fun, or strip them using by re-encoding from ASCII-8bit to UTF-8 stripping out the invalid characters.
...
attachments['attachment.txt'] = File.binread('/path/to/file')
.encode('utf-8', 'binary', invalid: :replace, undef: :replace)
...
(String#scrub does this but since you can't read it in as UTF-8 then you cant use it.)
With your edit, this seems pretty clear to me:
The file on your filesystem is encoded in latin1.
File.read uses the standard ruby encoding by default. If LANG contains something like "en_GB.utf8", File.read will associate the string with utf-8 encoding. You can verify this by logging the value of str.encoding (where str is the value of File.read).
File.read does not actually verify the encoding, it only slurps in the bytes and slaps on the encoding (like force_encoding).
Later, in ActionMailer, something wants to transcode the string, for whatever reason, and that fails as expected (and with the result you are noticing).
If your text files are encoded in latin1, then use File.read(path, encoding: Encoding::ISO_8859_1). This way, it may work. Let us know if it doesn't...
When reading the file at time of attachment, I can use the following syntax.
mail.attachments[file.file_name.to_s] = File.read(path_to_file).force_encoding("BINARY").gsub(0xA0.chr,"")
The important addition is the following, which goes after the call to File.read(...):
.force_encoding("BINARY").gsub(0xA0.chr,"")
The stripping and encoding ought to be done at time of file upload to our system, so this answer isn't the resolution. It's a short-term band-aid.

Encode Carrierwave attachment to base64 in Rails

I am using the Carrierwave gem to upload attachments to my model. I added elasticsearch with the mapper attachments plugin to allow for full text search of the attachments.
Carrierwave and elasticsearch work fine, but in order to get the full text search working I need to encode the attachment as base64.
I have followed this tutorial (http://rny.io/rails/elasticsearch/2013/08/05/full-text-search-for-attachments-with-rails-and-elasticsearch.html) but I assume there has been some changes to either Rails or Carrierwave as I can't get it to work. Specifically, when I am trying to encode the attachment as base64, I get the following Type error:
no implicit conversion of CarrierWave::SanitizedFile into String
The error is in the following line of the model:
File.open(Base64.encode64(File.read(document.file)))
If I replace the path with a url to an actual file it works fine.
I have searched SO and the only relevant answer I can find gives me the same error: Carrierwave encode file to base64 as process
I am a complete rails newbie and hopefully this is something that's obvious to everyone except me, but we're all beginners at first, right?
Thanks!
CarrierWave's read method returns the content of the file. So assuming Document is your model and file is your uploader attribute, this should work:
Base64.encode64(document.file.read)

Rails DateTime re-format using #{DateTime.now}

I have a task in Rails that exports an xml, I've one line in it though
tmp_filename="#{Rails.root}/tmp/orders-#{o.id}-#{DateTime.now}.xml"
and this outputs the xml file with a filename like
orders-42-2015-01-28T17:22:35+00-00.xml
This is the way it shows up when its uploaded directly to amazon s3, the problem is I need to get rid of the colons and just have dashes because the system thats taking these files doesn't work properly with the colon in the filename.
The strange thing is that when I download the file from s3 it downloads as dashes.
I'm not sure how or if I can use strftime on #{}
Could anybody help with what I'm trying to do. Or if this is just an amazon s3 thing and the file is actually being generated with the - and not : already.
Strftime doesn't seem to work on amazon s3, the file still uploads as the original format even after adding
tmp_filename="#{Rails.root}/tmp/orders-#{o.id}-#{DateTime.now.strftime('%d-%m-%Y-%H%M%S')}.xml"
and it also adds an extra +00:00 at the end for some reason that I can't get rid of
Can't you just format the DateTime without colons, for example:
tmp_filename="#{Rails.root}/tmp/orders-#{o.id}-#{DateTime.now.strftime('%Y-%m-%d-%H-%M-%S')}.xml"
With this you'll get the time in format like below, without colons:
irb(main):010:0> DateTime.now.strftime('%Y-%m-%d-%H-%M-%S')
=> "2015-01-29-10-50-30"

Character encoding, how do I tell the difference?

Characters coming out of my database are encoded differently than the same characters written directly in the source. For exmaple, the word Permissões shows a different result when the string is written directly in the HTML, than when the string is output from a db record.
# From the source
Addressable::URI.encode("Permissões.pdf") #=> "Permiss%C3%B5es.pdf"
# From the db
Addressable::URI.encode("Permissões.pdf") #=> "Permisso%CC%83es.pdf"
The encodings are different. But my database is set to UTF-8, and I am using HTML5. What could be causing this?
I am unable to download files I upload to S3 because of this issue. I tried to force the encoding attachment.path.encode("UTF-8") but that makes no diffrence.
To solve this, since I am using Rails, I used ActiveSupport::Multibyte::Unicode to normalize any unicode characters before they get inserted into the database.
before_save do
self.path = ActiveSupport::Multibyte::Unicode.normalize(path)
end

UTF Coding for CSV import

I am attempting to import a csv file into a rails application. I followed the directions given in a RailsCast > http://railscasts.com/episodes/396-importing-csv-and-excel
No matter what I do, however I still get the following error:
ArgumentError in PropertiesController#import
invalid byte sequence in UTF-8 Products.
I'm hoping someone can help me find a solution.
Have you read the CSV documentation? The open method, along with new support multibyte character conversions on the fly:
You must provide a mode with an embedded Encoding designator unless your data is in Encoding::default_external(). CSV will check the Encoding of the underlying IO object (set by the mode you pass) to determine how to parse the data. You may provide a second Encoding to have the data transcoded as it is read just as you can with a normal call to IO::open(). For example, "rb:UTF-32BE:UTF-8" would read UTF-32BE data from the file but transcode it to UTF-8 before CSV parses it.

Resources