identifying problematic row of data giving mass import error - ruby-on-rails

I am using activerecord-import to bulk insert a bunch of data in a .csv file into my rails app. Unfortunately, I am getting an error when I call import on my model.
ArgumentError (invalid byte sequence in UTF-8)
I know the problem is that I have a string with weird characters somewhere in the 1000+ rows of data that I am importing, but I can't figure out which row is the problem.
Does activerecord-import have any error handling built in that I could use to figure out which row/row(s) were problematic (e.g. some option I could set when calling import function on my model)? As far as I can tell the answer is no.
Alternatively, can I write some code that would check the array that I am passing into activerecord-import to determine which rows have strings that are invalid in UTF-8?

Without being able to see the data, it is only possible to guess. Most likely, you have a character combination that is not UTF-8 valid.
You should be able to check your file with
iconv -f utf8 <filename>

Related

How to remove an invalid sequence from a Genbank file containing multiple genome sequences based on ID

I have a ~3 GB Genbank file containing complete Genbank annotations for ~20,000 bacterial genome sequences. My goal is to use BioPython to parse these sequences, and write individual fasta files for non-duplicate sequences with something like the following:
from Bio import SeqIO
records = SeqIO.parse(r'C:\Users\aaa\aaa\file.gb', 'genbank')
for record in records:
if seq_name not in organism_dict:
with open(output_folder + seq_name, 'w') as handle:
SeqIO.write(record, handle, 'fasta')
This works perfectly fine for the first ~2,000 sequences, but then reaches an entry with an invalid footer and produces the error message ValueError: Sequence line mal-formed 'title>NCBI/ffsrv11 - WWW Error 500 Diagnostic'.
I managed to find the sequence causing the error, so what I'd like to do is delete it from my Genbank file and manually download it as a fasta file later. However, I can't open the file in a text editor (due to its size), and I can't parse the file (due to the error), so I'm wondering if anyone has an idea of how to remove a sequence based on Genbank ID. I'm open to non-python options.
Thank you in advance,
Daniel
Try adding a Try/Except where the Except writes Record.ID to a seperate file. If the Try fails it won't write and the Except will collect Record.IDs for later download.

Spreadsheets ruby gem encoding not working

I'm getting a weird problem when I try to write strings (that are UTF-8) in a xls with the Spreadsheets gem. It doesn't give errors, but I get an invalid spreadsheet, with random characters (opened on Excel and Calc, same thing).
So I assume it is an encoding error, but I thought the lib would automatically convert my strings to the encoding used by Excel... I tried converting them to ISO by hand (.encode('ISO-8859-1')), force_encoding to UTF-8, and many other combinations of these two methods. Some give execution errors, and the others just don't work. Is there anything special I should do?
Spreadsheets: http://spreadsheet.rubyforge.org/
Code:
book = Spreadsheet::Workbook.new
sheet = book.create_worksheet
lines.each do |line|
sheet.row(row).concat(line) #line is in utf-8
end
book.write #file
You should try adding the following magic comment on top of your ruby script and then try.
# encoding: UTF-8
Before processing your source code interpreter reads this line and sets proper encoding. So, I assume this should solve your problem.

Rails oracle raw16

I'm using Rails 3.2.1 and I have stuck on some problem for quite long.
I'm using oracle enhanced adapter and I have raw(16) (uuid) column and when I'm trying to display the data there is 2 situations:
1) I see the weird symbols
2) I'm getting incompatible character encoding: Ascii-8bit and utf-8.
In my application.rb file I added the
config.encoding = 'utf-8'
and in my view file I added
'#encoding=utf-8'
But so far nothing worked
I also tried to add html_safe but it failed .
How can I safely diaply my uuid data?
Thank you very much
Answer:
I used the unpack method to convert the
binary with those parameters
H8H4H4H4H12 and in the end joined the
array :-)
The RAW datatype is a string of bytes that can take any value. This includes binary data that doesn't translate to anything meaningful in ASCII or UTF-8 or in any character set.
You should really read Joel Spolsky's note about character sets and unicode before continuing.
Now, since the data can't be translated reliably to a string, how can we display it? Usually we convert or encode it, for instance:
we could use the hexadecimal representation where each byte is converted to two [0-9A-F] characters (in Oracle using the RAWTOHEX function). This is fine for display of small binary field such as RAW(16).
you can also use other encodings such as base 64, in Oracle with the UTL_ENCODE package.

Unpacking ActiveRecord binary blob to hex string drops escape characters ("%25" converted to "%")

I have a Ruby-on-Rails app that accepts a binary file upload, stores it as an ActiveRecord object in a local database, and passes a hex equivalent of the binary blob to a back-end web service for processing. This usually works great.
Two days ago, I ran into a problem with a file containing the hex sequence \x25\x32\x35, %25 in ASCII. The binary representation of the file was stored properly in the database but the hex string representation of the file that resulted from
sample.binary.unpack('H*').to_s
was incorrect. After investigating, I found that those three bytes were converted to hex string 25, the representation for %. It should have been 253235, the representation for %25
It makes sense for Ruby or Rails or ActiveRecord to do this. %25 is the proper URL-encoded value for %. However, I need to turn off this optimization or validation or whatever it is. I need blob.unpack('H*') to include a hex equivelant for every byte of the blob.
One (inefficient) way to solve this is to store a hex representation of the file in the database. Grabbing the file directly from the HTTP POST request works fine:
params[:sample].read.unpack('H*').to_s
That stores the full 253235. Something about the roundtrip to the database (sqlite) or the HTTPClient post from the front-end web service to the back-end web service (hosted within WEBrick) is causing the loss of fidelity.
Eager to hear any ideas, willing to try whatever to test out suggestions. Thanks.
This is a known issue with rails and it's sqlite adapter:
There is a bug filed here in the old rails system (with patch):
https://rails.lighthouseapp.com/projects/8994/tickets/5040
And a new bug filed here in the new rails issue tracking system:
https://github.com/rails/rails/issues/2407
Any string that contains '%00' will be mangled when converting to binary and back. A binary that contains the string '%25' will be converted to '%' which is what you are seeing.

Bad characters from CSV into database

I am trying to figure out why I keep getting bad characters when I import information into my database from a CSV file.
Setup:
Database is UTF-8 encoding
HTML Page = UTF-8 Encoding (Meta Tag)
What I'm receiving when the file is imported is.
But in the CSV file everything looks clean, and the actual number is +1 (250) 862-8350
So I don't know what the issue is, my hunch is something to do with a form of trimming but I haven't been able to figure out what it is... any light would be appreciated!
Well I found out my answer, and it's somewhat embarasing. when my phone number gets put into the database I run it through my cleaner, and then encode the data... But I didn't notice that my database column was set to a small character count... and my encoding was longer that what would be inserted into the database... So in short, I made a my column 64 vs 32 and it solved the problem.
Thank you for your time in trying to help me though!

Resources