Bad characters from CSV into database - character-encoding

I am trying to figure out why I keep getting bad characters when I import information into my database from a CSV file.
Setup:
Database is UTF-8 encoding
HTML Page = UTF-8 Encoding (Meta Tag)
What I'm receiving when the file is imported is.
But in the CSV file everything looks clean, and the actual number is +1 (250) 862-8350
So I don't know what the issue is, my hunch is something to do with a form of trimming but I haven't been able to figure out what it is... any light would be appreciated!

Well I found out my answer, and it's somewhat embarasing. when my phone number gets put into the database I run it through my cleaner, and then encode the data... But I didn't notice that my database column was set to a small character count... and my encoding was longer that what would be inserted into the database... So in short, I made a my column 64 vs 32 and it solved the problem.
Thank you for your time in trying to help me though!

Related

tFuzzyMatch apparently not working on Arabic text strings

I have created a job in talend open studio for data integration v5.5.1.
I am trying to find matches between two customer names columns, one is a lookup and the other contain dirty data.
The job runs as expected when the customer names are in english. However, for arabic names, only exact matches are found regardless of the underlying match algorithm i used (levenschtein, metaphone, double metaphone) even with loose bounds for the levenschtein algorithm min 1 max 50).
I suspect this has to do with character encoding. How should I proceed? any way I can operate using the unicode or even UTF-8 interpretation in Talend?
I am using excel data sources through tFileInputExcel
I got it resolved by moving the data to mysql with a UTF-8 collation. Somehow Excel input wasn't preserving the collation.

identifying problematic row of data giving mass import error

I am using activerecord-import to bulk insert a bunch of data in a .csv file into my rails app. Unfortunately, I am getting an error when I call import on my model.
ArgumentError (invalid byte sequence in UTF-8)
I know the problem is that I have a string with weird characters somewhere in the 1000+ rows of data that I am importing, but I can't figure out which row is the problem.
Does activerecord-import have any error handling built in that I could use to figure out which row/row(s) were problematic (e.g. some option I could set when calling import function on my model)? As far as I can tell the answer is no.
Alternatively, can I write some code that would check the array that I am passing into activerecord-import to determine which rows have strings that are invalid in UTF-8?
Without being able to see the data, it is only possible to guess. Most likely, you have a character combination that is not UTF-8 valid.
You should be able to check your file with
iconv -f utf8 <filename>

Strange character encoding issue

I have some data which has been imported into Postgres, for use in a Rails application. However somehow the foreign accents have become strangely encoded:
ä appears as â§
á appears as â°
é appears as â©
ó appears as ââ¥
I'm pretty sure the problem is with the integrity of the data, rather than any problem with Rails. It doesn't seem to match any encoding I try:
# Replace "cp1252" with any other encoding, to no effect
"Trollâ§ttan".encode("cp1252").force_encoding("UTF-8") #-> junk
If anyone was able to identify what kind of encoding mixup I'm suffering from, that would be great.
As a last resort, I may have to manually replace each corrupted accent character, but if anyone can suggest a programatic solution (or a even a starting point for fixing this - I've found it very hard to debug), I'd be v. grateful.
It's hardly possible with recent versions of PostgreSQL to have invalid UTF8 inside a UTF8 database. There are other plausible possibilities that may lead to that output, though.
In the typical case of é appearing as ©, either:
The contents of the database are valid, but some client-side layer is interpreting the bytes from the database as if they were iso-latin-something whereas they are UTF8.
The contents are valid and the SQL client-side layer is valid, but the terminal/software/webpage with which you're looking at this is configured for iso-latin1 or a similar mono-bytes encoding (win1252, iso-latin9...).
The contents of the database consist of the wrong characters with a valid UTF8 encoding. This is what you end up with if you take iso-latin-something bytes, convert them to UTF8 representation, then take the resulting byte stream as if if was still in iso-latin, and reconvert it once again to UTF8, and insert that into the database.
Note that while the © sequence is typical in UTF8 versus iso-latin confusion, the presence of an additional â in all your sample strings is uncommon. It may be the result of another misinterpretation on top of the primary one. If you're in case #3, that may mean that an automated fix based on search-replace will be harder than the normal case which is already tricky.

Recommended column delimiter for Click stream data to consumed by SSIS

I am working with some click stream data and i would need to give specifications to the vendor regarding a preferred format to be consumed by SSIS.
As its URL data in the text file which column delimiter would you recommend. I was thinking pipe "|" but i realize that pipes can be used within the URL.
I did some testing to specify multiple charecters as delimiter lile |^| but when I am creating a flat file connection there is not option in SSIS. I had type these charecters. But when i went to edit the flat file connection manager it had changed to {|}^{|}. It just made me nervous to the import succeeded.
I just wanted to see if anybody has good ideas as to which would safe column delimiter to use.
Probably tab-delimited would be fairly safe, at least assuming that by "clickstream" you mean a list of URLs or something similar. But in theory any delimiter should be fine as long as the supplier quotes the data appropriately.

Unpacking ActiveRecord binary blob to hex string drops escape characters ("%25" converted to "%")

I have a Ruby-on-Rails app that accepts a binary file upload, stores it as an ActiveRecord object in a local database, and passes a hex equivalent of the binary blob to a back-end web service for processing. This usually works great.
Two days ago, I ran into a problem with a file containing the hex sequence \x25\x32\x35, %25 in ASCII. The binary representation of the file was stored properly in the database but the hex string representation of the file that resulted from
sample.binary.unpack('H*').to_s
was incorrect. After investigating, I found that those three bytes were converted to hex string 25, the representation for %. It should have been 253235, the representation for %25
It makes sense for Ruby or Rails or ActiveRecord to do this. %25 is the proper URL-encoded value for %. However, I need to turn off this optimization or validation or whatever it is. I need blob.unpack('H*') to include a hex equivelant for every byte of the blob.
One (inefficient) way to solve this is to store a hex representation of the file in the database. Grabbing the file directly from the HTTP POST request works fine:
params[:sample].read.unpack('H*').to_s
That stores the full 253235. Something about the roundtrip to the database (sqlite) or the HTTPClient post from the front-end web service to the back-end web service (hosted within WEBrick) is causing the loss of fidelity.
Eager to hear any ideas, willing to try whatever to test out suggestions. Thanks.
This is a known issue with rails and it's sqlite adapter:
There is a bug filed here in the old rails system (with patch):
https://rails.lighthouseapp.com/projects/8994/tickets/5040
And a new bug filed here in the new rails issue tracking system:
https://github.com/rails/rails/issues/2407
Any string that contains '%00' will be mangled when converting to binary and back. A binary that contains the string '%25' will be converted to '%' which is what you are seeing.

Resources