Encoding error PostgreSQL 8.4 - ruby-on-rails

I am importing data from a CSV file. One of the fields has an accent(Telefónica O2 UK Limited). The application throws en error while inserting the data to the table.
PGError: ERROR: invalid byte sequence for encoding "UTF8": 0xf36e6963
HINT: This error can also happen if the byte sequence does not match the
encoding expected by the server, which is controlled by "client_encoding".
: INSERT INTO "companies" ("name", "validated")
VALUES(E'Telef?nica O2 UK Limited', 't')
The data entry through the forms works when I enter names with accents and umlaut.
How do I workaround this issue?
Edit
I addressed the issue by converting the file encoding. I uploaded the CSV file to Google docs and exported the file to CSV.

The error message is pretty clear: Your client_encoding setting is set to UTF8 and you try to insert a character which isn't encoded in UTF8 (if it's a CSV from MS Excel, your file is probably encoded in Windows-1252 instead).
You could either convert it in your application or you can alter your PostgreSQL connection to match the encoding you want to insert (thus enabling PostgreSQL to do the conversion for you). You can do so by executing SET CLIENT_ENCODING TO 'WIN1252'; on your PostgreSQL connection before trying to insert that data. After the import you should reset it to its original value with RESET CLIENT_ENCODING;
HTH!

I think you can try to use the Ruby gem rchardet, which may be a better solution. Example code:
require ‘rchardet’
cd = CharDet.detect(string_of_unknown_encoding)
encoding = cd['encoding']
converted_string = Iconv.conv(‘UTF-8′, encoding, str_of_unknown_encoding)
Here are some related links:
https://github.com/jmhodges/rchardet
http://www.meeho.net/blog/2010/03/ruby-how-to-detect-the-encoding-of-a-string/

Related

Sybase ASE: show the character encoding used by the database

I am working on a Sybase ASE database and would like to know the character encoding (UTF8 or ASCII or whatever) used by the databae.
What's the command to show which character encoding the database uses?
The command you're looking for is actually a system stored procedure:
1> sp_helpsort
2> go
... snip ...
Sort Order Description
------------------------------------------------------------------
Character Set = 190, utf8
Unicode 3.1 UTF-8 Character Set
Class 2 Character Set
Sort Order = 50, bin_utf8
Binary sort order for the ISO 10646-1, UTF-8 multibyte encodin
g character set (utf8).
... snip ...
From this output we see this particular ASE dataserver has been configured with a default character set of utf8 and default sort order of binary (bin_utf8). This means all data is stored as utf8 and all indexing/sort operations are performed using a binary sort order.
Keep in mind ASE can perform character set conversions (for reads and writes) based on the client's character set configuration. Though the successfulness of said conversions will depend on the character sets in question (eg, a client connecting with utf8 may find many characters cannot be converted for storage in a dataserver defined with a default character set of iso_1).
With a query:
select
cs.name as server_character_set,
cs.description as character_set_description
from
master..syscharsets cs left outer join
master..sysconfigures cfg on
cs.id = cfg.value
where
cfg.config = 131
Example output:
server_character_set character_set_description
utf8 Unicode 3.1 UTF-8 Character Set

Postgres invalid byte sequence for encoding "UTF8": 0xc3 0x2f

I work with a payment API and it returns some XML. For logging I want to save the API response in my database.
One word in the API is "manhã" but the API returns "manh�". Other chars like á ou ç are being returned correctly, this is some bug in the API I guess.
But when trying to save this in my DB I get:
Postgres invalid byte sequence for encoding "UTF8": 0xc3 0x2f
How can I solve this?
I tried things like
response.encode("UTF-8") and also force_encode but all I get is:
Encoding::UndefinedConversionError ("\xC3" from ASCII-8BIT to UTF-8)
I need to either remove this wrong character or convert it somehow.
You’re on the right track - you should be able to solve the problem with the encode method - when the source encoding is known you should be able to simply use:
response.encode(‘UTF-8’, ‘ISO-8859-1’)
There may be times where there are invalid characters in the source encoding, and to get around exceptions, you can instruct ruby how to handle them:
# This will transcode the string to UTF-8 and replace any invalid/undefined characters with ‘’ (empty string)
response.encode(‘UTF-8’, 'ISO-8859-1', invalid: :replace, undef: :replace, replace: ‘’)
This is all laid out in the Ruby docs for String - check them out!
—--
Note, many people incorrectly assume that force_encode will somehow fix encoding problems. force_encode simply tags the string as the specified encoding - it does not transcode and replace/remove the invalid characters. When you're converting between encodings, you must transcode so that characters in one character set are correctly represented in the other character set.
As pointed out in the comment section, you can use force_encoding to transcode your string if you used: response.force_encoding('ISO-8859-1').encode('UTF-8') (which is equivalent to the first example using encode above).

MemSQL load data infile does not support hexadecimal delimiter

From this, MySQL load data infile command works well with hexadecimal delimiter like X'01' or X'1e' in my case. But the same command can't be run with same command load data infile on MemSQL.
I tried specifying various forms of of the same delimiter \x1e like:
'0x1e' or 0x1e
X'1e'
'\x1e' or 'x1e'
All the above don't work and throw either syntax error or other error like this:
This is like the delimiter can't be resolved correctly:
mysql> load data local infile '/container/data/sf10/region.tbl.hex' into table REGION CHARACTER SET utf8 fields terminated by '\x1e' lines terminated by '\n';
ERROR 1261 (01000): Row 1 doesn't contain data for all columns
This is syntax error:
mysql> load data local infile '/container/data/sf10/region.tbl.hex' into table REGION CHARACTER SET utf8 fields terminated by 0x1e lines terminated by '\n';
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '0x1e lines terminated by '\n'' at line 1
mysql>
The data is actually delimited by non-printable hexadecimal character of \x1e and line terminated by regular \n. Use cat -A can see the delimited characters as ^^. So the delimiter should be correct.
$ cat -A region.tbl.hex
0^^AFRICA^^lar deposits. blithely final packages cajole. regular waters are final requests. regular accounts are according to $
1^^AMERICA^^hs use ironic, even requests. s$
Are there a correct way to use hex values as delimiter? I can't find such information in documentation.
For the purpose of comparison, hex delimiter (0x1e) can work well on MySQL:
mysql> load data local infile '/tmp/region.tbl.hex' into table region CHARACTER SET utf8 fields terminated by 0x1e lines terminated by '\n';
Query OK, 5 rows affected (0.01 sec)
Records: 5 Deleted: 0 Skipped: 0 Warnings: 0
MemSQL supported hex delimiters as of 6.7, of the form in the last code block in your question. Prior to that, you would need the literal quoted 0x1e character in your sql string, which is annoying to do from a CLI. If youre on an older version you may need to upgrade.

importing csv into database returns invalid byte

I'm trying to import a CSV file into my database but I get this error:
PG::CharacterNotInRepertoire: ERROR: invalid byte sequence for encoding "UTF8": 0xe2 0x80 0x22
How do I go about fixing this? The CSV comes from an external provider.
You need to find out which encoding your csv-file is. Ask the provider of the file or just experiment with an editor that can switch the encodings.
The you just need to convert the string before parsing it with csv. For example, if it is ISO-8859-15 (Windows Western Europe with Euro) you can convert the string like this:
def convert_iso(st)
st.force_encoding('iso-8859-15').encode('utf-8')
end

Ruby Gem randomly returns Encoding Error

So I forked this gem on GitHub, thinking that I may be able to fix and update some of the issues with it for use in a Rails project. I basically get this output:
irb(main):020:0> query = Query::simpleQuery('xx.xxx.xxx.xx', 25565)
=> [false, #<Encoding::CompatibilityError: incompatible character encodings: UTF-8 and ASCII-8BIT>]
irb(main):021:0> query = Query::simpleQuery('xx.xxx.xxx.xx', 25565)
=> {:motd=>"Craftnet", :gametype=>"SMP", :map=>"world", :numplayers=>"0", :maxplayers=>"48"}
The first response is the example of the Encoding error, and the second is the wanted output (IP's taken out). Basically this is querying a Minecraft server for information on it.
I tried using
Encoding.default_external = Encoding::UTF_8
Encoding.default_internal = Encoding::UTF_8
But that just gave the same response, randomly spitting encoding errors and not.
Here is the relevant GitHub repo with all the code: RubyMinecraft
Any help would be greatly appreciated.
In the Query class there is this line:
#key = Array(key).pack('N')
This creates a String with an associated encoding of ASCII-8BIT (i.e. it’s a binary string).
Later #key gets used in this line:
query = #sock.send("\xFE\xFD\x00\x01\x02\x03\x04" + #key, 0)
In Ruby 2.0 the default encoding of String literals is UTF-8, so this is combining a UTF-8 string with a binary one.
When Ruby tries to do this it first checks to see if the binary string only contains 7-bit values (i.e. all bytes are less than or equal to 127, with the top byte being 0), and if it does it considers it compatible with UTF-8 and so combines them without further issue. If it doesn’t, (i.e. if it contains bytes greater than 127) then the two strings are not compatible and an Encoding::CompatibilityError is raised.
Whether an error is raised depends on the contents of #key, which is initialized from a response from the server. Sometimes this value happens to contain only 7-bit values, so no error is raised, at other times there is a byte with the high bit set, so it generates an error. This is why the errors appear to be “random”.
To fix it you can specify that the string literal in the line where the two strings are combined should be treated as binary. The simplest way would be to use force_encoding like this:
query = #sock.send("\xFE\xFD\x00\x01\x02\x03\x04".force_encoding(Encoding::ASCII_8BIT) + #key, 0)

Resources