Migrating a mysql 4 wtih latin1 charset to mysql 5 with utf8 - character-encoding

I have a an old mysql 4 database with latin1 character set wtih content in Cyrilic, that I need to migrate in mysql 5 with utf8. When I make the mysql dump I see strange characters and I can make a prorper recovery.
Any help?

You need to make a dump with the parameter --default-character-set set to the value of the source database (latin 1), change the charset od the db and tables in the txt file and the restore it with the charset of the target database (utf8):
Here is the procedure how to do it:
http://itworkarounds.blogspot.com/2011/07/mysql-database-migration-and-character.html

Related

Sybase ASE: show the character encoding used by the database

I am working on a Sybase ASE database and would like to know the character encoding (UTF8 or ASCII or whatever) used by the databae.
What's the command to show which character encoding the database uses?
The command you're looking for is actually a system stored procedure:
1> sp_helpsort
2> go
... snip ...
Sort Order Description
------------------------------------------------------------------
Character Set = 190, utf8
Unicode 3.1 UTF-8 Character Set
Class 2 Character Set
Sort Order = 50, bin_utf8
Binary sort order for the ISO 10646-1, UTF-8 multibyte encodin
g character set (utf8).
... snip ...
From this output we see this particular ASE dataserver has been configured with a default character set of utf8 and default sort order of binary (bin_utf8). This means all data is stored as utf8 and all indexing/sort operations are performed using a binary sort order.
Keep in mind ASE can perform character set conversions (for reads and writes) based on the client's character set configuration. Though the successfulness of said conversions will depend on the character sets in question (eg, a client connecting with utf8 may find many characters cannot be converted for storage in a dataserver defined with a default character set of iso_1).
With a query:
select
cs.name as server_character_set,
cs.description as character_set_description
from
master..syscharsets cs left outer join
master..sysconfigures cfg on
cs.id = cfg.value
where
cfg.config = 131
Example output:
server_character_set character_set_description
utf8 Unicode 3.1 UTF-8 Character Set

MemSQL load data infile does not support hexadecimal delimiter

From this, MySQL load data infile command works well with hexadecimal delimiter like X'01' or X'1e' in my case. But the same command can't be run with same command load data infile on MemSQL.
I tried specifying various forms of of the same delimiter \x1e like:
'0x1e' or 0x1e
X'1e'
'\x1e' or 'x1e'
All the above don't work and throw either syntax error or other error like this:
This is like the delimiter can't be resolved correctly:
mysql> load data local infile '/container/data/sf10/region.tbl.hex' into table REGION CHARACTER SET utf8 fields terminated by '\x1e' lines terminated by '\n';
ERROR 1261 (01000): Row 1 doesn't contain data for all columns
This is syntax error:
mysql> load data local infile '/container/data/sf10/region.tbl.hex' into table REGION CHARACTER SET utf8 fields terminated by 0x1e lines terminated by '\n';
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '0x1e lines terminated by '\n'' at line 1
mysql>
The data is actually delimited by non-printable hexadecimal character of \x1e and line terminated by regular \n. Use cat -A can see the delimited characters as ^^. So the delimiter should be correct.
$ cat -A region.tbl.hex
0^^AFRICA^^lar deposits. blithely final packages cajole. regular waters are final requests. regular accounts are according to $
1^^AMERICA^^hs use ironic, even requests. s$
Are there a correct way to use hex values as delimiter? I can't find such information in documentation.
For the purpose of comparison, hex delimiter (0x1e) can work well on MySQL:
mysql> load data local infile '/tmp/region.tbl.hex' into table region CHARACTER SET utf8 fields terminated by 0x1e lines terminated by '\n';
Query OK, 5 rows affected (0.01 sec)
Records: 5 Deleted: 0 Skipped: 0 Warnings: 0
MemSQL supported hex delimiters as of 6.7, of the form in the last code block in your question. Prior to that, you would need the literal quoted 0x1e character in your sql string, which is annoying to do from a CLI. If youre on an older version you may need to upgrade.

How to store Japanese kanji character in Oracle database

This Japanese character 𠮷, which has four bytes, is saved as ???? in Oracle database whereas other Japanese characters are saved properly.
The configuration in boot.rb of my rails application contains:
ENV['NLS_LANG'] = 'AMERICAN_AMERICA.UTF8'
and sqldeveloper of oracle db has
NLS_LANGUAGE AMERICAN
NLS_TERRITORY AMERICA
NLS_CHARACTERSET JA16SJISTILDE
The datatype of the column is NVARCHAR2.
Try NLS_LANG=AMERICAN_AMERICA.AL32UTF8
Oracle Character set UTF8 is actually CESU-8 whereas AL32UTF8 is commonly known UTF-8
If you stay in Basic Multilingual Plane (BMP) then UTF8 and AL32UTF8 are equal, however when you have characters above U+FFFF then they are different.
𠮷 is U+20BB7 which is Supplementary Ideographic Plane

Delphi XE, Firebird and UTF8

I'm upgrading a D7 program to XE, and under Delphi 7 I had code like this...
ParamByName ('Somefield').AsString:=someutf8rawbytestring;
Under XE if someutf8rawbytestring contains unicode characters such as Cyrillic script, then they appear as ???? in the DB.
I see that someutf8rawbytestring is 8 characters long, for my 4 character string, which is correct. But in the DB there are just four characters.
I'm using Firebird 2 through TIBQuery with XE and updating a Varchar field with character type 'NONE'.
So what it looks like is that the utf8 is being detected and converted somehow back to unicode data points, and then that is failing a string conversion for the DB. I've tried setting the varchar field to UTF8 encoding but with the same result.
So how should this be handled?
EDIT: I can use a database tool and edit my DB field to have some non-ASCII data and when I read it back it comes as a utf8 encoded string that I can use UTF8decode on and it's correct. But writing data back to this field seems impossible without getting a bunch of ???? in the DB. I've tried ParamByName ('Somefield').AsString:=somewidestring; and ParamByName ('Somefield').AsWideString:=somewidestring; and I just get rubbish in the DB...
EDIT2: Here's the code (in one iteration) ...
procedure TFormnameEdit.savename(id : integer);
begin
With DataModule.UpdateNameQuery do begin
ParamByName ('Name').AsString:=UTF8Encode(NameEdit.Text);
ParamByName ('ID').AsInteger:=id;
ExecSQL;
Transaction.Commit;
end;
end;
As #Lightbulb recommended, adding lc_ctype=UTF8 to the TIBDatabase params solved the problem.

Encoding error PostgreSQL 8.4

I am importing data from a CSV file. One of the fields has an accent(Telefónica O2 UK Limited). The application throws en error while inserting the data to the table.
PGError: ERROR: invalid byte sequence for encoding "UTF8": 0xf36e6963
HINT: This error can also happen if the byte sequence does not match the
encoding expected by the server, which is controlled by "client_encoding".
: INSERT INTO "companies" ("name", "validated")
VALUES(E'Telef?nica O2 UK Limited', 't')
The data entry through the forms works when I enter names with accents and umlaut.
How do I workaround this issue?
Edit
I addressed the issue by converting the file encoding. I uploaded the CSV file to Google docs and exported the file to CSV.
The error message is pretty clear: Your client_encoding setting is set to UTF8 and you try to insert a character which isn't encoded in UTF8 (if it's a CSV from MS Excel, your file is probably encoded in Windows-1252 instead).
You could either convert it in your application or you can alter your PostgreSQL connection to match the encoding you want to insert (thus enabling PostgreSQL to do the conversion for you). You can do so by executing SET CLIENT_ENCODING TO 'WIN1252'; on your PostgreSQL connection before trying to insert that data. After the import you should reset it to its original value with RESET CLIENT_ENCODING;
HTH!
I think you can try to use the Ruby gem rchardet, which may be a better solution. Example code:
require ‘rchardet’
cd = CharDet.detect(string_of_unknown_encoding)
encoding = cd['encoding']
converted_string = Iconv.conv(‘UTF-8′, encoding, str_of_unknown_encoding)
Here are some related links:
https://github.com/jmhodges/rchardet
http://www.meeho.net/blog/2010/03/ruby-how-to-detect-the-encoding-of-a-string/

Resources