This Japanese character 𠮷, which has four bytes, is saved as ???? in Oracle database whereas other Japanese characters are saved properly.
The configuration in boot.rb of my rails application contains:
ENV['NLS_LANG'] = 'AMERICAN_AMERICA.UTF8'
and sqldeveloper of oracle db has
NLS_LANGUAGE AMERICAN
NLS_TERRITORY AMERICA
NLS_CHARACTERSET JA16SJISTILDE
The datatype of the column is NVARCHAR2.
Try NLS_LANG=AMERICAN_AMERICA.AL32UTF8
Oracle Character set UTF8 is actually CESU-8 whereas AL32UTF8 is commonly known UTF-8
If you stay in Basic Multilingual Plane (BMP) then UTF8 and AL32UTF8 are equal, however when you have characters above U+FFFF then they are different.
𠮷 is U+20BB7 which is Supplementary Ideographic Plane
Related
I am working on a Sybase ASE database and would like to know the character encoding (UTF8 or ASCII or whatever) used by the databae.
What's the command to show which character encoding the database uses?
The command you're looking for is actually a system stored procedure:
1> sp_helpsort
2> go
... snip ...
Sort Order Description
------------------------------------------------------------------
Character Set = 190, utf8
Unicode 3.1 UTF-8 Character Set
Class 2 Character Set
Sort Order = 50, bin_utf8
Binary sort order for the ISO 10646-1, UTF-8 multibyte encodin
g character set (utf8).
... snip ...
From this output we see this particular ASE dataserver has been configured with a default character set of utf8 and default sort order of binary (bin_utf8). This means all data is stored as utf8 and all indexing/sort operations are performed using a binary sort order.
Keep in mind ASE can perform character set conversions (for reads and writes) based on the client's character set configuration. Though the successfulness of said conversions will depend on the character sets in question (eg, a client connecting with utf8 may find many characters cannot be converted for storage in a dataserver defined with a default character set of iso_1).
With a query:
select
cs.name as server_character_set,
cs.description as character_set_description
from
master..syscharsets cs left outer join
master..sysconfigures cfg on
cs.id = cfg.value
where
cfg.config = 131
Example output:
server_character_set character_set_description
utf8 Unicode 3.1 UTF-8 Character Set
Our application automatically modifies the layout of Arabic text when it is followed by a bracket and I was wondering whether this was the correct behaviour or not?
The application shows items in the following format:
[ID of structure](version)
So version 1.5 of the English structure "stackoverflow" would be displayed as:
stackoverflow(1.5)
Note: the brackets need to be displayed. There is no space between the ID and the first bracket. The brackets simply encompass the version. The brackets could have been any character but it's far too late to switch to a different character now!
This works fine for left to right languages, but for Arabic languages the structures appear in the form:
ستاكوفيرفلوو(1.0)
I am not an Arabic speaker and I need to know if this is actually correct. Is the Arabic format the equivalent of the English format or has something gone horribly wrong?
The text in Arabic should be shown like:
ستاكوفيرفلوو(1.0)
I added the html entity of RLM / Right-to-left Mark in order to fix the text. You should do so if your application doesn't support Bidi native-ly. You can add the RLM by these ways:
HTML Entity (decimal)
HTML Entity (hex)
HTML Entity (named)
How to type in Microsoft Windows Alt +200F
UTF-8 (hex) 0xE2 0x80 0x8F (e2808f)
UTF-8 (binary) 11100010:10000000:10001111
UTF-16 (hex) 0x200F (200f)
UTF-16 (decimal) 8,207
UTF-32 (hex) 0x0000200F (200f)
UTF-32 (decimal) 8,207
C/C++/Java source code "\u200F"
Python source code u"\u200F"
(note: StackOverflow right transliteration is ستاك-أوفرفلو)
I have a pdf which have following mapping:
<019A> <0074>
<039E> <00A9>
<019F> <00740069>
<01B5> <0075>
<01C0> <0076>
<01C7> <0079>
<03EC> <0030>
The mapping, cid <019F> represent ligature ti.
In mapping \u0074 -> t and \u0069 -> i (hence) ligature ti.
How do I get actual ligature unicode? or I have to keep the track for such pattern and replace cid mapping with actual unicode of the ligature?
Thanks.
Essentially, for every character code you cannot assume that there is only one unicode character in the mapping. You will have to take output of both the characters. It can be even more that two characters in unicode. Some fonts have ligatures for "ffl" as well.
Also to be noted here Unicode specification also has special single character definitions for ligatures as well: https://en.wikipedia.org/wiki/Typographic_ligature
It's possible the special ligature unicode characters may be used in the mapping.
I have a an old mysql 4 database with latin1 character set wtih content in Cyrilic, that I need to migrate in mysql 5 with utf8. When I make the mysql dump I see strange characters and I can make a prorper recovery.
Any help?
You need to make a dump with the parameter --default-character-set set to the value of the source database (latin 1), change the charset od the db and tables in the txt file and the restore it with the charset of the target database (utf8):
Here is the procedure how to do it:
http://itworkarounds.blogspot.com/2011/07/mysql-database-migration-and-character.html
We have different types of hyphens/dashes (in some text) populated in db. Before comparing them with some user input text, i have to normalize any type of dashes/hyphens to simple hyphen/minus (ascii 45).
The possible dashes we have to convert are:
Minus(−) U+2212 − or − or −
Hyphen-minus(-) U+002D -
Hyphen(-) U+2010
Soft Hyphen U+00AD
Non-breaking hyphen U+2011 ‑
Figure dash(‒) U+2012 (8210) ‒ or ‒
En dash(–) U+2013 (8211) –, – or –
Em dash(—) U+2014 (8212) —, — or —
Horizontal bar(―) U+2015 (8213) ― or ―
These all have to be converted to Hyphen-minus(-) using gsub.
I've used CharDet gem to detect the character encoding type of the fetched string. It's showing windows-1252. I've tried Iconv to convert the encoding to ascii. But it's throwing an exception Iconv::IllegalSequence.
ruby -v => ruby 1.8.7 (2009-06-12 patchlevel 174) [i686-darwin9.8.0]
rails -v => Rails 2.3.5
mysql encoding => 'latin1'
Any idea how to accomplish this?
Caveat: I know nothing about Ruby, but you have problems that are nothing to do with the programming language that you are using.
You don't need to convert Hyphen-minus(-) U+002D - to simple hyphen/minus (ascii 45); they're the same thing.
You believe that the database encoding is latin1. The statement "My data is encoded in ISO-8859-1 aka latin1" is up there with "The check is in the mail" and "Of course I'll still love you in the morning". All it tells you is that it is a single-byte-per-character encoding.
Presuming that "fetched string" means "byte string extracted from the database", chardet is very likely quite right in reporting windows-1252 aka cp1252 -- however this may be by accident as chardet sometimes seems to report that as a default when it has exhausted other possibilities.
(a) These Unicode characters cannot be decoded into latin1 or cp1252 or ascii:
Minus(−) U+2212 − or − or −
Hyphen(-) U+2010
Non-breaking hyphen U+2011 ‑
Figure dash(‒) U+2012 (8210) ‒ or ‒
Horizontal bar(―) U+2015 (8213) ― or ―
What gives you the impression that they may possibly appear in the input or in the database?
(b) These Unicode characters can be decoded into cp1252 but not latin1 or ascii:
En dash(–) U+2013 (8211) –, – or –
Em dash(—) U+2014 (8212) —, — or —
These (most likely the EN DASH) are what you really need to convert to an ascii hyphen/dash. What was in the string that chardet reported as windows-1252?
(c) This can be decoded into cp1252 and latin1 but not ascii:
Soft Hyphen U+00AD
If a string contains non-ASCII characters, any attempt (using iconv or any other method) to convert it to ascii will fail, unless you use some kind of "ignore" or "replace with ?" option. Why are you trying to do that?