Does the character encoding ISO-8859-13 cover extinct Baltic languages? - character-encoding

I was wondering if the character encoding ISO-8859-13 covers extinct Baltic languages like Golyad', Semigallian, Selonian, Old Curonian, and the Western Baltic languages?

In addition to what #McDowell said, I'd recommend that you use UTF-8, just in case they have characters which ISO-8859-13 doesn't cover.

Related

Why choose EUC-JP over UTF-8 or SHIFT-JIS?

I've been working with a Japanese company who chooses to encode our files with EUC-JP.
I've been curious for quite a while now and tried asking superiors why EUC-JP over SHIFT-JIS or UTF-8, but get answers "like it's convention or such".
Do you know why the initial coders might have chosen EUC-JP over other character encoding?
Unlike Shift-JIS, EUC-JP is ASCII-safe - any byte where the eigth bit is zero is ASCII. It was also historically popular in Unix variants. Either of these things could have been an important factor a long time ago before UTF8 was generally adopted. Check the Wikipedia article for more details.

How to distinguish a WCHAR is Chinese, Japanese or ASCII?

For example delphi code
wchar_IsASCii : array[0..1] of WCHAR ;
wchar_IsASCii[0] := 'A' ;
wchar_IsASCii[1] := 'じ' ;
How can I tell whether wchar_IsASCii[0] belong to ASCII, or wchar_IsASCii[1] does not belong to ASCII?
Actually, I only need know whether a UNICODE char belong to ASCII, that’s all How to distinguish a WCHAR char is Chinese, Japanese or ASCII.
I don't know Delphi, but what I can tell you is you need to determine what range the character fits into in Unicode. Here is a link about finding CJK characters in Unicode: What's the complete range for Chinese characters in Unicode?
and unless Delphi has some nice library for distinguishing Chinese and Japanese charatcers, you're going to have to determine that yourself. Here is a good answer here on SO for how to do that:
Testing for Japanese/Chinese Characters in a string
The problem is... what do you mean by ASCII ? Original ASCII standard is 7-bit code, known as Latin1 - it is not even a byte.
Then if you come with so-called "extended ASCII" - a 1 byte items - then half of it can be next to anything. It can by Greek on one machien, European diacritics on another, Cyrillic at third one... etc.
So i think if all you need is testing whether you have 7 bit Latin1 character - ruling out extended characters from French, German, Spanish alphabets and all Scandinavians ones, then - as Unicode was designed as yet another superset for Latin1 what you need is checking that (0 <= Ord(char-var)) and ($7f >= Ord(char-var)).
However, if you really need to tell languages, if you consider Greek And Cyrillic somewhat ASCII and Japanese alphabets (there are two by the way, Hiragana and Katakana) not (or if you consider French and German more or less ASCII-like, but Russian not) you would have to look at Unicode Ranges.
http://www.unicode.org/charts/index.html
To come with 32-bit codepoint of UCS4 standard you can use http://docwiki.embarcadero.com/Libraries/XE3/en/System.Character.ConvertToUtf32
There are next to standard IBM Classes for Unicode but looks no good translation for Delphi exists Has anyone used ICU with Delphi?
You can use Jedi CodeLib, but its tables are (comments are contradicting) either from Unicode 4.1 or 5.0, not from current 6.2, though for Japanese version 5.0 should be enough.
http://wiki.delphi-jedi.org/wiki/JCL_Help:TUnicodeBlock
http://wiki.delphi-jedi.org/wiki/JCL_Help:CodeBlockFromChar
http://wiki.delphi-jedi.org/wiki/JCL_Help:CodeBlockName#TUnicodeBlock
You can also use Microsoft MLang interface to query internet-like character codes (RFC 1766)
http://msdn.microsoft.com/en-us/library/aa741220.aspx
http://msdn.microsoft.com/en-us/library/aa767880.aspx
http://msdn.microsoft.com/en-us/library/aa740986.aspx
http://www.transl-gunsmoker.ru/2011/05/converting-between-lcids-and-rfc-1766.html
http://www.ietf.org/rfc/rfc1766.txt
Generally, a character belongs to ASCII, if its code is in range 0x0000..0x007F, see http://www.unicode.org/charts/PDF/U0000.pdf. A new Delphi has class function TCharacter.IsAscii but it is from some strange reason declared as private.
ASCII characters have a decimal value less than 127.
However, unless you are running a teletype machine from the 1960's, ASCII chars may not be sufficient. ASCII chars will only cover English language characters. If you actually need to support "Western European" characters such as umlaut vowels, graves, etc, found in German, French, Spanish, Swedish, etc, then testing for Unicode char value <= 127 won't suffice. You might get away with testing for char value <= 255, as long as you don't need to work with Eastern European scripts.

What specific languages does the character encoding EUC-JP cover?

I want to know what very specific languages the encoding EUC-JP actually cover?
Short answer: Japanese.
Longer answer: http://en.wikipedia.org/wiki/Extended_Unix_Code#EUC-JP
EUC-JP was, of course, designed for Japanese, so includes all the essential Japanese characters:
22/64 of CJK Symbols and Punctuation
87/96 of Hiragana
90/96 of Katakana
12157/20992 of CJK Unified Ideographs [Kanji]
155/240 of Halfwidth and Fullwidth Forms
But it supports Western scripts as well:
128/128 of Basic Latin [ASCII]
82/128 of Latin-1 Supplement
120/128 of Latin Extended-A
17/208 of Latin Extended-B
6/80 of Spacing Modifier Letters
71/144 of Greek and Coptic
92/256 of Cyrillic
15/112 of General Punctuation
4/80 of Letterlike Symbols
6/112 of Arrows
32/256 of Mathematical Operators
1/256 of Miscellaneous Technical
32/128 of Box Drawing
12/96 of Geometric Shapes
7/256 of Miscellaneous Symbols
!"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_
`abcdefghijklmnopqrstuvwxyz{|}~¡¢£¤¦§¨©ª¬®¯°±´¶¸º¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍ
ÎÏÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĀāĂ㥹ĆćĈĉĊċČčĎ
ďĐđĒēĖėĘęĚěĜĝĞğĠġĢĤĥĦħĨĩĪīĮįİıIJijĴĵĶķĸĹĺĻļĽľĿŀŁłŃńŅņŇňʼnŊŋŌōŐőŒœŔŕ
ŖŗŘřŚśŜŝŞşŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽžǍǎǏǐǑǒǓǔǕǖǗǘǙǚǛǜǵˇ˘˙˚
˛˝΄΅ΆΈΉΊΌΎΏΐΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩΪΫάέήίΰαβγδεζηθικλμνξοπρςστυ
φχψωϊϋόύώЁЂЃЄЅІЇЈЉЊЋЌЎЏАБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдежзи
йклмнопрстуфхцчшщъыьэюяёђѓєѕіїјљњћќўџ‐―‖‘’“”†‡‥…‰′″※℃№™Å←↑→↓⇒⇔
∀∂∃∇∈∋−√∝∞∠∧∨∩∪∫∬∴∵∽≒≠≡≦≧≪≫⊂⊃⊆⊇⊥⌒─━│┃┌┏┐┓└┗┘┛├┝┠┣┤┥┨┫┬┯┰┳┴┷┸┻┼
┿╂╋■□▲△▼▽◆◇○◎●◯★☆♀♂♪♭♯ 
So you could use EUC-JP to write not only Japanese, but also English, Spanish, French, German, Greek, Russian, etc. (but not Arabic or Hebrew).
It's hard to answer the question of exactly which languages are "supported" because of ambiguities about exactly which characters are required for a language (e.g., Does Dutch need the IJ ligature, or is "IJ" adequate? Are "café" and "jalapeño" English words?)

Can someone explain ja_JP.UTF8?

I know utf8,but what's the difference between *.utf8?
From the answer to my post
Locale = ja_JP
Encoding = UTF-8
Before Unicode, handling non-english characters was done using tricks like Code Pages (like this) and special character sets (like this: Shift_JIS). UTF-8 contains a much larger range of characters with a completely different mapping system (i.e. the way each character is addressed by number).
When setting ja_JP.UTF8 as the locale, the "UTF8" part signifies the encoding for the special characters needed. For example, when you output a currency amount in the Japanese locale, you will need the ¥ character. The encoding information defines which character set to use to display the ¥.
I'm assuming there could exist a ja_JP.Shift_JIS locale. One difference to the UTF8 one - among others - would be that the ¥ sign is displayed in a way that works in this specific encoding.
Why ja_JP?
The two codes ja_JP signify language (I think based on this ISO norm) and country (based on this one). This is important if a language is spoken in more than one country. In the german speaking area, for example, the Swiss format numbers differently than the germans: 1'000'000 vs. 1.000.000. The country code serves to define these distinctions within the same language.
In which context? ja_JP tells us that the string is in the Japanese language. That does not have anything to do with the character encoding, but is probably used - depending on context - for sorting, keyboard input and language on displayed text in the program.
At a guess, I'd say each utf8 file with that naming convention contains a language definition for translating your site.
It's a locale name. The basic format is language_COUNTRY. ja = Japanese language, JP = Japan.
In addition to a date format, currency symbol, etc., each locale is associated with a character encoding. This is a historical legacy from the days when every language had its own encoding. Now, UTF-8 provides a common encoding for every locale.
The reason .UTF8 is part of the locale name is to distinguish it from older locales with a different encoding. For example, there's an ja_JP.EUC-JP locale available on my system. And for Germany, there's the choice of de_DE (obsolete pre-Euro locale with ISO-8859-1 encoding), de_DE#euro (ISO-8859-15 encoding, to provide the € sign), and de_DE.UTF-8.

What's the proper technical term for "high ascii" characters?

What is the technically correct way of referring to "high ascii" or "extended ascii" characters? I don't just mean the range of 128-255, but any character beyond the 0-127 scope.
Often they're called diacritics, accented letters, sometimes casually referred to as "national" or non-English characters, but these names are either imprecise or they cover only a subset of the possible characters.
What correct, precise term that will programmers immediately recognize? And what would be the best English term to use when speaking to a non-technical audience?
"Non-ASCII characters"
ASCII character codes above 127 are not defined. many differ equipment and software suppliers developed their own character set for the value 128-255. Some chose drawing symbols, sone choose accent characters, other choose other characters.
Unicode is an attempt to make a universal set of character codes which includes the characters used in most languages. This includes not only the traditional western alphabets, but Cyrillic, Arabic, Greek, and even a large set of characters from Chinese, Japanese and Korean, as well as many other language both modern and ancient.
There are several implementations of Unicode. One of the most popular if UTF-8. A major reason for that popularity is that it is backwards compatible with ASCII, character codes 0 to 127 are the same for both ASCII and UTF-8.
That means it is better to say that ASCII is a subset of UTF-8. Characters code 128 and above are not ASCII. They can be UTF-8 (or other Unicode) or they can be a custom implementation by a hardware or software supplier.
You could coin a term like “trans-ASCII,” “supra-ASCII,” “ultra-ASCII” etc. Actually, “meta-ASCII” would be even nicer since it alludes to the meta bit.
A bit sequence that doesn't represent an ASCII character is not definitively a Unicode character.
Depending on the character encoding you're using, it could be either:
an invalid bit sequence
a Unicode character
an ISO-8859-x character
a Microsoft 1252 character
a character in some other character encoding
a bug, binary data, etc
The one definition that would fit all of these situations is:
Not an ASCII character
To be highly pedantic, even "a non-ASCII character" wouldn't precisely fit all of these situations, because sometimes a bit sequence outside this range may be simply an invalid bit sequence, and not a character at all.
"Extended ASCII" is the term I'd use, meaning "characters beyond the original 0-127".
Unicode is one possible set of Extended ASCII characters, and is quite, quite large.
UTF-8 is the way to represent Unicode characters that is backwards-compatible with the original ASCII.
Taken words from an online resource (Cool website though) because I found it useful and appropriate to write and answer.
At first only included capital letters and numbers , but in 1967 was added the lowercase letters and some control characters, forming what is known as US-ASCII, ie the characters 0 through 127.
So with this set of only 128 characters was published in 1967 as standard, containing all you need to write in English language.
In 1981, IBM developed an extension of 8-bit ASCII code, called "code page 437", in this version were replaced some obsolete control characters for graphic characters. Also 128 characters were added , with new symbols, signs, graphics and latin letters, all punctuation signs and characters needed to write texts in other languages, ​such as Spanish.
In this way was added the ASCII characters ranging from 128 to 255.
IBM includes support for this code page in the hardware of its model 5150, known as "IBM-PC", considered the first personal computer.
The operating system of this model, the "MS-DOS" also used this extended ASCII code.
Non-ASCII Unicode characters.
If you say "High ASCII", you are by definition in the range 128-255 decimal. ASCII itself is defined as a one-byte (actually 7-bit) character representation; the use of the high bit to allow for non-English characters happened later and gave rise to the Code Pages that defined particular characters represented by particular values. Any multibyte (> 255 decimal value) is not ASCII.

Resources