what is the charset for spanish windows?
Spanish (es) iso-8859-1, windows-1252
The only correct, acceptable character set nowadays is the Universal Character Set (UCS), and the only correct, acceptable encodings are Unicode Transformation Formats (UTF).
One of the worst misfeatures Windows has is a silly notion of a locale having several character sets and encodings associated to it: an 8-bit legacy so-called "OEM" character set that comes from the DOS days, an 8-bit legacy so-called "ANSI" character set that comes from the early Windows days, and so-called "wide character" UTF-16 Little-Endian, which is what Windows supports when applications are "Unicode". While the former so-called "OEM" thing is left for DOS applications, most of the Windows API is annoyingly duplicated with "A" (ANSI) functions and W (wide char) functions.
In the case of the Spanish locale, the "OEM" character set is CP850, the "ANSI" character set is CP1252, and of course there's UTF-16 Little-Endian which is what you should be using.
My recommendation is that you avoid CP1252 and CP850 like the plague and use UTF-16LE, as well as develop your applications with Unicode semantics and conventions. Some applications also support UTF-8, which is more convenient for European languages.
include meta tage in header file.
<meta charset="iso-8859-1">
<meta http-equiv="content-type" content="text/html;charset=utf-8" />
Not 100% on topic but probably very useful to some who will arrive here by a Google search, as I did: ÁáÉéÍíÓóÚúÑñ¡¿
Related
I've been working with a Japanese company who chooses to encode our files with EUC-JP.
I've been curious for quite a while now and tried asking superiors why EUC-JP over SHIFT-JIS or UTF-8, but get answers "like it's convention or such".
Do you know why the initial coders might have chosen EUC-JP over other character encoding?
Unlike Shift-JIS, EUC-JP is ASCII-safe - any byte where the eigth bit is zero is ASCII. It was also historically popular in Unix variants. Either of these things could have been an important factor a long time ago before UTF8 was generally adopted. Check the Wikipedia article for more details.
There is a public project called Moby containing several word lists. Some files contain European alphabets symbols and were created in pre-Unicode time. Readme, dated 1993, reads:
"Foreign words commonly used in English usually include their
diacritical marks, for example, the acute accent e is denoted by ASCII
142."
Wikipedia says that the last ASCII symbol has number 127.
For example this file: http://www.gutenberg.org/files/3203/files/mobypos.txt contains symbols that I couldn't read in any of vatious Latin encodings. (There are plenty of such symbols in the very end of section of words beginning with B, just before C letter. )
Could someone advise please what encoding should be used for reading this file or how can it be converted to some readable modern encoding?
A little research suggests that the encoding for this page is Mac OS Roman, which has é at position 142. Viewing the page you linked and changing the encoding (in Chrome, View → Encoding → Western (Macintosh)) seems to display all the words correctly (it is incorrectly reporting ISO-8859-1).
How you deal with this depends on the language / tools you are using. Here’s an example of how you could convert into UTF-8 with Ruby:
require 'open-uri'
s = open('http://www.gutenberg.org/files/3203/files/mobypos.txt').read
s.force_encoding('macroman')
s.encode!('utf-8')
You are right in that ASCII only goes up to position 127 (it’s a 7-bit encoding), but there are a large number of 8 bit encodings that are supersets of ASCII and people sometimes refer to those as “Extended ASCII”. It appears that whoever wrote the readme you refer to didn’t know about the variety of encodings and thought the one he happened to be using at the time was universal.
There isn’t a general solution to problems like this, as there is no guaranteed way to determine the encoding of some text from the text itself. In this case I just used Wikipedia to look through a few until I found one that matched. Joel Spolsky’s article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) is a good place to start reading about character sets and encodings if you want to learn more.
It appears that many Traditional Chinese characters are supported by the Windows implementation of Simplified Chinese character set, which started as GBK / GB2312 and from my understanding now supports the entire GB18030.
The question is, are all the Traditional Chinese characters there? If yes, doesn't it make the Big5 character set totally redundant?
There are only a handful of printable non-private-use characters (¯ˍ‧╴) present in code page 950 (Big5) but not in 936 (GB2312), none of which are Chinese ideographs. In that sense, yes, Big5 is “redundant”.
However, for various technical and political reasons, it's highly unlikely that all Traditional Chinese documents will immediately be converted to GB18030 (or UTF-8), so expect Windows to support CP950 for years to come.
For example delphi code
wchar_IsASCii : array[0..1] of WCHAR ;
wchar_IsASCii[0] := 'A' ;
wchar_IsASCii[1] := 'じ' ;
How can I tell whether wchar_IsASCii[0] belong to ASCII, or wchar_IsASCii[1] does not belong to ASCII?
Actually, I only need know whether a UNICODE char belong to ASCII, that’s all How to distinguish a WCHAR char is Chinese, Japanese or ASCII.
I don't know Delphi, but what I can tell you is you need to determine what range the character fits into in Unicode. Here is a link about finding CJK characters in Unicode: What's the complete range for Chinese characters in Unicode?
and unless Delphi has some nice library for distinguishing Chinese and Japanese charatcers, you're going to have to determine that yourself. Here is a good answer here on SO for how to do that:
Testing for Japanese/Chinese Characters in a string
The problem is... what do you mean by ASCII ? Original ASCII standard is 7-bit code, known as Latin1 - it is not even a byte.
Then if you come with so-called "extended ASCII" - a 1 byte items - then half of it can be next to anything. It can by Greek on one machien, European diacritics on another, Cyrillic at third one... etc.
So i think if all you need is testing whether you have 7 bit Latin1 character - ruling out extended characters from French, German, Spanish alphabets and all Scandinavians ones, then - as Unicode was designed as yet another superset for Latin1 what you need is checking that (0 <= Ord(char-var)) and ($7f >= Ord(char-var)).
However, if you really need to tell languages, if you consider Greek And Cyrillic somewhat ASCII and Japanese alphabets (there are two by the way, Hiragana and Katakana) not (or if you consider French and German more or less ASCII-like, but Russian not) you would have to look at Unicode Ranges.
http://www.unicode.org/charts/index.html
To come with 32-bit codepoint of UCS4 standard you can use http://docwiki.embarcadero.com/Libraries/XE3/en/System.Character.ConvertToUtf32
There are next to standard IBM Classes for Unicode but looks no good translation for Delphi exists Has anyone used ICU with Delphi?
You can use Jedi CodeLib, but its tables are (comments are contradicting) either from Unicode 4.1 or 5.0, not from current 6.2, though for Japanese version 5.0 should be enough.
http://wiki.delphi-jedi.org/wiki/JCL_Help:TUnicodeBlock
http://wiki.delphi-jedi.org/wiki/JCL_Help:CodeBlockFromChar
http://wiki.delphi-jedi.org/wiki/JCL_Help:CodeBlockName#TUnicodeBlock
You can also use Microsoft MLang interface to query internet-like character codes (RFC 1766)
http://msdn.microsoft.com/en-us/library/aa741220.aspx
http://msdn.microsoft.com/en-us/library/aa767880.aspx
http://msdn.microsoft.com/en-us/library/aa740986.aspx
http://www.transl-gunsmoker.ru/2011/05/converting-between-lcids-and-rfc-1766.html
http://www.ietf.org/rfc/rfc1766.txt
Generally, a character belongs to ASCII, if its code is in range 0x0000..0x007F, see http://www.unicode.org/charts/PDF/U0000.pdf. A new Delphi has class function TCharacter.IsAscii but it is from some strange reason declared as private.
ASCII characters have a decimal value less than 127.
However, unless you are running a teletype machine from the 1960's, ASCII chars may not be sufficient. ASCII chars will only cover English language characters. If you actually need to support "Western European" characters such as umlaut vowels, graves, etc, found in German, French, Spanish, Swedish, etc, then testing for Unicode char value <= 127 won't suffice. You might get away with testing for char value <= 255, as long as you don't need to work with Eastern European scripts.
What is the technically correct way of referring to "high ascii" or "extended ascii" characters? I don't just mean the range of 128-255, but any character beyond the 0-127 scope.
Often they're called diacritics, accented letters, sometimes casually referred to as "national" or non-English characters, but these names are either imprecise or they cover only a subset of the possible characters.
What correct, precise term that will programmers immediately recognize? And what would be the best English term to use when speaking to a non-technical audience?
"Non-ASCII characters"
ASCII character codes above 127 are not defined. many differ equipment and software suppliers developed their own character set for the value 128-255. Some chose drawing symbols, sone choose accent characters, other choose other characters.
Unicode is an attempt to make a universal set of character codes which includes the characters used in most languages. This includes not only the traditional western alphabets, but Cyrillic, Arabic, Greek, and even a large set of characters from Chinese, Japanese and Korean, as well as many other language both modern and ancient.
There are several implementations of Unicode. One of the most popular if UTF-8. A major reason for that popularity is that it is backwards compatible with ASCII, character codes 0 to 127 are the same for both ASCII and UTF-8.
That means it is better to say that ASCII is a subset of UTF-8. Characters code 128 and above are not ASCII. They can be UTF-8 (or other Unicode) or they can be a custom implementation by a hardware or software supplier.
You could coin a term like “trans-ASCII,” “supra-ASCII,” “ultra-ASCII” etc. Actually, “meta-ASCII” would be even nicer since it alludes to the meta bit.
A bit sequence that doesn't represent an ASCII character is not definitively a Unicode character.
Depending on the character encoding you're using, it could be either:
an invalid bit sequence
a Unicode character
an ISO-8859-x character
a Microsoft 1252 character
a character in some other character encoding
a bug, binary data, etc
The one definition that would fit all of these situations is:
Not an ASCII character
To be highly pedantic, even "a non-ASCII character" wouldn't precisely fit all of these situations, because sometimes a bit sequence outside this range may be simply an invalid bit sequence, and not a character at all.
"Extended ASCII" is the term I'd use, meaning "characters beyond the original 0-127".
Unicode is one possible set of Extended ASCII characters, and is quite, quite large.
UTF-8 is the way to represent Unicode characters that is backwards-compatible with the original ASCII.
Taken words from an online resource (Cool website though) because I found it useful and appropriate to write and answer.
At first only included capital letters and numbers , but in 1967 was added the lowercase letters and some control characters, forming what is known as US-ASCII, ie the characters 0 through 127.
So with this set of only 128 characters was published in 1967 as standard, containing all you need to write in English language.
In 1981, IBM developed an extension of 8-bit ASCII code, called "code page 437", in this version were replaced some obsolete control characters for graphic characters. Also 128 characters were added , with new symbols, signs, graphics and latin letters, all punctuation signs and characters needed to write texts in other languages, such as Spanish.
In this way was added the ASCII characters ranging from 128 to 255.
IBM includes support for this code page in the hardware of its model 5150, known as "IBM-PC", considered the first personal computer.
The operating system of this model, the "MS-DOS" also used this extended ASCII code.
Non-ASCII Unicode characters.
If you say "High ASCII", you are by definition in the range 128-255 decimal. ASCII itself is defined as a one-byte (actually 7-bit) character representation; the use of the high bit to allow for non-English characters happened later and gave rise to the Code Pages that defined particular characters represented by particular values. Any multibyte (> 255 decimal value) is not ASCII.