What Character Encoding Is This? - character-encoding

I need to clean up some file containing French text. Problem is that the files erroneously contain multiple encodings within the same file.
I think some sections are ISO8859-1 (Latin 1) but other parts have text encoded in single byte characters that look like 'extended' ASCII. In other words, it is UTF-7 encoding plus the following:
0x82 for é (e acute)
0x8a for è (e grave)
0x88 for ê (e circumflex)
0x85 for à (a grave)
0x87 for ç (c cedilla)
What encoding is this?

That's the original IBM PC encoding, Code page 437.

This website here shows a link with 0x87 for cedilla. I haven't look much further than this, but I bet the rest of your information could be found here as well.

Related

Which codepage is 0x81 = ü, 0x94 = ö, 0x9A = Ü?

I've got a CSV file, which has a character encoding which I can't identify. From it's content (German language entries) I could find the following characters matching some 1-byte character encodings:
0x81 = ü
0x94 = ö
0x9A = Ü
Which Codepage is this? Is there any website where you can maybe lookup code pages by known entries?
I was assuming this could be WINDOWS-1252 or ISO-8859-1, but it's neither of them.
As I found out by some more trial and error the encoding is "CP 437" or also called "DOS". Weird to see such an encoding used nowadays.

Where should my brackets be in relation to the text for Arabic languages?

Our application automatically modifies the layout of Arabic text when it is followed by a bracket and I was wondering whether this was the correct behaviour or not?
The application shows items in the following format:
[ID of structure](version)
So version 1.5 of the English structure "stackoverflow" would be displayed as:
stackoverflow(1.5)
Note: the brackets need to be displayed. There is no space between the ID and the first bracket. The brackets simply encompass the version. The brackets could have been any character but it's far too late to switch to a different character now!
This works fine for left to right languages, but for Arabic languages the structures appear in the form:
ستاكوفيرفلوو(1.0)
I am not an Arabic speaker and I need to know if this is actually correct. Is the Arabic format the equivalent of the English format or has something gone horribly wrong?
The text in Arabic should be shown like:
ستاكوفيرفلوو(1.0) ‏
I added the html entity of RLM / Right-to-left Mark ‏ in order to fix the text. You should do so if your application doesn't support Bidi native-ly. You can add the RLM by these ways:
HTML Entity (decimal) ‏
HTML Entity (hex) ‏
HTML Entity (named) ‏
How to type in Microsoft Windows Alt +200F
UTF-8 (hex) 0xE2 0x80 0x8F (e2808f)
UTF-8 (binary) 11100010:10000000:10001111
UTF-16 (hex) 0x200F (200f)
UTF-16 (decimal) 8,207
UTF-32 (hex) 0x0000200F (200f)
UTF-32 (decimal) 8,207
C/C++/Java source code "\u200F"
Python source code u"\u200F"
(note: StackOverflow right transliteration is ستاك-أوفرفلو)

character c cedilla ( SMALL) is displayed as CAPITAL

I am facing an issue when displaying the C cedilla character (U+00E7 ç) used in French language, on a handset.
When it is sent via USSGW/SS7 as small c cedilla , it is displayed on handset as capital c cedilla (U+00C7 Ç).
For info, the character is encoded with gsm7bit.
Do you have any solution or idea for this situation?
The original ETSI TS 100 900 V7.2.0 (1999-07) Digital cellular telecommunications system (Phase 2+);
Alphabets and language-specific information
(GSM 03.38 version 7.2.0 Release 1998) defined byte 0x09 as Ç (capital C with cedilla).
Subsequently in GSM 03.38 to Unicode mappings, a clarification was made:
General notes:
This table contains the data the Unicode Consortium has on how ETSI GSM 03.38 7-bit default alphabet characters map into Unicode. This mapping is based on ETSI TS 100 900 V7.2.0 (1999-07), with a correction of 0x09 to small c-cedilla, instead of capital C-cedilla.
and in the table:
0x08 0x00F2 # LATIN SMALL LETTER O WITH GRAVE
0x09 0x00E7 # LATIN SMALL LETTER C WITH CEDILLA
#0x09 0x00C7 # LATIN CAPITAL LETTER C WITH CEDILLA (see note above)
0x0A 0x000A # LINE FEED
So there you have it, this character was remapped at some point. It is likely that you are correctly-encoding the character, but an older device or something using a library with the old standard is interpreting the character according to the original mapping, resulting in the capital letter.
I'm not seeing a mapping for Ç so it shouldn't appear any more.

In what charset is 0xE1 an "a" with an umlaut?

I am trying to identify an extended ascii charset where 0xE1 is an "a" with an umlaut (8859-1 character E4)
and 0xF5 is a "u" with an umlaut (8859-1 character FC).
Has anyone seen this charset before? It quite possibly dates back to the 80's.
To my knowledge, there are no standard ASCII character sets which use those symbols. Here is a list of the standard character sets: http://www.columbia.edu/kermit/csettables.html
However, the character set you referenced is in fact used for interfacing with some LEDs, for example HT1632-compliant LEDs use the same character set: http://blog.thiseldo.co.uk/wp-filez/USB_HT1632_Matrix.pde
I hope this helps.
These are in the commonly used ISO-8859-1 character set, also known as latin-1.

Parsing \"–\" with Erlang re

I've parsed an HTML page with mochiweb_html and want to parse the following text fragment
0 – 1
Basically I want to split the string on the spaces and dash character and extract the numbers in the first characters.
Now the string above is represented as the following Erlang list
[48,32,226,128,147,32,49]
I'm trying to split it using the following regex:
{ok, P}=re:compile("\\xD2\\x80\\x93"), %% characters 226, 128, 147
re:split([48,32,226,128,147,32,49], P, [{return, list}])
But this doesn't work; it seems the \xD2 character is the problem [if I remove it from the regex, the split occurs]
Could someone possibly explain
what I'm doing wrong here ?
why the '–' character seemingly requires three integers for representation [226, 128, 147]
Thanks.
226,128,147 is E2,80,93 in hex.
> {ok, P} = re:compile("\xE2\x80\x93").
...
> re:split([48,32,226,128,147,32,49], P, [{return, list}]).
["0 "," 1"]
As to your second question, about why a dash takes 3 bytes to encode, it's because the dash in your input isn't an ASCII hyphen (hex 2D), but is a Unicode en-dash (hex 2013). Your code is recieving this in UTF-8 encoding, rather than the more obvious UCS-2 encoding. Hex 2013 comes out to hex E28093 in UTF-8 encoding.
If your next question is "why UTF-8", it's because it's far easier to retrofit an old system using 8-bit characters and null-terminated C style strings to use Unicode via UTF-8 than to widen everything to UCS-2 or UCS-4. UTF-8 remains compatible with ASCII and C strings, so the conversion can be done piecemeal over the course of years, or decades if need be. Wide characters require a "Big Bang" one-time conversion effort, where everything has to move to the new system at once. UTF-8 is therefore far more popular on systems with legacies dating back to before the early 90s, when Unicode was created.

Resources