I'm trying to figure out how to decode some corrupt characters I have in a spreadsheet. There is a list of website titles: some in English, some in Greek, some in other languages. For example, Greek phrase ΕΛΛΗΝΙΚΑ ΝΕΑ ΤΩΡΑ shows as ŒïŒõŒõŒóŒùŒôŒöŒë ŒùŒïŒë Œ§Œ©Œ°Œë. So the whitespaces are OK, but the actual letters gone all wrong.
I have noticed that letters got converted to pairs of symbols:
Ε - Œï
Λ - Œõ
And so on. So it's almost always Œ and then some other symbol after it.
I went further, removed the repeated letter and checked difference in ASCII codes of the actual phrase and what was left of the corrupted phrase: ord('ï') - ord('Ε') and so on. The difference is almost the same all the time: `
678
678
678
676
676
677
676
678
0 (this is a whitespace)
676
678
678
0 (this is a whitespace)
765
768
753
678
I have manually decoded some of the other letters from other titles:
Greek
Œë Α
Œî Δ
Œï Ε
Œõ Λ
Œó Η
Œô Ι
Œö Κ
Œù Ν
Œ° Ρ
Œ§ Τ
Œ© Ω
Œµ ε
Œª λ
œÑ τ
ŒØ ί
Œø ο
œÑ τ
œâ ω
ŒΩ ν
Symbols
‚Äò ‘
‚Äô ’
‚Ķ …
‚Ć †
‚Äú “
Other
√© é
It's good I have a translation for this phrase, but there are a couple of others I don't have translation for. I would be glad to see any kind of advice because searching around StackOverflow didn't show me anything related.
It's a character encoding issue. The string appears to be in encoding Mac OS Roman (figured it out by educated guesses on this site). The IANA code for this encoding is macintosh, and its Windows code page number is 100000.
Here's a Python function that will decode macintosh to utf-8 strings:
def macToUtf8(s):
return bytes(s, 'macintosh').decode('utf-8')
print(macToUtf8('ΕΛΛΗΝΙΚΑ ΝΕΑ ΤΩΡΑ'))
# outputs: ΕΛΛΗΝΙΚΑ ΝΕΑ ΤΩΡΑ
My best guess is that your spreadsheet was saved on a Mac Computer, or perhaps saved using some Macintosh-based setting.
See also this issue: What encoding does MAC Excel use?
Related
I am facing an issue when displaying the C cedilla character (U+00E7 ç) used in French language, on a handset.
When it is sent via USSGW/SS7 as small c cedilla , it is displayed on handset as capital c cedilla (U+00C7 Ç).
For info, the character is encoded with gsm7bit.
Do you have any solution or idea for this situation?
The original ETSI TS 100 900 V7.2.0 (1999-07) Digital cellular telecommunications system (Phase 2+);
Alphabets and language-specific information
(GSM 03.38 version 7.2.0 Release 1998) defined byte 0x09 as Ç (capital C with cedilla).
Subsequently in GSM 03.38 to Unicode mappings, a clarification was made:
General notes:
This table contains the data the Unicode Consortium has on how ETSI GSM 03.38 7-bit default alphabet characters map into Unicode. This mapping is based on ETSI TS 100 900 V7.2.0 (1999-07), with a correction of 0x09 to small c-cedilla, instead of capital C-cedilla.
and in the table:
0x08 0x00F2 # LATIN SMALL LETTER O WITH GRAVE
0x09 0x00E7 # LATIN SMALL LETTER C WITH CEDILLA
#0x09 0x00C7 # LATIN CAPITAL LETTER C WITH CEDILLA (see note above)
0x0A 0x000A # LINE FEED
So there you have it, this character was remapped at some point. It is likely that you are correctly-encoding the character, but an older device or something using a library with the old standard is interpreting the character according to the original mapping, resulting in the capital letter.
I'm not seeing a mapping for Ç so it shouldn't appear any more.
Doing some reading and came across this block of code on the topic of Unicode Escapes in Ruby:
money = "\u{20AC 20 A3 20 A5}" # => "€ £ ¥"
I understand that in this ruby syntax, the actual spaces between the {}'s doesn't output an encoded space, that's the reason for the code point 20 but what I don't understand is why there's a code point 20 at the very beginning of the {}, right after the \u. No space has been output in the result and I copied it verbatim from the book.
It’s not a 20 at the beginning, it’s 20AC, which is the code point for €. The contents of the braces are a space separated list of codepoints (in hex format). 20AC is €, 20 is a space, A3 is £ and A5 is ¥.
I received files which, sadly, I cannot get info about how they were generated. I need to parse these files.
The file is entirely ASCII besides for one character: 0xDB (in decimal it gives 219).
Obviously (from looking at the file) this character is a currency symbol. I know it because:
it is mandatory for these files to contain a currency symbol anywhere an amount appears
there's no other currency symbol (neither $ nor euro nor nothing) nowhere in the files
everytime that 0xDB appears it's next to an amount
I think that in these files that 0xDB is supposed to represent the Euro symbol (it is actually very highly probable that this 0xDB appears everywhere a Euro symbol is supposed to appear).
The file command says this about the files:
ISO-8859 English text, with CRLF, LF line terminators
An hexdump gives this:
00000030 71 75 61 6e 74 20 db 32 2e 36 30 0a 20 41 49 4d |quant .2.60. AIM|
^^ ^
The files are all otherwise normally formatted/parsable. Actually I'm getting all the infos fine besides for that weird 0xDB character.
Does anyone know what's going on? How did a currency symbol (supposedly the euro symbol) somehow become a 0xDB?
It's neither ISO-8859-1 (aka ISO Latin 1) nor ISO-8859-15 because in both case code point 219 corresponds to 'Û' (just as Unicode codepoint 219 is 'LATIN CAPITAL LETTER U WITH CIRCUMFLEX').
It's not extended-ASCII.
It could be Mac OS Roman
It's MacRoman. In fact it has to be -- that's the only charset in which the Euro sign maps to 0xDB.
Here's the full charset mapping for MacRoman.
Using the macroman script, one learns:
$ macroman 0xDB
MacRoman DB ⇒ U+20AC ‹€› \N{ EURO SIGN }
You can go the other way, too:
$ macroman U+00E9
MacRoman 8E ⇐ U+00E9 ‹é› \N{ LATIN SMALL LETTER E WITH ACUTE }
And we know that U+20AC EURO SIGN is indeed a currency symbol because of the uniprops script’s output:
$ uniprops -a U+20AC
U+20AC <€> \N{ EURO SIGN }:
\pS \p{Sc}
All Any Assigned InCurrencySymbols Common Zyyy Currency_Symbol Sc Currency_Symbols S Gr_Base Grapheme_Base Graph GrBase Print Symbol X_POSIX_Graph X_POSIX_Print
Age=2.1 Bidi_Class=ET Bidi_Class=European_Terminator BC=ET Block=Currency_Symbols Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR Script=Common Decomposition_Type=None DT=None East_Asian_Width=A East_Asian_Width=Ambiguous EA=A Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=PR Line_Break=Prefix_Numeric LB=PR Numeric_Type=None NT=None Numeric_Value=NaN NV=NaN Present_In=2.1 IN=2.1 Present_In=3.0 IN=3.0 Present_In=3.1 IN=3.1 Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 IN=6.0 SC=Zyyy Script=Zyyy Sentence_Break=Other SB=XX Sentence_Break=XX Word_Break=Other WB=XX Word_Break=XX _X_Begin
0xDB represents the Euro sign in the Mac OS Roman character encoding.
I need to clean up some file containing French text. Problem is that the files erroneously contain multiple encodings within the same file.
I think some sections are ISO8859-1 (Latin 1) but other parts have text encoded in single byte characters that look like 'extended' ASCII. In other words, it is UTF-7 encoding plus the following:
0x82 for é (e acute)
0x8a for è (e grave)
0x88 for ê (e circumflex)
0x85 for à (a grave)
0x87 for ç (c cedilla)
What encoding is this?
That's the original IBM PC encoding, Code page 437.
This website here shows a link with 0x87 for cedilla. I haven't look much further than this, but I bet the rest of your information could be found here as well.
I've parsed an HTML page with mochiweb_html and want to parse the following text fragment
0 – 1
Basically I want to split the string on the spaces and dash character and extract the numbers in the first characters.
Now the string above is represented as the following Erlang list
[48,32,226,128,147,32,49]
I'm trying to split it using the following regex:
{ok, P}=re:compile("\\xD2\\x80\\x93"), %% characters 226, 128, 147
re:split([48,32,226,128,147,32,49], P, [{return, list}])
But this doesn't work; it seems the \xD2 character is the problem [if I remove it from the regex, the split occurs]
Could someone possibly explain
what I'm doing wrong here ?
why the '–' character seemingly requires three integers for representation [226, 128, 147]
Thanks.
226,128,147 is E2,80,93 in hex.
> {ok, P} = re:compile("\xE2\x80\x93").
...
> re:split([48,32,226,128,147,32,49], P, [{return, list}]).
["0 "," 1"]
As to your second question, about why a dash takes 3 bytes to encode, it's because the dash in your input isn't an ASCII hyphen (hex 2D), but is a Unicode en-dash (hex 2013). Your code is recieving this in UTF-8 encoding, rather than the more obvious UCS-2 encoding. Hex 2013 comes out to hex E28093 in UTF-8 encoding.
If your next question is "why UTF-8", it's because it's far easier to retrofit an old system using 8-bit characters and null-terminated C style strings to use Unicode via UTF-8 than to widen everything to UCS-2 or UCS-4. UTF-8 remains compatible with ASCII and C strings, so the conversion can be done piecemeal over the course of years, or decades if need be. Wide characters require a "Big Bang" one-time conversion effort, where everything has to move to the new system at once. UTF-8 is therefore far more popular on systems with legacies dating back to before the early 90s, when Unicode was created.