In which encoding is 0xDB a currency symbol? - character-encoding

I received files which, sadly, I cannot get info about how they were generated. I need to parse these files.
The file is entirely ASCII besides for one character: 0xDB (in decimal it gives 219).
Obviously (from looking at the file) this character is a currency symbol. I know it because:
it is mandatory for these files to contain a currency symbol anywhere an amount appears
there's no other currency symbol (neither $ nor euro nor nothing) nowhere in the files
everytime that 0xDB appears it's next to an amount
I think that in these files that 0xDB is supposed to represent the Euro symbol (it is actually very highly probable that this 0xDB appears everywhere a Euro symbol is supposed to appear).
The file command says this about the files:
ISO-8859 English text, with CRLF, LF line terminators
An hexdump gives this:
00000030 71 75 61 6e 74 20 db 32 2e 36 30 0a 20 41 49 4d |quant .2.60. AIM|
^^ ^
The files are all otherwise normally formatted/parsable. Actually I'm getting all the infos fine besides for that weird 0xDB character.
Does anyone know what's going on? How did a currency symbol (supposedly the euro symbol) somehow become a 0xDB?
It's neither ISO-8859-1 (aka ISO Latin 1) nor ISO-8859-15 because in both case code point 219 corresponds to 'Û' (just as Unicode codepoint 219 is 'LATIN CAPITAL LETTER U WITH CIRCUMFLEX').
It's not extended-ASCII.

It could be Mac OS Roman

It's MacRoman. In fact it has to be -- that's the only charset in which the Euro sign maps to 0xDB.
Here's the full charset mapping for MacRoman.

Using the macroman script, one learns:
$ macroman 0xDB
MacRoman DB ⇒ U+20AC ‹€› \N{ EURO SIGN }
You can go the other way, too:
$ macroman U+00E9
MacRoman 8E ⇐ U+00E9 ‹é› \N{ LATIN SMALL LETTER E WITH ACUTE }
And we know that U+20AC EURO SIGN is indeed a currency symbol because of the uniprops script’s output:
$ uniprops -a U+20AC
U+20AC <€> \N{ EURO SIGN }:
\pS \p{Sc}
All Any Assigned InCurrencySymbols Common Zyyy Currency_Symbol Sc Currency_Symbols S Gr_Base Grapheme_Base Graph GrBase Print Symbol X_POSIX_Graph X_POSIX_Print
Age=2.1 Bidi_Class=ET Bidi_Class=European_Terminator BC=ET Block=Currency_Symbols Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR Script=Common Decomposition_Type=None DT=None East_Asian_Width=A East_Asian_Width=Ambiguous EA=A Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=PR Line_Break=Prefix_Numeric LB=PR Numeric_Type=None NT=None Numeric_Value=NaN NV=NaN Present_In=2.1 IN=2.1 Present_In=3.0 IN=3.0 Present_In=3.1 IN=3.1 Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 IN=6.0 SC=Zyyy Script=Zyyy Sentence_Break=Other SB=XX Sentence_Break=XX Word_Break=Other WB=XX Word_Break=XX _X_Begin

0xDB represents the Euro sign in the Mac OS Roman character encoding.

Related

Greek and special characters show as mojibake - how to decode?

I'm trying to figure out how to decode some corrupt characters I have in a spreadsheet. There is a list of website titles: some in English, some in Greek, some in other languages. For example, Greek phrase ΕΛΛΗΝΙΚΑ ΝΕΑ ΤΩΡΑ shows as ŒïŒõŒõŒóŒùŒôŒöŒë ŒùŒïŒë Œ§Œ©Œ°Œë. So the whitespaces are OK, but the actual letters gone all wrong.
I have noticed that letters got converted to pairs of symbols:
Ε - Œï
Λ - Œõ
And so on. So it's almost always Œ and then some other symbol after it.
I went further, removed the repeated letter and checked difference in ASCII codes of the actual phrase and what was left of the corrupted phrase: ord('ï') - ord('Ε') and so on. The difference is almost the same all the time: `
678
678
678
676
676
677
676
678
0 (this is a whitespace)
676
678
678
0 (this is a whitespace)
765
768
753
678
I have manually decoded some of the other letters from other titles:
Greek
Œë Α
Œî Δ
Œï Ε
Œõ Λ
Œó Η
Œô Ι
Œö Κ
Œù Ν
Œ° Ρ
Œ§ Τ
Œ© Ω
Œµ ε
Œª λ
œÑ τ
ŒØ ί
Œø ο
œÑ τ
œâ ω
ŒΩ ν
Symbols
‚Äò ‘
‚Äô ’
‚Ķ …
‚Ć †
‚Äú “
Other
√© é
It's good I have a translation for this phrase, but there are a couple of others I don't have translation for. I would be glad to see any kind of advice because searching around StackOverflow didn't show me anything related.
It's a character encoding issue. The string appears to be in encoding Mac OS Roman (figured it out by educated guesses on this site). The IANA code for this encoding is macintosh, and its Windows code page number is 100000.
Here's a Python function that will decode macintosh to utf-8 strings:
def macToUtf8(s):
return bytes(s, 'macintosh').decode('utf-8')
print(macToUtf8('ΕΛΛΗΝΙΚΑ ΝΕΑ ΤΩΡΑ'))
# outputs: ΕΛΛΗΝΙΚΑ ΝΕΑ ΤΩΡΑ
My best guess is that your spreadsheet was saved on a Mac Computer, or perhaps saved using some Macintosh-based setting.
See also this issue: What encoding does MAC Excel use?

Setting charset on thermal printer via ESC/POS

I have a thermal printer "MPT-II" from an unknown Chinese brand, that has both USB and Bluetooth. I can successfully print text using:
Loyverse app on Android
JavaScript
Raw HEX or decimal
However, only using the Loyverse app am I able to input special characters, and by special characters I mean the Danish characters æøå/ÆØÅ.
If I open up any BLE tool on Windows (Bluetooth LE Lab for example), I can select the correct characteristic and send something like 104 101 108 108 111 13 10 which would print "hello" on the printer. I've read a bit about the ESC R and ESC t commands, but how exactly do I set those modes? I've tried prepending it to each command, such as 27 82 1 104 101 108 108 111 13 10 where the 27 82 1 corresponds to ESC R 4 and the 4 corresponds to Denmark I.
According to the printer's manual, it states the following:
GB18030 character set, ASCII characters, user defined characters, bar codes CODE39, EAN13, EAN8, CODABAR, CODE93, ITF, bitmaps.
According to that list, the Danish character set is not supported. I'm not sure how the Loyverse app is doing it correctly, but the text is the same using raw commands and Loyverse, so I don't think Loyverse is converting to a bitmap and sending that data.
So my real question is: How do I send the correct character set for my printer? Maybe the character set is already correct, but the ASCII character for æøå/ÆØÅ are wrong?
EDIT: I have confirmed that something works with the ESC XXXX commands. If I do 27 97 2 followed by my "hello" sequence, the text is printed to the right (right aligned). So that definitely works.. I have tried probably all character sets thus far using ESC R and ESC t but none of them work :(
EDIT 2: I have now tested every single combination of ESC R and ESC t. I went through the entire list printing some Chinese characters, and every single line of 150+ I tried all returned the same Chinese character. So ESC R or ESC t is definitely not the command I should be using to change the charset.

ASCII Space Character in Ruby Unicode Escape

Doing some reading and came across this block of code on the topic of Unicode Escapes in Ruby:
money = "\u{20AC 20 A3 20 A5}" # => "€ £ ¥"
I understand that in this ruby syntax, the actual spaces between the {}'s doesn't output an encoded space, that's the reason for the code point 20 but what I don't understand is why there's a code point 20 at the very beginning of the {}, right after the \u. No space has been output in the result and I copied it verbatim from the book.
It’s not a 20 at the beginning, it’s 20AC, which is the code point for €. The contents of the braces are a space separated list of codepoints (in hex format). 20AC is €, 20 is a space, A3 is £ and A5 is ¥.

Load testing tool URL encoding system

I have a load testing tool (Borland's SilkPerformer) that is encoding / character as \x252f. Can anyone please tell me what encoding system the tool might be using?
Two different escape sequences have been combined:
a C string hexadecimal escape sequence.
an URL-encoding scheme (percent encoding).
See this little diagram:
+---------> C escape in hexadecimal notation
: +------> hexadecimal number, ASCII for '%'
: : +---> hexadecimal number, ASCII for '/'
\x 25 2f
Explained step for step:
\ starts a C string escape sequence.
x is an escape in hexadecimal notation. Two hex digits are expected to follow.
25 is % in ASCII, see ASCII Table.
% starts an URL encode, also called Percent-encoding. Two hex digits are expected to follow.
2f is the slash character (/) in ASCII.
The slash is the result.
Now I don't know why your software chooses to encode the slash character in such a weird way. Slash characters in urls need to be url encoded if they don't denote directory separators (the same thing the backslash does for Windows). So you will often find the slash character being encoded as %2f. That's normal. But I find it weird and a bit suspicious that the percent character is additionally encoded as a hexadecimal escape sequence for C strings.

Parsing \"–\" with Erlang re

I've parsed an HTML page with mochiweb_html and want to parse the following text fragment
0 – 1
Basically I want to split the string on the spaces and dash character and extract the numbers in the first characters.
Now the string above is represented as the following Erlang list
[48,32,226,128,147,32,49]
I'm trying to split it using the following regex:
{ok, P}=re:compile("\\xD2\\x80\\x93"), %% characters 226, 128, 147
re:split([48,32,226,128,147,32,49], P, [{return, list}])
But this doesn't work; it seems the \xD2 character is the problem [if I remove it from the regex, the split occurs]
Could someone possibly explain
what I'm doing wrong here ?
why the '–' character seemingly requires three integers for representation [226, 128, 147]
Thanks.
226,128,147 is E2,80,93 in hex.
> {ok, P} = re:compile("\xE2\x80\x93").
...
> re:split([48,32,226,128,147,32,49], P, [{return, list}]).
["0 "," 1"]
As to your second question, about why a dash takes 3 bytes to encode, it's because the dash in your input isn't an ASCII hyphen (hex 2D), but is a Unicode en-dash (hex 2013). Your code is recieving this in UTF-8 encoding, rather than the more obvious UCS-2 encoding. Hex 2013 comes out to hex E28093 in UTF-8 encoding.
If your next question is "why UTF-8", it's because it's far easier to retrofit an old system using 8-bit characters and null-terminated C style strings to use Unicode via UTF-8 than to widen everything to UCS-2 or UCS-4. UTF-8 remains compatible with ASCII and C strings, so the conversion can be done piecemeal over the course of years, or decades if need be. Wide characters require a "Big Bang" one-time conversion effort, where everything has to move to the new system at once. UTF-8 is therefore far more popular on systems with legacies dating back to before the early 90s, when Unicode was created.

Resources