Do all character sets have ASCII in common? - character-encoding

The reason I ask is that there's a "standard" for affix files that says to read the first line of a file, and it will tell you how the file is encoded:
The first line specifies the character set used for both the
wordlist and the affix file (should be all uppercase).
For example:
SET ISO8859-1
That strikes me as being both unreasonable and unreliable, unless all character sets have the 7-bit ASCII range in common, which would allow you to "taste" up to the first newline byte(s): 0xA or 0xD.
But I have no idea if the ASCII range is common to all character sets or not.

No. EBCDIC is non-ASCII based, and is still used in IBM mainframe-based software environments with extreme backwards-compatibility requirements.
More popular are UTF-16 and UTF-32, which although ASCII-based, are backwards-incompatible due to all the extra 00 bytes.
Still, there are only a few ways to encode the Basic Latin alphabet. (What distinguishes most of the hundreds of character encodings that exist are their handling of accented and non-Latin letters.) So, the program that reads these files only needs to handle a few possible ways of encoding the word SET:
53 45 54 for ASCII-based encodings (Windows-1252, UTF-8, etc.)
E2 C5 E3 for EBCDIC-based encodings (if these are considered worth supporting at all)
00 53 00 45 00 54 for UTF-16BE
53 00 45 00 54 00 for UTF-16LE
00 00 00 53 00 00 00 45 00 00 00 54 for UTF-32BE
53 00 00 00 45 00 00 00 54 00 00 00 for UTF-32LE
The decoder could simply look for them all.

Related

Unable to get the correct CRC16 with a given Polynomial

I'm struggling with an old radiation sensor and his communication protocol.
The sensor is event driven, the master starts the communication with a data transmission or a data request.
Each data telegram uses a CRC16 to check only the variable data block and a CRC8 to check all the telegram.
My main problem is the crc16, According to the datasheet the poly used to check the data block is: CRC16 = X^14 + X^12 + X^5 + 1 --> 0x5021 ??
I captured some data with a valid CRC16 and tried to replicate the expected value in order to send my own data transmission, but I can't get the same value.
I'm using the sunshine CRC calculator trying any possible combination with that poly.
I also try CRC Reveng but no results.
Here are a few data with the correct CRC16:.
Data | CRC16 (MSB LSB)
14 00 00 0A | 1B 84
15 00 00 0C | 15 88
16 00 00 18 | 08 1D
00 00 00 00 | 00 00
00 00 00 01 | 19 D8
00 00 00 02 | 33 B0
01 00 00 00 | 5A DC
08 00 00 00 | c6 c2
10 00 00 00 | 85 95
80 00 00 00 | 0C EC
ff ff ff ff | f3 99
If I send an invalid CRC16 in the telegram, the sensor send a negative acknowledge with the expected value, so I can try any data in order to test or get more examples if needed.
if useful, the sensor uses a 8bit 8051 microprocessor, and this is an example of a valid CRC8 checked with sunshine CRC:
CRC8 = X^8 + X^6 + X^3 + 1 --> 0x49
Input reflected Result reflected
control byte | Data |CRC16 | CRC8
01 0E 01 00 24 2A 06 ff ff ff ff f3 99 |-> 0F
Any help is appreciated !
Looks like a typo on the polynomial. An n-bit CRC polynomial always starts with xn. Like your correct 8-bit polynomial. The 16-bit polynomial should read X16 + X12 + X5 + 1, which in fact is a very common 16-bit CRC polynomial.
To preserve the note in the comment, the four data bytes in the examples are swapped in each pair of bytes, which needs to be undone to get the correct CRC. (The control bytes in the CRC8 example are not swapped.)
So 14 00 00 0a becomes 00 14 0a 00, for which the above-described CRC gives the expected 0x1b84.
I would guess that the CRC is stored in the stream also swapped, so the message as bytes would be 00 14 0a 00 84 1b. That results in a sequence whose total CRC is 0.

What is this unusual text being used with loadstring() in Lua?

I have some Lua code which appears to be an attempt to secure the code by obscurity. My understanding of the loadstring() function is a text string is composed of Lua source code text and then converted to executable Lua code by the loadstring() method.
With the following Lua source, I tried to read the contents of the variable code by invoking print on the variable code; while I did see some valid source text in the converted string, a majority of the characters were not displayed (I assume ones with character codes below 40 and above 176). Note that there are some particularly high values in there for ASCII, e.g. 231 is obviously in the extended set, being the trademark sign. Additionally, there are several null characters in there. All this makes me doubt if it is indeed ASCII.
Could someone please tell me if the string is valid Lua source, and how to be able to get Lua to return the string as printable characters so that I can see what this code does?
When I run my version with print in the Lua console on Windows I get many empty boxes, presumably the console can only print pure ASCII?
Note that the code is executed using Lua version 5.0.2
code='\27\76\117\97\80\1\4\4\4\6\8\9\9\8\182\9\147\104\231\245\125\65\12\0\0\0\64\108\117\97\101\109\103\46\108\117\97\0\1\0\0\0\0\0\0\5\23\0\0\0\8\0\0\0\16\0\0\0\17\0\0\0\17\0\0\0\17\0\0\0\17\0\0\0\17\0\0\0\18\0\0\0\18\0\0\0\19\0\0\0\20\0\0\0\21\0\0\0\35\0\0\0\35\0\0\0\26\0\0\0\49\0\0\0\49\0\0\0\37\0\0\0\59\0\0\0\59\0\0\0\54\0\0\0\61\0\0\0\66\0\0\0\2\0\0\0\4\0\0\0\104\52\120\0\1\0\0\0\22\0\0\0\7\0\0\0\77\111\100\117\108\101\0\12\0\0\0\22\0\0\0\0\0\0\0\12\0\0\0\4\13\0\0\0\122\122\97\78\111\100\101\78\97\109\101\115\0\4\6\0\0\0\90\90\65\48\49\0\4\6\0\0\0\90\90\65\48\50\0\4\14\0\0\0\122\122\97\84\101\120\116\90\101\105\108\101\110\0\4\12\0\0\0\122\122\97\80\111\115\105\116\105\111\110\0\3\0\0\0\0\0\0\240\63\4\8\0\0\0\122\122\97\84\101\120\116\0\4\1\0\0\0\0\4\20\0\0\0\122\122\97\67\117\114\114\101\110\116\84\101\120\116\86\97\108\117\101\0\4\9\0\0\0\122\122\97\83\101\116\117\112\0\4\10\0\0\0\122\122\97\83\101\108\101\99\116\0\4\9\0\0\0\122\122\97\82\101\115\101\116\0\4\0\0\0\0\0\0\0\2\0\0\0\0\1\0\7\14\0\0\0\3\0\0\0\3\0\0\0\4\0\0\0\4\0\0\0\4\0\0\0\5\0\0\0\5\0\0\0\5\0\0\0\5\0\0\0\4\0\0\0\5\0\0\0\7\0\0\0\7\0\0\0\8\0\0\0\4\0\0\0\7\0\0\0\115\116\114\116\98\108\0\0\0\0\0\13\0\0\0\16\0\0\0\40\102\111\114\32\103\101\110\101\114\97\116\111\114\41\0\5\0\0\0\11\0\0\0\12\0\0\0\40\102\111\114\32\115\116\97\116\101\41\0\5\0\0\0\11\0\0\0\2\0\0\0\118\0\5\0\0\0\11\0\0\0\0\0\0\0\2\0\0\0\4\7\0\0\0\98\117\102\102\101\114\0\4\1\0\0\0\0\0\0\0\0\14\0\0\0\65\0\0\1\7\0\0\1\0\0\0\1\3\128\1\2\222\0\128\1\5\0\0\4\198\0\0\5\83\1\2\4\7\0\0\4\29\0\0\1\84\254\127\0\5\0\0\1\27\0\1\1\27\128\0\0\0\0\0\0\26\0\0\0\1\1\0\4\18\0\0\0\27\0\0\0\28\0\0\0\28\0\0\0\29\0\0\0\29\0\0\0\30\0\0\0\32\0\0\0\32\0\0\0\32\0\0\0\32\0\0\0\32\0\0\0\33\0\0\0\33\0\0\0\33\0\0\0\33\0\0\0\33\0\0\0\27\0\0\0\35\0\0\0\2\0\0\0\8\0\0\0\122\122\97\70\105\108\101\0\0\0\0\0\17\0\0\0\6\0\0\0\122\101\105\108\101\0\3\0\0\0\16\0\0\0\1\0\0\0\7\0\0\0\77\111\100\117\108\101\0\5\0\0\0\4\5\0\0\0\114\101\97\100\0\0\4\14\0\0\0\122\122\97\84\101\120\116\90\101\105\108\101\110\0\4\12\0\0\0\122\122\97\80\111\115\105\116\105\111\110\0\3\0\0\0\0\0\0\240\63\0\0\0\0\18\0\0\0\148\3\128\0\139\62\0\1\153\0\1\1\85\128\125\0\20\0\128\0\148\2\128\0\4\0\0\2\6\63\1\2\4\0\0\3\70\191\1\3\73\128\1\2\4\0\0\2\4\0\0\3\70\191\1\3\140\191\1\3\201\128\126\2\212\251\127\0\27\128\0\0\0\0\0\0\37\0\0\0\1\2\0\7\21\0\0\0\39\0\0\0\39\0\0\0\39\0\0\0\39\0\0\0\39\0\0\0\40\0\0\0\40\0\0\0\40\0\0\0\40\0\0\0\43\0\0\0\46\0\0\0\46\0\0\0\46\0\0\0\46\0\0\0\46\0\0\0\46\0\0\0\46\0\0\0\46\0\0\0\46\0\0\0\46\0\0\0\49\0\0\0\3\0\0\0\6\0\0\0\118\97\108\117\101\0\0\0\0\0\20\0\0\0\9\0\0\0\110\111\100\101\78\97\109\101\0\0\0\0\0\20\0\0\0\20\0\0\0\122\122\97\83\101\108\101\99\116\101\100\80\111\115\105\116\105\111\110\0\10\0\0\0\20\0\0\0\1\0\0\0\7\0\0\0\77\111\100\117\108\101\0\7\0\0\0\4\8\0\0\0\122\122\97\84\101\120\116\0\4\14\0\0\0\122\122\97\84\101\120\116\90\101\105\108\101\110\0\4\20\0\0\0\122\122\97\67\117\114\114\101\110\116\84\101\120\116\86\97\108\117\101\0\4\5\0\0\0\67\97\108\108\0\4\5\0\0\0\90\90\65\48\0\4\14\0\0\0\58\65\99\116\105\118\97\116\101\78\111\100\101\0\3\0\0\0\0\0\0\240\63\0\0\0\0\21\0\0\0\4\0\0\2\4\0\0\3\198\190\1\3\6\128\1\3\201\0\125\2\4\0\0\2\4\0\0\3\134\190\1\3\201\0\126\2\0\0\0\2\197\0\0\3\1\1\0\4\0\128\0\5\65\1\0\6\147\1\2\4\1\1\0\5\0\0\1\6\147\129\2\5\129\1\0\6\89\0\2\3\27\128\0\0\0\0\0\0\54\0\0\0\1\0\0\4\19\0\0\0\56\0\0\0\56\0\0\0\56\0\0\0\56\0\0\0\56\0\0\0\56\0\0\0\56\0\0\0\56\0\0\0\56\0\0\0\57\0\0\0\57\0\0\0\57\0\0\0\57\0\0\0\57\0\0\0\57\0\0\0\57\0\0\0\57\0\0\0\57\0\0\0\59\0\0\0\0\0\0\0\1\0\0\0\7\0\0\0\77\111\100\117\108\101\0\7\0\0\0\4\5\0\0\0\67\97\108\108\0\4\13\0\0\0\122\122\97\78\111\100\101\78\97\109\101\115\0\3\0\0\0\0\0\0\240\63\4\14\0\0\0\58\65\99\116\105\118\97\116\101\78\111\100\101\0\4\4\0\0\0\97\108\108\0\3\0\0\0\0\0\0\0\0\3\0\0\0\0\0\0\0\64\0\0\0\0\19\0\0\0\5\0\0\0\4\0\0\1\198\190\0\1\6\191\0\1\193\0\0\2\147\128\0\1\1\1\0\2\65\1\0\3\89\0\2\0\5\0\0\0\4\0\0\1\198\190\0\1\6\192\0\1\193\0\0\2\147\128\0\1\1\1\0\2\65\1\0\3\89\0\2\0\27\128\0\0\23\0\0\0\34\0\0\0\202\0\0\1\10\0\1\2\65\0\0\3\129\0\0\4\95\0\0\2\137\0\125\1\10\0\0\2\137\128\126\1\201\63\127\1\73\64\128\1\73\64\129\1\98\0\0\2\0\128\0\0\137\128\129\1\162\0\0\2\0\128\0\0\137\0\130\1\226\0\0\2\0\128\0\0\137\128\130\1\27\0\1\1\27\128\0\0';
return loadstring(code)();
This string is valid chunk of Lua code precompiled into bytecode. Header say it's for Lua 5.0. It's not a text, it doesn't need decoding, so can be run directly with loadstring()
To provide a few more details than Vlad's answer for anyone who may come across this posting.
The Lua loadstring() function accepts a string of characters that are either Lua source text or Lua bytecode. It appears that the function determines type of the text by looking at the first character of the string to see if it is an escape character (0x1b or decimal 27) or not.
The loadstring() function returns an anonymous function so in the code sample:
code='\27\76\117\97\80\1\4\4\4\6\8\9\9\8\182\9\147\104\231\245\125\65\12\0\0\0\64\108\117\97\101\109\103\46\108\117\97\0\1\0\0\0\0\0\0\5\23\0\0\0\8\0\0\0\16\0\0\0\17\0\0\0\17\0\0\0\17\0\0\0\17\0\0\0\17\0\0\0\18\0\0\0\18\0\0\0\19\0\0\0\20\0\0\0\21\0\0\0\35\0\0\0\35\0\0\0\26\0\0\0\49\0\0\0\49\0\0\0\37\0\0\0\59\0\0\0\59\0\0\0\54\0\0\0\61\0\0\0\66\0\0\0\2\0\0\0\4\0\0\0\104\52\120\0\1\0\0\0\22\0\0\0\7\0\0\0\77\111\100\117\108\101\0\12\0\0\0\22\0\0\0\0\0\0\0\12\0\0\0\4\13\0\0\0\122\122\97\78\111\100\101\78\97\109\101\115\0\4\6\0\0\0\90\90\65\48\49\0\4\6\0\0\0\90\90\65\48\50\0\4\14\0\0\0\122\122\97\84\101\120\116\90\101\105\108\101\110\0\4\12\0\0\0\122\122\97\80\111\115\105\116\105\111\110\0\3\0\0\0\0\0\0\240\63\4\8\0\0\0\122\122\97\84\101\120\116\0\4\1\0\0\0\0\4\20\0\0\0\122\122\97\67\117\114\114\101\110\116\84\101\120\116\86\97\108\117\101\0\4\9\0\0\0\122\122\97\83\101\116\117\112\0\4\10\0\0\0\122\122\97\83\101\108\101\99\116\0\4\9\0\0\0\122\122\97\82\101\115\101\116\0\4\0\0\0\0\0\0\0\2\0\0\0\0\1\0\7\14\0\0\0\3\0\0\0\3\0\0\0\4\0\0\0\4\0\0\0\4\0\0\0\5\0\0\0\5\0\0\0\5\0\0\0\5\0\0\0\4\0\0\0\5\0\0\0\7\0\0\0\7\0\0\0\8\0\0\0\4\0\0\0\7\0\0\0\115\116\114\116\98\108\0\0\0\0\0\13\0\0\0\16\0\0\0\40\102\111\114\32\103\101\110\101\114\97\116\111\114\41\0\5\0\0\0\11\0\0\0\12\0\0\0\40\102\111\114\32\115\116\97\116\101\41\0\5\0\0\0\11\0\0\0\2\0\0\0\118\0\5\0\0\0\11\0\0\0\0\0\0\0\2\0\0\0\4\7\0\0\0\98\117\102\102\101\114\0\4\1\0\0\0\0\0\0\0\0\14\0\0\0\65\0\0\1\7\0\0\1\0\0\0\1\3\128\1\2\222\0\128\1\5\0\0\4\198\0\0\5\83\1\2\4\7\0\0\4\29\0\0\1\84\254\127\0\5\0\0\1\27\0\1\1\27\128\0\0\0\0\0\0\26\0\0\0\1\1\0\4\18\0\0\0\27\0\0\0\28\0\0\0\28\0\0\0\29\0\0\0\29\0\0\0\30\0\0\0\32\0\0\0\32\0\0\0\32\0\0\0\32\0\0\0\32\0\0\0\33\0\0\0\33\0\0\0\33\0\0\0\33\0\0\0\33\0\0\0\27\0\0\0\35\0\0\0\2\0\0\0\8\0\0\0\122\122\97\70\105\108\101\0\0\0\0\0\17\0\0\0\6\0\0\0\122\101\105\108\101\0\3\0\0\0\16\0\0\0\1\0\0\0\7\0\0\0\77\111\100\117\108\101\0\5\0\0\0\4\5\0\0\0\114\101\97\100\0\0\4\14\0\0\0\122\122\97\84\101\120\116\90\101\105\108\101\110\0\4\12\0\0\0\122\122\97\80\111\115\105\116\105\111\110\0\3\0\0\0\0\0\0\240\63\0\0\0\0\18\0\0\0\148\3\128\0\139\62\0\1\153\0\1\1\85\128\125\0\20\0\128\0\148\2\128\0\4\0\0\2\6\63\1\2\4\0\0\3\70\191\1\3\73\128\1\2\4\0\0\2\4\0\0\3\70\191\1\3\140\191\1\3\201\128\126\2\212\251\127\0\27\128\0\0\0\0\0\0\37\0\0\0\1\2\0\7\21\0\0\0\39\0\0\0\39\0\0\0\39\0\0\0\39\0\0\0\39\0\0\0\40\0\0\0\40\0\0\0\40\0\0\0\40\0\0\0\43\0\0\0\46\0\0\0\46\0\0\0\46\0\0\0\46\0\0\0\46\0\0\0\46\0\0\0\46\0\0\0\46\0\0\0\46\0\0\0\46\0\0\0\49\0\0\0\3\0\0\0\6\0\0\0\118\97\108\117\101\0\0\0\0\0\20\0\0\0\9\0\0\0\110\111\100\101\78\97\109\101\0\0\0\0\0\20\0\0\0\20\0\0\0\122\122\97\83\101\108\101\99\116\101\100\80\111\115\105\116\105\111\110\0\10\0\0\0\20\0\0\0\1\0\0\0\7\0\0\0\77\111\100\117\108\101\0\7\0\0\0\4\8\0\0\0\122\122\97\84\101\120\116\0\4\14\0\0\0\122\122\97\84\101\120\116\90\101\105\108\101\110\0\4\20\0\0\0\122\122\97\67\117\114\114\101\110\116\84\101\120\116\86\97\108\117\101\0\4\5\0\0\0\67\97\108\108\0\4\5\0\0\0\90\90\65\48\0\4\14\0\0\0\58\65\99\116\105\118\97\116\101\78\111\100\101\0\3\0\0\0\0\0\0\240\63\0\0\0\0\21\0\0\0\4\0\0\2\4\0\0\3\198\190\1\3\6\128\1\3\201\0\125\2\4\0\0\2\4\0\0\3\134\190\1\3\201\0\126\2\0\0\0\2\197\0\0\3\1\1\0\4\0\128\0\5\65\1\0\6\147\1\2\4\1\1\0\5\0\0\1\6\147\129\2\5\129\1\0\6\89\0\2\3\27\128\0\0\0\0\0\0\54\0\0\0\1\0\0\4\19\0\0\0\56\0\0\0\56\0\0\0\56\0\0\0\56\0\0\0\56\0\0\0\56\0\0\0\56\0\0\0\56\0\0\0\56\0\0\0\57\0\0\0\57\0\0\0\57\0\0\0\57\0\0\0\57\0\0\0\57\0\0\0\57\0\0\0\57\0\0\0\57\0\0\0\59\0\0\0\0\0\0\0\1\0\0\0\7\0\0\0\77\111\100\117\108\101\0\7\0\0\0\4\5\0\0\0\67\97\108\108\0\4\13\0\0\0\122\122\97\78\111\100\101\78\97\109\101\115\0\3\0\0\0\0\0\0\240\63\4\14\0\0\0\58\65\99\116\105\118\97\116\101\78\111\100\101\0\4\4\0\0\0\97\108\108\0\3\0\0\0\0\0\0\0\0\3\0\0\0\0\0\0\0\64\0\0\0\0\19\0\0\0\5\0\0\0\4\0\0\1\198\190\0\1\6\191\0\1\193\0\0\2\147\128\0\1\1\1\0\2\65\1\0\3\89\0\2\0\5\0\0\0\4\0\0\1\198\190\0\1\6\192\0\1\193\0\0\2\147\128\0\1\1\1\0\2\65\1\0\3\89\0\2\0\27\128\0\0\23\0\0\0\34\0\0\0\202\0\0\1\10\0\1\2\65\0\0\3\129\0\0\4\95\0\0\2\137\0\125\1\10\0\0\2\137\128\126\1\201\63\127\1\73\64\128\1\73\64\129\1\98\0\0\2\0\128\0\0\137\128\129\1\162\0\0\2\0\128\0\0\137\0\130\1\226\0\0\2\0\128\0\0\137\128\130\1\27\0\1\1\27\128\0\0';
return loadstring(code)();
you have a text string that contains Lua bytecode, as indicated by the leading escape character of \27, and then a call to loadstring() to create a function which is then executed.
The first few characters of the text string contain the precompiled Lua header (see Lua 5.2 Bytecode and Virtual Machine). The length of this header varies depending on the version of Lua. However the first few characters seem to be fairly standard. code='\27\76\117\97\80 ... contains the escape character (0x1b or decimal 27), the capital letter L (decimal 76), the lower case letter u (decimal 117), the lower case letter a (decimal 97), and the Lua version (decimal 80 is 0x50 indicating version 5.0).
The following example is from Lua 5.2 Bytecode and Virtual Machine.
What exactly is in the bytecode? Here is the hexdump of hello.luac
(made by hd on my system).
00000000 1b 4c 75 61 52 00 01 04 04 04 08 00 19 93 0d 0a |.LuaR...........|
00000010 1a 0a 00 00 00 00 00 00 00 00 00 01 04 07 00 00 |................|
00000020 00 01 00 00 00 46 40 40 00 80 00 00 00 c1 80 00 |.....F##........|
00000030 00 96 c0 00 01 5d 40 00 01 1f 00 80 00 03 00 00 |.....]#.........|
00000040 00 04 06 00 00 00 48 65 6c 6c 6f 00 04 06 00 00 |......Hello.....|
The format is not officially documented, and needs to be
reverse-engineered. The necessary material is in the Lua source code,
of course, in several places, mainly ldump.c and lundump.c. I have
also cross-checked with NFI and LAT, but any remaining errors are
mine.
The code starts with an 18-byte file header, which is the same for all
official Lua 5.2 bytecode compiled on a machine like yours, whether by
luac or load or loadfile. Lua 5.1 only had a 12-byte header, similar
to the first 12 bytes of this one.
Byte numbers are in origin-1 decimal (mostly showing the arithmetic)
and origin-0 hex.
1 x00: 1b 4c 75 61 LUA_SIGNATURE from lua.h.
5 x04: 52 00
Binary-coded decimal 52 for the Lua version, 00 to say the bytecode is
compatible with the "official" PUC-Rio implementation.
5+2 x06: 01 04
04 04 08 00 Six system parameters. On x386 machines they mean:
little-endian, 4-byte integers, 4-byte VM instructions, 4-byte size_t
numbers, 8-byte Lua numbers, floating-point. These parameters must all
match up between the bytecode file and the Lua interpreter, otherwise
the bytecode is invalid.
7+6 x0c: 19 93 0d 0a 1a 0a
Present in all
bytecode produced by Lua 5.2 from PUC-Rio. Described in lundump.h as
"data to catch conversion errors". Might be constructed from
binary-coded decimal 1993 (the year it all started), Windows line
terminator, MS-DOS text file terminator, Unix line terminator.
After these 18 bytes come the functions defined in the file. Each function
starts with an 11-byte function header.
13+6 x12: 00 00 00 00 Line number in source code where chunk starts.
0 for the main chunk.
19+4 x16: 00 00 00 00 Line number in source
code where chunk stops. 0 for the main chunk.
23+4 x1a: 00 01 04
Number of parameters, vararg flag, number of registers used by this
function (not more than 255, obviously). Local variables are stored in
registers; there may not be more than 200 of them (see lparser.c).

Why could be natural thinking to Unicode encoding as an array of 32-bit integers?

I was reading Python guide about Unicode. In this section, it says:
To summarize the previous section: a Unicode string is a sequence of code points, which are numbers from 0 to 0x10ffff. This sequence needs to be represented as a set of bytes (meaning, values from 0-255) in memory. The rules for translating a Unicode string into a sequence of bytes are called an encoding.
The first encoding you might think of is an array of 32-bit integers. In this representation, the string “Python” would look like this:
P y t h o n
0x50 00 00 00 79 00 00 00 74 00 00 00 68 00 00 00 6f 00 00 00 6e 00 00 00
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Why might we think of 32-bit integers if code points are numbers from 0 to 0x10ffff? Maybe is it assuming that we are on a 32-bit system?

Ruby 1.9.3 Why does "\x03".force_encoding("UTF-8") get \u0003 ,but "\x03".force_encoding("UTF-16") gets "\x03"

Ruby 1.9.3
irb(main):036:0* "\x03".force_encoding("UTF-16")
=> "\x03"
irb(main):040:0* "\x03".force_encoding("UTF-8")
=> "\u0003"
Why is "\x03".force_encoding("UTF-8") is \u0003 and "\x03".force_encoding("UTF-16") ends up with "\x03" , I thought it should be the other way round?
Because "\x03" is not a valid code point in UTF-16, but a valid one in UTF-8 (ASCII 03, ETX, end of text). You have to use at least two bytes to represent a unicode code point in UTF-16.
That's why "\x03" can be treated as unicode \u0003 in UTF-8 but not in UTF-16.
To represent "\u0003" in UTF-16, you have to use two byte, either 00 03 or 03 00, depending on the byte order. That's why we need to specify byte order in UTF-16. For the big-endian version, the byte sequence should be
FE FF 00 03
For the little-endian, the byte sequence should be
FF FE 03 00
The byte order mark should appear at the beginning of a string, or at the beginning of a file.
Starting from Ruby 1.9, String is just a byte sequence with a specific encoding as a tag. force_encoding is a method to change the encoding tag, it won't affect the byte sequence. You can verify that by inspecting "\x03".force_encoding("UTF-8").bytes.
If you see "\u0003", that doesn't mean you got a String which is represented in two bytes 00 03, but some byte(s) that represents the Unicode code point 0003 under the specific encoding as carried in that String. It may be:
03 //tagged as UTF-8
FE FF 00 03 //tagged as UTF-16
FF FE 03 00 //tagged as UTF-16
03 //tagged as GBK
03 //tagged as ASCII
00 00 FE FF 00 00 00 03 // tagged as UTF-32
FF FE 00 00 03 00 00 00 // tagged as UTF-32

4 byte checksum, sum32 algorithm for Epson printers

I'm programming a low level communication with an Epson tm-t88iv thermal printer on a Linux device, which receives only hexadecimal packages. I have read the manual trying to understand how the checksum is built but i can't manage to recreate it.
the manual says that the checksum are 4 bytes representing the 2 bytes sum of all the data in the package sent.
I have currently four working examples I found by listening to a port on a windows computer with a different program. the last 4 hexadecimals are the checksum (03 marks the end of the data and is included in the checksum calculation, according to the manual).
02 AC 00 01 1C 00 00 03 30 30 43 45
02 AC 00 00 1C 80 80 1C 00 00 1C 00 00 1C 03 30 32 32 31
02 AD 07 01 1C 00 00 1C 31 30 03 30 31 35 33
02 AD 00 00 1C 80 80 1C 00 00 1C 00 00 1C 03 30 32 32 32
I have read somewhere that there is a sum32 algorithm but i can't find any example of it or how to program it.
Wow, this is a bad algorithm! If someone else finds himself trying to understand Epson's terrible low-level communication manual, this is how the check-sum is done:
The checksum base is 30 30 30 30
Sum in hexadecimals all of the data package (for example, 02+89+00+00+1C+80+80+1C+00+01+1C+09+0C+1C+03 = 214)
Then separate the result digit by digit, if its a letter add 1 to the value (for example B2 would be 2|1|4).
sum it to the checksum base number by number starting from right to left (this would be a checksum of 30 32 31 34).
Note: It works perfectly, but for some reason the examples I posted above don't seem to match so much. They are all the printers response, but slightly after it got a hardware problem and had to be reformatted by technical support, so maybe it got fixed.
I hope it helps somebody somewhere.

Resources