How can I convert a GUID into a byte array in Ruby? - ruby-on-rails

In order to save data traffic we want to send our GUID's as array of bytes instead of as a string (with the use of Google Protocol Buffers).
How can I convert a string representation of a GUID in Ruby to an array of bytes:
Example:
Guid: 35918bc9-196d-40ea-9779-889d79b753f0
=> Result: C9 8B 91 35 6D 19 EA 40 97 79 88 9D 79 B7 53 F0
In .NET this seems to be natively implemented:
http://msdn.microsoft.com/en-us/library/system.guid.tobytearray%28v=vs.110%29.aspx

Your example GUID is in a Microsoft specific format. From Wikipedia:
Other systems, notably Microsoft's marshalling of UUIDs in their COM/OLE libraries, use a mixed-endian format, whereby the first three components of the UUID are little-endian, and the last two are big-endian.
So in order to get that result, we have to move the bits around a little. Specifically, we have to change the endianess of the first three components. Let's start by breaking the GUID string apart:
guid = '35918bc9-196d-40ea-9779-889d79b753f0'
parts = guid.split('-')
#=> ["35918bc9", "196d", "40ea", "9779", "889d79b753f0"]
We can convert these hex-strings to binary via:
mixed_endian = parts.pack('H* H* H* H* H*')
#=> "5\x91\x8B\xC9\x19m#\xEA\x97y\x88\x9Dy\xB7S\xF0"
Next let's swap the first three parts:
big_endian = mixed_endian.unpack('L< S< S< A*').pack('L> S> S> A*')
#=> "\xC9\x8B\x915m\x19\xEA#\x97y\x88\x9Dy\xB7S\xF0"
L denotes a 32-bit unsigned integer (1st component)
S denotes a 16-bit unsigned integer (2nd and 3rd component)
< and > denote little-endian and big-endian, respectively
A* treats the remaining bytes as an arbitrary binary string (we don't have to convert these)
If you prefer an array of bytes instead of a binary string, you'd just use:
big_endian.bytes
#=> [201, 139, 145, 53, 109, 25, 234, 64, 151, 121, 136, 157, 121, 183, 83, 240]
PS: if your actual GUID isn't Microsoft specific, you can skip the swapping part.

Related

The DHT in JPEG does not contain the actual huffman code, how come?

The DHT contains 16 bytes that just contains count of how many values were encoded with huffman code of each length from 1 bit all the way to 16 bits. After this, it contains the actual values that were encoded, all these value are 8 bits in size.
Q: Why is huffman code not stored, how does decoder derive the codes?
Q: If there are say 4 values that have huffman code of 3 bits long, we shall write them as 4 bytes. Does it matter what order they are in or they have to be in ascending or descending order? I do know that the values must be in order such that the values with 1 bit huffman code are then followed by values with 2 bit huffman code e.t.c.
Q: I have used jpegsnoop to look at huffman table of different files, some made in MS paint and some were downloaded. I find that they all have the SAME table.
Here are the DHT entries I got from JPEG snoop:
Destination ID = 1
Class = 1 (AC Table)
Codes of length 01 bits (000 total):
Codes of length 02 bits (002 total): 00 01
Codes of length 03 bits (001 total): 02
Codes of length 04 bits (002 total): 03 11
Codes of length 05 bits (004 total): 04 05 21 31
Codes of length 06 bits (004 total): 06 12 41 51
Codes of length 07 bits (003 total): 07 61 71
Codes of length 08 bits (004 total): 13 22 32 81
Codes of length 09 bits (007 total): 08 14 42 91 A1 B1 C1
Codes of length 10 bits (005 total): 09 23 33 52 F0
Codes of length 11 bits (004 total): 15 62 72 D1
Codes of length 12 bits (004 total): 0A 16 24 34
Codes of length 13 bits (000 total):
Codes of length 14 bits (001 total): E1
Codes of length 15 bits (002 total): 25 F1
Codes of length 16 bits (119 total): 17 18 19 1A 26 27 28 29 2A 35 36 37 38 39 3A 43
44 45 46 47 48 49 4A 53 54 55 56 57 58 59 5A 63
64 65 66 67 68 69 6A 73 74 75 76 77 78 79 7A 82
83 84 85 86 87 88 89 8A 92 93 94 95 96 97 98 99
9A A2 A3 A4 A5 A6 A7 A8 A9 AA B2 B3 B4 B5 B6 B7
B8 B9 BA C2 C3 C4 C5 C6 C7 C8 C9 CA D2 D3 D4 D5
D6 D7 D8 D9 DA E2 E3 E4 E5 E6 E7 E8 E9 EA F2 F3
F4 F5 F6 F7 F8 F9 FA
Total number of codes: 162
And
Destination ID = 1
Class = 0 (DC / Lossless Table)
Codes of length 01 bits (000 total):
Codes of length 02 bits (003 total): 00 01 02
Codes of length 03 bits (001 total): 03
Codes of length 04 bits (001 total): 04
Codes of length 05 bits (001 total): 05
Codes of length 06 bits (001 total): 06
Codes of length 07 bits (001 total): 07
Codes of length 08 bits (001 total): 08
Codes of length 09 bits (001 total): 09
Codes of length 10 bits (001 total): 0A
Codes of length 11 bits (001 total): 0B
Codes of length 12 bits (000 total):
Codes of length 13 bits (000 total):
Codes of length 14 bits (000 total):
Codes of length 15 bits (000 total):
Codes of length 16 bits (000 total):
Total number of codes: 012
The AC table compresses RRRRSSSS that contain zero-run length and AC coefficient magnitude while the DC table compresses SSSS. Thus, I think that the AC table must contain total of 255 entries (exlcuded 0) while the DC table must be 15 entries (excluded 0). However, neither of the tables contain this many total number of codes. WHY?
Q: Why is huffman code not stored, how does decoder derive the codes?
The reason the Huffman tables is defined as they are rather than with the actual codes is that it is much smaller and simpler to encode that way. PNG uses a similar but different method.
Keep in mind that to store the Huffman codes in the JPEG stream you would need to include both the length and the code itself. The result would be much larger than encoding a count of lengths.
Q: If there are say 4 values that have huffman code of 3 bits long, we shall write them as 4 bytes. Does it matter what order they are in or they have to be in ascending or descending order?
If the Huffman code has 3 bits, it is written as three bits to the JPEG stream. The codes are generated in ascending order.
Q: I have used jpegsnoop to look at huffman table of different files, some made in MS paint and some were downloaded. I find that they all have the SAME table.
The encoder is being lazy and using a fixed Huffman table. There is a sample Huffman table in the JPEG standard that they often use. To generate optimal Huffman codes, the encoder must make two passes over the data. With a preset table, the encoder only needs to make one pass.
F.1.2.1.2 and F.1.2.2.1 of the JPEG Specification explain why the Huffman tables are not fully populated. For baseline encoding DC difference values are limited to 11 bits (table F.1) and AC values are limited to 10 bits (table F.2).
Since DC Huffman symbols only need SSSS values from 0 to 11 their Huffman trees need only 12 codes as you've reported.
AC Huffman symbols have a prefix zero run count from 0 to 15. With 11 bit sizes that works out to 16 * 11 = 176 symbols. However, they don't include the symbols 0x10, 0x20, ... 0xE0 because they are redundant. They encode a run of 1, 2, ... 14 zeros followed by a 0 value. If an encoder has, say, 7 zero values followed by a 3 bit value it can encode that as 0x73. There would be no point encoding it with two symbols 0x60;0x03.
Ignoring those 14 useless symbols we end up with 162 codes as you have reported.
By the way, the 0xF0 (ZRL) value is needed because there isn't a symbol that can express a run of 16 zeros followed by a value thus is cannot be merged.
I don't know why the JPEG spec limits the DC and AC values to a certain number of bits. I speculate that the extra precision would have no effect or is typically thrown away by quantization. Or maybe it has to do with the mathematics of the Inverse Discrete Cosine Transform. Keep in mind that these Huffman encoded values are (quantized) coefficients for the IDCT and are only indirectly related to the continuous tone RGB output.
The Huffman encoding is almost fully determined by the relative frequencies of all 256 symbols (except tiebreaker rules). This means you can choose many, many formats to encode those relative frequencies; the most simple one would be to simply store all these frequencies. The receiver can then rebuild the encoding from that order.
Background: the two least frequent characters of a Huffman encoding share the same (long) prefix, and differ only in the last bit. This combination is then assigned a joint frequency (sum of both combinations), which is used recursively to determine the prefix. Finally, you end up with two sets, one holding X characters and the other holding 256-X characters. The first set has prefix 0 and the second set has prefix 1.
Yes, that's arbitrary, you could swap those 0 and 1, and have a similar table and the same compression ratio - a 0 is just as long as a one. That's why you have detailed rules (e.g. most common set gets 0, tiebreaker is first byte in set)
Back to encoding. You want to store these relative frequencies efficiently, as we're using compression here. Now, as I pointed out, when we have codes suffix-0 and suffix-1, they're both equally long (namely the suffix length plus one). So we know from the fact that there are 119 unique 16 bits codes that there are 60 unique prefixes with length 15. Calculating backwards, we also know that there are two unique symbols with length 15, total 62, so there must be 31 prefixes of length 14. We can backtrack again to prefixes of length 1.
Again, it's necessary to point out that we don't know here the exact values of those prefixes, and the matching symbols. This depends on the tiebreaker rules, as pointed out, but those rules are fixed for JPEG.
JPEG does have a bit of a special case: Huffman codes for very rare symbols should be longer than 16 bits. That's inconvenient, so in building the table you don't choose the two sets with the least combined frequency if either of them already has a long suffix - combining those two subsets would just make the suffix even longer. You see this with all the 16-bit codes in the example: most should have had longer codes in a pure Huffman encoding.
I think the worst case is if the most frequent character appears 50%, the next 25%, etc. You'll get codes 0, 10, 110, 1110 etcetera. That's unary counting, which is indeed optimal for that case, but the longest code is now 256 bits. You'd need a document with 2^256 bytes to have a frequency of 2^-256, though.

What's the representation of a set of char in PASCAL?

They ask me to represet a set of char like into "map memory". What chars are in the set? The teacher told us to use ASCII code, into a set of 32 bytes.
A have this example, the set {'A', 'B', 'C'}
(The 7 comes from 0111)
= {00 00 00 00 00 00 00 00 70 00
00 00 00 00 00 00 00 00 00 00
00}
Sets in pascal can be represented in memory with one bit for every element; if the bit is 1, the element is present in the set.
A "set of char" is the set of ascii char, where each element has an ordinal value from 0 to 255 (it should be 127 for ascii, but often this set is extended up to a byte, so there are 256 different characters).
Hence a "set of char" is represented in memory as a block of 32 bytes which contain a total of 256 bits. The character "A" (upper case A) has an ordinal value of 65. The integer division of 65 by 8 (the number of bits a byte can hold) gives 8. So the bit representing "A" in the set resides in the byte number 8. 65 mod 8 gives 1, which is the second bit in that byte.
The byte number 8 will have the second bit ON for the character A (and the third bit for B, and the fourth for C). All the three characters together give the binary representation of 0000.1110 ($0E in hex).
To demonstrate this, I tried the following program with turbo pascal:
var
ms : set of char;
p : array[0..31] of byte absolute ms;
i : integer;
begin
ms := ['A'..'C'];
for i := 0 to 31 do begin
if i mod 8=0 then writeln;
write(i,'=',p[i],' ');
end;
writeln;
end.
The program prints the value of all 32 bytes in the set, thanks to the "absolute" keyword. Other versions of pascal can do it using different methods. Running the program gives this result:
0=0 1=0 2=0 3=0 4=0 5=0 6=0 7=0
8=14 9=0 10=0 11=0 12=0 13=0 14=0 15=0
16=0 17=0 18=0 19=0 20=0 21=0 22=0 23=0
24=0 25=0 26=0 27=0 28=0 29=0 30=0 31=0
where you see that the only byte different than 0 is the byte number 8, and it contains 14 ($0E in hex, 0000.1110). So, your guess (70) is wrong.
That said, I must add that nobody can state this is always true, because a set in pascal is implementation dependent; so your answer could also be right. The representation used by turbo pascal (on dos/windows) is the most logical one, but this does not exclude other possible representations.

Strtofloat / Floattostr conversion

I am having an issue with the StrToFloat routine. I am running Delphi 7 on Windows Vista with the regional format set to German (Austria)
If I run the following code -
DecimalSeparator:='.';
anum:=StrToFloat('50.1123');
edt2.Text:=FloatToStr(anum);
when I convert the string to a float anum becomes 50,1123 and when I convert it back to a sting it becomes '50.1123'
How do it so that when I convert the string to a float the number appears with a decimal point rather than a comma as the decimal separator.
thanks
Colin
You have to appreciate the difference between a floating-point number and a textual representation of it (that is, a string of characters).
A floating-point number, as it is normally stored in a computer (e.g. in a Delphi float variable), does not have a decimal separator. Only a textual representation of it does. If the IDE displays the anum as '50,1123' this simply means that the IDE uses your computer's local regional settings when it creates a textual representation of the number inside the IDE.
In your computer's memory, the value '50.1123' (or, if you prefer, '50,1123'), is only stored using ones and zeroes. In hexadecimal notation, the number is stored as 9F AB AD D8 5F 0E 49 40 and contains no information about how it should be displayed. It is not like you can grab a magnifying glass and direct it to a RAM module to find a tiny, tiny, string '50.1123' (or '50,1123').
Of course, when you want to display the number to the user, you use FloatToStr which takes the number and creates a string of characters out of it. The result can be either '50.1123' or '50,1123', or something else. (In memory, these strings are 35 30 2C 31 31 32 33 and 35 30 2E 31 31 32 33 (ASCII), respectively.)

Assembly Language: Memory Bytes and Offsets

I am confused as to how memory is stored when declaring variables in assembly language. I have this block of sample code:
val1 db 1,2
val2 dw 1,2
val3 db '12'
From my study guide, it says that the total number of bytes required in memory to store the data declared by these three data definitions is 8 bytes (in decimal). How do I go about calculating this?
It also says that the offset into the data segment of val3 is 6 bytes and the hex byte at offset 5 is 00. I'm lost as to how to calculate these bytes and offsets.
Also, reading val1 into memory will produce 0102 but reading val3 into memory produces 3132. Are apostrophes represented by the 3 or where does it come from? How would val2 be read into memory?
You have two bytes, 0x01 and 0x02. That's two bytes so far.
Then you have two words, 0x0001 and 0x0002. That's another four bytes, making six to date.
The you have two more bytes making up the characters of the string '12', which are 0x31 and 0x32 in ASCII (a). That's another two bytes bringing the grand total to eight.
In little-endian format (which is what you're looking at here based on the memory values your question states), they're stored as:
offset value
------ -----
0 0x01
1 0x02
2 0x01
3 0x00
4 0x02
5 0x00
6 0x31
7 0x32
(a) The character set you're using in this case is the ASCII one (you can follow that link for a table describing all the characters in that set).
The byte values 0x30 thru 0x39 are the digits 0 thru 9, just as the bytes 0x41 thru 0x5A represent the upper-case alpha characters. The pseudo-op:
db '12'
is saying to insert the bytes for the characters '1' and '2'.
Similarly:
db 'Pax is a really cool guy',0
would give you the hex-dump representation:
addr +0 +1 +2 +3 +4 +5 +6 +7 +8 +9 +A +B +C +D +E +F +0123456789ABCDEF
0000 50 61 78 20 69 73 20 61 20 72 65 61 6C 6C 79 20 Pax is a really
0010 63 6F 6F 6C 20 67 75 79 00 cool guy.
val1 is two consecutive bytes, 1 and 2. db means "direct byte". val2 is two consecutive words, i.e. 4 bytes, again 1 and 2. in memory they will be 1, 0, 2, 0, assuming you're on a big endian machine. val3 is a two bytes string. 31 and 32 in are 49 and 50 in hexadecimal notation, they are ASCII codes for the characters "1" and "2".

Addressing memory data in 32 bit protected mode with nasm

So my book says i can define a table of words like so:
table: dw "13,37,99,99"
and that i can snatch values from the table by incrementing the index into the address of the table like so:
mov ax, [table+2] ; should give me 37
but instead it places 0x2c33 in ax rather than 0x3337
is this because of a difference in system architecture? maybe because the book is for 386 and i'm running 686?
0x2C is a comma , and 0x33 is the character 3, and they appear at positions 2 and 3 in your string, as expected. (I'm a little confused as to what you were expecting, since you first say "should give me 37" and later say "rather than 0x3337".)
You have defined a string constant when I suspect that you didn't mean to. The following:
dw "13,37,99,99"
Will produce the following output:
Offset 00 01 02 03 04 05 06 07 08 09 0A 0B
31 33 2C 33 37 2C 39 39 2C 39 39 00
Why? Because:
31 is the ASCII code for '1'
33 is the ASCII code for '3'
2C is the ASCII code for ','
...
39 is the ASCII code for '9'
NASM also null-terminates your string by putting 0 byte at the end (If you don't want your strings to be null-terminated use single quotes instead, '13,37,99,99')
Take into account that ax holds two bytes and it should be fairly clear why ax contains 0x2C33.
I suspect what you wanted was more along the lines of this (no quotes and we use db to indicate we are declaring byte-sized data instead of dw that declares word-sized data):
db 13,37,99,99
This would still give you 0x6363 (ax holds two bytes / conversion of 99, 99 to hex). Not sure where you got 0x3337 from.
I recommend that you install yourself a hex editor and have an experiment inspecting the output from NASM.

Resources