mnist database parsing c - parsing

I am trying to parse the MNIST Database of handwritten numbers. However, when I look at the values that it is giving me when I use fread, they aren't right. I have changed the endianness, but the numerical values aren't correct still. Link to the database is here: http://yann.lecun.com/exdb/mnist/
int ChangeEndianness(int value) {
int result = 0;
result |= (value & 0x000000FF) << 24;
result |= (value & 0x0000FF00) << 8;
result |= (value & 0x00FF0000) >> 8;
result |= (value & 0xFF000000) >> 24;
return result;
}
FILE *imageTestFiles = fopen("train-images-idx3-ubyte.gz","r");
if(imageTestFiles == NULL) {
perror("File Not Found");
}
int magic_number_bytes;
fread(&magic_number_bytes, sizeof(int), 1, imageTestFiles);
printf("%d\n", ChangeEndianness(magic_number_bytes));
All this is supposed to do is print the "magic number" which is 2049 or 0x00000801, but it instead prints a 529205256 which is 0x1F8B0808. I am sorta new to C, always used Java beforehand. Thanks in advance!

The file must first be decompressed rather than simply removing the gz extension.
One can tell your code is operating on a compressed file because 0x1F8B is the magic number for the gzip file format.
If xxd is used to display the file contents after downloading you get the observed 0x1F8B0808:
$ xxd -p train-images-idx3-ubyte.gz | head -c 8
1f8b0808
However, if you decompress the file:
$ gunzip train-images-idx3-ubyte.gz
$ xxd -p train-images-idx3-ubyte | head -c 8
00000803
you get the expected magic number for the MNIST data.

Related

code snippet lack of understanding

I am boxing with a C code snippet I need to convert.
One of the functions is as follows:
+(float) calcTemp:(NSData *)data {
char scratchVal[data.length];
[data getBytes:&scratchVal length:data.length];
UInt16 temp;
temp = (scratchVal[0] & 0xff) | ((scratchVal[1] << 8) & 0xff00);
return (float)temp;
}
This line I just can't seem to grasp:
temp = (scratchVal[0] & 0xff) | ((scratchVal[1] << 8) & 0xff00);
i know its probably a novo question (but I am a noob^), if someone could explain that line to me i would greatly appreciate it. In particular the things with address references and the operator uses.
In the code snippet I don't see why they call getBytes:length method on data, since its not being used. But mainly, Im just trying to understand the line that I pointed out.
The line
temp = (scratchVal[0] & 0xff) | ((scratchVal[1] << 8) & 0xff00);
is creating an unsigned 16-bit integer value from two bytes originating in scratchVal. A single & in this context is not the address operator but bitwise AND. So the lower byte of temp is set from the first byte contained in scratchVal, and the upper byte of temp is set by left-shifting the second byte contained in scratchVal. The two resulting numbers are joined together using bitwise OR |. To avoid sign extension or other unwanted bits the masks 0xff and 0xff00 are used to ensure all undesirables are zero.
Presented visually, if scratchVal contains the bits aaaaaaaa bbbbbbbb in the first two bytes then temp will end up as an unsigned integer with the bit pattern bbbbbbbbaaaaaaaa.
The second question asked why they're calling -getBytes:length:. The line
[data getBytes:&scratchVal length:data.length];
reads the bytes from data into the scratchVal temporary buffer.
In response to the question in the comment
why it is needed to left shift the bits to concatenate them
A simple assignment won't work. Assuming again that scratchVal is a char buffer containing the bits aaaaaaaa bbbbbbbb, the code
temp = scratchVal[0];
would make temp equal to the UInt16 equivalent of the bits aaaaaaaa. You can't use addition because the result will be whatever value comes from adding the two bytes together (aaaaaaaa + bbbbbbbb).
Using real numbers as an example, suppose the first two bytes of scratchVal are equal to 0x7f 0x7f.
temp = scratchVal[0] + scratchVal[1];
Turns out to be 0x7f + 0x7f = 0xfe which is not the purpose of this code.
Building the value using OR can be better understood by breaking it down into steps.
The first part of the expression is scratchVal[0] & 0xff = 0x7f & 0xff = 0x7f
The second part is (scratchVal[1] << 8) & 0xff00 = (0x7f << 8) & 0xff = 0x7f00 & 0xff = 0x7f00
The final result in this case is 0x7f | 0x7f00 = 0x7f7f.

How do you convert 8-bit bytes to 6-bit characters?

I have a specific requirement to convert a stream of bytes into a character encoding that happens to be 6-bits per character.
Here's an example:
Input: 0x50 0x11 0xa0
Character Table:
010100 T
000001 A
000110 F
100000 SPACE
Output: "TAF "
Logically I can understand how this works:
Taking 0x50 0x11 0xa0 and showing as binary:
01010000 00010001 10100000
Which is "TAF ".
What's the best way to do this programmatically (pseudo code or c++). Thank you!
Well, every 3 bytes, you end up with four characters. So for one thing, you need to work out what to do if the input isn't a multiple of three bytes. (Does it have padding of some kind, like base64?)
Then I'd probably take each 3 bytes in turn. In C#, which is close enough to pseudo-code for C :)
for (int i = 0; i < array.Length; i += 3)
{
// Top 6 bits of byte i
int value1 = array[i] >> 2;
// Bottom 2 bits of byte i, top 4 bits of byte i+1
int value2 = ((array[i] & 0x3) << 4) | (array[i + 1] >> 4);
// Bottom 4 bits of byte i+1, top 2 bits of byte i+2
int value3 = ((array[i + 1] & 0xf) << 2) | (array[i + 2] >> 6);
// Bottom 6 bits of byte i+2
int value4 = array[i + 2] & 0x3f;
// Now use value1...value4, e.g. putting them into a char array.
// You'll need to decode from the 6-bit number (0-63) to the character.
}
Just in case if someone is interested - another variant that extracts 6-bit numbers from the stream as soon as they appear there. That is, results can be obtained even if less then 3 bytes are currently read. Would be useful for unpadded streams.
The code saves the state of the accumulator a in variable n which stores the number of bits left in accumulator from the previous read.
int n = 0;
unsigned char a = 0;
unsigned char b = 0;
while (read_byte(&byte)) {
// save (6-n) most significant bits of input byte to proper position
// in accumulator
a |= (b >> (n + 2)) & (077 >> n);
store_6bit(a);
a = 0;
// save remaining least significant bits of input byte to proper
// position in accumulator
a |= (b << (4 - n)) & ((077 << (4 - n)) & 077);
if (n == 4) {
store_6bit(a);
a = 0;
}
n = (n + 2) % 6;
}

How to solve a Zlib adler32 rolling checksum problem?

I am using adler32 function from zlib to calculate the weak checksum of a chunk of memory x (4096 in length). Everything is fine, but now I would like to perform the rolling checksum if the chunks from different file do not match. However, I am not sure how to write a function to perform that on the value returned by adler32 in zlib. So if the checksum does not match, how do I calculate rolling checksum by using original checksum, x + 1 byte and x + 4096 + 1? Basically trying to build rsync implementation.
Pysync has implemented rolling on top of zlib's Adler32 like this:
_BASE=65521 # largest prime smaller than 65536
_NMAX=5552 # largest n such that 255n(n+1)/2 + (n+1)(BASE-1) <= 2^32-1
_OFFS=1 # default initial s1 offset
import zlib
class adler32:
def __init__(self,data=''):
value = zlib.adler32(data,_OFFS)
self.s2, self.s1 = (value >> 16) & 0xffff, value & 0xffff
self.count=len(data)
def update(self,data):
value = zlib.adler32(data, (self.s2<<16) | self.s1)
self.s2, self.s1 = (value >> 16) & 0xffff, value & 0xffff
self.count = self.count+len(data)
def rotate(self,x1,xn):
x1,xn=ord(x1),ord(xn)
self.s1=(self.s1 - x1 + xn) % _BASE
self.s2=(self.s2 - self.count*x1 + self.s1 - _OFFS) % _BASE
def digest(self):
return (self.s2<<16) | self.s1
def copy(self):
n=adler32()
n.count,n.s1,n.s2=self.count,self.s1,self.s2
return n
But as Peter stated, rsync does not use Adler32 directly, but a faster variant of it.
Code of the rsync tool is bit hard to read, but checkout librsync. It's a completely separate project and it's much more readable. Take a look at rollsum.c and rollsum.h. There is an efficient implementation of the variant in C macros:
/* the Rollsum struct type*/
typedef struct _Rollsum {
unsigned long count; /* count of bytes included in sum */
unsigned long s1; /* s1 part of sum */
unsigned long s2; /* s2 part of sum */
} Rollsum;
#define ROLLSUM_CHAR_OFFSET 31
#define RollsumInit(sum) { \
(sum)->count=(sum)->s1=(sum)->s2=0; \
}
#define RollsumRotate(sum,out,in) { \
(sum)->s1 += (unsigned char)(in) - (unsigned char)(out); \
(sum)->s2 += (sum)->s1 - (sum)->count*((unsigned char)(out)+ROLLSUM_CHAR_OFFSET); \
}
#define RollsumRollin(sum,c) { \
(sum)->s1 += ((unsigned char)(c)+ROLLSUM_CHAR_OFFSET); \
(sum)->s2 += (sum)->s1; \
(sum)->count++; \
}
#define RollsumRollout(sum,c) { \
(sum)->s1 -= ((unsigned char)(c)+ROLLSUM_CHAR_OFFSET); \
(sum)->s2 -= (sum)->count*((unsigned char)(c)+ROLLSUM_CHAR_OFFSET); \
(sum)->count--; \
}
#define RollsumDigest(sum) (((sum)->s2 << 16) | ((sum)->s1 & 0xffff))

How do I create an FCS for PPP packets?

I am trying to create a software simulation on an Ubuntu GNU/Linux machine which will work like PPPoE. I would like this simulator to take outgoing packets, strip off the ethernet header, insert the PPP flags (7E, FF, 03, 00, and 21) and place the IP layer information in the PPP packet. I am having trouble with the FCS that goes after the data. From what I can tell, the cell modem I am using has a 2 byte FCS using the CRC16-CCITT method. I have found several pieces of software that will calculate this checksum, but none of them produce what is coming out the serial line (I have a serial line "sniffer" that shows me everything the modem is being sent).
I have been looking into the source of pppd and the linux kernel itself, and I can see that both of them have a method of adding an FCS to the data. It seems quite difficult to implement, as I have no experience in kernel hacking. Can someone come up with a simple way (preferably in Python) of calculating an FCS that matches the one that the kernel produces?
Thanks.
P.S. If anyone wants, I can add a sample of the data output I am getting to the serial modem.
Used simple python library crcmod.
import crcmod #pip3 install crcmod
fcsData = "A0 19 03 61 DC"
fcsData=''.join(fcsData.split(' '))
print(fcsData)
crc16 = crcmod.mkCrcFun(0x11021, rev=True,initCrc=0x0000, xorOut=0xFFFF)
print(hex(crc16(bytes.fromhex(fcsData))))
fcs=hex(crc16(bytes.fromhex(fcsData)))
I recently did something like this while testing code to kill a ppp connection ..
This worked for me:
# RFC 1662 Appendix C
def mkfcstab():
P = 0x8408
def valiter():
for b in range(256):
v = b
i = 8
while i:
v = (v >> 1) ^ P if v & 1 else v >> 1
i -= 1
yield v & 0xFFFF
return tuple(valiter())
fcstab = mkfcstab()
PPPINITFCS16 = 0xffff # Initial FCS value
PPPGOODFCS16 = 0xf0b8 # Good final FCS value
def pppfcs16(fcs, bytelist):
for b in bytelist:
fcs = (fcs >> 8) ^ fcstab[(fcs ^ b) & 0xff]
return fcs
To get the value:
fcs = pppfcs16(PPPINITFCS16, (ord(c) for c in frame)) ^ 0xFFFF
and swap the bytes (I used chr((fcs & 0xFF00) >> 8), chr(fcs & 0x00FF))
Got this from mbed.org PPP-Blinky:
// http://www.sunshine2k.de/coding/javascript/crc/crc_js.html - Correctly calculates
// the 16-bit FCS (crc) on our frames (Choose CRC16_CCITT_FALSE)
int crc;
void crcReset()
{
crc=0xffff; // crc restart
}
void crcDo(int x) // cumulative crc
{
for (int i=0; i<8; i++) {
crc=((crc&1)^(x&1))?(crc>>1)^0x8408:crc>>1; // crc calculator
x>>=1;
}
}
int crcBuf(char * buf, int size) // crc on an entire block of memory
{
crcReset();
for(int i=0; i<size; i++)crcDo(*buf++);
return crc;
}

How to use some text processing(awk etc..) to put some character in a text file at certain lines

I have a text file which has hex values, one value on one separate line. A file has many such values one below another. I need to do some analysis of the values for which i need to but some kind of delimiter/marker say a '#' in this file before line numbers 32,47,62,77... difference between two line numbers in this patterin is 15 always.
I am trying to do it using awk. I tried few things but didnt work.
What is the command in awk to do it?
Any other solution involving some other language/script/tool is also welcome.
Thank you.
-AD
This is how you can use AWK for it,
awk 'BEGIN{ i=0; } \
{if (FNR<31) {print $0} \
else {i++; if (i%15) {print $0} else {printf "#%s\n",$0}}\
}' inputfile.txt > outputfile.txt
How it works,
BEGIN sets an iterator for counting from your starting line 32
FNR<31 starts counting from the 31st record (the next record needs a #)
input lines are called records and FNR is an AWK variable that counts them
Once we start counting, the i%15 prefixes a # on every 15th line
$0 prints the record (the line) as is
You can type all the text with white spaces skipping the trailing '\' on a single command line.
Or, you can use it as an AWK file,
# File: comment.awk
BEGIN{ i=0; }
$0 ~ {\
if (FNR<31) {print $0} \
else {\
i++; \
if (i%15) {\
print $0
}\
else {\
printf "#%s\n",$0
}\
}\
}
And run it as,
awk -f comment.awk inputfile.txt > outputfile.txt
Hope this will help you to use more AWK.
Python:
f_in = open("file.txt")
f_out = open("file_out.txt","w")
offset = 4 # 0 <= offset < 15 ; first marker after fourth line in this example
for num,line in enumerate(f_in):
if not (num-offset) % 15:
f_out.write("#\n")
f_out.write(line)
Haskell:
offset = 31;
chunk_size = 15;
main = do
{
(h, t) <- fmap (splitAt offset . lines) getContents;
mapM_ putStrLn h;
mapM_ ((putStrLn "#" >>) . mapM_ putStrLn) $
map (take chunk_size) $
takeWhile (not . null) $
iterate (drop chunk_size) t;
}

Resources