Importing MNIST dataset with Fortran - machine-learning

A Linux/GFortran question.
I know exactly what my problem is but I can't figure out how to solve it...
I want to import the MNIST dataset images and labels into Fortran arrays to play around with Machine Learning algorithms using Fortran. I've done this with Python but I can't replicate reading the data files with Fortran.
The dataset files and file layout descriptions are at:
http://yann.lecun.com/exdb/mnist/
The 2 problems I'm struggling with are...
1) The data in the files is stored in unsigned bytes. I can't find a similar datatype in Fortran. I'm using integer(kind=1) to read the first 4 bytes successfully, which constitutes the file magic number, but I'm worried about incorrectly reading the value of one of these bytes into the signed integer(kind=1) datatype.
2) The data is stored in Big-Endian format. So when I read the number of images, rows and columns, which are stored in 4 byte integers, into my Little-Endian machine, I receive the obvious gobbledegook. Ideally, what I would like to be able to do is specify the Endiness of a variable to read from a file in an edit descriptor. Is this possible?
Any assistance would be much appreciated.
Kind regards

Related

Best way to store processed text data for streaming to gensim?

I've got several hundred pandas data frames, each of which has a column of very long strings that need to be processed/sentencized and finally tokenized before modeling with word2vec.
I can store them in any format on the disk, before I build a stream to pass them to gensim's word2vec function.
What format would be best, and why? The most important criterion would be performance vis-a-vis training (which will take many days), but coherent structure to the filesystem would also be nice.
Would it be crazy to store several million or maybe even a few billion text files containing one sentence each? Or perhaps some sort of database? If this was numerical data I'd use hdf5. But it's text. The cleanest would be to store them in the original data frames, but that seems less ideal from an i/o perspective, because I'd have to load each data frame (largish) every epoch.
What makes the most sense here?
As you do your preprocessing/tokenization of all the source data that you want to be part of a single training session, append the results to a single plain-text file.
Use space-separated words, and end each 'sentence' (or any other useful text-chunk that's less than 10,000 words long) with a newline.
Then you can use the corpus_file option for specifying your pre-tokenized training data, and will get the maximum possible multithreading benefit. (That mode will direct each thread to open its own view into a range of the single file, so there's no blocking on any distributor thread.)

GLTF validation error MESH_PRIMITIVE_ACCESSOR_WITHOUT_BYTESTRIDE

So i'm working on a OBJ/GLTF2 converter, and for simplicity i've decided to use one file for every kind of buffer, i have positions.bin (vertex) indices.bin Normals.bin and Uvs.bin the exported files open with windows 10 visualizer but the GLTF validator prints a bunch of MESH_PRIMITIVE_ACCESSOR_WITHOUT_BYTESTRIDE errors.
The file is structured so every buffer binary file have just one view and many accessor with offset (one for each face)
I'm doing something wrong ? or the validator isn't working as expected ? my data is tightly packed so i see no reason to have a ByteStride ...
I haven't an hosting so i'm using we transfer here, sorry for that
Example file
This question has been answered here: https://github.com/KhronosGroup/glTF/issues/1198
To sum up the explainations is that the bytestride can be deducted by the software that reads the GLTF as long as the bufferview isn't shared among accessors, tightly packed data still have the bytestride, it just happen to be equal to the data length and MUST be specified when it can't be deducted.

Reading TIFF files

I need to read and interpret a binary file containing a TIFF image. I know there exist readers for doing this but I want to go the hard way. I found the TIFF format description and need to parse the binary file in small chunks. Assume I was able to read in memory the complete binary file. This means that I have a variable containing one long list of bytes.
I know via the format definition what the meaning is of the different groups of n bytes.
How can one define character variables with different lengths (sometimes 2, sometimes 3, sometimes 4 etc.) so that the variable address points to the right position in the image variable array?
With other words, assume my image is loaded into an array Image containing all bytes of the file.
The first 2 bytes I want to load in a string with length 2 bytes so that I can just link the address pointer to the first position in the Image array and automatically the first 2 bytes are associated with the first character string. A second string of 4 bytes would have another meaning and so I make the address for the second string of 4 bytes point to the 3rd position of the Image array.
Is this feasible in C++? I remember that this was a normal way of working for dynamical memory allocation in Fortran 77 in a simulation code I analysed a long time ago.
Thanks in advance for the hints!
Regards,
Stefan
The C++ language is easily capable of processing TIFF files from a byte array. The idea you have in mind is basically correct, but there a few problems with it. C strings are zero-terminated and the strings which appear in TIFF files are not necessarily zero terminated since their length is specified explicitly. It really is simpler to create a dedicated data structure to hold the TIFF-specific data fields and then parse the binary data into the structure. Your method will immediately run into trouble with the Motorola/Intel byte issue if your machine has the opposite endian-ness.

Can HDF5 perform "value mapping"?

If I have a 32^3 array of 64 bit integers, but it contains only a dozen different values, can you tell HDF5 to use an "internal mapping" to save memory and/or disk space? What I mean is that the array would be access normally with 64 bit ints, but each value would internally be stored as a byte (?) index into a table of 64 bit ints, potentially saving about 7/8 of the memory and/or disk space. If this is possible, does it actually saves memory, disk space or both?
I don't believe that HDF5 provides this functionality right out of the box, but there is no reason why you couldn't implement routines to write your data to an HDF5 file and read it back again in the way that you seem to want. I suppose you could write your look-up table and your array into different datasets.
It's possible, but not something I have any evidence to indicate, that HDF's compression facility would sufficiently compress your integer dataset that you could save a useful amount of space.
Then again, for the HDF5 files I work with (10s of GBs) I wouldn't bother to try to devise my own encoding scheme to save such modest amounts of space as a 32768 element array of 64 bit numbers might be able to dispense with. Sure, you could transform a dataset of 2097152 bits into one of 131072 but disk space (even RAM) just isn't that tight these days.
I'm beginning to form the impression that you are trying to use HDF5 on, perhaps, a smartphone :-)

Comparing using Map Reduce(Cloudera Hadoop 0.20.2) two text files of size of almost 3GB

I'm trying to do the following in hadoop map/reduce( written in java, linux kernel OS)
Text files 'rules-1' and 'rules-2' (total 3GB in size) contains some rules, each rule are separated by endline character, so the files can be read using readLine() function.
These files 'rules-1' and 'rules-2' needs to be imported as a whole from hdfs in every map function in my cluster i.e. these file are not splittable across different map function.
Input to the mapper's map function is a text file called 'record' (each line is terminated by endline character), so from the 'record' file we get the (key, value) pair. The file is splittable and can be given as input to different map function used in the whole map/reduce process.
What needs to be done is compare each value(i.e. lines from record file) with the rules inside 'rules-1' and 'rules-2'
Problem is, if I pull out each line of rules-1 and rules-2 files to a static arraylist only once, so that each mapper can share the same arraylint and try to compare elements in the arraylist with the each input value from the record file, I get a memory overflow error, since 3GB cannot be stored at a time in the arraylist.
Alternatively, if I import only few lines from the rules-1 and rules-2 files at a time and compare them to each value, map/reduce is taking a lot time to finish its job.
Could you guys provide me any other alternative ideas how can this be done without the memory overflow error? Will it help if I put those file-1 and file-2 inside a hdfs supporting database or something? I'm going out of ideas actually.Would really appreciate if some of you guys could provide me your valuable suggestions.
Iif you input files are small - you can load them into static variables and use rules as an input.
If above is not a case I can suggest the following ways:
a) To give rule-1 and rule-2 high replication factor close to the number of nodes you have. Then you can read from HDFS rule=1 and rule-2 for each record in the input relatively efficient - because it will be sequential read from the local datanode.
b) If you can consider some hash function which, when applied to the rule and to the input string will predict without false negatives that they can match - then you can emit this hash for rules, input record and resolve all possible matches in the reducer. It will be very similar to the way how a join is done using MR
c) I would consider some other optimization techniques like building search trees, or sorting since otherwise the problem looks computationally expensive and will took forever...
On this page find Real-World Cluster Configurations
it will cover file size configuration
You could use the param "mapred.child.java.opts" in conf/mapred-site.xml to increase the memory for your mappers. You might not be able to run as many map slots per server but with more servers in your cluster you could still parallelize your job.
Read the content text file from the MapReduce function and read the keyword text file from the mapper function (for reading your HDFS) and split using StringTokenizer value.toString reading from MapReduce and in your mapper function write HDFS read text file code it will read line-by-line so use two while loops here you compare. Whenever you want data send it to reducer.
Split the 3gb text file into several text files and apply that all text files as usual MapReduce your previous program.
For splitting text file I written Java program and you decide how many lines you want write in each text file.

Resources