Recurring attributes for HDF5 dataset - hdf5

I am seeing attributes like dhm* (example dhm8a48ced6a620499f12ff5f72b4e59f62ae6357) attached to some of the datasets in my HDF5 files. What do they mean or represent? Their value is an array of 8-bit Integer. The arrays are always of fixed size.

Related

What is a "label" in Caffe?

In Caffe when you are defining your inputs for the NN in the protobuf file, you can input "data" and "label". I'm guessing label contains the expected output for training data (what it is normally considered the y values in Machine Learning literature).
My problem is that in the caffe.proto file, label is defined as a scalar (int or long). At least with data, I can set it to an numpy array, because it takes String values. If I'm training for more than one prediction output, how could I pass it as an array?
Or am I mistaken? What is label? What is it for? And how can I pass the y values to caffe?
The basic use case of caffe used to be image classification: assigning a single integer label per input image. Thus, the "datum" data structure reserves space for a 4D float array (batches of 3 channels images) and an integer "label" per image in the batch.
This restriction can be easily overcome using HDF5 input data layer.
See e.g., this answer.

How to convert plain text into feature/value pair format

I checked various svm classifier, which uses feature/value pair format for classification purpose. (I am focusing on svmlight - http://svmlight.joachims.org/) format is like this :
-1 1:0.43 3:0.12 9284:0.2 # abcdef
But as I am getting user input in form of plain text, to classify it using svmlight, I need to convert plain text to this format.
how it could be done?
You have to use some real valued embeeding. In other words, you have data in the space of texts, which is more or less space of varied length sequences of words. There are numerous approaches, one better for one purpose, and other - for another, the most simple ones include:
encode on word level, so each word is a "dimension", so in your case - you create a dictionary of words and assign each word a consequtive integer. Now each document can be encoded as a vector, where each feature's value is for example "if the word is in the document" (set of words) or maybe "how many times does it word occur" (bag of words; also known as term frequency, tf) or some more complex statistics (like for example tf-idf; term frequency multiplied by inverted document frequency).
encode on level of ngrams, similarly to the previous one, but instead of enumerating each word you enumerate each n-gram (n-gram is any sequence of n-words), this is more syntatical feature, but requires significantly more data to train on.
use some "magical encoding" or specialistic "string kernels".
First two approaches can be easily done using scikit-learn's tfidf vectorizer, see http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html . The last one requires more complex software.

how to predict using scikit?

I have trained an estimator, called clf, using fit method and save the model to disk. The next time to run the program , which will load clf from disk.
my problem is :
how to predict a sample which saved on disk? I mean, how to load it and predict?
how to get the sample label instead of label integer after predict?
how to predict a sample which saved on disk? I mean, how to load it and predict?
You have to use the same array representation for the new samples as the one used for the samples passed to fit method. If you want to predict a single sample, the input must be a 2D numpy array with shape (1, n_features).
The way to read your original file on the HDD and convert it to a numpy array representation suitable for classifier is a domain specific issue: it depends whether you are trying to classify text files, jpeg files, frames in a video file, rows in database, log lines for syslog monitored services...
how to get the sample label instead of label integer after predict?
Just keep a list of label names and ensure that the integer used as target values when fitting are in the range [0, n_classes). For instance ['spam', 'ham'], if you have predictions in the range [0, 1] then you can do:
new_samples = # 2D array with shape (n_samples, n_features)
label_names = ['ham', 'spam']
predictions = [label_names[pred] for pred in clf.predict(new_samples)]

Scikit and Pandas: Fitting Large Data

How do I use scikit-learn to train a model on a large csv data (~75MB) without running into memory problems?
I'm using IPython notebook as the programming environment, and pandas+sklearn packages to analyze data from kaggle's digit recognizer tutorial.
The data is available on the webpage , link to my code , and here is the error message:
KNeighborsClassifier is used for the prediction.
Problem:
"MemoryError" occurs when loading large dataset using read_csv
function. To bypass this problem temporarily, I have to restart the
kernel, which then read_csv function successfully loads the file, but
the same error occurs when I run the same cell again.
When the read_csv function loads the file successfully, after making changes to the dataframe, I can pass the features and labels to the KNeighborsClassifier's fit() function. At this point, similar memory error occurs.
I tried the following:
Iterate through the CSV file in chunks, and fit the data accordingly, but the problem is that the predictive model is overwritten every time for a chunk of data.
What do you think I can do to successfully train my model without running into memory problems?
Note: when you load the data with pandas it will create a DataFrame object where each column has an homogeneous datatype for all the rows but 2 columns can have distinct datatypes (e.g. integer, dates, strings).
When you pass a DataFrame instance to a scikit-learn model it will first allocate a homogeneous 2D numpy array with dtype np.float32 or np.float64 (depending on the implementation of the models). At this point you will have 2 copies of your dataset in memory.
To avoid this you could write / reuse a CSV parser that directly allocates the data in the internal format / dtype expected by the scikit-learn model. You can try numpy.loadtxt for instance (have a look at the docstring for the parameters).
Also if you data is very sparse (many zero values) it will be better to use a scipy.sparse datastructure and a scikit-learn model that can deal with such an input format (check the docstrings to know). However the CSV format itself is not very well suited for sparse data and I am not sure there exist a direct CSV-to-scipy.sparse parser.
Edit: for reference KNearestNeighborsClassifer allocate temporary distances array with shape (n_samples_predict, n_samples_train) which is very wasteful when only (n_samples_predict, n_neighbors) is needed instead. This issue can be tracked here:
https://github.com/scikit-learn/scikit-learn/issues/325

How are matrices stored in memory?

Note - may be more related to computer organization than software, not sure.
I'm trying to understand something related to data compression, say for jpeg photos. Essentially a very dense matrix is converted (via discrete cosine transforms) into a much more sparse matrix. Supposedly it is this sparse matrix that is stored. Take a look at this link:
http://en.wikipedia.org/wiki/JPEG
Comparing the original 8x8 sub-block image example to matrix "B", which is transformed to have overall lower magnitude values and much more zeros throughout. How is matrix B stored such that it saves much more memory over the original matrix?
The original matrix clearly needs 8x8 (number of entries) x 8 bits/entry since values can range randomly from 0 to 255. OK, so I think it's pretty clear we need 64 bytes of memory for this. Matrix B on the other hand, hmmm. Best case scenario I can think of is that values range from -26 to +5, so at most an entry (like -26) needs 6 bits (5 bits to form 26, 1 bit for sign I guess). So then you could store 8x8x6 bits = 48 bytes.
The other possibility I see is that the matrix is stored in a "zig zag" order from the top left. Then we can specify a start and an end address and just keep storing along the diagonals until we're only left with zeros. Let's say it's a 32-bit machine; then 2 addresses (start + end) will constitute 8 bytes; for the other non-zero entries at 6 bits each, say, we have to go along almost all the top diagonals to store a sum of 28 elements. In total this scheme would take 29 bytes.
To summarize my question: if JPEG and other image encoders are claiming to save space by using algorithms to make the image matrix less dense, how is this extra space being realized in my hard disk?
Cheers
The dct needs to be accompanied with other compression schemes that take advantage of the zeros/high frequency occurrences. A simple example is run length encoding.
JPEG uses a variant of Huffman coding.
As it says in "Entropy coding" a zig-zag pattern is used, together with RLE which will already reduce size for many cases. However, as far as I know the DCT isn't giving a sparse matrix per se. But it usually enhances the entropy of the matrix. This is the point where the compressen becomes lossy: The intput matrix is transferred with DCT, then the values are quantizised and then the huffman-encoding is used.
The most simple compression would take advantage of repeated sequences of symbols (zeros). A matrix in memory may look like this (suppose in dec system)
0000000000000100000000000210000000000004301000300000000004
After compression it may look like this
(0,13)1(0,11)21(0,12)43010003(0,11)4
(Symbol,Count)...
As my under stand, JPEG on only compress, it also drop data. After the 8x8 block transfer to frequent domain, it drop the in-significant (high-frequent) data, which means it only has to save the significant 6x6 or even 4x4 data. That it can has higher compress rate then non-lost method (like gif)

Resources