Caffe's way of doing data shuffling - machine-learning

Is shuffling done by setting the flag --shuffle as below as found in create_imagenet.sh ? :
GLOG_logtostderr=1 $TOOLS/convert_imageset \
--resize_height=$RESIZE_HEIGHT \
--resize_width=$RESIZE_WIDTH \
--shuffle \
I mean I don't need to shuffle it manually afterwards, if the flag does it already. What about the label, is it shuffled automatically in the generated lmdb file?

Using convert_imageset tool creates a copy of your training/validation data in a binary database file (either in lmdb or leveldb format). The data encoded in the dataset includes pairs of example and its corresponding label.
Therefore, when shuffle-ing the dataset the labels are shuffled with the data to maintain the correspondence between data and its ground-truth label.
There is no need to shuffle the data again during training.

Related

How can I one hot encode nii.gz files stored in two different folders?

I have read all those images, which I am going to use for binary classification, and stored them in two different NumPy arrays. Now, I need to one hot encode these images, and then feed it to a Neural network.
I don't understand how I can one hot encode two different numpy arrays and then feed them to a neural network.
array_1 contains all the images that will be labelled as 1, and array_2 contains all the images that will be labelled as 0.
The Python package platipy has functionality to encode (multiple-valued) label maps.
To install:
pip install -U pip
pip install platipy
Here is a short example:
import SimpleITK as sitk
from platipy.imaging.label.utils import binary_encode_structure_list
img_label_1 = sitk.ReadImage("img_label_1.nii.gz")
img_label_2 = sitk.ReadImage("img_label_2.nii.gz")
img_label_3 = sitk.ReadImage("img_label_3.nii.gz")
# etc., for however many labels you have
label_list = [img_label_1, img_label_2, img_label_3]
img_encoded = binary_encode_structure_list(label_list)
If you need to use numpy, then you can just convert this SimpleITK image into a 3D numpy array:
arr_encoded = sitk.GetArrayFromImage(img_encoded)
N.B. You can also decode an encoded label map (e.g. the output of your NN) using tools in platipy:
from platipy.imaging.label.utils import binary_decode_image
label_list = binary_decode_image(img_prediction_encoded)
Hope this helps!

Create LMDB for image dataset with k-hot labels

I wanna to create a classifier for an image dataset that each image is in multiple classes from all classes, so the target values are k-hot vectors. Now I create a text file which contains address if image file and space and a k-hot vector in each line but when i try to run scripts to create lmdb files it raise errors that can not open or find files. I try the same process with same data and just a number as class label and everything goes well. So I think it cannot parse .txt file correctly when labels are vectors.
Any suggestion...
Thank you
Caffe "Data" layers and convert_imageset script were written with a very specific use case in mind: image classification. Therefore the basic element stored in (and fetched from) LMDB by caffe is Datum that has a room for a single integer label.
You can see a more lengthy discussion on this subject here
It does not mean Caffe cannot facilitate different types of inputs/tasks.
You can use "HDF5Data" layer instead. When it comes to hdf5 inputs caffe has almost no restrictions on the input shape and size.
See, e.g., this answer and this one for more details on how to actually make it work.

Caffe - How to imbalance Cifar10 data

I'm doing a research on the impact of imbalanced data with caffe framework. Now I am trying to make a new cifar10 distribution by trying to remove some of the data from specified class. I read the document of cifar10. It said that the .bin file has a data structure like
1*8 bit label data | 3*1024 for RGB pixel
So I write down a script to filter out the data for those class. And make a new .bin file.
Now I run the script on caffe and try to make a LMDB dataset
#!/usr/bin/env sh
# This script converts the cifar data into leveldb format.
EXAMPLE=examples/cifar10
DATA=data/cifar10
DBTYPE=lmdb
echo "Creating $DBTYPE..."
rm -rf $EXAMPLE/cifar10_train_$DBTYPE $EXAMPLE/cifar10_test_$DBTYPE
./build/examples/cifar10/convert_cifar_data.bin $DATA $EXAMPLE $DBTYPE
echo "Computing image mean..."
./build/tools/compute_image_mean -backend=$DBTYPE \
$EXAMPLE/cifar10_train_$DBTYPE $EXAMPLE/mean.binaryproto
echo "Done."
However After I filter out those data. It seems that the IMDB still has the same size and doesn't look any different than the one without filtered. Can somebody tell me what should I do to make the data imbalanced?

how to predict using scikit?

I have trained an estimator, called clf, using fit method and save the model to disk. The next time to run the program , which will load clf from disk.
my problem is :
how to predict a sample which saved on disk? I mean, how to load it and predict?
how to get the sample label instead of label integer after predict?
how to predict a sample which saved on disk? I mean, how to load it and predict?
You have to use the same array representation for the new samples as the one used for the samples passed to fit method. If you want to predict a single sample, the input must be a 2D numpy array with shape (1, n_features).
The way to read your original file on the HDD and convert it to a numpy array representation suitable for classifier is a domain specific issue: it depends whether you are trying to classify text files, jpeg files, frames in a video file, rows in database, log lines for syslog monitored services...
how to get the sample label instead of label integer after predict?
Just keep a list of label names and ensure that the integer used as target values when fitting are in the range [0, n_classes). For instance ['spam', 'ham'], if you have predictions in the range [0, 1] then you can do:
new_samples = # 2D array with shape (n_samples, n_features)
label_names = ['ham', 'spam']
predictions = [label_names[pred] for pred in clf.predict(new_samples)]

Scikit and Pandas: Fitting Large Data

How do I use scikit-learn to train a model on a large csv data (~75MB) without running into memory problems?
I'm using IPython notebook as the programming environment, and pandas+sklearn packages to analyze data from kaggle's digit recognizer tutorial.
The data is available on the webpage , link to my code , and here is the error message:
KNeighborsClassifier is used for the prediction.
Problem:
"MemoryError" occurs when loading large dataset using read_csv
function. To bypass this problem temporarily, I have to restart the
kernel, which then read_csv function successfully loads the file, but
the same error occurs when I run the same cell again.
When the read_csv function loads the file successfully, after making changes to the dataframe, I can pass the features and labels to the KNeighborsClassifier's fit() function. At this point, similar memory error occurs.
I tried the following:
Iterate through the CSV file in chunks, and fit the data accordingly, but the problem is that the predictive model is overwritten every time for a chunk of data.
What do you think I can do to successfully train my model without running into memory problems?
Note: when you load the data with pandas it will create a DataFrame object where each column has an homogeneous datatype for all the rows but 2 columns can have distinct datatypes (e.g. integer, dates, strings).
When you pass a DataFrame instance to a scikit-learn model it will first allocate a homogeneous 2D numpy array with dtype np.float32 or np.float64 (depending on the implementation of the models). At this point you will have 2 copies of your dataset in memory.
To avoid this you could write / reuse a CSV parser that directly allocates the data in the internal format / dtype expected by the scikit-learn model. You can try numpy.loadtxt for instance (have a look at the docstring for the parameters).
Also if you data is very sparse (many zero values) it will be better to use a scipy.sparse datastructure and a scikit-learn model that can deal with such an input format (check the docstrings to know). However the CSV format itself is not very well suited for sparse data and I am not sure there exist a direct CSV-to-scipy.sparse parser.
Edit: for reference KNearestNeighborsClassifer allocate temporary distances array with shape (n_samples_predict, n_samples_train) which is very wasteful when only (n_samples_predict, n_neighbors) is needed instead. This issue can be tracked here:
https://github.com/scikit-learn/scikit-learn/issues/325

Resources