How do I use scikit-learn to train a model on a large csv data (~75MB) without running into memory problems?
I'm using IPython notebook as the programming environment, and pandas+sklearn packages to analyze data from kaggle's digit recognizer tutorial.
The data is available on the webpage , link to my code , and here is the error message:
KNeighborsClassifier is used for the prediction.
Problem:
"MemoryError" occurs when loading large dataset using read_csv
function. To bypass this problem temporarily, I have to restart the
kernel, which then read_csv function successfully loads the file, but
the same error occurs when I run the same cell again.
When the read_csv function loads the file successfully, after making changes to the dataframe, I can pass the features and labels to the KNeighborsClassifier's fit() function. At this point, similar memory error occurs.
I tried the following:
Iterate through the CSV file in chunks, and fit the data accordingly, but the problem is that the predictive model is overwritten every time for a chunk of data.
What do you think I can do to successfully train my model without running into memory problems?
Note: when you load the data with pandas it will create a DataFrame object where each column has an homogeneous datatype for all the rows but 2 columns can have distinct datatypes (e.g. integer, dates, strings).
When you pass a DataFrame instance to a scikit-learn model it will first allocate a homogeneous 2D numpy array with dtype np.float32 or np.float64 (depending on the implementation of the models). At this point you will have 2 copies of your dataset in memory.
To avoid this you could write / reuse a CSV parser that directly allocates the data in the internal format / dtype expected by the scikit-learn model. You can try numpy.loadtxt for instance (have a look at the docstring for the parameters).
Also if you data is very sparse (many zero values) it will be better to use a scipy.sparse datastructure and a scikit-learn model that can deal with such an input format (check the docstrings to know). However the CSV format itself is not very well suited for sparse data and I am not sure there exist a direct CSV-to-scipy.sparse parser.
Edit: for reference KNearestNeighborsClassifer allocate temporary distances array with shape (n_samples_predict, n_samples_train) which is very wasteful when only (n_samples_predict, n_neighbors) is needed instead. This issue can be tracked here:
https://github.com/scikit-learn/scikit-learn/issues/325
Related
I want to do Random Forest predictions on a large dataset and save the result as an dataframe. I read https://examples.dask.org/machine-learning/parallel-prediction.html and it says "Workers can write the predicted values to a shared file system, without ever having to collect the data on a single machine", but I cant figure out how to do this. I tried this by connecting to a distributed cluster and doing:
x = da.from_array(i,100000)
t = model.predict(x)
t= client.persist(t)
df=dd.from_array(t)
df.to_parquet("xy.parquet")
However this does not trigger any computation on the cluster (observed with dashboard), and runs my 1TB RAM machine into a memory error when to_parquet computes, even for a test where the numpy size of x and t is 7GB. Anything else I submit to the cluster is computed there.
So how do I save the results of the prediction?
EDIT:
This seems to be an issue of size for the input x. It has the shape (24507731,8). If I instead just throw in random data with the shape (24507,8) the computation finished. This is quite surprising as ParallelPostfit is supposed to make prediction on large data possible in the first place.
I trained a Sklearn RandomForestRegressor model on 19GB of training data. I would like to save it to disk in order to use it later for inference. As have been recomended in another stackoverflow questions, I tried the following:
Pickle
pickle.dump(model, open(filename, 'wb'))
Model was saved successfully. It's size on disk was 1.9 GB.
loaded_model = pickle.load(open(filename, 'rb'))
Loading of the model resulted in MemorError (despite 16 GB RAM)
cPickle - the same result as Pickle
Joblib
joblib.dump(est, 'random_forest.joblib' compress=3)
It also ends with the MemoryError while loading the file.
Klepto
d = klepto.archives.dir_archive('sklearn_models', cached=True, serialized=True)
d['sklearn_random_forest'] = est
d.dump()
Arhcive is created, but when I want to load it using the following code, I get the KeyError: 'sklearn_random_forest'
d = klepto.archives.dir_archive('sklearn_models', cached=True, serialized=True)
d.load(model_params)
est = d[model_params]
I tried saving dictionary object using the same code, and it worked, so the code is correct. Apparently Klepto cannot persist sklearn models. I played with cached and serialized parameters and it didn't help.
Any hints on how to handle this would be very appreciated. Is it possible to save the model in JSON, XML, maybe HDFS, or maybe other formats?
Try using joblib.dump()
In this method, you can use the param "compress". This param takes in Integer values between 0 and 9, the higher the value the more compressed your file gets. Ideally, a compress value of 3 would suffice.
The only downside is that the higher the compress value slower the write/read speed!
The size of a Random Forest model is not strictly dependent on the size of the dataset that you trained it with. Instead, there are other parameters that you can see on the Random Forest classifier documentation which control how big the model can grow to be. Parameters like:
n_estimators - the number of trees
max_depth - how "tall" each tree can get
min_samples_split and min_samples_leaf - the number of samples that allow nodes in the tree to split/continue splitting
If you have trained your model with a high number of estimators, large max depth, and very low leaf/split samples, then your resulting model can be huge - and this is where you run into memory problems.
In these cases, I've often found that training smaller models (by controlling these parameters) -- as long as it doesn't kill the performance metrics -- will resolve this problem, and you can then fall back on joblib or the other solutions you mentioned to save/load your model.
I am actually trying to use TensorFlow and use the LSTM.
For that, I have data in the text file (10MB).
When I try to copy the data in numpy I get memory full Error.
Any suggestions how to get the data ready so that I can use in LSTM?
Reading the data from File before processing tensor flow with this function:
def read_data(fname):
with open(fname,encoding="utf8") as f:
content = f.readlines()
content = [x.strip() for x in content]
content = [word for i in range(len(content)) for word in content[i].split()]
content = np.array(content)
return content
At the np.array(content), it is giving memory full Error. How can I get around this so that I can use this data in LSTM in TensorFlow?
Please also suggest if there is any LSTM which can read large amounts of data
Memory error indeed means that you cannot fit the numpy array into your memory because of the overhead of indexing string lists in numpy. The problem you are not creating a single matrix of words. Each word list of content has a different length, so calling np.array will create an array for each line and then add them into one large numpy array. This what numpy is for. Numpy is efficient why dealing with numerical tensors, not lists of list of strings.
Here is a related question.
If you plan to use TensforFlow, you can use tf.Dataset API. It can load file line by line and you can then apply all the stuff you need within TensorFlow, e.g., applying (calling the map method) tf.string_split and padding + batching the data.
You will end up with something like this:
tf.TextLineDataset(fname).map(lambda s: tf.strings.split([s])[0])
Note that before batching and passing it into LSTM you need to convert the strings to vocabulary indices and call embedding lookup on the indices.
I have a model to train on a large data set that does not fit into RAM. So, basically my plan is to slice the data set creating a DataSet instance with input vectors and associated labels for every chunk. E.g. if I have 1M input vectors/labels I'd split them into 10 chunks each having 100K records.
Then I'd put a chunk into 2 INDArray objects (for inputs and labels), create a DataSet and call model.fit() with that data set, repeating this procedure for every chunk and repeating the whole process until say the model's score reaches some value.
My questions are:
1. Do I understand the process correctly?
2. Can the INDArray instances be reused? Would it be right to allocate them once and then just fill them up with data set chunks over and over again?
You don't have to do any of this. Workspaces already solves your allocation problem:
http://deeplearning4j.org/workspaces
Just use the standard datavec -> recordreaderdatasetiterator -> dataset pattern.
That already handles minibatches for you.
Now I'm using fb torch library from github fb torch resnet
It's my first time to use torch and lua, so Im encountering some problems.
My goal is to save the feature vector of specific layer (last avg pooling of resnet) into a one file with the class of the input image. All input images are from cifar-10 db.
The file format that i want to get is like belows
image1.txt := class index of image and feature vector of image 1 of cifar-10
image2.txt := class index of image and feature vector of image 2 of cifar-10
// and so on through all images of cifar-10
Now I have seen some sample code of that github extract-features.lua
Because it's my first time for lua, I feel so hard to understand this code and to modify to the way i want. And i don't want my data to save into t7 file format.
How can i access only one specific layer from network in torch via lua? (last average pooling)
How can i access values of the layer and classification result index?
How can read all each images from cifar-10 db file(t7 batch)?
Sorry for too many questions. But im feeling hard using torch because of pool amouns of community threads and posting of torch.. please understand me.
How can i access only one specific layer from network in torch via lua? (last average pooling)
To access each layer you just have to load the model and get it using an integer number. If you do print model you will be able to see in which position the last average pooling is.
model = torch.load(path_to_model):cuda()
avg_pooling_layer = model:get(position_of_the_avg_pooling_layer)
How can i access values of the layer and classification result index?
I do not quite understand what you mean by this. If you want to see the output or the weights from a specific layer. (following the code above) You need to get these elements from the layer table. Again, to see which ones are the possible elements to get use print avg_pooling_layer
weights = avg_pooling_layer.weight -- get the weights of the layer
output = avg_pooling_layer.output -- get the output of the layer
How can read all each images from cifar-10 db file(t7 batch)?
To read the images from a t7 file use the torch function torch.load. (used before to load the model).
cifar_10 = torch.load("path_to_cifar-10.t7")
Once loaded you could have the training and test set in subtables or functions. Again, print the table and visualize which values are the ones you need to get.
Hope this helps!