Dask-ml ParallelPostFit not using distributed and causing memory error on local machine - dask

I want to do Random Forest predictions on a large dataset and save the result as an dataframe. I read https://examples.dask.org/machine-learning/parallel-prediction.html and it says "Workers can write the predicted values to a shared file system, without ever having to collect the data on a single machine", but I cant figure out how to do this. I tried this by connecting to a distributed cluster and doing:
x = da.from_array(i,100000)
t = model.predict(x)
t= client.persist(t)
df=dd.from_array(t)
df.to_parquet("xy.parquet")
However this does not trigger any computation on the cluster (observed with dashboard), and runs my 1TB RAM machine into a memory error when to_parquet computes, even for a test where the numpy size of x and t is 7GB. Anything else I submit to the cluster is computed there.
So how do I save the results of the prediction?
EDIT:
This seems to be an issue of size for the input x. It has the shape (24507731,8). If I instead just throw in random data with the shape (24507,8) the computation finished. This is quite surprising as ParallelPostfit is supposed to make prediction on large data possible in the first place.

Related

How to save large sklearn RandomForestRegressor model for inference

I trained a Sklearn RandomForestRegressor model on 19GB of training data. I would like to save it to disk in order to use it later for inference. As have been recomended in another stackoverflow questions, I tried the following:
Pickle
pickle.dump(model, open(filename, 'wb'))
Model was saved successfully. It's size on disk was 1.9 GB.
loaded_model = pickle.load(open(filename, 'rb'))
Loading of the model resulted in MemorError (despite 16 GB RAM)
cPickle - the same result as Pickle
Joblib
joblib.dump(est, 'random_forest.joblib' compress=3)
It also ends with the MemoryError while loading the file.
Klepto
d = klepto.archives.dir_archive('sklearn_models', cached=True, serialized=True)
d['sklearn_random_forest'] = est
d.dump()
Arhcive is created, but when I want to load it using the following code, I get the KeyError: 'sklearn_random_forest'
d = klepto.archives.dir_archive('sklearn_models', cached=True, serialized=True)
d.load(model_params)
est = d[model_params]
I tried saving dictionary object using the same code, and it worked, so the code is correct. Apparently Klepto cannot persist sklearn models. I played with cached and serialized parameters and it didn't help.
Any hints on how to handle this would be very appreciated. Is it possible to save the model in JSON, XML, maybe HDFS, or maybe other formats?
Try using joblib.dump()
In this method, you can use the param "compress". This param takes in Integer values between 0 and 9, the higher the value the more compressed your file gets. Ideally, a compress value of 3 would suffice.
The only downside is that the higher the compress value slower the write/read speed!
The size of a Random Forest model is not strictly dependent on the size of the dataset that you trained it with. Instead, there are other parameters that you can see on the Random Forest classifier documentation which control how big the model can grow to be. Parameters like:
n_estimators - the number of trees
max_depth - how "tall" each tree can get
min_samples_split and min_samples_leaf - the number of samples that allow nodes in the tree to split/continue splitting
If you have trained your model with a high number of estimators, large max depth, and very low leaf/split samples, then your resulting model can be huge - and this is where you run into memory problems.
In these cases, I've often found that training smaller models (by controlling these parameters) -- as long as it doesn't kill the performance metrics -- will resolve this problem, and you can then fall back on joblib or the other solutions you mentioned to save/load your model.

SelectKBest gives memory error even with sparse matrix

I have some spreadsheet data that is over a GB and wanting to use random forest. Following some other questions on here I was able to tune the algorithm to work with my data but unfortunately to get the best performance I needed to do one hot encoding of a categorical feature and now my input matrix has over 3000 features resulting in a memory error.
I'm trying to reduce these features so I'm using SelectKBest with chi2 which according to docs will deal with my sparse matrix but I'm still getting memory error.
I tried using to_sparse with fill_value=0 which seems to reduce memory footprint, but when I call fit_transform I get memory error
MemoryError Traceback (most recent call last)
in ()
4 Y_sparse = df_processed.loc[:,'Purchase'].to_sparse(fill_value=0)
5
----> 6 X_new = kbest.fit_transform(X_sparse, Y_sparse)
kbest = SelectKBest(mutual_info_regression, k = 5)
X_sparse = df_processed.loc[:,df_processed.columns != 'Purchase'].to_sparse(fill_value=0)
Y_sparse = df_processed.loc[:,'Purchase'].to_sparse(fill_value=0)
X_new = kbest.fit_transform(X_sparse, Y_sparse)
I simply want to reduce the 3000 features to something more manageable say 20 that correlate well with my Y values (continuous response)
The reason you are getting an error on everything is because to do anything in Pandas or sklearn, the entire dataset has to be loaded in memory along with all the other data from temporary steps.
Instead of doing one hot encoding, try binary encoding or hashing encoding. One-hot-encoding has a linear growth rate n where n is the number of categories in a categorical feature. Binary encoding has log_2(n) growth rate so you will be able to avoid memory error. If not, try hashing encoding.

How to put datasets created by torchvision.datasets in GPU in one operation?

I’m dealing with CIFAR10 and I use torchvision.datasets to create it. I’m in need of GPU to accelerate the calculation but I can’t find a way to put the whole dataset into GPU at one time. My model need to use mini-batches and it is really time-consuming to deal with each batch separately.
I've tried to put each mini-batch into GPU separately but it seems really time-consuming.
TL;DR
You won't save time by moving the entire dataset at once.
I don't think you'd necessarily want to do that even if you have the GPU memory to handle the entire dataset (of course, CIFAR10 is tiny by today's standards).
I tried various batch sizes and timed the transfer to GPU as follows:
num_workers = 1 # Set this as needed
def time_gpu_cast(batch_size=1):
start_time = time()
for x, y in DataLoader(dataset, batch_size, num_workers=num_workers):
x.cuda(); y.cuda()
return time() - start_time
# Try various batch sizes
cast_times = [(2 ** bs, time_gpu_cast(2 ** bs)) for bs in range(15)]
# Try the entire dataset like you want to do
cast_times.append((len(dataset), time_gpu_cast(len(dataset))))
plot(*zip(*cast_times)) # Plot the time taken
For num_workers = 1, this is what I got:
And if we try parallel loading (num_workers = 8), it becomes even clearer:
I've got an answer and I'm gonna try it later. It seems promising.
You can write a dataset class where in the init function, you red the entire dataset and apply all the transformations you need, and convert them to tensor format. Then, send this tensor to GPU (assuming there is enough memory). Then, in the getitem function you can simply use the index to retrieve the elements of that tensor which is already on GPU.

tensorflow conv2d memory consumption explain?

output = tf.nn.conv2d(input, weights, strides = [1,3,3,1], padding = 'VALID')
My input has shape 200x225x225x1, weights is 15x15x1x64. Hence, the output has shape 200x71x71x64 since (225-15)/3 + 1 = 71
Tensorboard shows that this operation consumes totally 768MB (see pic below). Assuming it takes into account the size of input (38.6MB), weights (0.06MB) and output (246.2MB) the total memory consumption should not exceed 300MB. So where does the rest of the memory consumption come from?
Although I'm not able to reproduce your graph and values based on information provided, it's possible that you're seeing additional memory usage due to intermediary values materialized during the computation of Conv2D. It's also possible that the instrumentation is incorrect. (e.g. reshape operations that do not result in a copy of Tensor memory end up duplicating the "memory usage" in the TF Node Stats instrumentation.) Without a reproducible test case, it's hard to say more. If you do feel like this is a bug in TensorFlow, please do raise an issue on Github!

Scikit and Pandas: Fitting Large Data

How do I use scikit-learn to train a model on a large csv data (~75MB) without running into memory problems?
I'm using IPython notebook as the programming environment, and pandas+sklearn packages to analyze data from kaggle's digit recognizer tutorial.
The data is available on the webpage , link to my code , and here is the error message:
KNeighborsClassifier is used for the prediction.
Problem:
"MemoryError" occurs when loading large dataset using read_csv
function. To bypass this problem temporarily, I have to restart the
kernel, which then read_csv function successfully loads the file, but
the same error occurs when I run the same cell again.
When the read_csv function loads the file successfully, after making changes to the dataframe, I can pass the features and labels to the KNeighborsClassifier's fit() function. At this point, similar memory error occurs.
I tried the following:
Iterate through the CSV file in chunks, and fit the data accordingly, but the problem is that the predictive model is overwritten every time for a chunk of data.
What do you think I can do to successfully train my model without running into memory problems?
Note: when you load the data with pandas it will create a DataFrame object where each column has an homogeneous datatype for all the rows but 2 columns can have distinct datatypes (e.g. integer, dates, strings).
When you pass a DataFrame instance to a scikit-learn model it will first allocate a homogeneous 2D numpy array with dtype np.float32 or np.float64 (depending on the implementation of the models). At this point you will have 2 copies of your dataset in memory.
To avoid this you could write / reuse a CSV parser that directly allocates the data in the internal format / dtype expected by the scikit-learn model. You can try numpy.loadtxt for instance (have a look at the docstring for the parameters).
Also if you data is very sparse (many zero values) it will be better to use a scipy.sparse datastructure and a scikit-learn model that can deal with such an input format (check the docstrings to know). However the CSV format itself is not very well suited for sparse data and I am not sure there exist a direct CSV-to-scipy.sparse parser.
Edit: for reference KNearestNeighborsClassifer allocate temporary distances array with shape (n_samples_predict, n_samples_train) which is very wasteful when only (n_samples_predict, n_neighbors) is needed instead. This issue can be tracked here:
https://github.com/scikit-learn/scikit-learn/issues/325

Resources