I have some spreadsheet data that is over a GB and wanting to use random forest. Following some other questions on here I was able to tune the algorithm to work with my data but unfortunately to get the best performance I needed to do one hot encoding of a categorical feature and now my input matrix has over 3000 features resulting in a memory error.
I'm trying to reduce these features so I'm using SelectKBest with chi2 which according to docs will deal with my sparse matrix but I'm still getting memory error.
I tried using to_sparse with fill_value=0 which seems to reduce memory footprint, but when I call fit_transform I get memory error
MemoryError Traceback (most recent call last)
in ()
4 Y_sparse = df_processed.loc[:,'Purchase'].to_sparse(fill_value=0)
5
----> 6 X_new = kbest.fit_transform(X_sparse, Y_sparse)
kbest = SelectKBest(mutual_info_regression, k = 5)
X_sparse = df_processed.loc[:,df_processed.columns != 'Purchase'].to_sparse(fill_value=0)
Y_sparse = df_processed.loc[:,'Purchase'].to_sparse(fill_value=0)
X_new = kbest.fit_transform(X_sparse, Y_sparse)
I simply want to reduce the 3000 features to something more manageable say 20 that correlate well with my Y values (continuous response)
The reason you are getting an error on everything is because to do anything in Pandas or sklearn, the entire dataset has to be loaded in memory along with all the other data from temporary steps.
Instead of doing one hot encoding, try binary encoding or hashing encoding. One-hot-encoding has a linear growth rate n where n is the number of categories in a categorical feature. Binary encoding has log_2(n) growth rate so you will be able to avoid memory error. If not, try hashing encoding.
Related
I want to do Random Forest predictions on a large dataset and save the result as an dataframe. I read https://examples.dask.org/machine-learning/parallel-prediction.html and it says "Workers can write the predicted values to a shared file system, without ever having to collect the data on a single machine", but I cant figure out how to do this. I tried this by connecting to a distributed cluster and doing:
x = da.from_array(i,100000)
t = model.predict(x)
t= client.persist(t)
df=dd.from_array(t)
df.to_parquet("xy.parquet")
However this does not trigger any computation on the cluster (observed with dashboard), and runs my 1TB RAM machine into a memory error when to_parquet computes, even for a test where the numpy size of x and t is 7GB. Anything else I submit to the cluster is computed there.
So how do I save the results of the prediction?
EDIT:
This seems to be an issue of size for the input x. It has the shape (24507731,8). If I instead just throw in random data with the shape (24507,8) the computation finished. This is quite surprising as ParallelPostfit is supposed to make prediction on large data possible in the first place.
I trained a Sklearn RandomForestRegressor model on 19GB of training data. I would like to save it to disk in order to use it later for inference. As have been recomended in another stackoverflow questions, I tried the following:
Pickle
pickle.dump(model, open(filename, 'wb'))
Model was saved successfully. It's size on disk was 1.9 GB.
loaded_model = pickle.load(open(filename, 'rb'))
Loading of the model resulted in MemorError (despite 16 GB RAM)
cPickle - the same result as Pickle
Joblib
joblib.dump(est, 'random_forest.joblib' compress=3)
It also ends with the MemoryError while loading the file.
Klepto
d = klepto.archives.dir_archive('sklearn_models', cached=True, serialized=True)
d['sklearn_random_forest'] = est
d.dump()
Arhcive is created, but when I want to load it using the following code, I get the KeyError: 'sklearn_random_forest'
d = klepto.archives.dir_archive('sklearn_models', cached=True, serialized=True)
d.load(model_params)
est = d[model_params]
I tried saving dictionary object using the same code, and it worked, so the code is correct. Apparently Klepto cannot persist sklearn models. I played with cached and serialized parameters and it didn't help.
Any hints on how to handle this would be very appreciated. Is it possible to save the model in JSON, XML, maybe HDFS, or maybe other formats?
Try using joblib.dump()
In this method, you can use the param "compress". This param takes in Integer values between 0 and 9, the higher the value the more compressed your file gets. Ideally, a compress value of 3 would suffice.
The only downside is that the higher the compress value slower the write/read speed!
The size of a Random Forest model is not strictly dependent on the size of the dataset that you trained it with. Instead, there are other parameters that you can see on the Random Forest classifier documentation which control how big the model can grow to be. Parameters like:
n_estimators - the number of trees
max_depth - how "tall" each tree can get
min_samples_split and min_samples_leaf - the number of samples that allow nodes in the tree to split/continue splitting
If you have trained your model with a high number of estimators, large max depth, and very low leaf/split samples, then your resulting model can be huge - and this is where you run into memory problems.
In these cases, I've often found that training smaller models (by controlling these parameters) -- as long as it doesn't kill the performance metrics -- will resolve this problem, and you can then fall back on joblib or the other solutions you mentioned to save/load your model.
output = tf.nn.conv2d(input, weights, strides = [1,3,3,1], padding = 'VALID')
My input has shape 200x225x225x1, weights is 15x15x1x64. Hence, the output has shape 200x71x71x64 since (225-15)/3 + 1 = 71
Tensorboard shows that this operation consumes totally 768MB (see pic below). Assuming it takes into account the size of input (38.6MB), weights (0.06MB) and output (246.2MB) the total memory consumption should not exceed 300MB. So where does the rest of the memory consumption come from?
Although I'm not able to reproduce your graph and values based on information provided, it's possible that you're seeing additional memory usage due to intermediary values materialized during the computation of Conv2D. It's also possible that the instrumentation is incorrect. (e.g. reshape operations that do not result in a copy of Tensor memory end up duplicating the "memory usage" in the TF Node Stats instrumentation.) Without a reproducible test case, it's hard to say more. If you do feel like this is a bug in TensorFlow, please do raise an issue on Github!
How can we make a working classifier for sentiment analysis since for that we need to train our classifier on huge data sets.
I have the huge data set to train, but the classifier object (here using Python), gives memory error when using 3000 words. And I need to train for more than 100K words.
What I thought was dividing the huge data set into smaller parts and make a classifier object for each and store it in a pickle file and use all of them. But it seems using all the classifier object for testing is not possible as it takes only one of the object during testing.
The solution which is coming in my mind is either to combine all the saved classifier objects stored in the pickle file (which is just not happening) or to keep appending the same object with new training set (but again, it is being overwritten and not appended).
I don't know why, but I could not find any solution for this problem even when it is the basic of machine learning. Every machine learning project needs to be trained in huge data set and the object size for training those data set will always give a memory error.
So, how to solve this problem? I am open to any solution, but would like to hear what is followed by people who do real time machine learning projects.
Code Snippet :
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
all_words = []
for w in movie_reviews.words():
all_words.append(w.lower())
all_words = nltk.FreqDist(all_words)
word_features = list(all_words.keys())[:3000]
def find_features(document):
words = set(document)
features = {}
for w in word_features:
features[w] = (w in words)
return features
featuresets = [(find_features(rev), category) for (rev, category) in documents]
numtrain = int(len(documents) * 90 / 100)
training_set = featuresets[:numtrain]
testing_set = featuresets[numtrain:]
classifier = nltk.NaiveBayesClassifier.train(training_set)
PS : I am using the NLTK toolkit using NaiveBayes. My training dataset is being opened and stored in the documents.
There are two things you seem to be missing:
Datasets for text are usually extremely sparse, and you should store them as sparse matrices. For such representation, you should be able to store milions of documents inyour memory with vocab. of 100,000.
Many modern learning methods are trained in mini-batch scenario, meaning that you never need whole dataset in memory, instead, you feed it to the model with random subsets of data - but still training a single model. This way your dataset can be arbitrary large, memory consumption is constant (fixed by minibatch size), and only training time scales with the amount of samples.
I'm tryin to use scikit-learn to cluster text documents. On the whole, I find my way around, but I have my problems with specific issues. Most of the examples I found illustrate clustering using scikit-learn with k-means as clustering algorithm. Adopting these example with k-means to my setting works in principle. However, k-means is not suitable since I don't know the number of clusters. From what I read so far -- please correct me here if needed -- DBSCAN or MeanShift seem the be more appropriate in my case. The scikit-learn website provides examples for each cluster algorithm. The problem is now, that with both DBSCAN and MeanShift I get errors I cannot comprehend, let alone solve.
My minimal code is as follows:
docs = []
for item in [database]:
docs.append(item)
vectorizer = TfidfVectorizer(min_df=1)
X = vectorizer.fit_transform(docs)
X = X.todense() # <-- This line was needed to resolve the isse
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
...
(My documents are already processed, i.e., stopwords have been removed and an Porter Stemmer has been applied.)
When I run this code, I get the following error when instatiating DBSCAN and calling fit():
...
File "/usr/local/lib/python2.7/dist-packages/sklearn/cluster/dbscan_.py", line 248, in fit
clust = dbscan(X, **self.get_params())
File "/usr/local/lib/python2.7/dist-packages/sklearn/cluster/dbscan_.py", line 86, in dbscan
n = X.shape[0]
IndexError: tuple index out of range
Clicking on the line in dbscan_.py that throws the error, I noticed the following line
...
X = np.asarray(X)
n = X.shape[0]
...
When I use these to lines directly in my code for testing, I get the same error. I don't really know what np.asarray(X) is doing here, but after the command X.shape = (). Hence X.shape[0] bombs -- before, X.shape[0] correctly refers to the number of documents. Out of curiosity, I removed X = np.asarray(X) from dbscan_.py. When I do this, something is computing heavily. But after some seconds, I get another error:
...
File "/usr/lib/python2.7/dist-packages/scipy/sparse/csr.py", line 214, in extractor
(min_indx,max_indx) = check_bounds(indices,N)
File "/usr/lib/python2.7/dist-packages/scipy/sparse/csr.py", line 198, in check_bounds
max_indx = indices.max()
File "/usr/lib/python2.7/dist-packages/numpy/core/_methods.py", line 17, in _amax
out=out, keepdims=keepdims)
ValueError: zero-size array to reduction operation maximum which has no identity
In short, I have no clue how to get DBSCAN working, or what I might have missed, in general.
It looks like sparse representations for DBSCAN are supported as of Jan. 2015.
I upgraded sklearn to 0.16.1 and it worked for me on text.
The implementation in sklearn seems to assume you are dealing with a finite vector space, and wants to find the dimensionality of your data set. Text data is commonly represented as sparse vectors, but now with the same dimensionality.
Your input data probably isn't a data matrix, but the sklearn implementations needs them to be one.
You'll need to find a different implementation. Maybe try the implementation in ELKI, which is very fast, and should not have this limitation.
You'll need to spend some time in understanding similarity first. For DBSCAN, you must choose epsilon in a way that makes sense for your data. There is no rule of thumb; this is domain specific. Therefore, you first need to figure out which similarity threshold means that two documents are similar.
Mean Shift may actually need your data to be vector space of fixed dimensionality.