conversion of xarray to numpy array - oom-kill event - memory

I'm using xarray to read in and modify a data set for my analysis.
That's the data repr.:
To plot the data I have to convert it to numpy array:
Z_diff.values()
When doing so I get the error message:
slurmstepd: error: Detected 1 oom-kill event(s) in step 33179783.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
I'm using the following settings:
#SBATCH --ntasks-per-node=16
#SBATCH --nodes=4
#SBATCH --mem=250G

It looks like you're just running out of memory. When using dask with xarray (link), the data is split into chunks (183 of them in your case). So only a small portion of the dataset is kept in memory at once. Numpy arrays don't function that way so the whole dataset is being read into memory when you use .values() and you're exceeding your memory.
Depending on what you're trying to plot, you might be able to plot each chunk individually or plot data from each chunk one at a time on the same plot. Or just plot a subset of the data to avoid reading all the data into memory at once. Lastly, if available, you could potentially also request more memory until you no longer get this error.

Related

Dask-ml ParallelPostFit not using distributed and causing memory error on local machine

I want to do Random Forest predictions on a large dataset and save the result as an dataframe. I read https://examples.dask.org/machine-learning/parallel-prediction.html and it says "Workers can write the predicted values to a shared file system, without ever having to collect the data on a single machine", but I cant figure out how to do this. I tried this by connecting to a distributed cluster and doing:
x = da.from_array(i,100000)
t = model.predict(x)
t= client.persist(t)
df=dd.from_array(t)
df.to_parquet("xy.parquet")
However this does not trigger any computation on the cluster (observed with dashboard), and runs my 1TB RAM machine into a memory error when to_parquet computes, even for a test where the numpy size of x and t is 7GB. Anything else I submit to the cluster is computed there.
So how do I save the results of the prediction?
EDIT:
This seems to be an issue of size for the input x. It has the shape (24507731,8). If I instead just throw in random data with the shape (24507,8) the computation finished. This is quite surprising as ParallelPostfit is supposed to make prediction on large data possible in the first place.

dask read_parquet runs out of memory

I'm trying to read a big (will not fit in memory) parquet dataset, amd then sample from it. Each partition of the dataset fits perfectly in memory.
The dataset is about 20Gb of data on disk, divided in 104 partitions of about 200Mb each. I don't want to use more than 40Gb of memory at any point, so i'm setting the n_workers and memory_limit accordingly.
My hypothesis was that Dask would load as many partitions as it could handle, sample from them, scrap them from memory and then continue loading the next ones. Or something like that.
Instead, judging by the execution graph (104 load operations in parallel, after each a sample), it looks like it tries to load all partitions simultaneously, and therefore the workers keep getting killed for running out of memory.
Am I missing something?
This is my code:
from datetime import datetime
from dask.distributed import Client
client = Client(n_workers=4, memory_limit=10e9) #Gb per worker
import dask.dataframe as dd
df = dd.read_parquet('/path/to/dataset/')
df = df.sample(frac=0.01)
df = df.compute()
To reproduce the error you can create a mock dataset 1/10th the size of the one I was trying to load using this code, and try my code with 1GB memory_limit=1e9 to compensate.
from dask.distributed import Client
client = Client() #add restrictions depending on your system here
from dask import datasets
df = datasets.timeseries(end='2002-12-31')
df = df.repartition(npartitions=104)
df.to_parquet('./mock_dataset')
Parquet is an efficient binary format, with encoding and compression. There is a very good chance that in memory, it takes up far more space than you think.
In order to sample the data at 1%, each partition is being loaded and expanded into memory in entirety, before being sub-selected. This comes with considerable memory overhead of buffer copies. Each worker thread will need to accommodate the currently-processed chunk, as well as results that have been accumulated so far on that worker, and then a task will copy all of these for the final concat operation (which also involves copies and overhead).
The general recommendation is that each worker should have access to "several times" the in-memory size of each partition, and in your case, those are ~2GB on-disc and bigger in memory.

LSTM machine learning panda

I am actually trying to use TensorFlow and use the LSTM.
For that, I have data in the text file (10MB).
When I try to copy the data in numpy I get memory full Error.
Any suggestions how to get the data ready so that I can use in LSTM?
Reading the data from File before processing tensor flow with this function:
def read_data(fname):
with open(fname,encoding="utf8") as f:
content = f.readlines()
content = [x.strip() for x in content]
content = [word for i in range(len(content)) for word in content[i].split()]
content = np.array(content)
return content
At the np.array(content), it is giving memory full Error. How can I get around this so that I can use this data in LSTM in TensorFlow?
Please also suggest if there is any LSTM which can read large amounts of data
Memory error indeed means that you cannot fit the numpy array into your memory because of the overhead of indexing string lists in numpy. The problem you are not creating a single matrix of words. Each word list of content has a different length, so calling np.array will create an array for each line and then add them into one large numpy array. This what numpy is for. Numpy is efficient why dealing with numerical tensors, not lists of list of strings.
Here is a related question.
If you plan to use TensforFlow, you can use tf.Dataset API. It can load file line by line and you can then apply all the stuff you need within TensorFlow, e.g., applying (calling the map method) tf.string_split and padding + batching the data.
You will end up with something like this:
tf.TextLineDataset(fname).map(lambda s: tf.strings.split([s])[0])
Note that before batching and passing it into LSTM you need to convert the strings to vocabulary indices and call embedding lookup on the indices.

tensorflow conv2d memory consumption explain?

output = tf.nn.conv2d(input, weights, strides = [1,3,3,1], padding = 'VALID')
My input has shape 200x225x225x1, weights is 15x15x1x64. Hence, the output has shape 200x71x71x64 since (225-15)/3 + 1 = 71
Tensorboard shows that this operation consumes totally 768MB (see pic below). Assuming it takes into account the size of input (38.6MB), weights (0.06MB) and output (246.2MB) the total memory consumption should not exceed 300MB. So where does the rest of the memory consumption come from?
Although I'm not able to reproduce your graph and values based on information provided, it's possible that you're seeing additional memory usage due to intermediary values materialized during the computation of Conv2D. It's also possible that the instrumentation is incorrect. (e.g. reshape operations that do not result in a copy of Tensor memory end up duplicating the "memory usage" in the TF Node Stats instrumentation.) Without a reproducible test case, it's hard to say more. If you do feel like this is a bug in TensorFlow, please do raise an issue on Github!

Scikit and Pandas: Fitting Large Data

How do I use scikit-learn to train a model on a large csv data (~75MB) without running into memory problems?
I'm using IPython notebook as the programming environment, and pandas+sklearn packages to analyze data from kaggle's digit recognizer tutorial.
The data is available on the webpage , link to my code , and here is the error message:
KNeighborsClassifier is used for the prediction.
Problem:
"MemoryError" occurs when loading large dataset using read_csv
function. To bypass this problem temporarily, I have to restart the
kernel, which then read_csv function successfully loads the file, but
the same error occurs when I run the same cell again.
When the read_csv function loads the file successfully, after making changes to the dataframe, I can pass the features and labels to the KNeighborsClassifier's fit() function. At this point, similar memory error occurs.
I tried the following:
Iterate through the CSV file in chunks, and fit the data accordingly, but the problem is that the predictive model is overwritten every time for a chunk of data.
What do you think I can do to successfully train my model without running into memory problems?
Note: when you load the data with pandas it will create a DataFrame object where each column has an homogeneous datatype for all the rows but 2 columns can have distinct datatypes (e.g. integer, dates, strings).
When you pass a DataFrame instance to a scikit-learn model it will first allocate a homogeneous 2D numpy array with dtype np.float32 or np.float64 (depending on the implementation of the models). At this point you will have 2 copies of your dataset in memory.
To avoid this you could write / reuse a CSV parser that directly allocates the data in the internal format / dtype expected by the scikit-learn model. You can try numpy.loadtxt for instance (have a look at the docstring for the parameters).
Also if you data is very sparse (many zero values) it will be better to use a scipy.sparse datastructure and a scikit-learn model that can deal with such an input format (check the docstrings to know). However the CSV format itself is not very well suited for sparse data and I am not sure there exist a direct CSV-to-scipy.sparse parser.
Edit: for reference KNearestNeighborsClassifer allocate temporary distances array with shape (n_samples_predict, n_samples_train) which is very wasteful when only (n_samples_predict, n_neighbors) is needed instead. This issue can be tracked here:
https://github.com/scikit-learn/scikit-learn/issues/325

Resources