Out of memory error for convolution using Theano - machine-learning

I am doing a convolution in Theano:
theano.tensor.nnet.conv.conv2d(x,h, border_mode='full')
and it runs out of memory, I get the following message:
RuntimeError: GpuCorrMM failed to allocate working memory of 3591 x 319086
Apply node that caused the error: GpuCorrMM_gradInputs{valid, (1, 1)}(GpuContiguous.0, GpuContiguous.0)
Inputs types: [CudaNdarrayType(float32, (True, False, True, False)), CudaNdarrayType(float32, (False, True, False, False))]
Inputs shapes: [(1, 513, 1, 7), (1, 1, 513, 622)]
Inputs strides: [(0, 7, 0, 1), (0, 0, 622, 1)]
Inputs values: ['not shown', 'not shown']
I have tried setting theano flags to 'optimizer_excluding=conv_dnn', but still didn't work. Is there any way around this?

You are trying to allocate a matrix which need something like 9TB of memory. An individual neuron needs 2.5GB of memory. The only optimization I know for such issues is to either decrease the number of units or buying more RAM. Loads of RAM :)

For me, I disabled g++ during runtime by simply remove the (MinGW) bin directory from the path variable. The processing time is slow, but it completes process.
My program execution enviroment: OS Windows Vista 32 bit, CPU Intel 2.16 GHz, RAM 4.00 GB and no GPU

Related

How to reduce the `dask_ml.xgboost` worker memory consumption?

I've been testing the dask_ml.xgboost regressor on a synthetic 10GB dataset. When training, the memory usage of the workers exceeds the amount available on my local laptop. I am aware that I can try running on an online dask cluster with larger memory, or that I can sample the data (and ignore the rest) before training. But is there a different solution? I tried limiting the number and the depth of the trees generated, subsampling the rows and columns, and changing the tree construction algorithm but the workers still run out of memory.
Given a fixed memory allocation, is there a way to reduce the memory consumption of each worker when training dask_ml.xgboost?
Here is a code snippet:
import dask.dataframe as dd
from dask.distributed import Client
from dask_ml.xgboost import XGBRegressor
client = Client(memory_limit='7GB')
ddf = dd.read_csv('10GB_float.csv')
X = ddf[ddf.columns.difference(['float_1'])].persist()
y = ddf['float_1'].persist()
reg = XGBRegressor(
objective='reg:squarederror', n_estimators=10, max_depth=2, tree_method='hist',
subsample=0.001, colsample_bytree=0.5, colsample_bylevel=0.5,
colsample_bynode=0.5, n_jobs=-1)
reg.fit(X, y)
The synthetic dataset 10GB_float.csv has 50 columns and 26758707 rows containing random floats (float64) ranging from 0 to 1. Below are the cluster details:
Cluster
Workers: 4
Cores: 12
Memory: 28.00 GB
And some information about my local laptop:
Memory: 31.1 GiB
Processor: Intel® Core™ i7-8750H CPU # 2.20GHz × 12
Additionally, here are the parameters of XGBRegressor (using .get_params()):
{'base_score': 0.5,
'booster': 'gbtree',
'colsample_bylevel': 0.5,
'colsample_bynode': 0.5,
'colsample_bytree': 0.5,
'gamma': 0,
'importance_type': 'gain',
'learning_rate': 0.1,
'max_delta_step': 0,
'max_depth': 2,
'min_child_weight': 1,
'missing': None,
'n_estimators': 10,
'n_jobs': -1,
'nthread': None,
'objective': 'reg:squarederror',
'random_state': 0,
'reg_alpha': 0,
'reg_lambda': 1,
'scale_pos_weight': 1,
'seed': None,
'silent': None,
'subsample': 0.001,
'verbosity': 1,
'tree_method': 'hist'}
Thank you very much for your time!

Handling Xarray/Dask Memory

I'm trying to use Xarray and Dask to open a multi-file dataset. However, I'm running into memory errors.
I have files that are typically this shape:
xr.open_dataset("/work/ba0989/a270077/coupled_ice_paper/model_data/coupled/LIG_coupled/outdata/fesom//LIG_coupled_fesom_thetao_19680101.nc")
<xarray.Dataset>
Dimensions: (depth: 46, nodes_2d: 126859, time: 366)
Coordinates:
* time (time) datetime64[ns] 1968-01-02 1968-01-03 ... 1969-01-01
* depth (depth) float64 -0.0 10.0 20.0 30.0 ... 5.4e+03 5.65e+03 5.9e+03
Dimensions without coordinates: nodes_2d
Data variables:
thetao (time, depth, nodes_3d) float32 ...
Attributes:
output_schedule: unit: d first: 1 rate: 1
30 files --> 41.5 GB
I also can set up a dask.distributed Client object:
Client()
<Client: 'tcp://127.0.0.1:43229' processes=8 threads=48, memory=68.72 GB>
So, if I suppose there is enough memory for the data to be loaded. However, when I then run xr.open_mfdataset, I very often get these sorts of warnings:
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 8.25 GB -- Worker memory limit: 8.59 GB
I guess there is something I can do with the chunks argument?
Any help would be very appreciated; unfortunately I'm not sure where to begin trying. I could, in principle, open just the first file (they will always have the same shape) to figure out how to ideally rechunk the files.
Thanks!
Paul
Examples of the chunks and parallel keywords to the opening functions, which correspond to how you utilise dask, can be found in this doc section.
That should be all you need!

Error: "DimensionMismatch("matrix A has dimensions (1024,10), vector B has length 9")" using Flux in Julia

i'm still new in Julia and in machine learning in general, but I'm quite eager to learn. In the current project i'm working on I have a problem about dimensions mismatch, and can't figure what to do.
I have two arrays as follow:
x_array:
9-element Array{Array{Int64,N} where N,1}:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 72, 73]
[11, 12, 13, 14, 15, 16, 17, 72, 73]
[18, 12, 19, 20, 21, 22, 72, 74]
[23, 24, 12, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 72, 74]
[36, 37, 38, 39, 40, 38, 41, 42, 72, 73]
[43, 44, 45, 46, 47, 48, 72, 74]
[49, 50, 51, 52, 14, 53, 72, 74]
[54, 55, 41, 56, 57, 58, 59, 60, 61, 62, 63, 62, 64, 72, 74]
[65, 66, 67, 68, 32, 69, 70, 71, 72, 74]
y_array:
9-element Array{Int64,1}
75
76
77
78
79
80
81
82
83
and the next model using Flux:
model = Chain(
LSTM(10, 256),
LSTM(256, 128),
LSTM(128, 128),
Dense(128, 9),
softmax
)
I zip both arrays, and then feed them into the model using Flux.train!
data = zip(x_array, y_array)
Flux.train!(loss, Flux.params(model), data, opt)
and immediately throws the next error:
ERROR: DimensionMismatch("matrix A has dimensions (1024,10), vector B has length 9")
Now, I know that the first dimension of matrix A is the sum of the hidden layers (256 + 256 + 128 + 128 + 128 + 128) and the second dimension is the input layer, which is 10. The first thing I did was change the 10 for a 9, but then it only throws the error:
ERROR: DimensionMismatch("dimensions must match")
Can someone explain to me what dimensions are the ones that mismatch, and how to make them match?
Introduction
First off, you should know that from an architectural standpoint, you are asking something very difficult from your network; softmax re-normalizes outputs to be between 0 and 1 (weighted like a probability distribution), which means that asking your network to output values like 77 to match y will be impossible. That's not what is causing the dimension mismatch, but it's something to be aware of. I'm going to drop the softmax() at the end to give the network a fighting chance, especially since it's not what's causing the problem.
Debugging shape mismatches
Let's walk through what actually happens inside of Flux.train!(). The definition is actually surprisingly simple. Ignoring everything that doesn't matter to us, we are left with:
for d in data
gs = gradient(ps) do
loss(d...)
end
end
Therefore, let's start by pulling the first element out of your data, and splatting it into your loss function. You didn't specify your loss function or optimizer in the question. Although softmax usually means you should use crossentropy loss, your y values are very much not probabilities, and so if we drop the softmax we can just use the dead-simple mse() loss. For optimizer, we'll default to good old ADAM:
model = Chain(
LSTM(10, 256),
LSTM(256, 128),
LSTM(128, 128),
Dense(128, 9),
#softmax, # commented out for now
)
loss(x, y) = Flux.mse(model(x), y)
opt = ADAM(0.001)
data = zip(x_array, y_array)
Now, to simulate the first run of Flux.train!(), we take first(data) and splat that into loss():
loss(first(data)...)
This gives us the error message you've seen before; ERROR: DimensionMismatch("matrix A has dimensions (1024,10), vector B has length 12"). Looking at our data, we see that yes, indeed, the first element of our dataset has a length of 12. And so we will change our model to instead expect 12 values instead of 10:
model = Chain(
LSTM(12, 256),
LSTM(256, 128),
LSTM(128, 128),
Dense(128, 9),
)
And now we re-run:
julia> loss(first(data)...)
50595.52542674723 (tracked)
Huzzah! It worked! We can run this again:
julia> loss(first(data)...)
50578.01417593167 (tracked)
The value changes because the RNN holds memory within itself which gets updated each time we run the network, otherwise we would expect the network to give the same answer for the same inputs!
The problem comes, however, when we try to run the second training instance through our network:
julia> loss([d for d in data][2]...)
ERROR: DimensionMismatch("matrix A has dimensions (1024,12), vector B has length 9")
Understanding LSTMs
This is where we run into Machine Learning problems more than programming problems; the issue here is that we have promised to feed that first LSTM network a vector of length 10 (well, 12 now) and we are breaking that promise. This is a general rule of deep learning; you always have to obey the contracts you sign about the shape of the tensors that are flowing through your model.
Now, the reasons you're using LSTMs at all is probably because you want to feed in ragged data, chew it up, then do something with the result. Maybe you're processing sentences, which are all of variable length, and you want to do sentiment analysis, or somesuch. The beauty of recurrent architectures like LSTMs is that they are able to carry information from one execution to another, and they are therefore able to build up an internal representation of a sequence when applied upon one time point after another.
When building an LSTM layer in Flux, you are therefore declaring not the length of the sequence you will feed in, but rather the dimensionality of each time point; imagine if you had an accelerometer reading that was 1000 points long and gave you X, Y, Z values at each time point; to read that in, you would create an LSTM that takes in a dimensionality of 3, then feed it 1000 times.
Writing our own training loop
I find it very instructive to write our own training loop and model execution function so that we have full control over everything. When dealing with time series, it's often easy to get confused about how to call LSTMs and Dense layers and whatnot, so I offer these simple rules of thumb:
When mapping from one time series to another (E.g. constantly predict future motion from previous motion), you can use a single Chain and call it in a loop; for every input time point, you output another.
When mapping from a time series to a single "output" (E.g. reduce sentence to "happy sentiment" or "sad sentiment") you must first chomp all the data up and reduce it to a fixed size; you feed many things in, but at the end, only one comes out.
We're going to re-architect our model into two pieces; first the recurrent "pacman" section, where we chomp up a variable-length time sequence into an internal state vector of pre-determined length, then a feed-forward section that takes that internal state vector and reduces it down to a single output:
pacman = Chain(
LSTM(1, 128), # map from timepoint size 1 to 128
LSTM(128, 256), # blow it up even larger to 256
LSTM(256, 128), # bottleneck back down to 128
)
reducer = Chain(
Dense(128, 9),
#softmax, # keep this commented out for now
)
The reason we split it up into two pieces like this is because the problem statement wants us to reduce a variable-length input series to a single number; we're in the second bullet point above. So our code naturally must take this into account; we will write our loss(x, y) function to, instead of calling model(x), it will instead do the pacman dance, then call the reducer on the output. Note that we also must reset!() the RNN state so that the internal state is cleared for each independent training example:
function loss(x, y)
# Reset internal RNN state so that it doesn't "carry over" from
# the previous invocation of `loss()`.
Flux.reset!(pacman)
# Iterate over every timepoint in `x`
for x_t in x
y_hat = pacman(x_t)
end
# Take the very last output from the recurrent section, reduce it
y_hat = reducer(y_hat)
# Calculate reduced output difference against `y`
return Flux.mse(y_hat, y)
end
Feeding this into Flux.train!() actually trains, albeit not very well. ;)
Final observations
Although your data is all Int64's, it's pretty typical to use floating point numbers with everything except embeddings (an embedding is a way to take non-numeric data such as characters or words and assign numbers to them, kind of like ASCII); if you're dealing with text, you're almost certainly going to be working with some kind of embedding, and that embedding will dictate what the dimensionality of your first LSTM is, whereupon your inputs will all be "one-hot" encoded.
softmax is used when you want to predict probabilities; it's going to ensure that for each input, the outputs are all between [0...1] and moreover that they sum to 1.0, like a good little probability distribution should. This is most useful when doing classification, when you want to wrangle your wild network output values of [-2, 5, 0.101] into something where you can say "we have 99.1% certainty that the second class is correct, and 0.7% certainty it's the third class."
When training these networks, you're often going to want to batch multiple time series at once through your network for hardware efficiency reasons; this is both simple and complex, because on one hand it just means that instead of passing a single Sx1 vector through (where S is the size of your embedding) you're instead going to be passing through an SxN matrix, but it also means that the number of timesteps of everything within your batch must match (because the SxN must remain the same across all timesteps, so if one time series ends before any of the others in your batch you can't just drop it and thereby reduce N halfway through a batch). So what most people do is pad their timeseries all to the same length.
Good luck in your ML journey!

ZeroDivisionError: float division by zero during net_segment inference patch aggregation

I ran (on Ubuntu 16.04 in a Google Cloud VM Instance):
net_segment inference -c <path-to-config>
for a binary segmentation problem using unet_2d with softmax and a (96,96,1) spatial window.
This was after I trained my model for 10 epochs and saved the checkpoint. I'm not sure why it's drawing a zero division error
from windows_aggregator_resize.py. What is the cause of this issue and what can I do to fix it?
Here are some inference settings and the corresponding error:
pixdim: (1.0, 1.0, 1.0)
[NETWORK]
batch_size: 1
cutoff: (0.01, 0.99)
name: unet_2d
normalisation: False
volume_padding_size: (96, 96, 0)
reg_type: L2
window_sampling: resize
multimod_foreground_type: and
[INFERENCE]
border = (96,96,0)
inference_iter = -1
output_interp_order = 0
spatial_window_size = (96,96,2)
INFO:niftynet: Accessing /home/xchaosfailx1/niftynet/models/MSD/heart_la_seg/models/model.ckpt-10 ...
INFO:niftynet: Restoring parameters from /home/xchaosfailx1/niftynet/models/MSD/heart_la_seg/models/model.ckpt-10
INFO:niftynet: Cleaning up...
WARNING:niftynet: stopped early, incomplete loops
INFO:niftynet: stopping sampling threads
INFO:niftynet: SegmentationApplication stopped (time in second 17.07).
Traceback (most recent call last):
File "/home/xchaosfailx1/.local/bin/net_segment", line 11, in <module>
sys.exit(main())
File "/home/xchaosfailx1/.local/lib/python3.5/site-packages/niftynet/__init__.py", line 139, in main
app_driver.run_application()
File "/home/xchaosfailx1/.local/lib/python3.5/site-packages/niftynet/engine/application_driver.py", line 275, in run_application
self._inference_loop(session, loop_status)
File "/home/xchaosfailx1/.local/lib/python3.5/site-packages/niftynet/engine/application_driver.py", line 493, in _inference_loop
self._loop(iter_generator(itertools.count(), INFER), sess, loop_status)
File "/home/xchaosfailx1/.local/lib/python3.5/site-packages/niftynet/engine/application_driver.py", line 442, in _loop
iter_msg.current_iter_output[NETWORK_OUTPUT])
File "/home/xchaosfailx1/.local/lib/python3.5/site-packages/niftynet/application/segmentation_application.py", line 390, in interpret_output
batch_output['window'], batch_output['location'])
File "/home/xchaosfailx1/.local/lib/python3.5/site-packages/niftynet/engine/windows_aggregator_resize.py", line 55, in decode_batch
self._save_current_image(window[batch_id, ...], resize_to_shape)
File "/home/xchaosfailx1/.local/lib/python3.5/site-packages/niftynet/engine/windows_aggregator_resize.py", line 82, in _save_current_image
[float(p) / float(d) for p, d in zip(window_shape, image_shape)]
File "/home/xchaosfailx1/.local/lib/python3.5/site-packages/niftynet/engine/windows_aggregator_resize.py", line 82, in <listcomp>
[float(p) / float(d) for p, d in zip(window_shape, image_shape)]
ZeroDivisionError: float division by zero
For reproducing the error:
changed the padding in niftynet.network.unet_2d.py from valid to same
dataset [Task2_Heart] : https://drive.google.com/drive/folders/1HqEgzS8BV2c7xYNrZdEAnrHk7osJJ--2
updated config:
https://drive.google.com/open?id=1RI111BZLv4Lhf9cGvHo_sAHRt_k5Xt0I
Didn't check the inference data but I think spatial_window_size in [INFERENCE] should be 96, 96, 1 as that's what you set in training.
The mistake that I made was that I set the border (96,96,0) under [Inference] to the same shape as my spatial window (96,96,1), so when the batch was cropped in decode_batch, the cropped image had an image shape with 0s in it. Hence, when the zoom ratio was calculated in _save_current_image, it led to a ZeroDivsionError. The temporary fix was to remove volume padding and changing the border=(0,0,0).

False Positives with Face recognition

I have a CNN trained upon the images (cropped faces) of Mark Ruffalo. For my positive class I have around 200 images and for the negative datapoints I have sampled 200 random faces.
The model has a high recall but a very low precision. How could I increase the precision ?Also I am constrained by the number of positive images that I have. I am ready to compromise the recall in this tradeoff.
I have tried increasing the number of negative samples but that introduces a form of bias and the model starts classifying everything as negative to attain a local optima.
I have based my CNN upon overfeat:
local features = nn.Sequential()
features:add(nn.SpatialConvolutionMM(3, 96, 11, 11))
features:add(nn.ReLU())
features:add(nn.SpatialMaxPooling(2, 2, 2, 2))
features:add(nn.SpatialConvolutionMM(96, 256, 5, 5))
features:add(nn.ReLU())
features:add(nn.SpatialMaxPooling(2, 2, 2, 2))
features:add(nn.SpatialConvolutionMM(256, 512, 3, 3))
features:add(nn.ReLU())
features:add(nn.SpatialMaxPooling(2, 2, 2, 2))
-- 24x24x512
features:add(nn.SpatialConvolutionMM(512, 1024, 3, 3))
features:add(nn.ReLU())
features:add(nn.SpatialMaxPooling(2, 2, 2, 2))
--11x11x1024
features:add(nn.SpatialConvolutionMM(1024, 1024, 3, 3))
features:add(nn.ReLU())
features:add(nn.SpatialMaxPooling(2, 2, 2, 2))
-- 1.3. Create Classifier (fully connected layers)
local classifier = nn.Sequential()
classifier:add(nn.View(1024*4*4))
classifier:add(nn.Dropout(0.5))
classifier:add(nn.Linear(1024*4*4, 3072))
classifier:add(nn.Threshold(0, 1e-6))
classifier:add(nn.Dropout(0.5))
classifier:add(nn.Linear(3072, 4096))
classifier:add(nn.Threshold(0, 1e-6))
classifier:add(nn.Linear(4096, noutputs))
model = nn.Sequential():add(features):add(classifier)
Kindly Help
Try playing with the raw output of the CNN instead of taking the sign() of the output node (since it is a positive and negative class I assume there is only one output in the range [-1,1]).
For instance, for one sample, the output could be [0.9] indicating that the positive class should be picked. But if you play with this values, you can find a specific threshold value, hopefully, that gives you the precision you need. In other words, if you find that anything greater than [-0.35] should actually be chosen as the positive class because it gived you better precision, then -0.35 should be your threshold value.
This is where ROC analysis comes in handy.
Let me know if this helps.

Resources