Storing specific shared variables in CPU - memory

Is it possible in theano to selectively choose some shared variables in the CPU? I have a huge matrix in the output layer over entire vocabulary (~2M) that wouldn't fit in the GPU memory. I have experimented with reducing its size thro' sampling, but I want to see if I can use the entire matrix. One way I could do is to use device=cpu,init_gpu_device=gpu in theano flags. But, this seem to use GPU only on a need basis. I checked the tutorial and it doesn't seem to have more details.
I wonder if it is possible to specify one or few shared variables to be stored in cpu. One can do this when creating the shared variable I guess. Having some of the variables in GPU will be faster than having everything in CPU right? Or does theano somehow figure out which ones to implicitly keep/move automatically? Would appreciate some explanation.

In newer Theano (I forgot Theano 0.8.2 or the dev version of Theano 0.9), there is a different interface. You can do theano.shared(data, target='cpu')
Continue to initialize the GPU as you did before.

Related

Is there a way to use distributed training with DASK using my GPU?

As of now, lightGBM model supports GPU training and distributed training (using DASK).
If it is possible, how can I use distributed training with DASK using my GPU or is there any other way to do so?
Actually my task is to use the power of GPU and distributed training in lightGBM model.
It may possible I am missing a concept because I'm a beginner.
I'm not a LightGBM expert, so it might be better to wait for some to chime in. But from what I've been able to find, lightGBM does not really work with both Dask and GPU support.
See https://github.com/microsoft/LightGBM/issues/4761#issuecomment-956358341:
Right now the dask interface doesn't directly support distributed training using GPU, you can subscribe to #3776 if you're interested in that. Are you getting any warnings about this? I think it probably isn't using the GPU at all.
Furthermore, if your data fits in a single machine then it's probably best not using distributed training at all. The dask interface is there to help you train a model on data that doesn't fit on a single machine by having partitions of the data on different machines which communicate with each other, which adds some overhead compared to single-node training.
And https://github.com/microsoft/LightGBM/issues/3776:
The Dask interface in https://github.com/microsoft/LightGBM/blob/706f2af7badc26f6ec68729469ec6ec79a66d802/python-package/lightgbm/dask.py currently only supports CPU-based training.
Anyway, if you have only one GPU, Dask shouldn't be of much help.

CatBoost Machine Learning hyperparameters: why not always use `thread_count = -1`?

With respect specifically to CatBoost:
Under what scenarios might one want to use fewer than the max number of threads of one's CPU? I cannot find an answer to this.
Is there a fixed cost/overhead associated with each core utilized? I.e., is more always better for all data set types/sizes?
Do the answers to the questions above generalize to all machine learning algorithms?
I think that most of the reasons for changing the thread_count are not catboost specific. Other libraries like sklearn offer the same feature. Reasons for not running with all CPUs are:
Debugging: If there is a problem it might be handy to only have one thread thus making the process more simple.
You want other processes on your machine to have CPU power. Especially if you have a server for in-memory data analysis shared by a team of data scientists. Your colleagues won't be happy if you take all resources.
Your job is so small that it simply does not need all the resources.
Your parallelize in another way: For example you try different hyper parameters using cross validation. Then it would make sense to dedicate one CPU to training one model rather than training a model with with all CPUs and then move on to train the next model with all CPUs
I hope this answers question 1. This generalizes to other in-memory ml libraries like sklearn.
Regarding question 2 I'm not sure. CatBoost does the parallelisation somewhere in its C++ Code and uses it via Cython in the Python package. I assume it introduces some overhead (since distributed computing always introduces overhead) but it's probably not too much. You could find out by timing some experiments.

TensorFlow iOS memory warnings

We are building an iOS app to perform image classification using the TensorFlow library.
Using our machine learning model (91MB, 400 classes) and the TensorFlow 'simple' example, we get memory warnings on any iOS device with 1GB of RAM. 2GB models do not experience any warnings, while < 1GB models completely run out of memory and crash the app.
We are using the latest TensorFlow code from the master branch that includes this iOS memory performance commit, which we thought might help but didn't.
We have also tried setting various GPU options on our TF session object, including set_allow_growth(true) and set_per_process_gpu_memory_fraction().
Our only changes to the TF 'simple' example code is a wanted_width and wanted_height of 299, and an input_mean and input_std of 128.
Has anyone else run into this? Is our model simply too big?
You can use memory mapping, have you tried that? Tensorflow provides documentation. You can also round your weight values to even less decimal places.

Tensorflow or Theano: Is there a special option for evaluation to use less memory?

I am exploring some of the deep learning libraries including Chainer, Torch, TensorFlow, and Theano.
Formerly I was a user of Chainer, and I found that Theano or Tensorflow has a great flexibility and seems to have a nice future potential.
However, what keeps me from moving to Theano or Tensorflow is the memory issue. Is there an option to make Theano or Tensorflow do not keep computation history? In Chainer it can be done by setting volatile flag, so that I can evaluate large data with less memory because it does not keep unnecessary data (which are only necessary when calculating gradients).
I am primarily working with RNNs, and the typical approach to training RNNs is to use truncated BPTT. However I found that it is useful and has slightly more accurate to feed the full sequence to the network when I want just forward computations, not backpropagation.
I tried to find this option from documentions of both frameworks, but I couldn't find. Is there a reason that this feature cannot be implemented?

Would it work and be faster if I call function in OpenCV GPU module in my kernel function?

OpenCV has a gpu. GPU-accelerated Computer Vision module (http://docs.opencv.org/modules/gpu/doc/gpu.html). There are many functions which is already use GPU techniques. So I can directly use the function OpenCV applies. But I wonder whether it would be faster if I write my own kernel and in each kernel I call function of OpenCV GPU module. This is in the case I have many images. To handle each image I call OpenCV funtion in GPU module. Then it would be parallel-nested-parallel.
Your question is not entirely clear to me, but I would like to say this: it's impossible to say which would be faster, unless somebody already implemented that same algorithm using the approach you have in mind, and then shared a report about the benchmark tests.
There's a number of factors involved:
It depends on the type of operation you are trying to implement: techniques that have a high arithmetic intensity are better fit for GPUs for sure, however, not all problems can be modeled for GPUs.
The size of the input images matter: wasting time sending data from RAM to the GPU might not compensate in the end, so running the algorithm on the CPU can be faster for small images.
The model/power of the CPU/GPU: if the computer has a really crappy GPU, then it's probably better to run the algorithms on the CPU.
What I'm saying is: don't assume OpenCV GPU's module will always run it's algorithms faster than the CPU you got. Test it, measure it! The only way to know for sure is through experimentation and benchmark.

Resources