We are building an iOS app to perform image classification using the TensorFlow library.
Using our machine learning model (91MB, 400 classes) and the TensorFlow 'simple' example, we get memory warnings on any iOS device with 1GB of RAM. 2GB models do not experience any warnings, while < 1GB models completely run out of memory and crash the app.
We are using the latest TensorFlow code from the master branch that includes this iOS memory performance commit, which we thought might help but didn't.
We have also tried setting various GPU options on our TF session object, including set_allow_growth(true) and set_per_process_gpu_memory_fraction().
Our only changes to the TF 'simple' example code is a wanted_width and wanted_height of 299, and an input_mean and input_std of 128.
Has anyone else run into this? Is our model simply too big?
You can use memory mapping, have you tried that? Tensorflow provides documentation. You can also round your weight values to even less decimal places.
Related
i am currently playing around with some generative models, such as stable-diffusion and i was wondering if it is technically possible and actually sensible to fine-tune the model on a Geforce RTX3070 with 8GB VRAM. Its just to play around a bit so small dataset and i dont expect good results out of it, but to my understanding if i turn down the batch size far enough and use lower resolution images it should be technically possible. Or am i missing something because on their repository they say that you need a GPU with at least 24GB.
I did not get to coding yet because i wanted to first check if its even possible before i end up setting everything up and then find out it does not work.
I've read about a person that was able to train using 12GB of ram, using the instructions in this video:
https://www.youtube.com/watch?v=7bVZDeGPv6I&ab_channel=NerdyRodent
It sounds a bit painful though. You would definitely want to try using the
--xformers
and
--lowvram
command line arguments when you startup SD. I would love to hear how this turns out and if you get it working.
I currently do text-to-speech using tacotron2 and hifi-gan. it working well with GPU but after deploying into server and use CPU to run the model, the result is not as good as before.
so my question is : does inference with CPU lower the model accuracy ?
if yes please kindly explain or send me any reference paper or article.
one more thing , I noticed that when running
model.cuda().eval().half()
and save the tacotron2 model , the model size reduce to half and it seem to run find ,so if I use this half-size model , will it lower the accuracy too ?
You want to look into mixed-precision training NVIDIA, Tensorflow.
Machine learning doesn't usually need high precision floating point.
GPUs & Frameworks can take this into account to speed up training.
However, during deployment the model doesn't take this into account.
The deeper your model, the more this may be an issue because the slight differences add up.
I am running the camera iOS example distributed by tensorflow, and it is quite slow: 4-5 seconds per inference on an iPhone6, running the inception5h.zip model.
To my understanding, this is GoogleNet model, which is light-weighted, and the iOS code pulls its first output layer, which is about half of size of the full model. I ran the same model with the python interface on my macbook, which takes 30 ms per inference.
So I am wondering why it is about 150x slower running the same model on iOS than on macbook. Seems I'm doing some obvious things wrong.
This isn't well-documented yet, but you need to pass in optimization flags to the compile script to get a fast version of the library. Here's an example:
tensorflow/contrib/makefile/compile_ios_tensorflow.sh "-Os"
That should bring your speed up a lot, informally I see a second or less with GoogLeNet on a 5S.
Is it possible in theano to selectively choose some shared variables in the CPU? I have a huge matrix in the output layer over entire vocabulary (~2M) that wouldn't fit in the GPU memory. I have experimented with reducing its size thro' sampling, but I want to see if I can use the entire matrix. One way I could do is to use device=cpu,init_gpu_device=gpu in theano flags. But, this seem to use GPU only on a need basis. I checked the tutorial and it doesn't seem to have more details.
I wonder if it is possible to specify one or few shared variables to be stored in cpu. One can do this when creating the shared variable I guess. Having some of the variables in GPU will be faster than having everything in CPU right? Or does theano somehow figure out which ones to implicitly keep/move automatically? Would appreciate some explanation.
In newer Theano (I forgot Theano 0.8.2 or the dev version of Theano 0.9), there is a different interface. You can do theano.shared(data, target='cpu')
Continue to initialize the GPU as you did before.
I'm working on robot vision system and its main purpose is to detect objects, i want to choose one of these libraries (CImg , OpenCV) and I have knowledge about both of them.
The robot I'm using has Linux , 1GHz CPU and 1G ram and I'm using C++ the size of image is 320p.
I want to have a real-time image processing near 20 out of 25 frames per seconds.
In your opinion which library is more powerful l although I have tested both and they have the same process time, open cv is slightly better and I think that's because I use pointers with open cv codes.
Please share your idea and your reason.
thanks.
I think you can possibly get best performance when you integrated - OpenCV with IPP.
See this reference, http://software.intel.com/en-us/articles/intel-integrated-performance-primitives-intel-ipp-open-source-computer-vision-library-opencv-faq/
Here is another reference http://experienceopencv.blogspot.com/2011/07/speed-up-with-intel-integrated.html
Further, if you freeze the algorithm that works perfectly, usually you can isolate your algorithm and work your way towards doing serious optimization (such as memory optimization, porting to assembly etc.) which might not be ready to use.
It really depends on what you want to do (what kind of objects you want to detect, accuracy, what algorithm you are using etc..) and how much time you have got. If it is for generic computer vision/image processing, I would stick with OpenCV. As Dipan said, do consider further optimization. In my experience with optimization for Computer Vision, the bottleneck usually is in memory interconnect bandwidth (or memory itself) and so you might have to trade in cycles (computation) to save on communication. Do understand the algorithm really well to further optimize the algorithm (which at times can give huge improvements as compared to compilers).