CoreML + Mobilenet SSD v1 slow on my macbook air - ios

I'm new to Core ML, been toying with it starting today. I tried to use a machine learning model called MobileNet SSD in real-time. It works, but it's rather slow. I see people talking about 20-30fps at the least, my MacBook gets to maybe three at max. Not sure where to start looking for what I did wrong, though.
I based my project on https://github.com/vonholst/SSDMobileNet_CoreML (which is for iOS, I translated it to macOS). If I run that with the simulator it's slow there, too.
I also tried using that GitHub project with the iPhone simulator by feeding it the same image over and over again (rather than sampling from the camera), and it still gets stuck at about the same frame rate.
What could cause this?

Related

Is it technically possible to fine-tune stable diffusion (any checkpoint) on a GPU with 8GB VRAM

i am currently playing around with some generative models, such as stable-diffusion and i was wondering if it is technically possible and actually sensible to fine-tune the model on a Geforce RTX3070 with 8GB VRAM. Its just to play around a bit so small dataset and i dont expect good results out of it, but to my understanding if i turn down the batch size far enough and use lower resolution images it should be technically possible. Or am i missing something because on their repository they say that you need a GPU with at least 24GB.
I did not get to coding yet because i wanted to first check if its even possible before i end up setting everything up and then find out it does not work.
I've read about a person that was able to train using 12GB of ram, using the instructions in this video:
https://www.youtube.com/watch?v=7bVZDeGPv6I&ab_channel=NerdyRodent
It sounds a bit painful though. You would definitely want to try using the
--xformers
and
--lowvram
command line arguments when you startup SD. I would love to hear how this turns out and if you get it working.

Training ML Models on iOS Devices

Is there any way to train PyTorch models directly on-device on an iPhone via the GPU? PyTorch Mobile docs seems to be completely focused on inference only, as do the iOS app examples (https://github.com/pytorch/ios-demo-app). I did find this article about using MPS backend on Macs (https://developer.apple.com/metal/pytorch/), but not sure if this is at all viable for iOS devices. There's also this prototype article about using iOS GPU for PyTorch mobile (https://pytorch.org/tutorials/prototype/ios_gpu_workflow.html), but it too seems to be focused on inference only.
We are attempting to train a large language model on the iPhone 14 and in order to make that possible given the memory constraints, we would like to a) discard intermediate activations and recompute them, and b) manage memory directly to write some intermediate activations to the filesystem and later read them back. We suspect that converting a PyTorch model to CoreML format and using CoreML for training would prevent us from making these low-level modifications, but PyTorch might have the APIs necessary for this. If there's any examples/pointers that anyone can link to that would be great.

TensorFlow iOS memory warnings

We are building an iOS app to perform image classification using the TensorFlow library.
Using our machine learning model (91MB, 400 classes) and the TensorFlow 'simple' example, we get memory warnings on any iOS device with 1GB of RAM. 2GB models do not experience any warnings, while < 1GB models completely run out of memory and crash the app.
We are using the latest TensorFlow code from the master branch that includes this iOS memory performance commit, which we thought might help but didn't.
We have also tried setting various GPU options on our TF session object, including set_allow_growth(true) and set_per_process_gpu_memory_fraction().
Our only changes to the TF 'simple' example code is a wanted_width and wanted_height of 299, and an input_mean and input_std of 128.
Has anyone else run into this? Is our model simply too big?
You can use memory mapping, have you tried that? Tensorflow provides documentation. You can also round your weight values to even less decimal places.

Tensorflow running slow on iOS

I am running the camera iOS example distributed by tensorflow, and it is quite slow: 4-5 seconds per inference on an iPhone6, running the inception5h.zip model.
To my understanding, this is GoogleNet model, which is light-weighted, and the iOS code pulls its first output layer, which is about half of size of the full model. I ran the same model with the python interface on my macbook, which takes 30 ms per inference.
So I am wondering why it is about 150x slower running the same model on iOS than on macbook. Seems I'm doing some obvious things wrong.
This isn't well-documented yet, but you need to pass in optimization flags to the compile script to get a fast version of the library. Here's an example:
tensorflow/contrib/makefile/compile_ios_tensorflow.sh "-Os"
That should bring your speed up a lot, informally I see a second or less with GoogLeNet on a 5S.

Image Processing - Beaglebone vs Raspberry Pi

I've been researching for a while and found tons of helpful resources on this subject, but I figured I would lay down my specifications here so I can get some recommendations from people experienced in this area. It seems like Beaglebone and Raspberry Pi with a Logitech or Microsoft camera are my best options at this point.
My target speed is 50 fps (20 ms per image) with the processing involved. From what I've looked at, this doesn't seem feasible considering most webcams don't go much past 30 fps. More specifically, I need to take the endpoints of an object (like a sheet of paper) and calculate where the midpoint is. Nothing incredibly fancy. 1080p isn't a requirement, I can most likely go much lower. Python is preferable over C and C++ since I've already done a lot of image processing with Python.
It looks like a lot of the code I'll be needing is mostly open-source already, so I really just need to figure out what controller/camera combo I should be using.
It's still a bit of a toss up between the two however here are my views.
The BBB will use a USB web cam and that will take a certain amount of processing power just to get the image. After that you can then manipulate it with SimpleCV
The RPi has a camera board that they say will only use < 3% of the cpu and the rest can be used for processing your image. Plus you can over-clock the RPi to 1Ghz.
Using the RPi with a basic webcam does not give a very good result, whereas the RPi camera works directly on the CSI bus and is set to do 1080 dpi natively. Plus they now have drivers for the camera that work with SimpleCV too.
IMHO I would say that the RPi B and Camera board would be technically faster that the BBB, but it also depends on what manipulation you plan to do :
Marc

Resources