SSD or YOLO on raspberry pi - machine-learning

SSD or YOLO on raspberry pi - machine-learning

Is it possible to run SSD or YOLO object detection on raspberry pi 3 for live object detection (2/4frames x second)?
I've tried this SSD implementation in python but it takes 14 s per frame.

I recently started looking into object detection for a project of mine and was wondering if am missing something to get stuff off the ground.
I want to implement a real time object detection system on a raspberry pi 3 for surveillance of an open spaces for eg a garden. I have already tried a few available solutions. I don't need to detect many classes(only 3 person,dog, bicycle) so maybe the fastest option can be retrained with fewer filters and parameters thereby decreasing the total compute time.
Darknet(YOLO) [https://github.com/pjreddie/darknet] Installed default darknet tested YOLOv2 and YOLO runs on a raspberry pi3 each frames runs for approx 450 secs for each image. Tiny YOLO had run for 40 seconds per image.
Tensorflow Google Object Detection (API)[https://github.com/tensorflow/models/blob/master/object_detection/g3doc/installation.md]: I had tried all available networks. The best performing one was SSD inception network which runs at 26 secs per image.
Microsoft Embedded Learning Library (ELL)[https://github.com/Microsoft/ELL]: I could not get this to work for some compilation reasons but will try to check it out again later. Please let me know If this worked for you and how it performs in Object detection tasks.
Darknet-NNPACK[https://github.com/thomaspark-pkj/darknet-nnpack]: Here the darknet was optimized for arm processors and had convolutions implemented with some kind of FFT implementations with speeds up stuff a lot.
I have achieved most promise from this but it has its problems.
Installed darknet tested YOLO(full v1) runs on a Raspberry Pi3 each image requires approx 45 secs which is 10x faster than default YOLO network. Tiny YOLO had run for 1.5 seconds per frame but gives no results.
This is possible bug reported due to version conflicts between the models and the cfg files. I have opened an github (issue)[https://github.com/thomaspark-pkj/darknet-nnpack/issues/13] a while back and yet to receive a response.
MXnet (SSD)[https://github.com/zhreshold/mxnet-ssd]: Port of SSD in Mxnet (not compiled with NNPACK) MXnet SSD resnet 50 per image 88 sec MXnet SSD inceptionv3 per image 35 sec
Caffe-YOLO[https://github.com/yeahkun/caffe-yolo]: Running caffe on yolo_small works with 24 sec per frame. Running caffe on yolo_tiny works with 5 sec per frame. This looks like the fastest of the one I have tried unless the darknet-nnpack issues could be solved.

I manage to run the MobileNetSSD on the raspberry pi and get around 4-5 fps the problem is that you might get around 80-90% pi resources making the camera RSTP connection to fail during alot of activity and lose alot of frames and get a ton of artifacts on the frames, so i had to purchase the NCS stick and plug it into the pi and now i can go 4 fps but the pi resources are pretty low around 30%.
The ncs with mobilenet ssd takes about 0.80 seconds to process an image.

One option is using the Movidius NCS, using the raspberry only will work only if the models are much much smaller.
Regarding the NCS implementation:
You should be able to make Mobilenet-SSD run at ~8fps. There are examples that work for simple use cases. I'm currently working an an object detector that is similar on the Darknet reference model, this runs at ~15fps with the NCS but as the model isn't yet available.
I'll open source it once it works well.
Here it is:
https://github.com/martinbel/yolo2NCS

Related

Is there a way of using the entire memory of my GPU for CUML calculations?

I am new to the RAPIDS AI world and I decided to try CUML and CUDF out for the first time.
I am running UBUNTU 18.04 on WSL 2. My main OS is Windows 11. I have a 64 GB RAM and a laptop RTX 3060 6 GB GPU.
At the time I am writing this post, I am running a TSNE fitting calculation over a CUDF dataframe composed by approximately 26 thousand values, stored in 7 columns (all the values are numerical or binary ones, since the categorical ones have been one hot encoded).
While classifiers like LogisticRegression or SVM were really fast, TSNE seems taking a while to output results (it's been more than a hour now, and it is still going on even if the Dataframe is not so big). The task manager is telling me that 100% of GPU is being used for the calculations even if, by running "nvidia-smi" on the windows powershell, the command returns that only 1.94 GB out of a total of 6 GB are currently in use. This seems odd to me since I read papers on RAPIDS AI's TSNE algorithm being 20x faster than the standard scikit-learn one.
I wonder if there is a way of increasing the percentage of dedicated GPU memory to perform faster computations or if it is just an issue related to WSL 2 (probably it limits the GPU usage at just 2 GB).
Any suggestion or thoughts?
Many thanks

The task manager is telling me that 100% of GPU is being used for the calculations
I'm not sure if the Windows Task Manager will be able to tell you of GPU throughput that is being achieved for computations.
"nvidia-smi" on the windows powershell, the command returns that only 1.94 GB out of a total of 6 GB are currently in use
Memory utilisation is a different calculation than GPU throughput. Any GPU application will only use as much memory as is requested, and there is no correlation between higher memory usage and higher throughput, unless the application specifically mentions a way that it can achieve higher throughput by using more memory (for example, a different algorithm for the same computation may use more memory).
TSNE seems taking a while to output results (it's been more than a hour now, and it is still going on even if the Dataframe is not so big).
This definitely seems odd, and not the expected behavior for a small dataset. What version of cuML are you using, and what is your method argument for the fit task? Could you also open an issue at www.github.com/rapidsai/cuml/issues with a way to access your dataset so the issue can be reproduced?

Problem with display rate of images (photos) using Docker + GPU

I want to ask a semi-theoretical question.
I'm using a Docker image which utilizing Nvidia-container-runtime to communicate with the GPU on my machine.
The Docker image purpose is to run an application which involves presentation of images (photos) at high rate (1 Hz - 10 Hz) (Gui application). However, as we noticed, there are some delays on the presentation rates in contrast to running the same application on bare OS (without the overhead of Docker container). Does anyone encountered this issue? Is this issue can be resolved somehow? As a note, the display rate should be as exact as possible, meaning we can't allow delays of more than 10 ms.

Multiple usb cameras on Raspberry Pi using OpenCV

I would like to connect 2 usb webcams to a RaspberryPI and be able to get at least 1920 x 1080 frames at 10 fps using OpenCV. Has anyone done this and knows if this is possible? I am worried that the PI has only 1 usb bus?? (usb2) and might get a usb bandwidth problem.
Currently I am using an Odroid and it has a usb2 and usb3 bus so I can connect 1 camera to each without any problemo..

What i have found in the past with this is no matter what you select using OpenCV for bandwidth options the cameras try to take up as much bandwidth as they want.
This has led to multiple cameras on a single USB port being a no-no.
That being said, this will depend on your camera and is very likely worth testing. I regularly use HD-3000 Microsoft cameras and they do not like working on the same port, even on my beefy i7 laptop. This is because the limitation is in the USB Host Bandwidth and not processing power etc.
I have had a similar development process to you inthe past though, and selected an Odroid XU4 because it had the multiple USB hosts for the cameras. It also means you have a metric tonne more processing power available and more importantly can buy and use the on-board chip if you want to create a custom electronics design.

Tensorflow scalibility

I am using tensorflow to train DNN, my network structure is very simple, each minibatch takes about 50ms when only one parameter server and one worker. In order to process huge samples, I am using distributed ASGD training, however, I found that increasing worker count could not increase throughput, for example, 40 machines could achieve 1.5 million samples per second, after doubling parameter server machine count and worker machine count, cluster still could only process 1.5 million samples per second or even worse. The reason is each step takes much longer when cluster is large. Does tensorflow have good scalibility, and any advice for speeding up training?

General approach to solving these problems is to find where bottlenecks are. You could be hitting a bottleneck in software or in your hardware.
General example of doing the math -- suppose you have 250M parameters, and each backward pass takes 1 second. This means each worker will be sending 1GB/sec of data and receiving 1GB/sec of data. If you have 40 machines, that'll be 80GB/sec of transfer between workers and parameter server. Suppose parameter server machines only have 1GB/sec fully duplex NIC cards. This means that if you have less than 40 parameter server shards, then your NIC card speed will be the bottleneck.
After ruling that out, you should consider interconnect speed. You may have N network cards in your cluster, but the cluster most likely can't handle all network cards sending data to all other network cards. Can your cluster handle 80GB/sec of data flowing between 80 machines? Google designs their own network hardware to handle their interconnect demands, so this is an important problem constraint.
Once you checked that your network hardware can handle the load, I would check software. IE, suppose you have a single worker, how does "time to send" scale with the number of parameter server shards? If the scaling is strongly sublinear, this suggests a bottleneck, perhaps some inefficient scheduling of threads or some-such.
As an example of finding and fixing a software bottleneck, see grpc RecvTensor is slow issue. That issue involved gRPC layer become inefficient if you are trying to send more than 100MB messages. This issue was fixed in upstream gRPC release, but not integrated into TensorFlow release yet, so current work-around is to break messages into pieces 100MB or smaller.
The general approach to finding these is to write lots of benchmarks to validate your assumptions about the speed.
Here are some examples:
benchmark sending messages between workers(local)
benchmark sharded PS benchmark (local)

Laptop to desktop memory RAM adapter reliability

I have recently came across an adaptor that would allow me to use laptop memory on my desktop. See item below:
http://www.amazon.co.uk/Laptop-Desktop-Adapter-Connector-Converter/dp/B009N7XX4Q/ref=sr_1_1?ie=UTF8&qid=1382361582&sr=8-1&keywords=Laptop+to+desktop+memory
Both the desktop and the laptop use DDR3.
My question is, are this adapters reliable?
I have 8 GB available and I was wondering if they could be put to use in my gaming rig.
The desktop is an i7 machine generally used for gaming and some basic development.

The adapter should be reliable based on how it looks. There is not much to it only that it extends the "mini" RAM block to a bigger one. You can make the analog with A-B USB cables.
What you should also consider is if both RAM devices use the same frequency and possible heat issues as you will have to cool down the laptop memory more that if it was desktop size. This is because a lot of current goes trough smaller size compared to the desktop based RAM blocks. Then again you have the extension board to handle and disperse some of the heat so if you are not having some really extensive RAM operations you should be fine but you should check what is the working frequency on both of them. For example if the laptop one is faster than the maximum one your computer can support then you won't get that faster performance and the RAM block will work with the frequency of the system bus but if it is slower then the system bus will work on that frequency.

Use standard things on this module as reference to calculate the width. Measure it on image and scale to a reference item and check on your system. Use contacts or the lock in grooves to do the scaling since they are of standard dimensions on all modules. Or the module length...

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart