How to reduce object detection time on a model trained over ResNet? - machine-learning

I have an custom object detection model trained over ResNet50 using PyTorch. It takes ~300ms to provide detection result after passing the image through the model.
Is there any way to reduce the detection time?

300ms with ResNet50 is not so bad. It looks like you managed to run it on a GPU. When I looked at it the last time, it did not work on GPU on iOS or even Linux.
Three things to check and experiment with:
Image resolution.
Anchors allocation.
Number of layers. For some tasks, resnet18 performs only slightly worse than resnet50.

Related

Can i Retrain Inception's Final Layer using depth images from Kinect.?

I like to know whether I can use data set of signs that is made using Kinect to retrain inception's final layer like mentioned in the Tensor Flow tutorial website that uses ordinary RGB images.I am new to this field. Opinions are much appreciated.
The short answer is "No. You cannot just fine tune only the last layer. But you can fine tune the whole pre-trained network.". The first layers of the pre-trained network is looking for RGB features. Your depth frames will hardly provide enough entropy to match that. Your options are:
If the recognised/tracked objects (hands) are not masked and you have actual depth data for the background, you can train from scratch on depth images with few contrast stretching and data whitening ((x-mu)/sigma). This will take very long time for the ivy league networks like Inception and ResNet. Also, keep in mind that most python based deep learning frameworks rely on PIL image loaders which by default assumes images are of 8bits channels mapped in the range [0, 1]. These image loaders cast all 16bits pixels ones.
If the recognised/tracked object (hands) are masked which means your background is set to the same value or barely have gradient in it, the network will overfit on the silhouette of the object because this is where the strongest edges are. The solution for this is to colorise the depth image using normal maps, HSA, HSV, JET colour coding to convert it into 3x8bits channeled image. This makes the training converge much faster and in my late experiments we found that you can fine tune the ivy league networks on the colorised depth.
Since you are new to this field.I would like to suggest you to read what is transfer learning all the three types mentioned.I would like to tell you to apply any of the mentioned forms of transfer learning basing on your data set.If your data set is very similar to the type of model you are using then you can pass through last layers.If you data is not similar you have to tune the existing model and use it.
As the layers of the neural networks increases the data specific feature extraction increases so you have to take care of the specific layers if your dataset is not very similar to the pre-built model dataset. The starting layers will contain more generic features.

shape Detection - TensorFlow

I'm trying to train a model to detect the basic shapes like Circle, Square, Rectangle, etc. using Tensorflow. What would be the best input data set? To load the shapes directly or to find the edge of the image using OpenCV and load only the edge image.
We can detect shapes using OpenCV too. What would be the added advantage to use Machine Learning.
Sample images given for training the model.
I would recommend starting with this guide for doing classification, not object detection:
https://kiosk-dot-codelabs-site.appspot.com/codelabs/tensorflow-for-poets/#0
Classification is for one unique tag for one picture (99% square, 1%circle). Object Detection is for classification of several objects within the picture (x_min=3,y_min=8,x_max=20,y_max30, 99% square). Your case looks more like a classification problem.
You don't need the full Docker installation as in the guide.
If you have Python 3.6 on your system, you can just do:
pip install tensorflow
And then jump to "4. Retrieving the images"
I had to try it out myself, so I downloaded the first 100 pictures of squares and circles from Google with the add-on "fatkun batch download image" from Chrome Web Store.
On my first 10 tests I get accuracy between 92,0% (0.992..) and 99,58%. If your examples are more uniform than a lot of different pictures from Google, you will probably get better results.
You may want to checkout objective detection in tensorflow.
https://github.com/tensorflow/models/tree/master/research/object_detection
There is a pre-trained model here
http://download.tensorflow.org/models/object_detection/ssd_mobilenet_v1_coco_11_06_2017.tar.gz
One potential advantage of using neural nets to do the detection is that it can reduce the cpu cycles to calculate. This is useful on mobile devices.
For example - the Hough transform https://en.wikipedia.org/wiki/Hough_transform is too expensive to calculate / but if a convolutional neural net was used instead - more possibilities open up for real time image processing.
To actually train a new model - see here https://www.tensorflow.org/tutorials/deep_cnn

semantic segmentation for large images

I am working on a limited number of large size images, each of which can have 3072*3072 pixels. To train a semantic segmentation model using FCN or U-net, I construct a large sample of training sets, each training image is 128*128.
In the prediction stage, what I do is to cut a large image into small pieces, the same as trainning set of 128*128, and feed these small pieces into the trained model, get the predicted mask. Afterwards, I just stitch these small patches together to get the mask for the whole image. Is this the right mechanism to perform the semantic segmentation against the large images?
Your solution is often used for this kind of problem. However, I would argue that it depends on the data if it truly makes sense. Let me give you two examples you can still find on kaggle.
If you wanted to mask certain parts of satellite images, you would probably get away with this approach without a drop in accuracy. These images are highly repetitive and there's likely no correlation between the segmented area and where in the original image it was taken from.
If you wanted to segment a car from its background, it wouldn't be desirable to break it into patches. Over several layers the network will learn the global distribution of a car in the frame. It's very likely that the mask is positive in the middle and negative in the corners of the image.
Since you didn't give any specifics what you're trying to solve, I can only give a general recommendation: Try to keep the input images as large as your hardware allows. In many situation I would rather downsample the original images than breaking it down into patches.
Concerning the recommendation of curio1729, I can only advise against training on small patches and testing on the original images. While it's technically possible thanks to fully convolutional networks, you're changing the data to an extend, that might very likely hurt performance. CNNs are known for their extraction of local features, but there's a large amount of global information that is learned over the abstraction of multiple layers.
Input image data:
I would not advice feeding the big image (3072x3072) directly into the caffe.
Batch of small images will fit better into the memory and parallel programming will too come into play.
Data Augmentation will also be feasible.
Output for big Image:
As for the output of big Image, you better recast the input size of FCN to 3072x3072 during test phase. Because, layers of FCN can accept inputs of any size.
Then you will get 3072x3072 segmented image as output.

Is this image too complex for a shallow NN classifier?

I am trying to classify a series of images like this one, with each class of comprising images taken from similar cellular structure:
I've built a simple network in Keras to do this, structured as:
1000 - 10
The network unaltered achieves very high (>90%) accuracy on MNIST classification, but almost never higher than 5% on these types of images. Is this because they are too complex? My next approach would be to try stacked deep autoencoders.
Seriously - I don't expect any nonconvolutional model to work well on this type of data.
A nonconv net for MNIST works well because the data is well preprocessed (it is centered in the middle and resized to certain size). Your images are not.
You may notice (on your pictures) that certain motifs reoccure - like this darker dots - with different positions and sizes - if you don't use convolutional model you will not capture that efficiently (e.g. you will have to recognize a dark dot moved a little bit in the image as a completely different object).
Because of this I think that you should try convolutional MNIST model instead classic one or simply try to design your own.
First question, is if you run the training longer do you get better accuracy? You may not have trained long enough.
Also, what is the accuracy on training data and what is the accuracy on testing data? If they are both high, you can run longer or use a more complex model. If training accuracy is better than testing accuracy, you are essentially at the limits of your data. (i.e. brute force scaling of model size wont help, but clever improvements might, i.e. try convolutional nets)
Finally, complex and noisy data you may need a lot of data to make a reasonable classification. So you need many, many images.
Deep stacked autoencoders, as I understand it is an unsupervised method, which isn't directly suitable for classification.

Why use sliding windows with convolutional neural nets in object detection?

I read that CNNs (with both convolution and max-pooling layers) are shift-invariant, but most object detection methods used a sliding window detector with non-maximum suppression. Is it necessary to use sliding windows with CNNs when doing object detection?
Basically, instead of training the network on small 50x50 patches of images containing the desired object, why not train on entire images where the object is present somewhere? All I can think of is practical/performance reasons (doing forward pass on smaller patches instead of whole images), but is there also a theoretical explanation I'm overlooking?
internally, CNN is doing a sliding window. Convolution in terms of 2d image is nothing more than a linear filter applied in the sliding window manner. This is simply nice, mathematical expression of the very same operation, which helps us do neat optimization. Max pooling on the other hand helps us to be robust in terms of small shifts/noise. So efficiently feeding image to the network is using (many!) sliding windows on it. Can we pass big images instead of small ones? Sure, but you wil get extremely big tensors (just compute how many numbers you will need, this is huge), and you will get really complex optimization problem. Nowadays we optimize in milions-dimensional space. Working with whole images might lead to bilions (or even bigger) number of dimensions. Optimization complexity grows exponentialy with the growth of the dimension, thus you will end up with extremely slow method (not in terms of computation itself - but convergence).

Resources