training custom object in YOLOv3, how does it work?

training custom object in YOLOv3, how does it work? - opencv

I got an project needs to detect person in anime-like style vedios
I just tested YOLOv3 608x608 with COCO in GTX 1050TI
however speed is only at about ~1.5FPS , but I need at least 10 FPS on 1050TI for my project
1.I want to know that does the number of the classes will effect detection speed? (I assume COCO is about finding 80 kinds object in picture? if I just need find one kind of object, will it go 80x faster?)
2.when I input image for training ,original image are 1920*1080, should I resize them to 608x608 before labeling and training?
3.is there any labeling tool should I use? in README.md at https://github.com/AlexeyAB/darknet <x> <y> <width> <height> seems need to be calculate and input by hand which seems too hard, maybe there is a tool I just need to crop where the object is in image?
4.if the object is not a square in image, how does YOLO know which part are object? How to avoid it train background as object?
do I have to remove all background and fill it as black, only keep the object in image?
5.is the output always a box? can I train and get output as mask? if I detect as mask, will it slower then box because it seems to be more information?
6.to get a good result, how many training image and test image should I make?
I know it's just some noob question in CV, however I really want to know this without spending weeks in training and find out answer myself , an answer will be appreciated!

3.
https://en.wikipedia.org/wiki/List_of_manual_image_annotation_tools
You should be able to get output of corners coordinates by using some image annotation tool.
4.
With enough images with different background for training, supposedly the model should be able to ignore background. A black background is still a background. I guess that's a kind of data augmentation, so it might help reduce overfitting.
5.
If it does not support mask out-of-the-box, maybe you want to do background-subtraction as an extra step to process the output.

1) In my opinion, GTX 1050Ti is not enough to test YOLO v3. Because, the model size (i.e. the number of layers) of the YOLO v3 becomes extremely large compared with the previous versions. The number of classes will be not matter in this case. If you want fast test computing speed, you should upgrade your GPU like 1070Ti.
2) Whatever the size of input images, it will be resized into the pre-defined size, which is depicted as cfg file, by force, so you don't need to resize the input image.

1) I think it may affect the speed a bit in because as you use less classes you get less convolutional filters before each YOLO layer (you set it up in the .cfg file), but it's not likely gonna be an 80x speed up
2) Maybe? I mean, YOLO's gonna resize them when training and then testing, so maybe if you really want to you could, but high res images usually work better, in my experience.
3)I like the OpenLabelling (you can just Google it and it's on GitHub)
4) You may wanna give YOLO negative images that have nothing in them to prevent them picking up on a background, where there's nothing there
5)YOLO doesn't do masks
6)About 1k per class is what probably will work, you can get by with 500 but the rule of thumb is that the more, the better)
If you're interested, I've put out the whole series on YOLO on YouTube, so you may wanna check it out: https://youtu.be/TP67icLSt1Y

Related

Image processing technique for image segmentation

I'm trying to create a model that segment various part of an aerial image.
I'm using a dataset found in kaggle: https://www.kaggle.com/datasets/bulentsiyah/semantic-drone-dataset
My question regards about the right way of treat images for semantic segmentation.
In this case is it better to simply resize the images (e.g. 6000x4000 to 256x256 pixel) or is it better to resize them less but then create patches from it (e.g. 6000x4000 to 1024x1024 pixel and then patches in 256x256 pixel).
I think that resizing too much an image may cause the loss of information but at the same time patching could not guarantee a full view of the image.
I also found a notebook that got 96% accuracy just by resizing so i'm not sure how to proceed:
https://www.kaggle.com/code/yesa911/aerial-semantic-segmentation-96-acc/notebook

I think there is not one correct answer to this. Dependant on the amount and size of the areas you want to segmentate, it seems unlikely to get a proper/accurate segemantion with images of your size. However, if there are only easy detectable and big areas in the image I would definetly go for the approach without patches, since the patch-approach is way more complex as it has more variables to consider (size of patches, overlapping patches, edge treatment). It would save you a lot of implementation time for preprocessing and stichting afterwards.
TLDR: I would start without patching and - if the result is sufficient - stop there. Else, try the patching approach afterwards.

How to resize(reshape) the images in CNN? Mathematical intuition behind resizing

I have been working on Images for few months for my internship, and recently I have been wondering that is there a mathematical way of resizing the images.
This becomes a fairly difficult task to resize the images because many a times freshers like me have little experience about the pre-processing in Images.
Given that my problem statement was Gender classification using the human eye. However I found it difficult because
The images were 3 channel
The images were in rectangular shape (17:11)
I did try to resize the images by following few blogs which said to start small and then go up, while it could have worked I still did not understand how small. I resized them to 800,800 randomly and go Resource Exhaustive error(I was using GPU).
So I ask the community if there is any such mathematical formula or a generalized way of doing the resizing task.
Thank you in advance.

This partially answers your question. But, normally many people use transfer learning and a pre-designed architecture for computer vision tasks. Since almost all architecture is designed for square input shape, you can get a better results by making the shape of your input image squared. Another solution would be only padding your 17X11 to make it square by 0 values. (you need to test to see which one works best in your case, but the common practice is re-shaping to square.)
It is fine to have 3 channel images, almost all images are designed for 3 channel input ( even for BW images it is suggested to repeat the channel to have 3 channel input for the model)
About resizing
About resizing the image, in theory, you need to resize the image to the model you are going to use. For example, LeNet-5 accepts images of Mnist with size 28x28. In theory, larger images result in better model performance, but in your case, the images are super low resolution you can start with 28x28 or 224x224 architectures and later use bigger ones and see if it helps in your case.
About the error it's pretty normal your model size was going to be bigger than your GPU memory so, you see Out of memory error. you can use a smaller model ( and smaller input image size) with your device, or you need to use a device with bigger GPU memory.
Finally, you should consider the size of architecture you are going to reuse to determine the correct resize of the dataset you need. If you are designing your model then best starting point can be something around 28x28 ( basically using Lenet) and later developing based on needs/performance.
the resizing can be as easy as calling a Transform with Pytorch transforms like ( i mean you don't need to manually recreate a copy of the dataset just for resizing)
T.Compose([
T.RandomResize(224)
])

find mosquitos' head in the image

I have images of mosquitos similar to these ones and I would like to automatically circle around the head of each mosquito in the images. They are obviously in different orientations and there are random number of them in different images. some error is fine. Any ideas of algorithms to do this?

This problem resembles a face detection problem, so you could try a naïve approach first and refine it if necessary.
First you would need to recreate your training set. For this you would like to extract small images with examples of what is a mosquito head or what is not.
Then you can use those images to train a classification algorithm, be careful to have a balanced training set, since if your data is skewed to one class it would hit the performance of the algorithm. Since images are 2D and algorithms usually just take 1D arrays as input, you will need to arrange your images to that format as well (for instance: http://en.wikipedia.org/wiki/Row-major_order).
I normally use support vector machines, but other algorithms such as logistic regression could make the trick too. If you decide to use support vector machines I strongly recommend you to check libsvm (http://www.csie.ntu.edu.tw/~cjlin/libsvm/), since it's a very mature library with bindings to several programming languages. Also they have a very easy to follow guide targeted to beginners (http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf).
If you have enough data, you should be able to avoid tolerance to orientation. If you don't have enough data, then you could create more training rows with some samples rotated, so you would have a more representative training set.
As for the prediction what you could do is given an image, cut it using a grid where each cell has the same dimension that the ones you used on your training set. Then you pass each of this image to the classifier and mark those squares where the classifier gave you a positive output. If you really need circles then take the center of the given square and the radius would be the half of the square side size (sorry for stating the obvious).
So after you do this you might have problems with sizes (some mosquitos might appear closer to the camera than others) , since we are not trained the algorithm to be tolerant to scale. Moreover, even with all mosquitos in the same scale, we still might miss some of them just because they didn't fit in our grid perfectly. To address this, we will need to repeat this procedure (grid cut and predict) rescaling the given image to different sizes. How many sizes? well here you would have to determine that through experimentation.
This approach is sensitive to the size of the "window" that you are using, that is also something I would recommend you to experiment with.

There are some research may be useful:
A Multistep Approach for Shape Similarity Search in Image Databases
Representation and Detection of Shapes in Images

From the pictures you provided this seems to be an extremely hard image recognition problem, and I doubt you will get anywhere near acceptable recognition rates.
I would recommend a simpler approach:
First, if you have any control over the images, separate the mosquitoes before taking the picture, and use a white unmarked underground, perhaps even something illuminated from below. This will make separating the mosquitoes much easier.
Then threshold the image. For example here i did a quick try taking the red channel, then substracting the blue channel*5, then applying a threshold of 80:
Use morphological dilation and erosion to get rid of the small leg structures.
Identify blobs of the right size to be moquitoes by Connected Component Labeling. If a blob is large enough to be two mosquitoes, cut it out, and apply some more dilation/erosion to it.
Once you have a single blob like this
you can find the direction of the body using Principal Component Analysis. The head should be the part of the body where the cross-section is the thickest.

Using flipped images for machine learning dataset

I'v got a binary classification problem. I'm trying to train a neural network to recognize objects from images. Currently I've about 1500 50x50 images.
The question is whether extending my current training set by the same images flipped horizontally is a good idea or not? (images are not symetric)
Thanks

I think you can do this to a much larger extent, not just flipping the images horizontally, but changing the angle of the image by 1 degree. This will result in 360 samples for every instance that you have in your training set. Depending on how fast your algorithm is, this may be a pretty good way to ensure that the algorithm isn't only trained to recognize images and their mirrors.
It's possible that it's a good idea, but then again, I don't know what's the goal or the domain of the image recognition. Let's say the images contain characters and you're asking the image recognition software to determine if an image contains a forward slash / or a back slash \ then flipping the image will make your training data useless. If your domain doesn't suffer from such issues, then I'd think it's a good idea to flip them and even rotate with varying degrees.

I have used flipped images in AdaBoost with great success in the course: http://www.csc.kth.se/utbildning/kth/kurser/DD2427/bik12/Schedule.php
from the zip "TrainingImages.tar.gz".
I know there are some information on pros/cons with using flipped images somewhere in the slides (at the homepage) but I can't find it. Also a great resource is http://www.csc.kth.se/utbildning/kth/kurser/DD2427/bik12/DownloadMaterial/FaceLab/Manual.pdf (together with the slides) going thru things like finding things in different scales and orientation.

If the images patches are not symmetric I don't think its a good idea to flip. Better idea is to do some similarity transforms to the training set with some limits. Another way to increase the dataset is to add gaussian smoothed templates to it. Make sure that the number of positive and negative samples are proportional. Too many positive and too less negative might skew the classifier and give bad performance on testing set.

It depends on what your NN is based on. If you are extracting rotation invariant features or features that do not depend on the spatial position within the the image (like histograms or whatever) and train your NN with these features, then rotating will not be a good idea.
If you are training directly on pixel values, then it might be a good idea.
Some more details might be useful.

how to recognize an same image with different size ?

We as human, could recognize these two images as same image :
In computer, it will be easy to recognize these two image if they are in the same size, so we have to make Preprocessing stage or step before recognize it, like scaling, but if we look deeply to scaling process, we will know that it's not an efficient way.
Now, could you help me to find some way to convert images into objects that doesn't deal with size or pixel location, to be input for recognition method ?
Thanks advance.

I have several ideas:
Let the image have several color thresholds. This way you get large
areas of the same color. The shapes of those areas can be traced with
curves which are math. If you do this for the larger and the smaller
one and see if the curves match.
Try to define key spots in the area. I don't know for sure how
this works but you can look up face detection algoritms. In such
an algoritm there is a math equation for how a face should look.
If you define enough object in such algorithms you can define
multiple objects in the images to see if the object match on the
same spots.
And you could see if the predator algorithm can accept images
of multiple size. If so your problem is solved.

It looks like you assume that human's brain recognize image in computationally effective way, which is rather not true. this algorithm is so complicated that we did not find it. It also takes a large part of your brain to deal with visual data.
When it comes to software there are some scale(or affine) invariant algorithms. One of such algorithms is LeNet 5 neural network.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart