I am training a yolov5 model for golf ball detection. I am facing serious issues with false detection as the color of golf ball is white. The model is detecting similar shapes with white color as golf ball moreover it is only detecting a golf ball in few scenarios like at a specific distance. I have a large amount of dataset annotated on roboflow (most probably 60,000 annotated images for golf ball). I am using yolov5s.pt as I have to deploy the model on mobile devices. If anyone knows how to tackle this issue then please do mention about it.
I have modified and improved my dataset several times to cover multiple scenarios in my data. But this didn't worked at all. mAP0.5 for ball is 95+ but still it results in false positives.
It's hard to say without more information (which isn't provided in the question). Perhaps one natural approach is to gather more data: gather images that are representative of the scenarios where the model makes errors, and annotate them.
Every model will have false positives. It is unlikely that any model will be perfect -- every model makes errors. It is simply a matter of how frequently or rarely those errors happen -- but it is unlikely you will ever drive the error rate to 0.
Related
I am creating a CNN to recognize a friend's speech. She uses unique vocalizations (not in any language) to communicate. To begin, I recorded 60 samples of three of these sounds (180 .wav samples total). After training the model, I was getting near perfect accuracy from both the test and validation data. I then recorded new sounds immediately after this training, and was getting about 50 percent accuracy, which showed some level of learning and generalizing, since random guesses on 3 classes should have been getting about 33% accuracy.
The next day I tried recording new audio again, and the model's predictions were as good as random. My guess as to the problem is that the model is sensitive to very small changes in environment. It showed some learning immediately after training because the environment would have been very similar. However, the following day, there were probably more substantial changes to environment (background noise, distance from microphone, sitting in different part of the room, etc.). Does this seem like a reasonable guess as to the cause of the problem? And if so, how can I make my model less sensitive to environment? Would adding white noise help? Are there ways to add background noise to my samples? Any help would be appreciated.
That is to be expected! 180 samples is not nearly enough to train the CNN on. A CNN contains thousands up to millions of parameters so you could quite possibly have many more parameters to tune than bytes of data in your dataset!
Futhermore, your claim of getting perfect accuracy on the test set seem suspicious. I'd wager that you have accidentally used test data to train the model with.
You could try "growing" your dataset by adding randomized noise to the sound files. I don't think that would help much though. The network would become resilient to the type of white noise you added, but probably not to the type of noise found in actual recordings. For example, in speech recognition noises people make when speaking like breathing, saying "uh" and "eh", or clearing their throats, can confuse the recognizer. It is very hard to synthetically add such noises.
Also, while two sounds may sound similar to human ears, their waveforms might be completely different. A song played in different keys sound similar or even identical to human ears but have totally different waveforms. You have the same effect if you listen to someone talking indoors vs. outdoors vs. in a noisy bar. Even whether someone is standing or sitting can completely change the sound of their voice.
Bottom line: you need waaaay more data. I also recommend experimenting with RNNs and Bidirectional RNNs. They are a better fit for temporal data like sound samples than CNNs. Generally they also require less parameters so training is faster.
I'm looking for a way to detect humans in a picture. For instance, regarding the picture below, I'd like to coarsely determine how many people are in the scene. I must be able to detect both standing and sitting people. I do not mind not detecting people located behind a physical object (such as the glass in the bus picture).
AFAIK, such a problem can rather easily be solved by training deep neural networks. However, my coworkers would like me to also implement a detection technique based on general image processing techniques. I've spent several days looking for techniques designed by researchers but I couldn't find anything else than saliency-based techniques (which may be fine, but I'd like to test several techniques based on old-fashioned image processing).
I'd like to mention that I'm not new to the topic of image segmentation & I used to segment aortas in medical scans. However, this task was easier IMHO since scanners have similar features: in this use-case (human detection in a bus, for instance), the pictures will have very different characteristics (e.g. image contrast can strongly vary, whether it's been taken during the day or at night).
Long story short, I'd like to know if there's some segmentation technique for human detection for which it'd be interesting giving a shot, given the fact that the images features vary a lot?
Is deep learning the only way to detect humans in a picture?
No. Is it the best way we know? Depends on your conditions.
The simplest way of detection is to generate lots of random bounding boxes and then solving the classification problem of the crop. Here is some pythonic pseudo-code:
def detect_people(image):
"""
Find all people in image.
Parameters
----------
image : image object
Returns
-------
people : list of axis-aligned bounding boxes (aabb)
Each bounding box contains a person
"""
people = []
for aabb in generate_random_aabb(image):
crop = crop_image(image, aabb)
if is_person(crop):
people.append(crop)
return people
In this case is_person can be any classifier, e.g. boosted decision stumps as used in the Viola–Jones object detection framework. Speaking of which: That would likely be the way to go without DL, but is much more complicated to explain.
Object Detection vs Segmentation
Your question mixes both. Object detection gives you bounding boxes (coarse) for instances. Semantic segmentation labels all pixels by classes, but does not distinguish different instances of the same class (e.g. different people). Instance segmentation is like object detection, but is fine-grained and aims for pixel-exact results.
If you are interested in segemantation, I can recommend my paper: A Survey of Semantic Segmentation
I have been using cascade classifier to train some kind of plants. Here is a sample image for what I want to detect
I sampled the little green plants for positives, and made negatives out of images with similar background and no green plants (as suggested by many sources). Used many images similar to this one for sampling.
I did not have a lot of training data so of course I did not expect some idealistic classification results.
I have set the usual parameters min_hit_rate 0.95 max_false_alarm 0.5 etc. I have tried training with 5,6,7,8,9 and 10 stages. The strange thing that happens to me is that during the training process I get hit rate of 1 during all stages, and after 5 stages I get good acceptance ratio 0.004 (similar for later stages 6,7,8...).
I tried testing my classifier on the same image which I used for the training samples and there is very illogical behavior:
the classifier detects almost everything BUT the positive samples i took from it (the same samples in the training with HIT RATION EQUAL TO 1).
the classifier is really but really slow it took over an hour for single input image (down-sampled scale factor 1.1).
I do not get it how could the same samples be classified as positives during training (through all the stages) and then NONE of it as positive on the image (there are a lot of false positives around it).
I checked everything a million times (I thought that I somehow mixed positives and negatives but I did not).
Can someone help me with this issue?
I can try and help but of course I can't train this thing for you unless you send me your images.
In my experience if you aren't getting the desired results, you are simply giving traincascade the wrong or not enough images (either or both positives or negatives).
I did not get great results until I created an annotation file using the built-in opencv_annotation tool. Have you done that? How many positives?
Did your negatives contain the background that you are attempting to detect your object in? This is key and can't be overlooked.
Also, I would use LBP, it's much faster.
If you or anyone is still stuck and have some positives created, send them to me and I'll see if I can train this thing.
And also, I have written hopefully a one-stop tutorial about this stuff after my experiences with it:
http://johnallen.github.io/opencv-object-detection-tutorial/
I've been tasked to carry out a benchmark of an existing classifier for my company. The biggest problem currently is differentiating between different type of transportation's, like recognizing if i'm currently in a train, driving a car or bicycling so this is the main focus.
I've been reading alot about LSTM, http://en.wikipedia.org/wiki/Long_short_term_memory and its recent success in handwriting and speech-recognition, where the time between significant events could be pretty long.
So, my first thought about the problem with train/bus is that there probably isn't such a clear and short cycle as there is when walking/running for instance so long-term memory is probably crucial.
Have anyone tried anything similar with decent results?
Or is there other techniques that could potentially solve this problem better?
I've worked on mode of transportation detection using smartphone accelerometers. The main result I've found is that almost any classifier will do; the key problem is then the set of features. (This is no different from many other machine learning problems.) My feature set ended up containing time-domain and frequency-domain values, both taken from time-series sliding-window segmentation.
Another problem is that the accelerometer can be placed anywhere. On the body, it can be anywhere and in any orientation. If the user is driving, is the phone in a pocket, in a bag, on a car seat, attached to a suction-cup window mount, etc.?
If you want to avoid these problems, use GPS instead of the accelerometer. You can make relatively accurate classifications with that sensor, but the cost is the battery usage.
I want to do pedestrian detection and tracking.
Input: Video Stream from CCTV camera.
Output:
#(no of) people going from left to right
# people going from right to left
# No. of people in the middle
What have i done so far:
For pedestrian detection I am using HOG and SVM. The detection is decent with high false positive rate. And its very slow as i am running in android platform.
Question:
After detection how to do I calculate the required values listed above. Can anyone tell me what is the tracking algorithm I have to use and any good algorithm for pedestrian detection.
Or should I use tracking algorithm? Is there a way to do without it?
Any references to codes/blogs/technical papers is appreciated.
Platform: C++ & OpenCV / android.
--Thanks
This is somehow close to a research problem.
You may want to have a look to this website which gathers a lot of references.
In particular, the work done by the group from Oxford present therein is pretty close to what you are doing, since their are using HOG for detection. (That work has been extremely illuminating for me).
EPFL and Julich have as well work done in the field.
You may also want to give a look to this review which describes several detection/tracking techniques, often involving variants of the HOG algorithm.
Along with #Acorbe response, I suggest the publications section of this (archived) website.
A recent work at the end of last year also released a code base here:
https://bitbucket.org/rodrigob/doppia
There have also been earlier pedestrian detector works that have released code as well:
https://sites.google.com/site/wujx2001/home/c4
http://www.vision.caltech.edu/Image_Datasets/CaltechPedestrians
The best accurate way is to use tracking algorithm instead of statistic appearance counting of incoming people and detection occurred left right and middle..
You can use extended statistical models.. That produce how many inputs producing one of the outputs and back validate from output detection the input.
My experience is that tracking leads to better results than approach above. But is also little bit complicated. We talk about multi target tracking when the critical is match detection with tracked model which should be update based on detection. If tracking is matched with wrong model. The problems are there.
Here on youtube I developed some multi target tracker by simple LBP people detector, but multi model and kalman filter for tracking. Both capabilities are available in opencv. You need to when something is detected create new kalman filter for each object and update in case you match same detection. Predict in case detection is not here in frame and also remove the Kalman i it is not necessary to track any more.
1 Detect
2 Match detections with kalmans, hungarian algorithm and l2 norm. (for example)
3 Lot of work. Decide if kalman shoudl be established, remove, update, or results is not detected and should be predicted. This is lot of work here.
Pure statistic approach is less accurate, second one is for experience people at least one moth of coding and 3 month of tuning.. If you need to be faster and your resources are quite limited. You can by smart statistic achieve your results by pure detection much faster and little bit less accurate. People are judge the image and video tracking even multi target tracking is capable to beat human. Try to count and register each person in video and count exits point. You are not able to do this in some number of people. It is really repents on, what you want, application, customer you have, and results you show to customers. If this is 4 numbers income, left, right, middle and your error is 20 percent is still much more than one bored small paid guard should achieved by all day long counting..
https://www.youtube.com/watch?v=d-RCKfVjFI4
You can find on my BLOG Some dataset for people detection and car detection on my blog same as script for learning ideas, tutorials and tracking examples..
Opencv blog tutorials code and ideas
You can use KLT for this purpose as this will tell you the flow of person traveling from left to right then you can compute that by computing line length which in given example is drawn using cv2.line you can use input parameters of this functions to compute your case, little math involved. if there is a flow of pixels from left to right this is case 1 or right to left then case 3 and for no flow case 2. Or you can use this basic tutorial to track object movement. LINK