How to get proper results detecting images with yolov7?

How to get proper results detecting images with yolov7? - machine-learning

This is a general question. I tried detecting objects in pictures downloaded from roboflow-datasets with yolov7, these were quite simple objects with a solid background (sun in the sky, balloons in the sky, rabbits on grass etc. ) Even though i had a good amount of pictures (500 to 1000) and a good amount of epochs it did not have a good result. Many objects were not detected, and those who got detected had a bad frame containing too much of the picture or were too small and did not contain the whole object. So there are some questions on general improving of yolov7:
How many pictures are needed in general to generate good results for one object ?
How many epochs are needed ?
Is there a way to improve the backbone aka. the neural network from yolov7 ?
Is there a way to improve the hyp. parameters ("learning parameters") ?
I tried googleing it, tried to improve the arguments for train.py. It would be nice if you could help me out.

Related

DBSCAN: How to Cluster Large Dataset with One Huge Cluster

I am trying to perform DBSCAN on 18 million data points, so far just 2D but hoping to go up to 6D. I have not been able to find a way to run DBSCAN on that many points. The closest I got was 1 million with ELKI and that took an hour. I have used Spark before but unfortunately it does not have DBSCAN available.
Therefore, my first question is if anyone can recommend a way of running DBSCAN on this much data, likely in a distributed way?
Next, the nature of my data is that the ~85% lies in one huge cluster (anomaly detection). The only technique I have been able to come up with to allow me to process more data is to replace a big chunk of that huge cluster with one data point in a way that it can still reach all its neighbours (the deleted chunk is smaller than epsilon).
Can anyone provide any tips whether I'm doing this right or if there is a better way to reduce the complexity of DBSCAN when you know that most data is in one cluster centered around (0.0,0.0)?

Have you added an index to ELKI, and tried the parallel version? Except for the git version, ELKI will not automatically add an index; and even then fine-turning the index for the problem can help.
DBSCAN is not a good approach for anomaly detection - noise is not the same as anomalies. I'd rather use a density-based anomaly detection. There are variants that try to skip over "clear inliers" more efficiently if you know you are only interested in the top 10%.
If you already know that most of your data is in one huge cluster, why don't you directly model that big cluster, and remove it / replace it with a smaller approximation.
Subsample. There usually is next to no benefit to using the entire data. Even (or in particular) if you are interested in the "noise" objects, there is the trivial strategy of randomly splitting your data in, e.g., 32 subsets, then cluster each of these subsets, and join the results back. These 32 parts can be trivially processed in parallel on separate cores or computers; but because the underlying problem is quadratic in nature, the speedup will be anywhere between 32 and 32*32=1024.
This in particular holds for DBSCAN: larger data usually means you also want to use much larger minPts. But then the results will not differ much from a subsample with smaller minPts.
But by any means: before scaling to larger data, make sure your approach solves your problem, and is the smartest way of solving this problem. Clustering for anomaly detection is like trying to smash a screw into the wall with a hammer. It works, but maybe using a nail instead of a screw is the better approach.
Even if you have "big" data, and are proud of doing "big data", always begin with a subsample. Unless you can show that the result quality increases with data set size, don't bother scaling to big data, the overhead is too high unless you can prove value.

What could be the reason that the model does not produce better results with lots more data?

So as I was playing with a Keras GAN from Jeff Heatons website.
Since the saying is the more data we have, the better the results we should get. I wanted to test this hypothesis. Also, I wanted to know if the GAN may just copy a sample from data.
That's why I created images with numbers ranging from 1 to 20000:
128px x 128px
Numbers are centered
Used the same for all (dark blue & yellow)
So to test this theory, I first trained the GAN with 5000 images. This is the result that I am getting:
And then I trained with with 20000 images:
I can't really see a big improvement. What gives? Do I need to try with much more images (50,000)? Do I need to improve the architecture of the GAN?

You do not need to modify the architecture of the GAN.
If I were you, I would first start looking for a good metric to check how you results improved/worsened.
Honestly when I am looking at the two different batches, the second one looks much more diverse, so it makes sense that the GAN is able to generate a wider range of numbers since it has seen many more pictures.

Facial Expression Recognition Data Preparation for CNN

I am quite new to the area of facial expression recognition and currently I'm doing a research on this via Deep Learning specifically CNN. I have some questions with regard to preparing and/or preprocessing my data.
I have segmented videos of frontal facial expressions (e.g. 2-3 seconds video of a person expressing a happy emotion based on his/her annotations).
Note: expressions displayed by my participants are quite of low intensity (not exaggerated expressions/micro-expressions)
General Question: Now, how should I prepare my data for training with CNN (I am a bit leaning on using a deep learning library, TensorFlow)?
Question 1: I have read some deep learning-based facial expression recognition (FER) papers that suggest to take the peak of that expression (most probably a single image) and use that image as part of your training data. How would I know the peak of an expression? What's my basis? If I'm going to take a single image, wouldn't some important frames of the subtlety of expression displayed by my participants be lost?
Question 2: Or would it be also correct to execute the segmented video in OpenCV in order to detect (e.g. Viola-Jones), crop and save the faces per frame, and use those images as part of my training data with their appropriate labels? I'm guessing some frames of faces to be redundant. However, since we knew that the participants in our data shows low intensity of expressions (micro-expressions), some movements of the face could also be important.
I would really appreciate anyone who can answer, thanks a lot!

As #unique monkey already pointed out, this is generally a supervised learning task. If you wish to extract an independent "peak" point, I recommend that you scan the input images and find the one in each sequence whose reference points deviate most from the subject's resting state.
If you didn't get a resting state, then how are the video clips cropped? For instance, were the subjects told to make the expression and hold it? What portion of the total expression (before, express, after) does the clip cover? Take one or both endpoints of the video clip; graph the movements of the reference points from each end, and look for a frame in which the difference is greatest, but then turns toward the other endpoint.

answer 1: Commonly we always depend on human's sense to decide which expression is the peak of the expression(I think you can distinguish the difference in smile and laugh)
answer 2: if you want to get a good result, I suggest you not treat data so rude like this method

Human recognition using Template matching

I am using Emgu-CV to identify each person in a big room.
My camera is static and in-door.
I would like to count the number of persons who visited the room, that is I want to recognize each person even if I got the images in different angles at different times in a day.
I am using Haar classifiers to detetct the face, heads and full body from the image and then I am comparing this with the already detected image portions using template matching so that I can recognize the person. But I am getting very poor results.
Is this the right approach for this problem ? can anyone suggest a better approach ?
Or is there any better libraries available which can solve this problem ?

I think Template Matching is the weak point in your system. I would suggest training a Haar cascade for each person individually that will result replace (detecting + recognition) with (detect a precise object). Sure if the number of people you want to recognize is rather small. Or you can use some other stuff like SURF but note their licence.

Estimate of the crowd head count in static image (openCV)

We are trying to count the number of people in a static image, this number may be as large as 100-150. I did some research and found some ways to do this, but we are not sure which one will work the best.So my question is this, will haartraining give us good results?
If you have more ideas please share with me.
Thankyou.

I would say it depends on how your haartraining goes. Are you wanting to train a classifier specifically to detect faces, and then detect the # of faces in the image and count them up? It's possible that your model will do one or more of these things: a. Count things that are not faces as faces b. Count single faces multiple times. If you can get strong training sets for both positive and negative images, it could definitely be worth a shot as far as ballparking a number. I wouldn't expect it to be exact though.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart