I've managed to extract HoG features from positive and negative images (from INRIA's person dataset ) using OpenCV's HOGDescriptor::compute function.
I've also managed to pack the data correctly and feed it into CvSVM for training purposes.
I have several questions:
While extracting features, I used positive images with dimension of 96 x 128, while the negative images are on average 320 x 240. I have been using window size of 64 x 128 for HoG extraction, should I use other window size ?
The size of extracted features for positive images are around 28800 features, while the negative ones are around 500000+. I have been truncating the features from negative ones to 28800, I think this is wrong, since I believe I'm losing too much information when feeding these features to SVM. How should I go and tackle this ? (It seems like I can only feed the same sample size for negative and positive features)
While doing prediction on images bigger than 64 x 128 (or 96 x 160), should I use a sliding window to do prediction ? Since large negative images still gives me more than 500000 features, but I can't feed it into SVM due to sample size.
Why you can't just resize all your patches to the same size? Hog descriptor depends on windows size, blocks and cells sizes. You should try different combinations. With small cells you can capture small details, but you will lose in generality and vice versa.
1.) Don't understand the question
2.) Make all descriptors the same size, extracting hog from resized images.
3.) Don't understand the question
Related
I have been playing around with some generative models, namely StyleGAN2 and diffusion models for image generation, specifically a homegrown dataset of 64 x 64 resolution. I have roughly 260 000 data samples.
These images are all grayscale, and every pixel intensity really matters, e.g. if the intensity (out of 255) of a pixel is 160 when it should be 155 for the image to look correct this can have a big impact on the quality of the image overall. This is within the context of the other pixels, of course, otherwise, we'd always be generating the same image.
The self-referential scores (FID and Kernal Inception Distance, KID) can be very low at the end of training - i.e. when the loss has converged, because the distribution of generated samples and real samples are very similar. You can also see this visually, where the images look roughly correct. But unfortunately, I really need the pixel values to be narrower.
One solution I was thinking of was to increase the resolution to 128x128 and essentially duplicate the pixel values in each dimension. This way the "noise" of individual pixels can be averaged out over 4 pixels during generation. Does this make sense, or does this sound naive? Thanks!
I am testing printed digits (0-9) on a Convolutional Neural Network. It is giving 99+ % accuracy on the MNIST Dataset, but when I tried it using fonts installed on computer (Ariel, Calibri, Cambria, Cambria math, Times New Roman) and trained the images generated by fonts (104 images per font(Total 25 fonts - 4 images per font(little difference)) the training error rate does not go below 80%, i.e. 20% accuracy. Why?
Here is "2" number Images sample -
I resized every image 28 x 28.
Here is more detail :-
Training data size = 28 x 28 images.
Network parameters - As LeNet5
Architecture of Network -
Input Layer -28x28
| Convolutional Layer - (Relu Activation);
| Pooling Layer - (Tanh Activation)
| Convolutional Layer - (Relu Activation)
| Local Layer(120 neurons) - (Relu)
| Fully Connected (Softmax Activation, 10 outputs)
This works, giving 99+% accuracy on MNIST. Why is so bad with computer-generated fonts? A CNN can handle lot of variance in data.
I see two likely problems:
Preprocessing: MNIST is not only 28px x 28px, but also:
The original black and white (bilevel) images from NIST were size normalized to fit in a 20x20 pixel box while preserving their aspect ratio. The resulting images contain grey levels as a result of the anti-aliasing technique used by the normalization algorithm. the images were centered in a 28x28 image by computing the center of mass of the pixels, and translating the image so as to position this point at the center of the 28x28 field.
Source: MNIST website
Overfitting:
MNIST has 60,000 training examples and 10,000 test examples. How many do you have?
Did you try dropout (see paper)?
Did you try dataset augmentation techniques? (e.g. slightly shifting the image, probably changing the aspect ratio a bit, you could also add noise - however, I don't think those will help)
Did you try smaller networks? (And how big are your filters / how many filters do you have?)
Remarks
Interesting idea! Did you try simply applying the trained MNIST network on your data? What are the results?
It may be an overfitting problem. It could happen when your network is too complex for the problem to resolve.
Check this article: http://es.mathworks.com/help/nnet/ug/improve-neural-network-generalization-and-avoid-overfitting.html
It definitely looks like an issue of overfitting. I see that you have two convolution layers, two max pooling layers and two fully connected. But how many weights total? You only have 96 examples per class, which is certainly smaller than the number of weights you have in your CNN. Remember that you want at least 5 times more instances in your training set than weights in your CNN.
You have two solutions to improve your CNN:
Shake each instance in the training set. You each number about 1 pixel around. It will already multiply your training set by 9.
Use a transformer layer. It will add an elastic deformation to each number at each epoch. It will strengthen a lot the learning by artificially increase your training set. Moreover, it will make it much more effective to predict other fonts.
Viola-Jones' AdaBoost method is very popular for face detection? We need lots of positive and negative samples o train a face detector.
The rule for collecting positive sample is simple: the image which contains faces. But the rule for collecting negative sample is not very clear: the image which does not contains faces.
But there are so many scene that do not contain faces (which may be sky, river, house animals etc.). Which should I collect it? How can know I have collected enough negative samples?
Some suggested idea for negative samples: using the positive samples and crop the face region using the left part as negative samples. Is this work?
You have asked many questions inside your thread.
Amount of samples. As a rule of thumbs: When you train a detector you need roughly few thousands positive and negative examples per stage. Typical detector has 10-20 stages. Each stage reduces the amount of negative by a factor of 2. So you will need roughly 3,000 - 10,000 positive examples and ~5,000,000 to 100,000,000 negative examples.
Which negatives to take. A rule of thumb: You need to find a face in a given environment. So you need to take that environment as negative examples. For instance, if you try to detect faces of students sitting in a classroom than take as negative examples images from the classroom (walls, windows, human body, clothes etc). Taking images of the moon or of the sky will probably not help you. If you don't know your environment than just take as much as possible different natural images (under different light conditions).
Should you take facial parts (like an eye, or a nose) as negative? You can but this is definitely not enough (to take only those negatives). The real strength of the detector will come from the negative images which represent the typical background of the faces
How to collect/generate negative samples - You don't actually need many negative images. You can take 1000 images and generate 10,000,000 negative samples from them. Here is how you do it. Suppose you take a photo of a car of 1 mega pixel resolution 1000x1000 pixels. Suppose than you want to train face detector to work on resolution of 20x20 pixels (like openCV did). So you take your 1000x1000 big image and cut it to pieces of 20x20. You can get 2,500 pieces (50x50). So this is how from a single big image you generated 2,500 negative examples. Now you can take the same big image and cut it to pieces of size 10x10 pixels. You will now have additional 10,000 negative examples. Each example is of size 10x10 pixels and you can enlarge it by factor of 2 to force all the sample to have the same size. You can repeat this process as much as you want (cutting the input image to pieces of different size). Mathematically speaking, if your image is of size NxN - You can generate O(N^4) negative examples from it by taking each possible rectangle inside it.
In step 4, I described how to take a single big image and cut it to a large amount of negative examples. I must warn you that negative examples should not have high co-variance so I don't recommend taking only one image and generating 1 million negative examples from it. As a rule of thumb - create a library of 1000 images (or download random images from Google). Verify than none of the images contains faces. Crop about 10,000 negative examples from each image and now you have got a decent 10,000,000 negative examples. Train your detector. In the next step you can cut each image to ~50,000 (partially overlapping pieces) and thus enlarge your amount of negatives to 50 millions. You will start having very good results with it.
Final enhancement step of the detector. When you already have a rather good detector, run it on many images. It will produce false detections (detect face where there is no face). Gather all those false detections and add them to your negative set. Now retrain the detector once again. The more such iterations you do the better your detector becomes
Real numbers - The best face detectors today (like Facebooks) use hundreds of millions of positive examples and billions of negatives. As positive examples they take not only frontal faces but faces in many orientations, different facial expressions (smiling, shouting, angry,...), different age groups, different genders, different races (Caucasians, blacks, Thai, Chinese,....), with or without glasses/hat/sunglasses/make-up etc. You will not be able to compete with the best, so don't get angry if your detector misses some faces.
Good luck
So following up from here, I now need to collect negative samples, for cascaded classification using OpenCV.
With positive samples, I know that all samples should have the same aspect ratio.
What about negative samples?
Should they all be larger than positive samples (since OpenCV is going to paste positives on top of negatives to create the test images).
Should all be the same size?
Can they be arbitrary sizes?
Should they too have the same aspect ratio among themselves?
From OpenCV doc on Cascade Classifier Training:
Negative samples are taken from arbitrary images. These images must not contain detected objects. [...] Described images may be of different sizes. But each image should be (but not nessesarily) larger then a training window size, because these images are used to subsample negative image to the training size.
Suppose we calculate HOG features of image patches of different sizes, ranging from 64 * 64 to 128 * 128. Now, if we want to do k-means on these, should we normalize the patches which belong to different scale? I know HOG features are normalized, but does scale matter?
Normally, the HOG representations are normalized. However, you must be careful to the block size. In fact, you must have the same number of blocks, whatever the size of the image. Otherwise, you obtain descriptors of different lengths and the k-means cannot be performed. This means that when having larger images, you will have larger blocks. The resulting histograms will contain information from more gradients, so they aren’t invariant at this stage. However, by applying the histogram normalization, the scale invariance of the final descriptor is obtained.
Yet, if you are not sure if the histogram normalization is well performed or not, you can extract the descriptor for an image and its resized version and compare them.
Good luck!