I'm trying to find landmarks using dlib. So, I'm fitting my model with HELEN dataset, there are 2000 items downloaded from here. But accuracy is very very low. But when I use shape_predictor_68_face_landmarks.dat accuracy is high. I've read the Kazemi paper and set nu as 0.1, depth=4 and oversampling_amount=20, but it's still work badly. What's wrong?
Dlib landmark trainer uses default parameters if size of the dataset is large enough but bounding box for the detected face should be square. Since faces are almost symmetric, augmenting the dataset by flipping the images vertically may increase the accuracy.
Related
Each layer in a CNN reduces the size of the input via convolution and max-pooling operations. Convolution is translation equivariant, but max-pooling is translation invariant. Correct me if this is wrong : each time max-pooling applied, the precise location of a feature is reduced. So the feature maps of the final conv layer in a very deep CNN will have a large receptive field (w.r.t the original image), but the location of this feature (in the original image) is not discernible from looking at this feature map alone.
If this is true, how can the accuracy of bounding boxes when we do localisation be so good with a deep CNN? I understand how classification works, but making accurate bounding box predictions is confusing me.
Perhaps a toy example will clarify my confusion;
Say we have a dataset of images with dimension 256x256x1, and we want to predict whether a cat is present, and if so, where it is, so our target is something like [sigmoid_cat_present, cat_location].
Our vanilla CNN (let's assume something like VGG) will take in the image and transform it to something like 16x16x256 in the last convolutional layer. Each pixel in this final 16x16 feature map can be influenced by a much larger region in the original image. So if we determine a cat is present, how can the [cat_location] be refined to value more granular than this effective receptive field?
To add to your question - how about pixel perfect accuracy of segmentation boundary !!
Your intuition regarding down-sampling via max-pooling is correct. Normal CNNs have that limit. However, there have been some improvements recently to overcome it.
The breakthrough to this problem came in 2015-6 in the form of U-net and atrous/dilated convolution introduced in DeepLab.
Dilated convolutions or atrous convolutions, previously described for wavelet analysis without signal decimation, expands window size without increasing the number of weights by inserting zero-values into convolution kernels. Dilated convolutions have been shown to decrease blurring in semantic segmentation maps, and are purported to work at least in part by extracting long range information without the need for pooling.
Using U-Net architectures is another method that seeks to retain high spatial frequency information by directly adding skip connections between early and late layers. In other words, up-sampling followed by down-sampling.
In TensorFlow, atrous convolutions are implemented with function:
tf.nn.atrous_conv2d
There are many more methods and this is an ongoing research area.
I trying to implement shape detector in tensorflow. For this, I am having two class, one is only vertical rectangle and another one is only right arrow, like the following images.
Training is done with 190 samples for each class.
I trained the model two times, one without ZCA whitened training data and another one with ZCA whitened training data, with the same network architecture and the same number of iterations.
When the following image of down arrow is tested with the first model, it is predicted as rectangle with 99.99 percent accuracy, but when the same image is tested with the second model (trained with the ZCA whitened samples), it is predicted as arrow with 100 percent accuracy.
I want to know how ZCA whitening changed the accuracy percent that much drastically, even though, no data augmentation (like rotation) is used for the training.
Any kind of help would be greatly appreciated.
I have built a FCN for image segmentation. The object to be segmented is only very few pixels relatively to the image size (1024x1024). This results in that the accuracy is very high, even if I only train with 10 images instead of 18000 (my full training set).
My approach to solve this is to use some kind of weighted accuracy, so that the accuracy actually say something about the performance of identifying the small object (now it gets high accuracy since so many pixels are not the object and by not classifying anything the accuracy still gets high).
How do I decide the weight, anybody with some experience?
As you wrote, use a custom weight function which penalizes misclassification of underrepresented pixels more. You can get the weight by calculating the quotient between the number of object pixels versus all of the pixels in the image, or you can try it by hand - just make sure you follow the metrics which tell you the accuracy of object pixels. Hope it helps.
You can use infogain loss layer for a "weighted" loss.
The infogain loss is a generalization of the cross entropy loss commonly used. It is defined using a weight matrix H (of size L-by-L, where L is the number of classes):
L(p) = -H log(p)
Where p is a vector of class probabilities.
You can find more details on this loss here.
I trained a CNN (on tensorflow) for digit recognition using MNIST dataset.
Accuracy on test set was close to 98%.
I wanted to predict the digits using data which I created myself and the results were bad.
What I did to the images written by me?
I segmented out each digit and converted to grayscale and resized the image into 28x28 and fed to the model.
How come that I get such low accuracy on my data set where as such high accuracy on test set?
Are there other modifications that i'm supposed to make to the images?
EDIT:
Here is the link to the images and some examples:
Excluding bugs and obvious errors, my guess would be that your problem is that you are capturing your hand written digits in a way that is too different from your training set.
When capturing your data you should try to mimic as much as possible the process used to create the MNIST dataset:
From the oficial MNIST dataset website:
The original black and white (bilevel) images from NIST were size
normalized to fit in a 20x20 pixel box while preserving their aspect
ratio. The resulting images contain grey levels as a result of the
anti-aliasing technique used by the normalization algorithm. the
images were centered in a 28x28 image by computing the center of mass
of the pixels, and translating the image so as to position this point
at the center of the 28x28 field.
If your data has a different processing in the training and test phases then your model is not able to generalize from the train data to the test data.
So I have two advices for you:
Try to capture and process your digit images so that they look as similar as possible to the MNIST dataset;
Add some of your examples to your training data to allow your model to train on images similar to the ones you are classifying;
For those still have a hard time with the poor quality of CNN based models for MNIST:
https://github.com/christiansoe/mnist_draw_test
Normalization was the key.
is there anyone ever tried to train an HOG-liner svm pedestrian detector base on The TUD-brussel dataset(which is introduced from this website):
https://www.mpi-inf.mpg.de/departments/computer-vision-and-multimodal-computing/research/people-detection-pose-estimation-and-tracking/multi-cue-onboard-pedestrian-detection/
I tried to implement it on opencv through visual studio 2012. I cropped positive samples from the original positive images base on their annotations(about 1777 samples for total). Negative samples was cropped from original negative images randomly, 20 samples for each image(about 3840 samples for total).
I also adapted two round bootstrapping(checking for hardexamples and retrain) to improve its performance. However, the test result for this detector on TUD-brussel was awful, about 97% miss rate when FPPG(false positive per image) equals to 1. I found another paper which achieved a reasonable result when trainning on TUD-brussel with HOG(on Figure3(a)):
https://www1.ethz.ch/igp/photogrammetry/publications/pdf_folder/walk10cvpr.pdf.
Is anybody have any idea on training HOG+linear SVM on TUD-brussel?
I have to face with a similar situation recently.I developed an image classifier with hog and linear svm in python using pycharm.Problem i faced was it took lot of time to train.
Solution:
Simple I resized each image to 250*250.it really incresed performance in my situation
Resize each image
convert to gray scale
find PCA
flat that and append it to training list
append labels to training labels
for file in listing1:
img = cv2.imread(path1 + file)
res=cv2.resize(img,(250,250))
gray_image = cv2.cvtColor(res, cv2.COLOR_BGR2GRAY)
xarr=np.squeeze(np.array(gray_image).astype(np.float32))
m,v=cv2.PCACompute(xarr)
arr= np.array(v)
flat_arr= arr.ravel()
training_set.append(flat_arr)
training_labels.append(1)
Now Training
trainData=np.float32(training_set)
responses=np.float32(training_labels)
svm = cv2.SVM()
svm.train(trainData,responses, params=svm_params)
svm.save('svm_data.dat')