I trained up a very vanilla CNN using keras/theano that does a pretty good job of detecting whether a small (32X32) portion of an image contains a (relatively simple) object of type A or B (or neither). The output is an array of three numbers [prob(neither class), prob(A), prob(B)]. Now, I want to take a big image (512X680, methinks) and sweep across the image, running the trained model on each 32X32 sub-image to generate a feature map that's 480X648, at each point consisting of a 3-vector of the aforementioned probabilities. Basically, I want to use my whole trained CNN as a (nonlinear) filter with three-dimensional output. At the moment, I am cutting each 32X32 out of the image one at a time and running the model on it, then dropping the resulting 3-vectors in a big 3X480X648 array. However, this approach is very slow. Is there a faster/better way to do this?
Related
I like to know whether I can use data set of signs that is made using Kinect to retrain inception's final layer like mentioned in the Tensor Flow tutorial website that uses ordinary RGB images.I am new to this field. Opinions are much appreciated.
The short answer is "No. You cannot just fine tune only the last layer. But you can fine tune the whole pre-trained network.". The first layers of the pre-trained network is looking for RGB features. Your depth frames will hardly provide enough entropy to match that. Your options are:
If the recognised/tracked objects (hands) are not masked and you have actual depth data for the background, you can train from scratch on depth images with few contrast stretching and data whitening ((x-mu)/sigma). This will take very long time for the ivy league networks like Inception and ResNet. Also, keep in mind that most python based deep learning frameworks rely on PIL image loaders which by default assumes images are of 8bits channels mapped in the range [0, 1]. These image loaders cast all 16bits pixels ones.
If the recognised/tracked object (hands) are masked which means your background is set to the same value or barely have gradient in it, the network will overfit on the silhouette of the object because this is where the strongest edges are. The solution for this is to colorise the depth image using normal maps, HSA, HSV, JET colour coding to convert it into 3x8bits channeled image. This makes the training converge much faster and in my late experiments we found that you can fine tune the ivy league networks on the colorised depth.
Since you are new to this field.I would like to suggest you to read what is transfer learning all the three types mentioned.I would like to tell you to apply any of the mentioned forms of transfer learning basing on your data set.If your data set is very similar to the type of model you are using then you can pass through last layers.If you data is not similar you have to tune the existing model and use it.
As the layers of the neural networks increases the data specific feature extraction increases so you have to take care of the specific layers if your dataset is not very similar to the pre-built model dataset. The starting layers will contain more generic features.
I've seen quite a few CNN code examples for ID'ing images, but they generally relate to a 1-to-1 input to target relationship (like the MNISt handwritten numerals set), and most seem to use similar image dimensions (pixels) for the input image and training images.
So...what is the usual approach for identifying multiple objects in one image? (like several people, or any other relatively complex scene). I've seen it done often enough, but haven't seen design approaches mentioned. Does this require some type of preprocessing or can this be handled directly by a CNN?
I would say the most known family of techniques to retrieve multiple objects from an images would be the Detection family.
With Detection, the basic idea is to have one or more Proposal windows of different sizes and ratios within an image, generated with either a calculated or random array of algorithms.
For each Proposal window, the Classification algorithm is then executed to reveal what that specific area of the image represents.
The next step would usually be to run a Merge process to combine all neighbouring areas into one single classification output.
Note: A None class is often also used to represent an area with no specific class found.
I'm trying to use a pretrained VGG16 as an object localizer in Tensorflow on ImageNet data. In their paper, the group mentions that they basically just strip off the softmax layer and either toss on a 4D/4000D fc layer for bounding box regression. I'm not trying to do anything fancy here (sliding windows, RCNN), just get some mediocre results.
I'm sort of new to this and I'm just confused about the preprocessing done here for localization. In the paper, they say that they scale the image to 256 as its shortest side, then take the central 224x224 crop and train on this. I've looked all over and can't find a simple explanation on how to handle localization data.
Questions: How do people usually handle the bounding boxes here?...
Do you use something like the tf.sample_distorted_bounding_box command, and then rescale the image based on that?
Do you just rescale/crop the image itself, and then interpolate the bounding box with the transformed scales? Wouldn't this result in negative box coordinates in some cases?
How are multiple objects per image handled?
Do you just choose a single bounding box from the beginning ,crop to that, then train on this crop?
Or, do you feed it the whole (centrally cropped) image, and then try to predict 1 or more boxes somehow?
Does any of this generalize to the Detection or segmentation (like MS-CoCo) challenges, or is it completely different?
Anything helps...
Thanks
Localization is usually performed as an intersection of sliding windows where the network identifies the presence of the object you want.
Generalizing that to multiple objects works the same.
Segmentation is more complex. You can train your model on a pixel mask with your object filled, and you try to output a pixel mask of the same size
I'm trying to set up an object classification system with OpenCV. When I detect a new object in a scene, I want to know if the new object belongs to a known object class (is it a box, a bottel, something unknown, etc.).
My steps so far:
Cutting down the Image to the roi where a new object could appear
Calculating keypoints for every Image (cv::SurfFeatureDetector)
Calculating descriptors for each keypoint (cv::SurfDescriptorExtractor)
Generating a vocabulary using Bag of Words (cv::BOWKMeansTrainer)
Calculating Response histograms (cv::BOWImgDescriptorExtractor)
Use the Response histograms to train a cv::SVM for every object class
Using the same set of images again to test the classification
I know that there is still something wrong with my code since the classification don't work yet.
But I don't really know, where I should use the full image (cutted down to the roi) or when I should extract the new object from the image and use just the object itself.
It's my first step into object recognition/classification and I saw people using both, full Images and extracted objects, but I just don't know when to use what.
I hope womeone can clarify this for me.
You should not use the same images for both testing and training.
In training, ideally you need to extract a ROI which includes just one dominant object, since the algorithm will assume that the codewords extracted from positive samples are the ones that should be presented in a test image to label it as positive. However, if you have a really big dataset like ImageNet, the algorithm should make a generalization.
In testing, you don't need to extract a ROI, because SIFT/SURF are scale invariant features. However, it's good to have a one dominant object in the test set, as well.
I think you should train 1 classifier for your each object class. This is called one-vs-all classifier.
One little note, if you don't want to worry about this issues and have big dataset. Just go with Convolutional Neural Networks. They have a really good generalization capability and are inherently multi-label thanks to their fully connected last layer.
I need to prepare training data which I will then use with OpenCV's cascaded classifier. I understand that for training data I'll need to provide rectangular images as samples with aspect ratios that correspond to the -w and -h parameters in OpenCV's training commands.
I was fine with this idea, but then I saw web-based annotation tool LabelMe.
People have labelled in LabelMe using complex polygons!
Can these polygons be somehow used in cascaded training?
Wouldn't using irregular polygons improve the classification results?
If not, then what is the use of the complex polygons that outline objects in LabelMe'd images?
Data sets annotated with LabelMe are used for many different purposes. Some of them, like image segmentation, require tight boundaries, rather than bounding boxes.
On the other hand, the cascade classifier in OpenCV is designed to classify rectangular image regions. It is then used as part of a sliding-window object detector, which also works with bounding boxes.
Whether tight boundaries help improve object detection is an interesting question. There is evidence that the background pixels caught by the bounding box actually help the classification.