I'm trying to understand and use the Faster R-CNN algorithm on my own data.
My question is about ROI coordinates: what we have as labels, and what we want in the end, are ROI coordinates in the input image. However, if I understand it correctly, anchor boxes are given in the convolutional feature map, then the ROI regression gives ROI coordinates relatively to an anchor box (so easily translatable to coordinates in conv feature map coordinates), and then the Fast-RCNN part does the ROI pooling using the coordinates in the convolutional feature map, and itself (classifies and) regresses the bounding box coordinates.
Considering that between the raw image and the convolutional features, some convolutions and poolings occured, possibly with strides >1 (subsampling), how do we associate coordinates in the raw images to coordinates in feature space (in both ways) ?
How are we supposed to give anchor boxes sizes: relatively to the input image size, or to the convolutional feature map ?
How is the bounding box regressed by Fast-RCNN expressed ? (I would guess: relatively to the ROI proposal, similarly to the encoding of the proposal relatively to the anchor box; but I'm not sure)
It looks like it's actually an implementation question, the method itself does not answer that.
A good way to do it though, that is used by Tensorflow Object Detection API, is to always give coordinates and ROI sizes relatively to the layer's input size. That is, all coordinates and sizes will be real numbers between 0 and 1. Likewise for the anchor boxes.
This handles nicely the problem of the downsampling, and allows easy computations of ROI coordinates.
When you don't use an activation function on a layer, the result will be raw numbers. These raw numbers are basically associated with the coordinates (labels) directly.
Using an activation function such as softmax or relu will give a probability value, which leads into a classification solution, instead of regression.
Related
I'm trying to use a pretrained VGG16 as an object localizer in Tensorflow on ImageNet data. In their paper, the group mentions that they basically just strip off the softmax layer and either toss on a 4D/4000D fc layer for bounding box regression. I'm not trying to do anything fancy here (sliding windows, RCNN), just get some mediocre results.
I'm sort of new to this and I'm just confused about the preprocessing done here for localization. In the paper, they say that they scale the image to 256 as its shortest side, then take the central 224x224 crop and train on this. I've looked all over and can't find a simple explanation on how to handle localization data.
Questions: How do people usually handle the bounding boxes here?...
Do you use something like the tf.sample_distorted_bounding_box command, and then rescale the image based on that?
Do you just rescale/crop the image itself, and then interpolate the bounding box with the transformed scales? Wouldn't this result in negative box coordinates in some cases?
How are multiple objects per image handled?
Do you just choose a single bounding box from the beginning ,crop to that, then train on this crop?
Or, do you feed it the whole (centrally cropped) image, and then try to predict 1 or more boxes somehow?
Does any of this generalize to the Detection or segmentation (like MS-CoCo) challenges, or is it completely different?
Anything helps...
Thanks
Localization is usually performed as an intersection of sliding windows where the network identifies the presence of the object you want.
Generalizing that to multiple objects works the same.
Segmentation is more complex. You can train your model on a pixel mask with your object filled, and you try to output a pixel mask of the same size
I have some conceptual issues in understanding SURF and SIFT algorithm All about SURF. As far as my understanding goes SURF finds Laplacian of Gaussians and SIFT operates on difference of Gaussians. It then constructs a 64-variable vector around it to extract the features. I have applied this CODE.
(Q1 ) So, what forms the features?
(Q2) We initialize the algorithm using SurfFeatureDetector detector(500). So, does this means that the size of the feature space is 500?
(Q3) The output of SURF Good_Matches gives matches between Keypoint1 and Keypoint2 and by tuning the number of matches we can conclude that if the object has been found/detected or not. What is meant by KeyPoints ? Do these store the features ?
(Q4) I need to do object recognition application. In the code, it appears that the algorithm can recognize the book. So, it can be applied for object recognition. I was under the impression that SURF can be used to differentiate objects based on color and shape. But, SURF and SIFT find the corner edge detection, so there is no point in using color images as training samples since they will be converted to gray scale. There is no option of using colors or HSV in these algorithms, unless I compute the keypoints for each channel separately, which is a different area of research (Evaluating Color Descriptors for Object and Scene Recognition).
So, how can I detect and recognize objects based on their color, shape? I think I can use SURF for differentiating objects based on their shape. Say, for instance I have a 2 books and a bottle. I need to only recognize a single book out of the entire objects. But, as soon as there are other similar shaped objects in the scene, SURF gives lots of false positives. I shall appreciate suggestions on what methods to apply for my application.
The local maxima (response of the DoG which is greater (smaller) than responses of the neighbour pixels about the point, upper and lover image in pyramid -- 3x3x3 neighbourhood) forms the coordinates of the feature (circle) center. The radius of the circle is level of the pyramid.
It is Hessian threshold. It means that you would take only maximas (see 1) with values bigger than threshold. Bigger threshold lead to the less number of features, but stability of features is better and visa versa.
Keypoint == feature. In OpenCV Keypoint is the structure to store features.
No, SURF is good for comparison of the textured objects but not for shape and color. For the shape I recommend to use MSER (but not OpenCV one), Canny edge detector, not local features. This presentation might be useful
I have a few doubts about how to approach my goal. I have an outside camera who is recording people and I want to draw an ellipse on every person.
Right now what I do is get the feature points of the people from the frame (I get them using a mask to only have the feature points on the people), set a EM algorithm and train it with my samples (the feature points extracted). The number of clusters is twice the number of people from the image (I get it before start the EM algorithm using other methods such as pixel counting with a codebook).
My question is
(a) Do I have to just train it only for the first frame and then use predict in the following frames? or,
(b) use train with the feature points in every frame?
Right now I am doing the option b) (I don't use predict) because I don't really know how to use the predict.
If I do a), can you help me with it and after that how to draw the ellipses?. If I do b), can you help me drawing an ellipse for every person? Since right know I got different ellipses for the same person using the cov, mean, etc (one for the arm, for example).
What I want to achieve is this paper using the Gaussian model: Link
If you would draw bounding boxes, rather then ellipses, you could use the function groupRectanlges to merge the different bounding boxes.
But, more important - for people detection, you can simply use openCV's person detector (based on HOG) or latent svm detector with the person model.
You should do b) anyway because, otherwise you'll try to match the keypoints to the clusters (persons) in the first frame. After a few seconds this would not be relevant.
It seems reasonable to assume that from frame to frame change is not going to be overwhelming, so reusing the results of the training on frame N-1 is a good seed to train on frame N, likely to converge faster that running EM from scratch on each frame.
in order to draw the ellipses you can leverage from the mixture of gaussian example in the python bindings:
https://github.com/opencv/opencv/blob/master/samples/python/gaussian_mix.py
Note if you use a diagonal covariance matrix, your ellipses are going to be aligned "straight", their own axis aligned with the X and Y axis of the frame, you can skip the calculation of the angle of the ellipse
I want to differentiate between two classes of objects through the differences in the shape of blob(blob is in the form of binary image) using shape descriptors and machine learning .I want to ask if there is any good shape feature which I can use to detect the descriptors for the irregular contour or blob obtained ?
there is a large body of work associated with shape descriptors, these methods work on either the outer edge detected pixels (the boundary) or the full filled-in binary shape. Both approaches rely on making the shape descriptors invariant to translation, rotation and scaling, and some to skew. The classical boundary method is Fourier Descriptors and the classic filled in method is Moment Invariants, both are covered in most good image processing textbooks and are easy to implement with OpenCV.
The answer is very subjective on the kinds of shapes you are looking for. If the contours of the shapes are discriminative enough, you can try shape context. To classify shapes, feed in these features into any classifier -- SVM or random forests for instance.
If the shapes have consistently occuring corners, then you can extract the corners using FAST or SURF, and describe the regions around the corners using SIFT or SURF. In this case, shapes are best recognised by feature matching or bags of words.
in my project I want to crop the ROI of an image. For this I create a map with the regions of interesst. Now I want to crop the area which has the most important pixels (black is not important, white is important).
Has someone an idea how to realize it? I think this is a maximazion problem
The red border in the image below is an example how I want to crop this image
If I understood your question correctly, you have computed a value at every point in the image. These values suggests the "importance"/"interestingness"/"saliency" of each point. The matrix/image containing these values is the "map" you are referring to. Your goal is to get the bounding box for regions of interests (ROI) with high "importance" score.
The way I think you can go about segmenting the ROIs is to apply Graph Cut based segmentation computing a "score" at each pixel using your importance map. The result of the segmentation is a binary mask that masks the "important" pixels. Next, run OpenCV's findcontours function on this binary mask to get the individual connected components. Then use OpenCV's boundingRect function on the contours returned by findContours(...) to get the bounding boxes.
The good thing about using a Graph Cut based segmentation algorithm in this way is that it will join up fragmented components i.e. the resulting binary mask will tend not to have pockets of small holes even if your "importance" map is noisy.
One Graph Cut based segmentation algorithm already implemented in OpenCV is the GrabCut algorithm. A quick hack would be to apply it on your "importance" map to get the binary mask I mentioned above. A more sophisticated approach would be to build the foreground and background (color perhaps?) model using your "importance" map and passing it as input to the function. More details on GrabCut in OpenCV can be found here: http://docs.opencv.org/modules/imgproc/doc/miscellaneous_transformations.html?highlight=grabcut#void grabCut(InputArray img, InputOutputArray mask, Rect rect, InputOutputArray bgdModel, InputOutputArray fgdModel, int iterCount, int mode)
If you would like greater flexibility, you can hack your own graphcut based segmentation algorithm using the following MRF library. This library allows you to specify your custom objective function in computing the graph cut: http://vision.middlebury.edu/MRF/code/
To use the MRF library, you will need to specify the "cost" at each point in your image indicating whether that point is "foreground" or "background". You can also think of this dichotomy as "important" or "not important" instead of "foreground" vs "background".
The MRF library's goal is to return you a label at each point such that total cost of assigning those labels is as small as possible. Hence, the game is to come up with a function to compute a small cost for points you consider important and large otherwise.
Specifically, the cost at each point is composed of 2 parts: 1) The data term/function and 2) The smoothness term/function. As mentioned earlier, the smaller the data term at each point, the more likely that point will be selected. If your "importance" score s_ij is in the range [0, 1], then a common way to compute your data term would be -log(s_ij).
The smoothness terms is a way to suggest whether 2 neighboring pixels p, q, should have the same label i.e. both "foreground", "background", or one "foreground" and the other "background". Similar to the data cost, you have to construct it such that the cost is small for neighbor pixels having similar "importance" score so that they will be assigned the same label. This term is responsible for "smoothing" the resulting mask so that you will not have pixels of low "importance" sprinkled within regions of high "importance" and vice versa. If there are such regions, OpenCV's findContours(...) function mentioned above will return contours for these regions, which can be filtered out perhaps by checking their size.
Details on functions to compute the cost can be found in the GrabCut paper: GrabCut
This blog post provides a bit more detail (and code) on creating your own graphcut segmentation algorithm in OpenCV: http://www.morethantechnical.com/2010/05/05/bust-out-your-own-graphcut-based-image-segmentation-with-opencv-w-code/
Another paper showing how to perform graph cut segmentation on grayscale images (your case), with better notations, and without the complicated image matting part (not implemented in OpenCV's version) in the GrabCut paper is this: Graph Cuts and Efficient N-D Image Segmentation
Hope this helps.