Difference between feature detection and object detection - opencv

I know that most common object detection involves Haar cascades and that there are many techniques for feature detection such as SIFT, SURF, STAR, ORB, etc... but if my end goal is to recognizes objects doesn't both ways end up giving me the same result? I understand using feature techniques on simple shapes and patterns but for complex objects these feature algorithms seem to work as well.
I don't need to know the difference in how they function but whether or not having one of them is enough to exclude the other. If I use Haar cascading, do I need to bother with SIFT? Why bother?
thanks
EDIT: for my purposes I want to implement object recognition on a broad class of things. Meaning that any cups that are similarly shaped as cups will be picked up as part of class cups. But I also want to specify instances, meaning a NYC cup will be picked up as an instance NYC cup.

Object detection usually consists of two steps: feature detection and classification.
In the feature detection step, the relevant features of the object to be detected are gathered.
These features are input to the second step, classification. (Even Haar cascading can be used
for feature detection, to my knowledge.) Classification involves algorithms
such as neural networks, K-nearest neighbor, and so on. The goal of classification is to find
out whether the detected features correspond to features that the object to be detected
would have. Classification generally belongs to the realm of machine learning.
Face detection, for example, is an example of object detection.
EDIT (Jul. 9, 2018):
With the advent of deep learning, neural networks with multiple hidden layers have come into wide use, making it relatively easy to see the difference between feature detection and object detection. A deep learning neural network consists of two or more hidden layers, each of which is specialized for a specific part of the task at hand. For neural networks that detect objects from an image, the earlier layers arrange low-level features into a many-dimensional space (feature detection), and the later layers classify objects according to where those features are found in that many-dimensional space (object detection). A nice introduction to neural networks of this kind is found in the Wolfram Blog article "Launching the Wolfram Neural Net Repository".

Normally objects are collections of features. A feature tends to be a very low-level primitive thing. An object implies moving the understanding of the scene to the next level up.
A feature might be something like a corner, an edge etc. whereas an object might be something like a book, a box, a desk. These objects are all composed of multiple features, some of which may be visible in any given scene.

Invariance, speed, storage; few reasons, I can think on top of my head. The other method to do would be to keep the complete image and then check whether the given image is similar to glass images you have in your database. But if you have a compressed representation of the glass, it will need lesser computation (thus faster), will need lesser storage and the features tells you the invariance across images.
Both the methods you mentioned are essentially the same with slight differences. In case of Haar, you detect the Haar features then you boost them to increase the confidence. Boosting is nothing but a meta-classifier, which smartly chooses which all Harr features to be included in your final meta-classification, so that it can give a better estimate. The other method, also more or less does this, except that you have more "sophisticated" features. The main difference is that, you don't use boosting directly. You tend to use some sort of classification or clustering, like MoG (Mixture of Gaussian) or K-Mean or some other heuristic to cluster your data. Your clustering largely depends on your features and application.
What will work in your case : that is a tough question. If I were you, I would play around with Haar and if it doesn't work, would try the other method (obs :>). Be aware that you might want to segment the image and give some sort of a boundary around for it to detect glasses.

Related

Detecting object in an image

currently I am struggling in implementing an algorithm to locate an object in an image. Suppose I have 100 training sets that all have a cat in there and each has the correct coordinate of each cat. My first idea is to create a fixed-sized square and traverse through the image. For each collection of pixels contained in the square, we can use it as a point for Support Vector Machine algorithm.
The problem is that I am not sure how to do that since usually, each point represents each class (has object or has no object) and it usually has a simple d features, while for this case, it has a dx3 matrix as its features (each features has RGB value).
A simple help will be welcome, thanks!
If I have understood your question well, applying machine learning for image Processing and computer vision is a bit different from Other kinds of problems. main difference is that you should somehow overcome the issue of locality and scale. does all kitties always appear in the specific coordinate (x,y) ?! of course not! they can be anywhere in the scene. so how is it possible to give a specific point to SVM for an object? it will not be generalized at all. This is the reason almost all basic operation in computer vision has something to do with convolution operation to extract features independent from their location. A pixel alone carries zero useful information, you need to analyse groups of pixels. there are 2 approach you can take:
classic methods:
use OpenCV to perform noise removal, edge detection, feature extraction using methods like SIFT and feed those features to a model like SVM, not the raw unprocessed pixels. feature extraction means to reach from d features to k more meaningful representation of inputs where usually (k < d) if not always.
Deep Learning:
Convolutional Neural networks(CNNs) Have shed lights on many Computer vision tasks which were far beyond reach until recently and more importantly with frameworks like Keras and tensorflow most problems in computer vision is just programming tasks to be honest and doesn't require much knowledge as one needed before. because (CNNs) extract features themselves and you don't need to do the feature engineering anymore which requires a well educated and knowledgeable person on the task.
so, choose whatever method you see fit for kitty detection =^.^= .

Is deep learning the only way to detect humans in a picture?

I'm looking for a way to detect humans in a picture. For instance, regarding the picture below, I'd like to coarsely determine how many people are in the scene. I must be able to detect both standing and sitting people. I do not mind not detecting people located behind a physical object (such as the glass in the bus picture).
AFAIK, such a problem can rather easily be solved by training deep neural networks. However, my coworkers would like me to also implement a detection technique based on general image processing techniques. I've spent several days looking for techniques designed by researchers but I couldn't find anything else than saliency-based techniques (which may be fine, but I'd like to test several techniques based on old-fashioned image processing).
I'd like to mention that I'm not new to the topic of image segmentation & I used to segment aortas in medical scans. However, this task was easier IMHO since scanners have similar features: in this use-case (human detection in a bus, for instance), the pictures will have very different characteristics (e.g. image contrast can strongly vary, whether it's been taken during the day or at night).
Long story short, I'd like to know if there's some segmentation technique for human detection for which it'd be interesting giving a shot, given the fact that the images features vary a lot?
Is deep learning the only way to detect humans in a picture?
No. Is it the best way we know? Depends on your conditions.
The simplest way of detection is to generate lots of random bounding boxes and then solving the classification problem of the crop. Here is some pythonic pseudo-code:
def detect_people(image):
"""
Find all people in image.
Parameters
----------
image : image object
Returns
-------
people : list of axis-aligned bounding boxes (aabb)
Each bounding box contains a person
"""
people = []
for aabb in generate_random_aabb(image):
crop = crop_image(image, aabb)
if is_person(crop):
people.append(crop)
return people
In this case is_person can be any classifier, e.g. boosted decision stumps as used in the Viola–Jones object detection framework. Speaking of which: That would likely be the way to go without DL, but is much more complicated to explain.
Object Detection vs Segmentation
Your question mixes both. Object detection gives you bounding boxes (coarse) for instances. Semantic segmentation labels all pixels by classes, but does not distinguish different instances of the same class (e.g. different people). Instance segmentation is like object detection, but is fine-grained and aims for pixel-exact results.
If you are interested in segemantation, I can recommend my paper: A Survey of Semantic Segmentation

Implementing Face Recognition using Local Descriptors (Unsupervised Learning)

I'm trying to implement a face recognition algorithm using Python. I want to be able to receive a directory of images, and compute pair-wise distances between them, when short distances should hopefully correspond to the images belonging to the same person. The ultimate goal is to cluster images and perform some basic face identification tasks (unsupervised learning).
Because of the unsupervised setting, my approach to the problem is to calculate a "face signature" (a vector in R^d for some int d) and then figure out a metric in which two faces belonging to the same person will indeed have a short distance between them.
I have a face detection algorithm which detects the face, crops the image and performs some basic pre-processing, so the images i'm feeding to the algorithm are gray and equalized (see below).
For the "face signature" part, I've tried two approaches which I read about in several publications:
Taking the histogram of the LBP (Local Binary Pattern) of the entire (processed) image
Calculating SIFT descriptors at 7 facial landmark points (right of mouth, left of mouth, etc.), which I identify per image using an external application. The signature is the concatenation of the square root of the descriptors (this results in a much higher dimension, but for now performance is not a problem).
For the comparison of two signatures, I'm using OpenCV's compareHist function (see here), trying out several different distance metrics (Chi Square, Euclidean, etc).
I know that face recognition is a hard task, let alone without any training, so I'm not expecting great results. But all I'm getting so far seems completely random. For example, when calculating distances from the image on the far right against the rest of the image, I'm getting she is most similar to 4 Bill Clintons (...!).
I have read in this great presentation that it's popular to carry out a "metric learning" procedure on a test set, which should significantly improve results. However it does say in the presentation and elsewhere that "regular" distance measures should also get OK results, so before I try this out I want to understand why what I'm doing gets me nothing.
In conclusion, my questions, which I'd love to get any sort of help on:
One improvement I though of would be to perform LBP only on the actual face, and not the corners and everything that might insert noise to the signature. How can I mask out the parts which are not the face before calculating LBP? I'm using OpenCV for this part too.
I'm fairly new to computer vision; How would I go about "debugging" my algorithm to figure out where things go wrong? Is this possible?
In the unsupervised setting, is there any other approach (which is not local descriptors + computing distances) that could work, for the task of clustering faces?
Is there anything else in the OpenCV module that maybe I haven't thought of that might be helpful? It seems like all the algorithms there require training and are not useful in my case - the algorithm needs to work on images which are completely new.
Thanks in advance.
What you are looking for is unsupervised feature extraction - take a bunch of unlabeled images and find the most important features describing these images.
The state-of-the-art methods for unsupervised feature extraction are all based on (convolutional) neural networks. Have look at autoencoders (http://ufldl.stanford.edu/wiki/index.php/Autoencoders_and_Sparsity) or Restricted Bolzmann Machines (RBMs).
You could also take an existing face detector such as DeepFace (https://www.cs.toronto.edu/~ranzato/publications/taigman_cvpr14.pdf), take only feature layers and use distance between these to group similar faces together.
I'm afraid that OpenCV is not well suited for this task, you might want to check Caffe, Theano, TensorFlow or Keras.

Difference between Low-Level and High-Level Feature Detection/ Extraction

According to this Wikipedia article Feature Extraction examples for Low-Level algorithms are Edge Detection, Corner Detection etc.
But what are High-Level algorithms?
I only found this quote from the Wikipedia article Feature Detection (computer vision):
Occasionally, when feature detection is computationally expensive and there are time constraints, a higher level algorithm may be used to guide the feature detection stage, so that only certain parts of the image are searched for features.
Could you give an example of one of these higher level algorithms?
There isn't a clear cut definition out there, but my understanding of "high-level" algorithms are more in tune with how we classify objects in real life. For low-level feature detection algorithms, these are mostly concerned with finding corresponding points between images, or finding those things that classify as something even remotely interesting at the lowest possible level you can think of - things like finding edges or lines in an image (in addition to finding interesting points of course). In addition, anything dealing with pixel intensities or colours directly is what I would consider low-level too.
High-level algorithms are mostly in the machine learning domain. These algorithms are concerned with the interpretation or classification of a scene as a whole. Things like body pose classification, face detection, classification of human actions, object detection and recognition and so on. These algorithms are concerned with training a system to recognize or classify something, then you provide it some unknown input that it has never seen before and its job is to either determine what is happening in the scene, or locate a region of interest where it detects an action that the system is trained to look for. This latter fact is probably what the Wikipedia article is referring to. You would have some sort of pre-processing stage where you have some high-level system that determines salient areas in the scene where something important is happening. You would then apply low-level feature detection algorithms in this localized area.
There is a great high-level computer vision workshop that talks about all of this, and you can find the slides and code examples here: https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/teaching/courses/ss-2019-high-level-computer-vision/
Good luck!
High-level features are something that we can directly see and recognize, like object classification, recognition, segmentation and so on. These are usually the goal of CV research, which is always based on 'low-level' features and algorithms.
Two of them are used in machine specially x-ray machine
Concerned Scene as a whole and edges of lines to help soft ware of machine to take good decision.
I think we should not confuse with high-level features and high-level inference. To me, high-level features mean shape, size, or a combination of low-level features etc. are the high-level features. While classification is the decision made based on the high-level features.

machine learning - svm feature fusion techique

for my final thesis i am trying to build up an 3d face recognition system by combining color and depth information. the first step i did, is to realign the data-head to an given model-head using the iterative closest point algorithm. for the detection step i was thinking about using the libsvm. but i dont understand how to combine the depth and the color information to one feature vector? they are dependent information (each point consist of color (RGB), depth information and also scan quality).. what do you suggest to do? something like weighting?
edit:
last night i read an article about SURF/SIFT features i would like to use them! could it work? the concept would be the following: extracting this features out of the color image and the depth image (range image), using each feature as a single feature vector for the svm?
Concatenation is indeed a possibility. However, as you are working on 3d face recognition you should have some strategy as to how you go about it. Rotation and translation of faces will be hard to recognize using a "straightforward" approach.
You should decide whether you attempt to perform a detection of the face as a whole, or of sub-features. You could attempt to detect rotation by finding some core features (eyes, nose, etc).
Also, remember that SVMs are inherently binary (i.e. they separate between two classes). Depending on your exact application you will very likely have to employ some multi-class strategy (One-against-all or One-against-many).
I would recommend doing some literature research to see how others have attacked the problem (a google search will be a good start).
It sounds simple, but you can simply concatenate the two vectors into one. Many researchers do this.
What you arrived at is an important open problem. Yes, there are some ways to handle it, as mentioned here by Eamorr. For example you can concatenate and do PCA (or some non linear dimensionality reduction method). But it is kind of hard to defend the practicality of doing so, considering that PCA takes O(n^3) time in the number of features. This alone might be unreasonable for data in vision that may have thousands of features.
As mentioned by others, the easiest approach is to simply combine the two sets of features into one.
SVM is characterized by the normal to the maximum-margin hyperplane, where its components specify the weights/importance of the features, such that higher absolute values have a larger impact on the decision function. Thus SVM assigns weights to each feature all on its own.
In order for this to work, obviously you would have to normalize all the attributes to have the same scale (say transform all features to be in the range [-1,1] or [0,1])

Resources