I have a slight confusion defferentiating between object recognition and object detection. Some people say object detection is a sub-topic of object recognition? Can someone clarify the the difference between these two topics?
To the best of my knowledge.
Object Recognition is responding to the question "What is the object in the image"
Whereas,
Object detection is answering the question "Where is that object"?
Hope someone can illustrate the difference by also generously providing an example to each
There is not a clear answer to this in the literature and many authors give these two terms different meanings or use them interchangeably, depending on the application. If I remember correctly, Szeliski in "Computer vision: Algorithms and applications" defines them in a way similar to this:
Object detection: to notice there is an object in an image and to know where it is in the image. So, you can outline the object but you may not know what object it is.
Object (or instance) recognition: to actually say what object you have detected, and maybe providing additionally information, such as where the object is located in the 3D space.
In some applications, such as recognizing and object to grasp it with a robotic arm, the recognition is just a verification step done after a detection, so that if you are not able to recognize the object, you cannot verify the detection and discard it (because it may be a false positive). For this reason "detection" and "recognition" are used as the same task sometimes.
Object recognition - which object is in the given image (which contains an object alone).
Object detection - which object is in the given image (which depicts a scene containing more than one object and is generally taken without constraints of background or view point) and where is it located.
If we take face as subset of object => face detection is to detect a face in an image, and then face recognition is to recognize the face as Angelina Jolie for example.
Related
I'm in research for my final project, i want to make object detection and motion classification like amazon go, i have read lot of research like object detection with SSD or YOLO and video classification using CNN+LSTM, i want to propose training algorithm like this:
Real time detection for multiple object (in my case: person) with SSD/YOLO
Get the boundary object and crop the frame
Feed cropped frame info to CNN+LSTM algo to make motion prediction (if the person's walking/taking items)
is it possible to make it in real-time environment?
or is there any better method for real-time detection and motion classification
If you want to use it in real-time application, several other things must be considered which are not appeared before implementation of algorithm in real environment.
About your 3-step proposed method, it already could be result in a good method, but the first step would be very accurate. I think it is better to combine the 3 steps in one step. Because the motion type of person is a good feature of a person. Because of that, I think all steps could be gathered in one step.
My idea is as follows:
1. a video classification dataset which just tag the movement of person or object
2. cnn-lstm based video classification method
This would solve your project properly.
This answer need to more details, if u interested in, I can answer u in more details.
Had pretty much the same problem. Motion prediction does not work that well in complex real-life situations. Here is a simple one:
(See in action)
I'm building a 4K video processing tool (some examples). Current approach looks like the following:
do rough but super fast segmentation
extract bounding box and shape
apply some "meta vision magic"
do precise segmentation within identified area
(See in action)
As of now the approach looks way more flexible comparing to motion tracking.
"Meta vision" intended to properly track shape evolution:
(See in action)
Let's compare:
Meta vision disabled
Meta vision enabled
I am quite new to the area of facial expression recognition and currently I'm doing a research on this via Deep Learning specifically CNN. I have some questions with regard to preparing and/or preprocessing my data.
I have segmented videos of frontal facial expressions (e.g. 2-3 seconds video of a person expressing a happy emotion based on his/her annotations).
Note: expressions displayed by my participants are quite of low intensity (not exaggerated expressions/micro-expressions)
General Question: Now, how should I prepare my data for training with CNN (I am a bit leaning on using a deep learning library, TensorFlow)?
Question 1: I have read some deep learning-based facial expression recognition (FER) papers that suggest to take the peak of that expression (most probably a single image) and use that image as part of your training data. How would I know the peak of an expression? What's my basis? If I'm going to take a single image, wouldn't some important frames of the subtlety of expression displayed by my participants be lost?
Question 2: Or would it be also correct to execute the segmented video in OpenCV in order to detect (e.g. Viola-Jones), crop and save the faces per frame, and use those images as part of my training data with their appropriate labels? I'm guessing some frames of faces to be redundant. However, since we knew that the participants in our data shows low intensity of expressions (micro-expressions), some movements of the face could also be important.
I would really appreciate anyone who can answer, thanks a lot!
As #unique monkey already pointed out, this is generally a supervised learning task. If you wish to extract an independent "peak" point, I recommend that you scan the input images and find the one in each sequence whose reference points deviate most from the subject's resting state.
If you didn't get a resting state, then how are the video clips cropped? For instance, were the subjects told to make the expression and hold it? What portion of the total expression (before, express, after) does the clip cover? Take one or both endpoints of the video clip; graph the movements of the reference points from each end, and look for a frame in which the difference is greatest, but then turns toward the other endpoint.
answer 1: Commonly we always depend on human's sense to decide which expression is the peak of the expression(I think you can distinguish the difference in smile and laugh)
answer 2: if you want to get a good result, I suggest you not treat data so rude like this method
In reference manual of OpenCV 2.4.3 KeyPoint::class_id is described as "object id that can be used to clustered keypoints by an object they belong to".
Due to lack of my adequate knowledge about 'keypoint' I can't understand whats the purpose of class_id.
My another question is, by using feature detector, descriptor and matcher we can detect matching keypoints of a query object in a training image. But how can segment out the query object in the training image. Can we use grabcut or watershed algorithm? if yes, how?
Answer of either question will be helpful.
Thanks in advance...
The class_id member field can be used, in practice, to store any information that you find useful for each keypoint. As the documentation says, you can store, for example, the id of a detected object.
For example, you have an image, extract keypoints from it (e.g. with SURF), run some object detector with these features, and the result is that each input feature contains now in class_id the id of the corresponding object, or -1 if they are attached to no object. I don't actually know if this field is filled by some OpenCV function or it is just for your use.
Regarding your second question, matching features may not be enough to segment out your object, because features can be located inside the object, so that you don't get any information about the outline. This is a good starting point, though, if you have more information about the object. For example, if you know that your object is planar (a poster on the wall), you can use the feature matches to compute a homography between the input and the training images. This would give you an outline. Or, for example, a large collection of close matched features may be an indication of the presence of an object. From that, you may try some other color segmentation, edge matching, etc.
I have found some short description about the class_id of a keypoint.
OpenCV says that:
class_id is object ID, that can be used to cluster keypoints by an object they belong to.
Here is the link.
Hope this will be somehow helpful.
I'm trying to do object detecting jobs using OpenCV. But there is something confuses me. Tracking & predicting algorithm like camshift and kalman filters can fulfill the task of tracking while SURF matching methods can also do that.
I don't quite understand the difference between the two approaches. I have done some codings based on feature2d (SURF is used) and motion_analysis_and_object_tracking (camshift is used) of OpenCV tutorial. It seems like they're just two means of one purpose. Am I right or am I missing out some concept?
And is it a good way to combine camshift tracking with SURF feature matching?...maybe more stuff can be applied, like contour matching?
Short answer is:
Detect interesting object using keypoint (SURF) or any other approach.
Get bounding rectangle of object and pass it as input for object tracker (e.g. CAMShift).
Use object tracker unless object will not lost.
Object tracking is process of finding the position of an object
using the information in previous frames. The difference between tracking and
detection is that while both the process localize the position of the object,
detection does not used any information from previous frames to localize the
object.
Look at "Object Tracking: A Survey" by Alper Yilmaz, Omar Javed and Mubarak Shah. This paper contains comprehensive overview of detection and tracking techniques.
I have a basic understanding in image processing and now studying in-depth the "Digital Image Processing" book by Gonzales.
When image given and object of interest approximated form is known (e.g. circle, triangle),
what is the best algorithm / method to find this object on image?
The object can be slightly deformed, so brute force approach will not help.
You may try using Histograms of Oriented Gradients (also called Edge Orientation Histograms). We have used them for detecting road signs. http://en.wikipedia.org/wiki/Histogram_of_oriented_gradients and the papers by Bill Triggs should get you started.
I recommend you use the Hough transform, which allows you to find any given pattern described by a equation. What's more the Hough transform works also great for deformed objects.
The algorithm and implementation itself is quite simple.
More details can be found here: http://en.wikipedia.org/wiki/Hough_transform , even a source code for this algorithm is included on a referenced page (http://www.rob.cs.tu-bs.de/content/04-teaching/06-interactive/HNF.html).
I hope that helps you.
I would look at your problem in two steps:
first finding your object's outer boundary:
I'm supposing you have contrasted enough image, that you can easily threshold to get a binary image of your object. You need to extract the object boundary chain-code.
then analyzing the boundary's shape to deduce the form (circle, polygon,...):
You can calculate the curvature in each point of the boundary chain and thus determine how many sharp angles (i.e. high curvature value) there are in your shape. Several sharp angles means you have a polygon, none means you have a circle (constant curvature).
You can find a description on how to get your object's boundary from the binary image and ways of analysing it in Gonzalez's Digital Image Processing, chapter 11.
I also found this insightful presentation on binary image analyis (PPT) and a matlab script that implements some of the techniques that Gonzalez talks about in DIP.
I strongly recommend you to use OpenCV, it's a great computer vision library that greatly help with anything related to computer vision. Their website isn't really attractive, nor helpful, but the API is really powerful.
A book that helped me a lot since there isn't a load of documentation on the web is Learning OpenCV. The documentation that comes with the API is good, but not great for learning how to use it.
Related to your problem, you could use a Canny Edge detector to find the border of your item and then analyse it, or you could proceed with and Hough transform to search for lines and or circles.
you can specially try 'face recognition'. Because, you know that is a specific topic. On the other hand 'face detection' etc. EmguCV can be useful for you.. It is .Net wrapper to the Intel OpenCV image processing library.
It looks like professor Jean Rouat from the University of Sherbooke, has found a way to find objects in images by processing neutral spiking neural network. His technology name RN-SPIKES, seems to be available for licencing.