Detect missing object using image comparison - opencv

I have an app that needs to take two photos of a room and detect if one of the objects in the first photos is missing in the second photo. How can it do it using OpenCV or any other tool?

If your object of interest has well known structure - i.e. it's an augmented reality marker, then this task is well described and almost trivial to implement. If you don't have a priori knowledge about the missing object, the only thing you can do is to compare 2 photos i.e. like in panorama stitching which can be done with SIFT or SURF (please read further about these). If the object is more complicated and if both the images were taken with similar field of view and the model of the missing object is known a priori (i.e. you have a picture of the object of interest and ONLY it), you can use SIFT or SURF to detect features both in the a priori model image and in the input image (the one where you want to find the missing object), perform feature matching and look for the densest cluster of matching features (i.e. use particle swarm optimization algorithm). You need to be aware that this won't work in every case and might give you quite a lot of false-positives. Disturbances such as the object looking different from every side, differences in lighting, partial occlusions etc will easily break this model. This approach does not require dewarping of the image etc. since SIFT is scale, translation and rotation invariant.
There are plenty of more robust methods for this, including deep learning (look for scientific papers on object detection), but the one with feature matching seems to be the easiest thing to try first.

Related

Is deep learning the only way to detect humans in a picture?

I'm looking for a way to detect humans in a picture. For instance, regarding the picture below, I'd like to coarsely determine how many people are in the scene. I must be able to detect both standing and sitting people. I do not mind not detecting people located behind a physical object (such as the glass in the bus picture).
AFAIK, such a problem can rather easily be solved by training deep neural networks. However, my coworkers would like me to also implement a detection technique based on general image processing techniques. I've spent several days looking for techniques designed by researchers but I couldn't find anything else than saliency-based techniques (which may be fine, but I'd like to test several techniques based on old-fashioned image processing).
I'd like to mention that I'm not new to the topic of image segmentation & I used to segment aortas in medical scans. However, this task was easier IMHO since scanners have similar features: in this use-case (human detection in a bus, for instance), the pictures will have very different characteristics (e.g. image contrast can strongly vary, whether it's been taken during the day or at night).
Long story short, I'd like to know if there's some segmentation technique for human detection for which it'd be interesting giving a shot, given the fact that the images features vary a lot?
Is deep learning the only way to detect humans in a picture?
No. Is it the best way we know? Depends on your conditions.
The simplest way of detection is to generate lots of random bounding boxes and then solving the classification problem of the crop. Here is some pythonic pseudo-code:
def detect_people(image):
"""
Find all people in image.
Parameters
----------
image : image object
Returns
-------
people : list of axis-aligned bounding boxes (aabb)
Each bounding box contains a person
"""
people = []
for aabb in generate_random_aabb(image):
crop = crop_image(image, aabb)
if is_person(crop):
people.append(crop)
return people
In this case is_person can be any classifier, e.g. boosted decision stumps as used in the Viola–Jones object detection framework. Speaking of which: That would likely be the way to go without DL, but is much more complicated to explain.
Object Detection vs Segmentation
Your question mixes both. Object detection gives you bounding boxes (coarse) for instances. Semantic segmentation labels all pixels by classes, but does not distinguish different instances of the same class (e.g. different people). Instance segmentation is like object detection, but is fine-grained and aims for pixel-exact results.
If you are interested in segemantation, I can recommend my paper: A Survey of Semantic Segmentation

Making a trained model (machine learning) from 3D models

i have a database with almost 20k 3D files, they are drawings from machine parts designed in a CAD software (solid works). Im trying to build a trained model from all of this 3D models, so i can build a 3D object Recognition App when someone can take a picture from one of this parts (in the real world) and the app can provide useful information about material , size , treatment and so on.
If anyone already do something similar, any information you can provide me would be greatly appreciated!
Some ideas:
1) Several pictures: instead of only one. As Rodrigo commented and Brad Larson tried to circumvent with his method, the problem with the user taking only one picture for the input is that you are necessarily lacking information to make a triangulation and form a point cloud in 3D. With 4 pictures taken from a slightly different angle, you can already reconstruct parts of the object. Comparing point clouds would make the endeavor much easier for any ML algorithm, Neuronal Networks (NN), Support Vector Machine (SVM) or others. A common standard to create point clouds is ASTM E2807, which uses the e57 file format.
On the downside a 3D vision algorithm might be heavy on the user's device, and is not the easiest to implement.
2) Artificial picture training: By training on pre-computed artificial pictures like Brad Larson suggested, you take over much of the computation, to the user's benefit. Be aware that you should probably use "features" extracted from the pictures, not the complete picture, both to train and to classify. The problem with this method is that you might be very sensitive to lighting and background context. You should take care to produce CAD pictures that have the same lightning conditions for all objects, so that the classifier doesn't overfit certain aspects of the "pictures" that do not belong to the object.
This aspect is where solution 1) is much more stable, it is less sensitive to the visual context.
3) Scale: The size of your object is an important descriptor. You should thus add scale information to your object descriptor before training. You could ask the user to take pictures with a reference object. Alternatively you can ask the user to make a rule-of-thumb estimate of the object size ("What are the approximate dimensions of the object, in [cm]?"). Providing size could make your algorithm significantly faster and more accurate.
If your test data in production is mainly images of the 3D object, then the method in the comment section by Brad Larson is the better approach and it is also easier to implement and takes a lot less effort and resources to get it up and running.
However if you want to classify between 3D models there are existing networks which exist to classify 3D point clouds. You will have to convert these models to point clouds and use them as training samples. One of those and which I have used is Voxnet. I also suggest you to add more variations to the training data like different rotations of the 3D model.
You can used Pre-Trained 3D Deep Neural Networks as there are many networks that could help you in your work and would produce high accuracy.

Medical image processing feature descriptors

I'm looking for local and global descriptors for medical image processing. I know about SIFT/SURF/GLOH/HOG, that are mainly applied to computer vision problems, but I would like to know if they are also applied to medical images to describe features or if there are specific descriptors in this field.
I would really appreciate any hint.
Thanks in advance,
Federico
If you want to use standard SIFTs for the multimodal matching, you have to a adjust it a little bit - make it invariant to image inversion.
There is a good paper about it by Kelman et.al "Keypoint Descriptors for Matching Across Multiple Image Modalities and Non-linear Intensity Variations"
There are also more special descriptors for multimodal matching, see "An efficient approach for robust multimodal retinal image registration based on UR-SIFT features and PIIFD descriptors" by Ghassabi et.al.
I assumed you need the descriptors for matching.
I'd personally submitted a poster submission and got it accepted for using SIFT as part of the feature detection and matching framework that my work was intended to do.
The feature detection methods you mentioned are good for general images and will work as a good general initial input for your framework, too. Now, since every anatomical region and every modality lives in its own feature domain(ie. brain regions done by MR, live regions done by CT, they all probably imply distinctive landmarks); its best that you first identify what it is unique in your or near your target anatomical region and then see if the aforementioned algorithms would locate your distinctive features(distinctive enough that it has to be in your region and no where else), then find ways to differentiate from the bag of features(that get detected along with your distinctive features). And the result sets would be the key features/descriptors that you would like to keep.
So, Yes, many feature detection algorithms have been extensively used for various areas in medical imaging.

image segmentation techniques

I am working on a computer vision application and I am stuck at a conceptual roadblock. I need to recognize a set of logos in a video, and so far I have been using feature matching methods like SIFT (and ASIFT by Yu and Morel), SURF, FERNS -- basically everything in the "Common Interfaces of Generic Descriptor Matchers" section of the OpenCV documentation. But recently I have been researching methods used in OCR/Random Trees classifier (I was playing with this dataaset: http://archive.ics.uci.edu/ml/datasets/Letter+Recognition) and thinking that this might be a better way to go about finding the logos. The problem is that I can't find a reliable way to automatically segment an arbitrary image.
My questions:
Should I bother looking into methods other than descriptor/keypoint, or is this the
best way to recognize a typical logo (stylized, few colors, sharp edges)?
How can I segment an arbitary image (or a video frame, in my case) so that I can properly
match against a sample database?
It would seem that HaarCascades work in a similar way (databases of samples), but I
can't figure out how the processes are related. Is there segmentation going on there?
Sorry of these questions are too broad. I'm trying to wrap my head around this stuff with little help. Thanks!
It seems like segmentation is not what you want. I think it has to do more with object detection and recognition. You want to detect the presence of a certain set of logos, in a certain set of images. This doesn't seem related to segmentation which is about labeling surfaces or areas of a common color, texture, shape, etc., although examining segmentation based methods may be useful.
I would definitely encourage you to look at problem and examine all possible methods that can be applied, not only the fashionable ones (such as SIFT, GLOH, SURF, etc). I would recommend you look at older, simpler methods like simple template matching, chamfering, etc.
Haar cascades became popular after a 2000 paper by Viola and Jones used for face detection (similar to what you see in modern point and click cameras). It does sound a bit similar to the problem you are interested in. You should perhaps also examine this part of the problem, but try not to focus too much on the learning part.

A good method for detecting the presence of a particular feature in an image

I have made a videochat, but as usual, a lot of men like to ehm, abuse the service (I leave it up to you to figure the nature of such abuse), which is not something I endorse in any way, nor do most of my users. No, I have not stolen chatroulette.com :-) Frankly, I am half-embarassed to bring this up here, but my question is technical and rather specific:
I want to filter/deny users based on their video content when this content is of offending character, like user flashing his junk on camera. What kind of image comparison algorithm would suit my needs?
I have spent a week or so reading some scientific papers and have become aware of multiple theories and their implementations, such as SIFT, SURF and some of the wavelet based approaches. Each of these has drawbacks and advantages of course. But since the nature of my image comparison is highly specific - to deny service if a certain body part is encountered on video in a range of positions - I am wondering which of the methods will suit me best?
Currently, I lean towards something along the following (Wavelet-based plus something I assume to be some proprietary innovations):
http://grail.cs.washington.edu/projects/query/
With the above, I can simply draw the offending body part, and expect offending content to be considered a match based on a threshold. Then again, I am unsure whether the method is invariable to transformations and if it is, to what kind - the paper isn't really specific on that.
Alternatively, I am thinking that a SURF implementation could do, but I am afraid that it could give me false positives. Can such implementation be trained to recognize/give weight to specific feature?
I am aware that there exist numerous questions on SURF and SIFT here, but most of them are generic in that they usually explain how to "compare" two images. My comparison is feature specific, not generic. I need a method that does not just compare two similar images, but one which can give me a rank/index/weight for a feature (however the method lets me describe it, be it an image itself or something else) being present in an image.
Looks like you need not feature detection, but object recognition, i.e. Viola-Jones method.
Take a look at facedetect.cpp example shipped with OpenCV (also there are several ready-to-use haarcascades: face detector, body detector...). It also uses image features, called Haar Wavelets. You might be interested to use color information, take a look at CamShift algorithm (also available in OpenCV).
This is more about computer vision. You have to recognize objects in your image/video sequence, whatever... for that, you can use a lot of different algorithms (most of them work in the spectral domain, that's why you will have to use a transformation).
In order to be accurate, you will also need a knowledge base or, at least, some descriptors that will define the object.
Try OpenCV, it has some algorithms already implemented (and basic descriptors included).
There are applications/algorithms out there that you can "train" (like neural networks) and are able to identify objects based on the training. Most of them (at least, the good ones) are not very popular and can only be found in research groups specialized in computer vision, object recognition, AI, etc.
Good luck!

Resources