Interestingly, what technologies and algorithms the large companies use. I found only that the Microsoft uses the PhotoDNA technologie, but it is responsible only how photos are compared. It is interesting also as they automatically detecting pornographic images.
For example, are used any of methods : Skin Detection, ROI Detection, Bag-of-Visual-Words.
Seminal work in this field was done in Fleck, Margaret M., David A. Forsyth, and Chris Bregler. "Finding naked people." Computer Vision—ECCV'96. Springer Berlin Heidelberg, 1996. 593-602.. The approach detects skin-colored regions and then determines whether or not the regions match predefined human shapes. More on their skin detection algorithm here: http://www.cs.hmc.edu/~fleck/naked-skin.html .
More recent papers with summaries of current methods are available:
http://iseclab.org/people/cplatzer/papers/sfcs05-platzer.pdf
http://arxiv.org/abs/1402.5792
You may also take a look at: What is the best way to programatically detect porn images?
update 2016: use a convnet. They are far better at building high resolution filters. I wrote about it in more detail here
http://blog.clarifai.com/what-convolutional-neural-networks-see-at-when-they-see-nudity/
https://www.slideshare.net/mobile/RyanCompton1/what-convnets-look-at-when-they-look-at-nudity
Related
I'm in research for my final project, i want to make object detection and motion classification like amazon go, i have read lot of research like object detection with SSD or YOLO and video classification using CNN+LSTM, i want to propose training algorithm like this:
Real time detection for multiple object (in my case: person) with SSD/YOLO
Get the boundary object and crop the frame
Feed cropped frame info to CNN+LSTM algo to make motion prediction (if the person's walking/taking items)
is it possible to make it in real-time environment?
or is there any better method for real-time detection and motion classification
If you want to use it in real-time application, several other things must be considered which are not appeared before implementation of algorithm in real environment.
About your 3-step proposed method, it already could be result in a good method, but the first step would be very accurate. I think it is better to combine the 3 steps in one step. Because the motion type of person is a good feature of a person. Because of that, I think all steps could be gathered in one step.
My idea is as follows:
1. a video classification dataset which just tag the movement of person or object
2. cnn-lstm based video classification method
This would solve your project properly.
This answer need to more details, if u interested in, I can answer u in more details.
Had pretty much the same problem. Motion prediction does not work that well in complex real-life situations. Here is a simple one:
(See in action)
I'm building a 4K video processing tool (some examples). Current approach looks like the following:
do rough but super fast segmentation
extract bounding box and shape
apply some "meta vision magic"
do precise segmentation within identified area
(See in action)
As of now the approach looks way more flexible comparing to motion tracking.
"Meta vision" intended to properly track shape evolution:
(See in action)
Let's compare:
Meta vision disabled
Meta vision enabled
I've been playing with the excellent GPUImage library, which implements several feature detectors: Harris, FAST, ShiTomas, Noble. However none of those implementations help with the feature extraction and matching part. They simply output a set of detected corner points.
My understanding (which is shakey) is that the next step would be to examine each of those detected corner points and extract the feature from then, which would result in descriptor - ie, a 32 or 64 bit number that could be used to index the point near to other, similar points.
From reading Chapter 4.1 of [Computer Vision Algorithms and Applications, Szeliski], I understand that using a BestBin approach would help to efficient find neighbouring feautures to match, etc. However, I don't actually know how to do this and I'm looking for some example code that does this.
I've found this project [https://github.com/Moodstocks/sift-gpu-iphone] which claims to implement as much as possible of the feature extraction in the GPU. I've also seen some discussion that indicates it might generate buggy descriptors.
And in any case, that code doesn't go on to show how the extracted features would be best matched against another image.
My use case if trying to find objects in an image.
Does anyone have any code that does this, or at least a good implementation that shows how the extracted features are matched? I'm hoping not to have to rewrite the whole set of algorithms.
thanks,
Rob.
First, you need to be careful with SIFT implementations, because the SIFT algorithm is patented and the owners of those patents require license fees for its use. I've intentionally avoided using that algorithm for anything as a result.
Finding good feature detection and extraction methods that also work well on a GPU is a little tricky. The Harris, Shi-Tomasi, and Noble corner detectors in GPUImage are all derivatives of the same base operation, and probably aren't the fastest way to identify features.
As you can tell, my FAST corner detector isn't operational yet. The idea there is to use a lookup texture based on a local binary pattern (why I built that filter first to test the concept), and to have that return whether it's a corner point or not. That should be much faster than the Harris, etc. corner detectors. I also need to finish my histogram pyramid point extractor so that feature extraction isn't done in an extremely slow loop on the GPU.
The use of a lookup texture for a FAST corner detector is inspired by this paper by Jaco Cronje on a technique they refer to as BFROST. In addition to using the quick, texture-based lookup for feature detection, the paper proposes using the binary pattern as a quick descriptor for the feature. There's a little more to it than that, but in general that's what they propose.
Feature matching is done by Hamming distance, but while there are quick CPU-side and CUDA instructions for calculating that, OpenGL ES doesn't have one. A different approach might be required there. Similarly, I don't have a good solution for finding a best match between groups of features beyond something CPU-side, but I haven't thought that far yet.
It is a primary goal of mine to have this in the framework (it's one of the reasons I built it), but I haven't had the time to work on this lately. The above are at least my thoughts on how I would approach this, but I warn you that this will not be easy to implement.
For object recognition / these days (as of a couple weeks ago) best to use tensorflow /Convolutional Neural Networks for this.
Apple has some metal sample code recently added. https://developer.apple.com/library/content/samplecode/MetalImageRecognition/Introduction/Intro.html#//apple_ref/doc/uid/TP40017385
To do feature detection within an image - I draw your attention to an out of the box - KAZE/AKAZE algorithm with opencv.
http://www.robesafe.com/personal/pablo.alcantarilla/kaze.html
For ios, I glued the Akaze class together with another stitching sample to illustrate.
detector = cv::AKAZE::create();
detector->detect(mat, keypoints); // this will find the keypoints
cv::drawKeypoints(mat, keypoints, mat);
// this is the pseudo SIFT descriptor
.. [255] = {
pt = (x = 645.707153, y = 56.4605064)
size = 4.80000019
angle = 0
response = 0.00223364262
octave = 0
class_id = 0 }
https://github.com/johndpope/OpenCVSwiftStitch
Here is a GPU accelerated SIFT feature extractor:
https://github.com/lukevanin/SIFTMetal
The code is written in Swift 5 and uses Metal compute shaders for most operations (scaling, gaussian blur, key point detection and interpolation, feature extraction). The implementation is largely based on the paper and code from the "Anatomy of the SIFT Method Article" published in the Image Processing Online Journal (IPOL) in 2014 (http://www.ipol.im/pub/art/2014/82/). Some parts are based on code by Rob Whess (https://github.com/robwhess/opensift), which I believe is now used in OpenCV.
For feature matching I tried using a kd-tree with the best-bin first (BBF) method proposed by David Lowe. While BBF does provide some benefit up to about 10 dimensions, with a higher number of dimensions such as used by SIFT, it is no better than quadratic search due to the "curse of dimensionality". That is to say, if you compare 1000 descriptors against 1000 other descriptors, it stills end up making 1,000 x 1,000 = 1,000,000 comparisons - the same as doing brute-force pairwise.
In the linked code I use a different approach optimised for performance over accuracy. I use a trie to locate the general vicinity for potential neighbours, then search a fixed number of sibling leaf nodes for the nearest neighbours. In practice this matches about 50% of the descriptors, but only makes 1000 * 20 = 20,000 comparisons - about 50x faster and scales linearly instead of quadratically.
I am still testing and refining the code. Hopefully it helps someone.
I wonder how do we evaluate feature detection/extraction methods (SIFT,SURF,MSER...) for object detection and tracking like pedestrians, lane vehicles etc.. Are there standard metrics for comparison? I have read blogs like http://computer-vision-talks.com/2011/07/comparison-of-the-opencvs-feature-detection-algorithms-ii/ some research papers like this. The problem is the more I learn the more I am confused.
It is very hard to estimate feature detectors per se, because features are only computation artifacts and not things that you are actually searching in images. Feature detectors do not make sense outside their intended context, which is affine-invariant image part matching for the descriptors that you have mentioned.
The very first usage of SIFT, SURF, MSER was multi-view reconstruction and automatic 3D reconstruction pipe-lines. Thus, these features are usually assessed from the quality of the 3D reconstrucution or image part matching that they provide. Roughly speaking, you have a pair of images that are related by a known transform (an affinity or an homography) and you measure the difference between the estimated homography (from the feature detector) and the real one.
This is also the method used in the blog post that you quote by the way.
In order to assess the practical interest of a detector (and not only its precision in an ideal multi-view pipe-line) some additional measurements of stability (under geometric and photometric changes) were added: does the number of detected features vary, does the quality of the estimated homography vary, etc.
Accidentally, it happens that these detectors may also work (also it was not their design purpose) for object detection and track (in tracking-by-detection cases). In this case, their performance is classically evaluated from more-or-less standardized image datasets, and typically expressed in terms of precision (probability of good answer, linked to the false alarm rate) and recall (probability of finding an object when it is present). You can read for example Wikipedia on this topic.
Addendum: What exactly do I mean by accidentally?
Well, as written above, SIFT and the like were designed to match planar and textured image parts. This is why you always see example with similar images from a dataset of graffiti.
Their extension to detection and tracking was then developed in two different ways:
While doing multiview matching (with a spherical rig), Furukawa and Ponce built some kind of 3D locally-planar object model, that they applied then to object detection in presence of severe occlusions. This worlk exploits the fact that an interesting object is often locally planar and textured;
Other people developed a less original (but still efficient in good conditions) approach by considering that they had a query image of the object to track. Individual frame detections are then performed by matching (using SIFT, etc.) the template image with the current frame. This exploits the fact that there are few false matchings with SIFT, that objects are usually observed in a distance (hence are usually almost planar in images) and that they are textured. See for example this paper.
I am currently working on a project where I have to extract the facial expression of a user (only one user at a time from a webcam) like sad or happy.
The best possibility I found so far:
I used OpenCV for face detection.
Some user on a OpenCV board suggested looking for AAM (active apereance models) and ASM (active shape models), but all I found were papers.
-So i'm Using Active Shape Models with Stasm, which will give me access to 77 different points within the face like on the picture
now i want to know the best way to do :
the best learning method to use on cohn and kanade database to classify the emotions (happy,....) ?
the best method to classify the facial expressions on a video in real time ?
Look here for similar solution video and description of algorithm: http://www2.isr.uc.pt/~pedromartins/ in "Identity and Expression Recognition on Low Dimensional Manifolds" 2009 year.
I have a basic understanding in image processing and now studying in-depth the "Digital Image Processing" book by Gonzales.
When image given and object of interest approximated form is known (e.g. circle, triangle),
what is the best algorithm / method to find this object on image?
The object can be slightly deformed, so brute force approach will not help.
You may try using Histograms of Oriented Gradients (also called Edge Orientation Histograms). We have used them for detecting road signs. http://en.wikipedia.org/wiki/Histogram_of_oriented_gradients and the papers by Bill Triggs should get you started.
I recommend you use the Hough transform, which allows you to find any given pattern described by a equation. What's more the Hough transform works also great for deformed objects.
The algorithm and implementation itself is quite simple.
More details can be found here: http://en.wikipedia.org/wiki/Hough_transform , even a source code for this algorithm is included on a referenced page (http://www.rob.cs.tu-bs.de/content/04-teaching/06-interactive/HNF.html).
I hope that helps you.
I would look at your problem in two steps:
first finding your object's outer boundary:
I'm supposing you have contrasted enough image, that you can easily threshold to get a binary image of your object. You need to extract the object boundary chain-code.
then analyzing the boundary's shape to deduce the form (circle, polygon,...):
You can calculate the curvature in each point of the boundary chain and thus determine how many sharp angles (i.e. high curvature value) there are in your shape. Several sharp angles means you have a polygon, none means you have a circle (constant curvature).
You can find a description on how to get your object's boundary from the binary image and ways of analysing it in Gonzalez's Digital Image Processing, chapter 11.
I also found this insightful presentation on binary image analyis (PPT) and a matlab script that implements some of the techniques that Gonzalez talks about in DIP.
I strongly recommend you to use OpenCV, it's a great computer vision library that greatly help with anything related to computer vision. Their website isn't really attractive, nor helpful, but the API is really powerful.
A book that helped me a lot since there isn't a load of documentation on the web is Learning OpenCV. The documentation that comes with the API is good, but not great for learning how to use it.
Related to your problem, you could use a Canny Edge detector to find the border of your item and then analyse it, or you could proceed with and Hough transform to search for lines and or circles.
you can specially try 'face recognition'. Because, you know that is a specific topic. On the other hand 'face detection' etc. EmguCV can be useful for you.. It is .Net wrapper to the Intel OpenCV image processing library.
It looks like professor Jean Rouat from the University of Sherbooke, has found a way to find objects in images by processing neutral spiking neural network. His technology name RN-SPIKES, seems to be available for licencing.