I came across top down and bottom up approaches while reading this paper
"https://arxiv.org/abs/1611.08050" on image processing.
I got a vague about top down approach from this paragraph:
"top-down approach: apply a separately trained human detector (based on object detection techniques such as the ones we discussed before), find each person, and then run pose estimation on every detection."
But couldn't understand bottom up approach from this:
"bottom-up approaches recognize human poses from pixel-level image evidence directly. They can solve both problems above: when you have information from the entire picture you can distinguish between the people, and you can also decouple the runtime from the number of people on the frameā¦ at least theoretically."
Please help me understand these concepts. Thank you.
Both paragraph's are from this blog : "https://medium.com/neuromation-blog/neuronuggets-understanding-human-poses-in-real-time-b73cb74b3818"
There are two person in the picture. all human has 15 joint(key point)
Top-down approach
find two bounding boxes including each person
estimate human joint(15 key-point) per each bounding box
In this example, Top-down approach need pose estimation twice.
Bottom-up approach
estimate all human joint(30 key-point) in the picture
classify which joint(15 key-point) are included in the same person
In this example, pose estimator doesn't care how many people are in the picture. they only consider how they can classify each joint to the each person.
In general situation, Top-down approach consume time much more than Bottom-up, because Top-down approach need N-times pose estimation by person detector results.
Related
My xmas holiday project this year was to build a little Android app, which should be able to detect arbitrary Euro coins in a picture, recognize their value and sum the values up.
My assumptions/requirements for the picture for a good recognition are
uniform background
picture should be roughly the size of a DinA4 paper
coins may not overlap, but may touch each other
number-side of the coins must be up/visible
My initial thought was, that for the coin value-recognition later it would be best to first detect the actual coins/their regions in the picture. Any recognition then would run on only these regions of the picture, where actual coins are found.
So the first step was to find circles. This i have accomplished using this OpenCV 3 pipeline, as suggested in several books and SO postings:
convert to gray
CannyEdge detection
Gauss blurring
HoughCircle detection
filtering out inner/redundant circles
The detection works rather successfully IMHO, here a picture of the result:
Coins detected with HoughCircles with blue border
Up to the recognition now for every found coin!
I searched for solutions to this problem and came up with
template matching
feature detection
machine learning
The template matching seems very inappropriate for this problem, as the coins can be arbitrary rotated with respect to a template coin (and the template matching algorithm is not rotation-invariant! so i would have to rotate the coins!).
Also pixels of the template coin will never exactly match those of the region of the formerly detected coin. So any algorithm computing the similarity will produce only poor results, i think.
Then i looked into feature detection. This seemed more appropriate to me. I detected the features of a template-coin and the candidate-coin picture and drew the matches (combination of ORB and BRUTEFORCE_HAMMING). Unfortunately the features of the template-coin were also detected in the wrong candidate coins.
See the following picture, where the template or "feature" coin is on the left, a 20 Cents coin. To the right there are the candidate coin, where the left-most coin is a 20 Cents coin. I actually expected this coin to have the most matches, unfortunately not. So again, this seems not to be a viable way to recognize the value of coins.
Feature-matches drawn between a template coin and candidate coins
So machine learning is the third possible solution. From university i still now about neural networks, how they work, etc. Unfortunately my practical knowledge is rather poor AND i don't know Support Vector Machines (SVM) at all, which is the machine learning supported by OpenCV.
So my question is actually not source-code related, but more how to setup the learning process.
Should i learn on the plain coin-images or should i first extract features and learn on the features? (i think: features)
How much positives and negatives per coin should be given?
Would i have to learn also on rotated coins or would this rotation be handled "automagically" by the SVM? So would the SVM recognize rotated coins, even if i only trained it on non-rotated coins?
One of my picture-requirements above ("DinA4") limits the size of the coin to a certain size, e.g. 1/12 of the picture-height. Should i learn on coins of roughly the same size or different sizes? I think, that different sizes would result in different features, which would not help the learning process, what do you think?
Of course, if you have a different possible solution, this is also welcome!
Any help is appreciated! :-)
Bye & Thanks!
Answering your questions:
1- Should i learn on the plain coin-images or should i first extract features and learn on the features? (i think: features)
For many object classification tasks it's better to extract the features first and then train a classifier using a learning algorithm. (e.g the features can be HOG and the learning algorithm can be something like SVM or Adaboost). It's mainly due to the fact that the features have more meaningful information compared to the pixel values. (They can describe edges,shapes, texture, etc.) However, the algorithms like deep learning will extract the useful features as a part of learning procedure.
2 - How much positives and negatives per coin should be given?
You need to answer this question depending on the variation in the classes you want to recognize and the learning algorithm you use. For SVM , if you use HOG features and want to recognize specific numbers on coins you won't need much.
3- Would i have to learn also on rotated coins or would this rotation be handled "automagically" by the SVM? So would the SVM recognize rotated coins, even if i only trained it on non-rotated coins?
Again it depends on your final decision about the features(not SVM which is the learning algorithm) you're going to choose. HOG features are not rotation invariant but there are features like SIFT or SURF which are.
4-One of my picture-requirements above ("DinA4") limits the size of the coin to a certain size, e.g. 1/12 of the picture-height. Should i learn on coins of roughly the same size or different sizes? I think, that different sizes would result in different features, which would not help the learning process, what do you think?
Again, choose your algorithm , some of them ask you for a fixed/similar width/height ratio. You can find out about the specific requirements in related papers.
If you decide to use SVM take a look at this and also if you feel ok with Neural Network, using Tensorflow is a good idea.
I'm searching for algorithms/methods that are used to classify or differentiate between two outdoor environments. Given an image with vehicles, I need to be able to detect whether the vehicles are in a natural desert landscape, or whether they're in the city.
I've searched but can't seem to find relevant work on this. Perhaps because I'm new at computer vision, I'm using the wrong search terms.
Any ideas? Is there any work (or related) available in this direction?
I'd suggest reading Prince's Computer Vision: Models, Learning, and Inference (free PDF available). It covers image classification, as well as many other areas of CV. I was fortunate enough to take the Machine Vision course at UCL which the book was designed for and it's an excellent reference.
Addressing your problem specifically, a simple MAP or MLE model on pixel colours will probably provide a reasonable benchmark. From there you could look at more involved models and feature engineering.
Seemingly complex classifications similar to "civilization" vs "nature" might be able to be solved simply with the help of certain heuristics along with classification based on color. Like Gilevi said, city scenes are sure to contain many flat lines and right angles, while desert scenes are dominated by rolling dunes and so on.
To address this directly, you could use OpenCV's hough - lines algorithm on the images (tuned for this problem of course) and look at:
a) how many lines are fit to the image at a given threshold
b) of the lines that are fit what is the expected angle between two of them; if the angles are uniformly distributed then chances are its nature, but if the angles are clumped up around multiples of pi/2 (more right angles and straight lines) then it is more likely to be a cityscape.
Color components, textures, and degree of smoothness(variation or gradient of image) may differentiate the desert and city background. You may also try Hough transform, which is used for line detection that can be viewed as city feature (building, road, bridge, cars,,,etc).
I would recommend you this research very similar with your project. This article presents a comparison of different classification techniques to obtain the scene classifier (urban, highway, and rural) based on images.
See my answer here: How to match texture similarity in images?
You can use the same method. I already solved in the past problems like the one you described with this method.
The problem you are describing is that of scene categorization. Search for works that use the SUN database.
However, you only working with two relatively different categories, so I don't think you need to kill yourself implementing state-of-the-art algorithms. I think taking GIST features + color features and training a non-linear SVM would do the trick.
Urban environments is usually characterized with a lot of horizontal and vertical lines, GIST captures that information.
At the end of the introduction to this instructive kaggle competition, they state that the methods used in "Viola and Jones' seminal paper works quite well". However, that paper describes a system for binary facial recognition, and the problem being addressed is the classification of keypoints, not entire images. I am having a hard time figuring out how, exactly, I would go about adjusting the Viola/Jones system for keypoint recognition.
I assume I should train a separate classifier for each keypoint, and some ideas I have are:
iterate over sub-images of a fixed size and classify each one, where an image with a keypoint as center pixel is a positive example. In this case I'm not sure what I would do with pixels close to the edge of the image.
instead of training binary classifiers, train classifiers with l*w possible classes (one for each pixel). The big problem with this is that I suspect it will be prohibitively slow, as every weak classifier suddenly has to do l*w*original operations
the third idea I have isn't totally hashed out in my mind, but since the keypoints are each parts of a greater part of a face (left, right center of an eye, for example), maybe I could try to classify sub-images as just an eye, and then use the left, right, and center pixels (centered in the y coordinate) of the best-fit subimage for each face-part
Is there any merit to these ideas, and are there methods I haven't thought of?
however, that paper describes a system for binary facial recognition
No, read the paper carefully. What they describe is not face specific, face detection was the motivating problem. The Viola Jones paper introduced a new strategy for binary object recognition.
You could train a Viola Jones style Cascade for eyes, another for a nose, and one for each keypoint you are interested in.
Then, when you run the code - you should (hopefully) get 2 eyes, 1 nose, etc, for each face.
Provided you get the number of items you expected, you can then say "here are the key points!" What takes more work is getting enough data to build a good detector for each thing you want to detect, and gracefully handling false positives / negatives.
I ended up working on this problem extensively. I used "deep learning," aka several layers of neural networks. I used convolutional networks. You can learn more about them by checking out these demos:
http://cs.stanford.edu/people/karpathy/convnetjs/demo/mnist.html
http://deeplearning.net/tutorial/lenet.html#lenet
I made the following changes to a typical convolutional network:
I did not do any down-sampling, as any loss of precision directly translates to a decrease in the model's score
I did n-way binary classification, with each pixel being classified as a keypoint or non-keypoint (#2 in the things I listed in my original post). As I suspected, computational complexity was the primary barrier here. I tried to use my GPU to overcome these issues, but the number of parameters in the neural network were too large to fit in GPU memory, so I ended up using an xl amazon instance for training.
Here's a github repo with some of the work I did:
https://github.com/cowpig/deep_keypoints
Anyway, given that deep learning has blown up in popularity, there are surely people who have done this much better than I did, and published papers about it. Here's a write-up that looks pretty good:
http://danielnouri.org/notes/2014/12/17/using-convolutional-neural-nets-to-detect-facial-keypoints-tutorial/
Information:
I would like to use OpenCV's HOG detection to identify objects that can be seen in a variety of orientations. The only problem is, I can't seem to find a reasonable feature detector or classifier to detect this in a rotation and scale invaraint way (as is needed by objects such as forearms).
Prior Work:
Lets focus on forearms for this discussion. A forearm can have multiple orientations, the primary distinct features probably being its contour edges. It is possible to have images of forearms that are pointing in any direction in an image, thus the complexity. So far I have done some in depth research on using HOG descriptors to solve this problem, but I am finding that the variety of poses produced by forearms in my positives training set is producing very low detection scores in actual images. I suspect the issue is that the gradients produced by each positive image do not produce very consistent results when saved into the Histogram. I have reviewed many research papers on the topic trying to resolve or improvie this, including the original from Dalal & Triggs [Link]: http://lear.inrialpes.fr/people/triggs/pubs/Dalal-cvpr05.pdf It also seems that the assumptions made for detecting whole humans do not necessary apply to detecting individual features (particularly the assumption that all humans are standing up seems to suggest HOG is not a good route for rotation invariant detection like that of forearms).
Note:
If possible, I would like to steer clear of any non-free solutions such as those pertaining to Sift, Surf, or Haar.
Question:
What is a good solution to detecting rotation and scale invariant objects in an image? Particularly for this example, what would be a good solution to detecting all orientations of forearms in an image?
I use hog to detect human heads and shoulders. To train particular part you have to give the location of it. If you use opencv, you can clip samples containing only the training part you want, and make sure all training samples share the same size. For example, I clip images to contain only head and shoulder and resize all them to 64x64. Other opensource codes may require you to pass the location as the input parameter, essentially the same.
Are you trying the Discriminatively trained deformable part model ?http://www.cs.berkeley.edu/~rbg/latent/
you may find answers there.
Algorithm for a drawing and painting robot -
Hello
I want to write a piece of software which analyses an image, and then produces an image which captures what a human eye perceives in the original image, using a minimum of bezier path objects of varying of colour and opacity.
Unlike the recent twitter super compression contest (see: stackoverflow.com/questions/891643/twitter-image-encoding-challenge), my goal is not to create a replica which is faithful to the image, but instead to replicate the human experience of looking at the image.
As an example, if the original image shows a red balloon in the top left corner, and the reproduction has something that looks like a red balloon in the top left corner then I will have achieved my goal, even if the balloon in the reproduction is not quite in the same position and not quite the same size or colour.
When I say "as perceived by a human", I mean this in a very limited sense. i am not attempting to analyse the meaning of an image, I don't need to know what an image is of, i am only interested in the key visual features a human eye would notice, to the extent that this can be automated by an algorithm which has no capacity to conceptualise what it is actually observing.
Why this unusual criteria of human perception over photographic accuracy?
This software would be used to drive a drawing and painting robot, which will be collaborating with a human artist (see: video.google.com/videosearch?q=mr%20squiggle).
Rather than treating marks made by the human which are not photographically perfect as necessarily being mistakes, The algorithm should seek to incorporate what is already on the canvas into the final image.
So relative brightness, hue, saturation, size and position are much more important than being photographically identical to the original. The maintaining the topology of the features, block of colour, gradients, convex and concave curve will be more important the exact size shape and colour of those features
Still with me?
My problem is that I suffering a little from the "when you have a hammer everything looks like a nail" syndrome. To me it seems the way to do this is using a genetic algorithm with something like the comparison of wavelet transforms (see: grail.cs.washington.edu/projects/query/) used by retrievr (see: labs.systemone.at/retrievr/) to select fit solutions.
But the main reason I see this as the answer, is that these are these are the techniques I know, there are probably much more elegant solutions using techniques I don't now anything about.
It would be especially interesting to take into account the ways the human vision system analyses an image, so perhaps special attention needs to be paid to straight lines, and angles, high contrast borders and large blocks of similar colours.
Do you have any suggestions for things I should read on vision, image algorithms, genetic algorithms or similar projects?
Thank you
Mat
PS. Some of the spelling above may appear wrong to you and your spellcheck. It's just international spelling variations which may differ from the standard in your country: e.g. Australian standard: colour vs American standard: color
There is an model that can implemented as an algorithm to calculate a saliency map for an image, determining which parts of the image would get the most attention from a human.
The model is called itti koch model
You can find a startin paper here
And more resources and c++ sourcecode here
I cannot answer your question directly, but you should really take a look at artist/programmer (Lisp) Harold Cohen's painting machine Aaron.
That's quite a big task. You might be interested in image vectorizing (don't know what it's called officially), which is used to take in rasterized images (such as pictures you take with a camera) and outputs a set of bezier lines (i think) that approximate the image you put in. Since good algorithms often output very high quality (read: complex) line sets you'd also be interested in simplification algorithms which can help enormously.
Unfortunately I am not next to my library, or I could reccomend a number of books on perceptual psychology.
The first thing you must consider is the physiology of the human eye is such that when we examine an image or scene, we are only capturing very small bits at a time, as our eyes dart around rapidly. Our mind peices the different parts together to try and form a whole.
You might start by finding an algorithm for the path of an eyeball as it darts around. Perhaps it is attracted to contrast?
Next is that our eyes adjust the "exposure" depending on the context. It's like those high dynamic range images, if they were peiced together not by multiple exposures of a whole scene, but by many small images, each balanced on its own, but blended into its surroundings to form a high dynamic range.
Now there was a finding in a monkey brain that there is a single neuron that lights up if there's a diagonal line in the upper left of its field of vision. Similar neurons can be found for vertical lines, and horizontal lines in various areas of that monkey's field of vision. The "diagonalness" determines the frequency with which that neuron fires.
one might speculated that other neurons might be found and mapped to other qualities such as redness, or texturedness, and other things.
There's something humans can do that I've not seen a computer program ever able to do. it's something called "closure", where a human is able to fill in information about something that they are seeing, that doesn't actually exist in the image. an example:
*
* *
is that a triangle? If you knew that it was in advance, then you could probably make a program to connect the dots. But what if it's just dots? How can you know? I wouldn't attempt this one unless I had some really clever way of dealing with that one.
There are many other facts about human perception you might be able to use. Good luck, you've not picked a straightforward task.
i think a thing that could help you in this enormous task is human involvement. i mean data. like you could have many people sitting staring at random dots (like from the previous post) and connect them as they see right. you could harness that data.