Let's consider some 2D signals (amplitude over time).
I have a library of snippets that have specific 'shapes'. For the sake of the example, let's assume a square wave, a triangle wave and a sawtooth wave, but in practice, they're a lot more complex.
And I have a complex signal, such as an audio recording.
What would be the best way to train a system to spot elements from the library inside the complex signal, knowing that:
The library shape may be present at a different frequency.
The library shape may be present at a different amplitude.
Multiple shapes can be overlapping.
There is noise all over.
What I would like to recover is:
Which shape was recognized.
How close is it to the reference signal.
Where does that shape fit in the complex signal (position, frequency, amplitude range).
Bonus problem: since I'm looking for the shape itself, it may be stretched over time in a non linear fashion.
I draw a quick picture to illustrate:
As an example here, I drew 3 basic shapes for my library and I overlapped a few at different places on an audio signal.
What would be the best approach to solve this problem?
I would lean toward training a classifier to recognize the shapes, but I am not sure this is the right approach, nor really how practical it would be with this kind of data which has frequencies well distributed all across a wide range (50hz to 15khz).
If you do not have a labeled data set, I would probably try a simplified convolutional autoencoder.
If the base shapes that you want to recognize are fixed, I would use
a single hidden layer (the bottleneck) where the kernel functions are set to base shapes. The values of the neurons at the bottleneck would tell you which shapes are detected and where.
Related
I am new to AI/ML and am trying to use the same for solving the following problem.
I have a set of (custom) images which while having common characteristics also will have a unique pattern/signature and color value. What set of algorithms should I use to have the pass in following manner:
1. Recognize the common characteristic (like presence of a triangle at any position in a 10x10mm image). If present, proceed, else exit.
2. Identify the unique pattern/signature to identify each image individually. The pattern/signature could be shape (visible to human eye or hidden like using an overlay shape using background image with no boundaries).
3. Store color tone/hue/saturation to determine any loss/difference (maybe because the capture source is different from the original one).
While this is in way similar to face recognition algo, for me saturation/shadow will matter while being direction independent.
I figure that using CNN may be the way to go for step#2 and SVN for step#1, any input on training, specifics will be appreciated. What about step#3, use BGR2HSV? The objective is to use ML/AI and not get into machine-vision.
Recognize the common characteristic (like presence of a triangle at any position in a 10x10mm image). If present, proceed, else exit.
In a sense, what you want is a classifier that can detect patterns in an image. However, we can train classifiers to detect certain types of patterns in images.
For example, I can train a classifier to recognise squares and circles, but if I show it an image with a triangle in it, I cannot expect it to tell me it is a triangle, because it has never seen it before. The downside is, your classifier will end up misclassifying it as one of the shapes it knows to exist: either square or circle. The upside is, you can prevent this.
Identify the unique pattern/signature to identify each image individually.
What you want to do is train a classifier on a large amount of labelled data. If you want the classifier to detect squares, circles, or triangles in an image, you must train it with a large amount of labelled images of squares, circles and triangles.
Store color tone/hue/saturation to determine any loss/difference (maybe because the capture source is different from the original one).
Now, you are leaving the territory of simple image labelling and entering the world of computer vision. This is not as simple as a vanilla image classifier, but it is possible and there are a lot of online tools to help you do this. For example, you may take a look at OpenCV. They have an implementation in python and C++.
I figure that using CNN may be the way to go for step#2 and SVN for
step#1
You can combine step 1 and step 2 with a Convolutional Neural Network (CNN). You do not need to use a two step prediction process. However, beware, if you pass the CNN an image of a car, it will still label it as a shape. You can, again circumvent this by training it on a million positive samples of shapes, and a million negative samples of random other images with the class "Other". This way, anything that is not a shape will get classified into "Other". This is one possibility.
What about step#3, use BGR2HSV? The objective is to use ML/AI and not
get into machine-vision.
With the inclusion of this step, there is no option but to get into computer vision. I am not exactly sure how to go about this, but I can guarantee OpenCV will provide you a way to do this. In fact, with OpenCV, you will no longer need to implement your own CNN, because OpenCV has its own image labelling libraries.
I have a problem. I have read many papers about video stabilization. Almost papers mention about smoothing motion by using Kalman Filter, so it's strong and run in real-time applications.
But there is also another filter strongly, that is particle filter.
But why dont we use Partilce filter in smoothing motion to create stabilized video?
Some papers only use particle filter in estimating global motion between frames (motion estimation part).
It is hard to understand them.
Can anyone explain them for me, please?
Thank you so much.
A Kalman Filter is uni-modal. That means it has one belief along with an error covariance matrix to represent the confidence in that belief as a normal distribution. If you are going to smooth some process, you want to get out a single, smoothed result. This is consistent with a KF. It's like using least squares regression to fit a line to data. You are simplifying the input to one result.
A particle filter is multi-modal by its very nature. Where a Kalman Filter represents belief as a central value and a variance around that central value, a particle filter just has many particles whose values are clustered around regions that are more likely. A particle filter can represent essentially the same state as a KF (imagine a histogram of the particles that looks like the classic bell curve of the normal distribution). But a particle filter can also have multiple humps or really any shape at all. This ability to have multiple simultaneous modes is ideally suited to handle problems like estimating motion, because one mode (cluster of particles) can represent one move, and another mode represents a different move. When presented with this ambiguity, a KF would have to abandon one of the possibilities altogether, but a particle filter can keep on believing both things at the same time until the ambiguity is resolved by more data.
I have images of mosquitos similar to these ones and I would like to automatically circle around the head of each mosquito in the images. They are obviously in different orientations and there are random number of them in different images. some error is fine. Any ideas of algorithms to do this?
This problem resembles a face detection problem, so you could try a naïve approach first and refine it if necessary.
First you would need to recreate your training set. For this you would like to extract small images with examples of what is a mosquito head or what is not.
Then you can use those images to train a classification algorithm, be careful to have a balanced training set, since if your data is skewed to one class it would hit the performance of the algorithm. Since images are 2D and algorithms usually just take 1D arrays as input, you will need to arrange your images to that format as well (for instance: http://en.wikipedia.org/wiki/Row-major_order).
I normally use support vector machines, but other algorithms such as logistic regression could make the trick too. If you decide to use support vector machines I strongly recommend you to check libsvm (http://www.csie.ntu.edu.tw/~cjlin/libsvm/), since it's a very mature library with bindings to several programming languages. Also they have a very easy to follow guide targeted to beginners (http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf).
If you have enough data, you should be able to avoid tolerance to orientation. If you don't have enough data, then you could create more training rows with some samples rotated, so you would have a more representative training set.
As for the prediction what you could do is given an image, cut it using a grid where each cell has the same dimension that the ones you used on your training set. Then you pass each of this image to the classifier and mark those squares where the classifier gave you a positive output. If you really need circles then take the center of the given square and the radius would be the half of the square side size (sorry for stating the obvious).
So after you do this you might have problems with sizes (some mosquitos might appear closer to the camera than others) , since we are not trained the algorithm to be tolerant to scale. Moreover, even with all mosquitos in the same scale, we still might miss some of them just because they didn't fit in our grid perfectly. To address this, we will need to repeat this procedure (grid cut and predict) rescaling the given image to different sizes. How many sizes? well here you would have to determine that through experimentation.
This approach is sensitive to the size of the "window" that you are using, that is also something I would recommend you to experiment with.
There are some research may be useful:
A Multistep Approach for Shape Similarity Search in Image Databases
Representation and Detection of Shapes in Images
From the pictures you provided this seems to be an extremely hard image recognition problem, and I doubt you will get anywhere near acceptable recognition rates.
I would recommend a simpler approach:
First, if you have any control over the images, separate the mosquitoes before taking the picture, and use a white unmarked underground, perhaps even something illuminated from below. This will make separating the mosquitoes much easier.
Then threshold the image. For example here i did a quick try taking the red channel, then substracting the blue channel*5, then applying a threshold of 80:
Use morphological dilation and erosion to get rid of the small leg structures.
Identify blobs of the right size to be moquitoes by Connected Component Labeling. If a blob is large enough to be two mosquitoes, cut it out, and apply some more dilation/erosion to it.
Once you have a single blob like this
you can find the direction of the body using Principal Component Analysis. The head should be the part of the body where the cross-section is the thickest.
I've been working on a project for some time, to detect and track (moving) vehicles in video captured from UAV's, currently I am using an SVM trained on bag-of-feature representations of local features extracted from vehicle and background images. I am then using a sliding window detection approach to try and localise vehicles in the images, which I would then like to track. The problem is that this approach is far to slow and my detector isn't as reliable as I would like so I'm getting quite a few false positives.
So I have been considering attempting to segment the cars from the background to find the approximate position so to reduce the search space before applying my classifier, but I am not sure how to go about this, and was hoping someone could help?
Additionally, I have been reading about motion segmentation with layers, using optical flow to segment the frame by flow model, does anyone have any experience with this method, if so could you offer some input to as whether you think this method would be applicable for my problem.
Below is two frames from a sample video
frame 0:
frame 5:
Assumimg your cars are moving, you could try to estimate the ground plane (road).
You may get a descent ground plane estimate by extracting features (SURF rather than SIFT, for speed), matching them over frame pairs, and solving for a homography using RANSAC, since plane in 3d moves according to a homography between two camera frames.
Once you have your ground plane you can identify the cars by looking at clusters of pixels that don't move according to the estimated homography.
A more sophisticated approach would be to do Structure from Motion on the terrain. This only presupposes that it is rigid, and not that it it planar.
Update
I was wondering if you could expand on how you would go about looking for clusters of pixels that don't move according to the estimated homography?
Sure. Say I and K are two video frames and H is the homography mapping features in I to features in K. First you warp I onto K according to H, i.e. you compute the warped image Iw as Iw( [x y]' )=I( inv(H)[x y]' ) (roughly Matlab notation). Then you look at the squared or absolute difference image Diff=(Iw-K)*(Iw-K). Image content that moves according to the homography H should give small differences (assuming constant illumination and exposure between the images). Image content that violates H such as moving cars should stand out.
For clustering high-error pixel groups in Diff I would start with simple thresholding ("every pixel difference in Diff larger than X is relevant", maybe using an adaptive threshold). The thresholded image can be cleaned up with morphological operations (dilation, erosion) and clustered with connected components. This may be too simplistic, but its easy to implement for a first try, and it should be fast. For something more fancy look at Clustering in Wikipedia. A 2D Gaussian Mixture Model may be interesting; when you initialize it with the detection result from the previous frame it should be pretty fast.
I did a little experiment with the two frames you provided, and I have to say I am somewhat surprised myself how well it works. :-) Left image: Difference (color coded) between the two frames you posted. Right image: Difference between the frames after matching them with a homography. The remaining differences clearly are the moving cars, and they are sufficiently strong for simple thresholding.
Thinking of the approach you currently use, it may be intersting combining it with my proposal:
You could try to learn and classify the cars in the difference image D instead of the original image. This would amount to learning what a car motion pattern looks like rather than what a car looks like, which could be more reliable.
You could get rid of the expensive window search and run the classifier only on regions of D with sufficiently high value.
Some additional remarks:
In theory, the cars should even stand out if they are not moving since they are not flat, but given your distance to the scene and camera resolution this effect may be too subtle.
You can replace the feature extraction / matching part of my proposal with Optical Flow, if you like. This amounts to identifying flow vectors that "stick out" from a consistent frame-to-frame motion of the ground. It may be prone to outliers in the optical flow, however. You can also try to get the homography from the flow vectors.
This is important: Regardless of which method you use, once you have found cars in one frame you should use this information to robustify your search of these cars in consecutive frame, giving a higher likelyhood to detections close to the old ones (Kalman filter, etc). That's what tracking is all about!
If the number of cars in your field of view always remain the same but move around then you can use optical flow...it will give you good results against a still background...if the number of cars are changing then you need to call goodFeaturestoTrack function in OpenCV after certain number of frames and again track the cars using optical flow.
You can use background modelling to model the background and hence the cars are always your foreground.The simplest example is frame differentiation...subtract the previous frame current frame. diff(x,y,k) = I(x,y,k) - I(x,y,k-1) .As your cars are moving in each frame you will get their position..
Both the process will work fine since you have a still background I presume..check this link to find what Optical flow can do.
Have OpenCV implementation of shape context matching? I've found only matchShapes() function which do not work for me. I want to get from shape context matching set of corresponding features. Is it good idea to compare and find rotation and displacement of detected contour on two different images.
Also some example code will be very helpfull for me.
I want to detect for example pink square, and in the second case pen. Other examples could be squares with some holes, stars etc.
The basic steps of Image Processing is
Image Acquisition > Preprocessing > Segmentation > Representation > Recognition
And what you are asking for seems to lie within the representation part os this general algorithm. You want some features that descripes the objects you are interested in, right? Before sharing what I've done for simple hand-gesture recognition, I would like you to consider what you actually need. A lot of times simplicity will make it a lot easier. Consider a fixed color on your objects, consider background subtraction (these two main ties to preprocessing and segmentation). As for representation, what features are you interested in? and can you exclude the need of some of these features.
My project group and I have taken a simple approach to preprocessing and segmentation, choosing a green glove for our hand. Here's and example of the glove, camera and detection on the screen:
We have used a threshold on defects, and specified it to find defects from fingers, and we have calculated the ratio of a rotated rectangular boundingbox, to see how quadratic our blod is. With only four different hand gestures chosen, we are able to distinguish these with only these two features.
The functions we have used, and the measurements are all available in the documentation on structural analysis for OpenCV, and for acces of values in vectors (which we've used a lot), can be found in the documentation for vectors in c++
I hope you can use the train of thought put into this; if you want more specific info I'll be happy to comment, Enjoy.