I want to recognize these geometric shapes which are related to each other. For instance, looking at the image of the roof below, just by knowing the existence of the ridges in RED, I know that the ridge in BLUE should also exist (even if it is not visible in the image). If I have thousands of such labelled images, a ML model should be able to learn this as well. However, I can't figure out how to represent this problem?
Label(s) : C, Z
Label(s) : D
Label(s) : C, Z
Label(s) : E, G
Let's call these ridges as lines, as in the 1st example, we have X and Y lines being detected by simple edge detection but not Z because it's not visible. The same way line D is not detected but lines A, B, C are, from example 2.
What I want is that I formulate a ML model that learns from the X and Y that there should be a Z and subsequently D from A, B, C.
I have a data set of such examples where ridges are labeled (the red and blue is just to distinguish, all ridges are labeled with same color).
There are some important things to keep in mind.
The brightness of the image could vary a lot.
The ridge could have any scale or even orientation (within reasonable limits).
The input image is almost always very noisy.
I can think of two approaches.
Use a network similar to those used in edge detection problems. These networks output probability for each pixel of Input to contain an edge. Your problem is similar, just that you don't need all the edges. But this may need some significant post processing as you may get a lot of close lying lines, you will have to collapse them into a single line using non maximal suppression or some morphological operation. For training the ground truth values can be binary masks containing true location of ridges or you can use some small gaussian over the actual ridge location so as to make loss function more stable.
Second method can be regression. You can have an output vector containing coordinates of end point of ridge as a flat vector. But that would require for you to fix maximum number of ridges that can be there. This method would not probably work on its own as you may get a lot of false positives due to bigger output vector, but this can be combined with first method and you can choose to keep a key point only if it's significantly close to an edge location obtained from first method.
I would use a CNN to do the detection of the roof. If color is not important, you can either make the image grayscale / other color channel models (e.g. HSV and remove the H-channel). Alternatively, you can augment your dataset by automatically changing the hue of any image and feeding this edited image to the CNN as well.
Related
I am trying to compute the blurriness of an image by using LaplacianFilter.
According to this article: https://www.pyimagesearch.com/2015/09/07/blur-detection-with-opencv/ I have to compute the variance of the output image. The problem is I don't understand conceptually how do I compute variance of an image.
Every pixel has 4 values for every color channel, therefore I can compute the variance of every channel, but then I get 4 values, or even 16 by computing variance-covariance matrix, but according to the OpenCV example, they have only 1 number.
After computing that number, they just play with the threshold in order to make a binary decision, whether the image is blurry or not.
PS. by no means I am an expert on this topic, therefore my statements can make no sense. If so, please be nice to edit the question.
On sentence description:
The blured image's edge is smoothed, so the variance is small.
1. How variance is calculated.
The core function of the post is:
def variance_of_laplacian(image):
# compute the Laplacian of the image and then return the focus
# measure, which is simply the variance of the Laplacian
return cv2.Laplacian(image, cv2.CV_64F).var()
As Opencv-Python use numpy.ndarray to represent the image, then we have a look on the numpy.var:
Help on function var in module numpy.core.fromnumeric:
var(a, axis=None, dtype=None, out=None, ddof=0, keepdims=<class 'numpy._globals$
Compute the variance along the specified axis.
Returns the variance of the array elements, a measure of the spread of a distribution.
The variance is computed for the flattened array by default, otherwise over the specified axis.
2. Using for picture
This to say, the var is calculated on the flatten laplacian image, or the flatted 1-D array.
To calculate variance of array x, it is:
var = mean(abs(x - x.mean())**2)
For example:
>>> x = np.array([[1, 2], [3, 4]])
>>> x.var()
1.25
>>> np.mean(np.abs(x - x.mean())**2)
1.25
For the laplacian image, it is edged image. Make images using GaussianBlur with different r, then do laplacian filter on them, and calculate the vars:
The blured image's edge is smoothed, so the variance is little.
First thing first, if you see the tutorial you gave, they convert the image to a greyscale, thus it will have only 1 channel and 1 variance. You can do it for each channel and try to compute a more complicated formula with it, or just use the variance over all the numbers... However I think the author also converts it to greyscale since it is a nice way of fusing the information and in one of the papers that the author supplies actually says that
A well focused image is expected to have a high variation in grey
levels.
The author of the tutorial actually explains it in a simple way. First, think what the laplacian filter does. It will show the well define edges here is an example using the grid of pictures he had. (click on it to see better the details)
As you can see the blurry images barely have any edges, while the focused ones have a lot of responses. Now, what would happen if you calculate the variance. let's imagine the case where white is 255 and black is 0. If everything is black... then the variance is low (cases of the blurry ones), but if they have like half and half then the variance is high.
However, as the author already said, this threshold is dependent on the domain, if you take a picture of a sky even if it is focus it may have low variance, since it is quite similar and does not have very well define edges...
I hope this answer your doubts :)
Currently I am learning dense optical flow by myself. To understand it, I conduct one experiment. I produce one image using Matlab. One box with a given grays value is placed under one uniform background and the box is translated two pixels in x and y directions in another image. The two images are input into the implementation of the algorithm called TV-L1. The generated motion vector outer of the box is not zero. Is the reason that the gradient outer of the box is zero? Is the values filled in from the values with large gradient value?
In Horn and Schunck's paper, it reads
In parts of the image where the brightness gradient is zero, the velocity
estimates will simply be averages of the neighboring velocity estimates. There
is no local information to constrain the apparent velocity of motion of the
brightness pattern in these areas.
The progress of this filling-in phenomena is similar to the propagation effects
in the solution of the heat equation for a uniform flat plate, where the time rate of change of temperature is proportional to the Laplacian.
Is it not possible to obtain correct motion vectors for pixels with small gradients? Or the experiment is not practical. In practical applications, this doesn't happen.
Yes, in so called homogenous image regions with very small gradients no information where a motion can dervided from exists. That's why the motion from your rectangle is propagated outer the border. If you give your background a texture this effect will be less dominant. I know such problem when it comes to estimate the ego-motion of a car. Then the streat makes a lot of problems cause of here homogenoutiy.
Two pioneers in this field Lukas&Kanade (LK) and Horn&Schunch (HS) are developed methods for computing Optical Flow (OF). Both rely on brightness constancy assumption which feature location pixel values between two sequence frames not change. This constraint may be expressed as two equations: I(x+dx,y+dy,t+dt)=I(x,y,t) and ∂I/∂x dx+∂I/∂y dy+∂I/∂t dt=0 by using a Taylor series expansion I(x+dx,y+dy,t+dt) , we get (x+dx,y+dy,t+dt)=I(x,y,t)+∂I/∂x dx+∂I/∂y dy+∂I/∂t dt… letting ∂x/∂t=u and ∂y/∂t=v and combining these equations we get the OF constraint equation: ∂I/∂t=∂I/∂t u+∂I/∂t v . The OF equation has more than one solution, so the different techniques diverge here. LK equations are derived assuming that pixels in a neighborhood of each tracked feature move with the same velocity as the feature. In OpenCV, to catch large motions with a small window size (to keep the “same local velocity” assumption).
I have a bunch of gray-scale images decomposed into superpixels. Each superpixel in these images have a label in the rage of [0-1]. You can see one sample of images below.
Here is the challenge: I want the spatially (locally) neighboring superpixels to have consistent labels (close in value).
I'm kind of interested in smoothing local labels but do not want to apply Gaussian smoothing functions or whatever, as some colleagues suggested. I have also heard about Conditional Random Field (CRF). Is it helpful?
Any suggestion would be welcome.
I'm kind of interested in smoothing local labels but do not want to apply Gaussian smoothing functions or whatever, as some colleagues suggested.
And why is that? Why do you not consider helpful advice of your colleagues, which are actually right. Applying smoothing function is the most reasonable way to go.
I have also heard about Conditional Random Field (CRF). Is it helpful?
This also suggests, that you should rather go with collegues advice, as CRF has nothing to do with your problem. CRF is a classifier, sequence classifier to be exact, requiring labeled examples to learn from and has nothing to do with the setting presented.
What are typical approaches?
The exact thing proposed by your collegues, you should define a smoothing function and apply it to your function values (I will not use a term "labels" as it is missleading, you do have values in [0,1], continuous values, "label" denotes categorical variable in machine learning) and its neighbourhood.
Another approach would be to define some optimization problem, where your current assignment of values is one goal, and the second one is "closeness", for example:
Let us assume that you have points with values {(x_i, y_i)}_{i=1}^N and that n(x) returns indices of neighbouring points of x.
Consequently you are trying to find {a_i}_{i=1}^N such that they minimize
SUM_{i=1}^N (y_i - a_i)^2 + C * SUM_{i=1}^N SUM_{j \in n(x_i)} (a_i - a_j)^2
------------------------- - --------------------------------------------
closeness to current constant to closeness to neighbouring values
values weight each part
You can solve the above optimization problem using many techniques, for example through scipy.optimize.minimize module.
I am not sure that your request makes any sense.
Having close label values for nearby superpixels is trivial: take some smooth function of (X, Y), such as constant or affine, taking values in the range [0,1], and assign the function value to the superpixel centered at (X, Y).
You could also take the distance function from any point in the plane.
But this is of no use as it is unrelated to the image content.
I need to be able to determine if a shape was drawn correctly or incorrectly,
I have sample data for the shape, that holds the shape and the order of pixels (denoted by the color of the pixel)
for example, you can see of the downsampled image and color variation
I'm having trouble figuring out the network I need to define that will accept this kind of input for training.
should I convert the sampledown image to a matrix and input it? let's say my image is 64x64, I would need 64x64 input neurons (and that's if I ignore the color of the pixels, I think) is that feasible solution?
If you have any guidance, I could use it :)
I gave you an example as below.
It is a binarized 4x4 image of letter c. You can either concatenate the rows or columns. I am concatenating by columns as shown in the figure. Then each pixel is mapped to an input neuron (totally 16 input neurons). In the output layer, I have 26 outputs, the letters a to z.
Note, in the figure, I did not connect all nodes from layer i to layer i+1 for simplicity, which you probably should connect all.
At the output layer, I highlight the node of c to indicate that for this training instance, c is the target label. The expected input and output vector are listed in the bottom of the figure.
If you want to keep the intensity of color, e.g., R/G/B, then you have to triple the number of inputs. Each single pixel is replaced with three neurons.
Hope this helps more. For a further reading, I strongly suggest the deep learning tutorial by Andrew Ng at here - UFLDL. It's the state of art of such image recognition problem. In the exercise with the tutorial, you will be intensively trained to preprocess the images and work with a lot of engineering tricks for image processing, together with the interesting deep learning algorithm end-to-end.
I have a few doubts about how to approach my goal. I have an outside camera who is recording people and I want to draw an ellipse on every person.
Right now what I do is get the feature points of the people from the frame (I get them using a mask to only have the feature points on the people), set a EM algorithm and train it with my samples (the feature points extracted). The number of clusters is twice the number of people from the image (I get it before start the EM algorithm using other methods such as pixel counting with a codebook).
My question is
(a) Do I have to just train it only for the first frame and then use predict in the following frames? or,
(b) use train with the feature points in every frame?
Right now I am doing the option b) (I don't use predict) because I don't really know how to use the predict.
If I do a), can you help me with it and after that how to draw the ellipses?. If I do b), can you help me drawing an ellipse for every person? Since right know I got different ellipses for the same person using the cov, mean, etc (one for the arm, for example).
What I want to achieve is this paper using the Gaussian model: Link
If you would draw bounding boxes, rather then ellipses, you could use the function groupRectanlges to merge the different bounding boxes.
But, more important - for people detection, you can simply use openCV's person detector (based on HOG) or latent svm detector with the person model.
You should do b) anyway because, otherwise you'll try to match the keypoints to the clusters (persons) in the first frame. After a few seconds this would not be relevant.
It seems reasonable to assume that from frame to frame change is not going to be overwhelming, so reusing the results of the training on frame N-1 is a good seed to train on frame N, likely to converge faster that running EM from scratch on each frame.
in order to draw the ellipses you can leverage from the mixture of gaussian example in the python bindings:
https://github.com/opencv/opencv/blob/master/samples/python/gaussian_mix.py
Note if you use a diagonal covariance matrix, your ellipses are going to be aligned "straight", their own axis aligned with the X and Y axis of the frame, you can skip the calculation of the angle of the ellipse