I have a very simple 2 dimensional software simulation of a ball that can roll down a hill. (see image)
The ball starts at the top of the hill (at x-position 0 and y-position 50)
and can roll down to either the left or the right side of the hill. The ball's and hill's properties (mass, material, etc.) stay constant but are unknown. Also the force that pushes the ball down the hill is unknown.
Moreover there is wind that either accelerates or decelerates the ball. The wind speed is known and can change every time.
I collected 1000 trials (i.e. rolling the ball down the hill 1000 times).
For each trial, I sampled the ball's position and the wind speed at every 10th millisecond and saved the data into a csv file.
The file looks like this:
trial_nr,direction,time(ms),wind(km/h),ball_position
0, None, 0, 10, 0/50
0, right, 10, 10, 5/45
0, right, 20, 4, 8/30
0, right, 30, 6, 10/25
0, right, 40, 7, 15/15 <- stop position (label)
1, None, 0, 7, 0/50
1, right, 10, 8, 7/43
...
1, right, 50, 3, 20/15 <- stop position (label)
2, None, 0, 8, 0/50
2, left, 10, 1, -3/48
2, left, 20, 0, -4/47
...
2, left, 60, 3, -17/15 <- stop position (label)
...more trials...
What I want to do now is to predict the stop position of the ball for future trials.
Concretely, I want to train a machine learning model with this csv file.
When it is trained, it gets as input the direction/wind/ball_position of a trial of the first 30ms and based on this data it should predict an estimate where the ball will stop after it rolled down the hill. This is a supervised learning problem since the label is the stop position of the ball.
I was thinking of using a neural network that gets as input the feature vectors in the csv file. But the problem is that not every feature vector has a label associated. Only the last feature vector of every trial has a label (the stop position). So how can I solve the given problem?
Every help is very appreciated.
Related
I am new to deep learning and attempting to understand how CNN performs image classification
i have gone through multiple youtube videos, multiple blogs and papers as well. And they all mention roughly the same thing:
add filters to get feature maps
perform pooling
remove linearity using RELU
send to a fully connected network.
While this is all fine and dandy, i dont really understand how convolution works in essence. Like for example. edge detection.
like for ex: [[-1, 1], [-1,1]] detects a vertical edge.
How? Why? how do we know for sure that this will detect a vertical edge .
Similarly matrices for blurring/sharpening, how do we actually know that they will perform what they are aimed for.
do i simply takes peoples word for it?
Please help/ i feel helpless since i am not able to understand convolution and how the matrices detects edges or shapes
Filters detect spatial patterns such as edges in an image by detecting the changes in intensity values of the image.
A quick recap: In terms of an image, a high-frequency image is the one where the intensity of the pixels changes by a large amount, while a low-frequency image the one where the intensity is almost uniform. An image has both high and low frequency components. The high-frequency components correspond to the edges of an object because at the edges the rate of change of intensity of pixel values is high.
High pass filters are used to enhance the high-frequency parts of an image.
Let's take an example that a part of your image has pixel values as [[10, 10, 0], [10, 10, 0], [10, 10, 0]] indicating the image pixel values are decreasing toward the right i.e. the image changes from light at the left to dark at the right. The filter used here is [[1, 0, -1], [1, 0, -1], [1, 0, -1]].
Now, we take the convolutional of these two matrices that give the output [[10, 0, 0], [10, 0, 0], [10, 0, 0]]. Finally, these values are summed up to give a pixel value of 30, which gives the variation in pixel values as we move from left to right. Similarly, we find the subsequent pixel values.
Here, you will notice that a rate of change of pixel values varies a lot from left to right thus a vertical edge has been detected. Had you used the filter [[1, 1, 1], [0, 0, 0], [-1, -1, -1]], you would get the convolutional output consisting of 0s only i.e. no horizontal edge present. In the similar ways, [[-1, 1], [-1, 1]] detects a vertical edge.
You can check more here in a lecture by Andrew Ng.
Edit: Usually, a vertical edge detection filter has bright pixels on the left and dark pixels on the right (or vice-versa). The sum of values of the filter should be 0 else the resultant image will become brighter or darker. Also, in convolutional neural networks, the filters are learned the same way as hyperparameters through backpropagation during the training process.
In RNN world, does it matter which end of the word vector is padded so they all have the same length?
Example
pad_left = [0, 0, 0, 0, 5, 4, 3, 2]
pad_right = [5, 4, 3, 2, 0, 0, 0, 0]
Some guys did experiments on the pre-padding and post-padding in their paper Effects of Padding on LSTMs and CNNs. Here is their conclusion.
For LSTMs, the accuracy of post-padding (50.117%) is way less than pre-padding (80.321%).
Pre-padding and post padding doesn’t matter much to CNN because unlike LSTMs, CNNs don’t try to remember stuff from the previous output, but instead tries to find pattern in the given data.
I have never expected such big effect of padding positions, so I suggest you verify it yourself.
please correct me if I am wrong: if we have x=10, y=20, when we apply a transform on these coordinates (Lets say scaling x and y by 10), the new coordinates will be x=100 and y=200.
So, if we apply scaling of x by -1 we get x= -10, y =20. But why this action causes the view to be mirrored? shouldn't the view just be re-drawn at it's new coordinates?
What am I missing here ?
Don't think about a single coordinate, think about a range of coordinates.
If you take the coords (x-value only here) of... 0, 1, 2, 3, 4 and scale them by 10 then they will map to 0, 10, 20, 30, 40 respectively. This will stretch out the x axis and so the view will look 10 times bigger than it did originally.
If you take those same x coords and scale them by -1 then they will map to 0, -1, -2, -3, -4 respectively.
That is, the pixel that is furthest away from the origin (4) is still furthest away from the origin but now at -4.
Each pixel is mirrored through the origin.
That's how scaling works in iOS, Android and general mathematics.
If you just want to slide the view around without changing the size of it at all then you can use a translation instead.
Take sample points (10,10), (20,0), (20,40), (20,20).
In Matlab polyfit returns slope 1, but for the same data openCV fitline returns slope 10.7. From hand calculations the near vertical line (slope 10.7) is a much better least squares fit.
How come we’re getting different lines from the two libraries?
OpenCV code - (on iOS)
vector<cv::Point> vTestPoints;
vTestPoints.push_back(cv::Point( 10, 10 ));
vTestPoints.push_back(cv::Point( 20, 0 ));
vTestPoints.push_back(cv::Point( 20, 40 ));
vTestPoints.push_back(cv::Point( 20, 20 ));
Mat cvTest = Mat(vTestPoints);
cv::Vec4f testWeight;
fitLine( cvTest, testWeight, CV_DIST_L2, 0, 0.01, 0.01);
NSLog(#"Slope: %.2f",testWeight[1]/testWeight[0]);
xcode Log shows
2014-02-12 16:14:28.109 Application[3801:70b] Slope: 10.76
Matlab code
>> px
px = 10 20 20 20
>> py
py = 10 0 20 40
>> polyfit(px,py,1)
ans = 1.0000e+000 -2.7733e-014
MATLAB is trying to minimise the error in y for an given input x (i.e. as if x is your independent and y your dependant variable).
In this case, the line that goes through the points (10,10) and (20,20) is probably the best bet. A near vertical line that goes close to all three points with x=20 would have a very large error if you tried to calculate a value for y given x=10.
Although I don't recognise the OpenCV syntax, I'd guess that CV_DIST_L2 is a distance metric that means you're trying to minimise overall distance between the line and each point in the x-y plane. In that case a more vertical line which passes through the middle of the point set would be the closest.
Which is "correct" depends on what your points represent.
I am trying to use FFTW for image convolution.
At first just to test if the system is working properly, I performed the fft, then the inverse fft, and could get the exact same image returned.
Then a small step forward, I used the identity kernel(i.e., kernel[0][0] = 1 whereas all the other components equal 0). I took the component-wise product between the image and kernel(both in the frequency domain), then did the inverse fft. Theoretically I should be able to get the identical image back. But the result I got is very not even close to the original image. I am suspecting this has something to do with where I center my kernel before I fft it into frequency domain(since I put the "1" at kernel[0][0], it basically means that I centered the positive part at the top left). Could anyone enlighten me about what goes wrong here?
For each dimension, the indexes of samples should be from -n/2 ... 0 ... n/2 -1, so if the dimension is odd, center around the middle. If the dimension is even, center so that before the new 0 you have one sample more than after the new 0.
E.g. -4, -3, -2, -1, 0, 1, 2, 3 for a width/height of 8 or -3, -2, -1, 0, 1, 2, 3 for a width/height of 7.
The FFT is relative to the middle, in its scale there are negative points.
In the memory the points are 0...n-1, but the FFT treats them as -ceil(n/2)...floor(n/2), where 0 is -ceil(n/2) and n-1 is floor(n/2)
The identity matrix is a matrix of zeros with 1 in the 0,0 location (the center - according to above numbering). (In the spatial domain.)
In the frequency domain the identity matrix should be a constant (all real values 1 or 1/(N*M) and all imaginary values 0).
If you do not receive this result, then the identify matrix might need padding differently (to the left and down instead of around all sides) - this may depend on the FFT implementation.
Center each dimension separately (this is an index centering, no change in actual memory).
You will probably need to pad the image (after centering) to a whole power of 2 in each dimension (2^n * 2^m where n doesn't have to equal m).
Pad relative to FFT's 0,0 location (to center, not corner) by copying existing pixels into a new larger image, using center-based-indexes in both source and destination images (e.g. (0,0) to (0,0), (0,1) to (0,1), (1,-2) to (1,-2))
Assuming your FFT uses regular floating point cells and not complex cells, the complex image has to be of size 2*ceil(2/n) * 2*ceil(2/m) even if you don't need a whole power of 2 (since it has half the samples, but the samples are complex).
If your image has more than one color channel, you will first have to reshape it, so that the channel are the most significant in the sub-pixel ordering, instead of the least significant. You can reshape and pad in one go to save time and space.
Don't forget the FFTSHIFT after the IFFT. (To swap the quadrants.)
The result of the IFFT is 0...n-1. You have to take pixels floor(n/2)+1..n-1 and move them before 0...floor(n/2).
This is done by copying pixels to a new image, copying floor(n/2)+1 to memory-location 0, floor(n/2)+2 to memory-location 1, ..., n-1 to memory-location floor(n/2), then 0 to memory-location ceil(n/2), 1 to memory-location ceil(n/2)+1, ..., floor(n/2) to memory-location n-1.
When you multiply in the frequency domain, remember that the samples are complex (one cell real then one cell imaginary) so you have to use a complex multiplication.
The result might need dividing by N^2*M^2 where N is the size of n after padding (and likewise for M and m). - You can tell this by (a. looking at the frequency domain's values of the identity matrix, b. comparing result to input.)
I think that your understanding of the Identity kernel may be off. An Identity kernel should have the 1 at the center of the 2D kernal not at the 0, 0 position.
example for a 3 x 3, you have yours setup as follows:
1, 0, 0
0, 0, 0
0, 0, 0
It should be
0, 0, 0
0, 1, 0
0, 0, 0
Check this out also
What is the "do-nothing" convolution kernel
also look here, at the bottom of page 3.
http://www.fmwconcepts.com/imagemagick/digital_image_filtering.pdf
I took the component-wise product between the image and kernel in frequency domain, then did the inverse fft. Theoretically I should be able to get the identical image back.
I don't think that doing a forward transform with a non-fft kernel, and then an inverse fft transform should lead to any expectation of getting the original image back, but perhaps I'm just misunderstanding what you were trying to say there...