I'm new to all the stuff I'm going to talking about so that the questions may be too simple.
Thanks in advance for your answers!
My questions cames from the following image:
To be more clear:
For the first Convolution, from 1 x 28 x28 to 25 x 26 x26, the input (1 layer) goes through the filter (25 layers). So, one layer was filtered 25 times ( right ? ).
But for the second Convolution, from 25 x 13 x 13 to 50 x 11 x 11, what's the operation of the filter 50 x 3 x 3 applied on the input 25 x 13 x 13? I confused about the operation. Because the output should be 1250 x 11 x 11 if each layer of the input 25 x 13 x 13 goes through the filter 50 x 3 x 3. Why is the output still 50 layers?
For the second Max Pooling, how does MaxPooling2D() deal with a layer with odd size? The remainder of (11 mod 2) is 1. In the above image, from 11 to 5, what happened on the 1?
In addition, What's the common operation for Max Pooling an odd-size input layer?
Each convolution is applied to all the channels of the input(output of the previous layer), in this case, one filter( from the 50x3x3 Conv2d) is applied to all 25(from 25x3x3 Conv2d) input then the results will be summed up to give one output feature map of 50Conv2d, this will be done for 50 times. here is a link about how filters are applied to features maps. The rule of thumb is, if the next convolution have N filters, its output should also have N features maps.
For maxPooling, the default value for padding in MaxPooling2D; which is applied in your case, is "valid|", it means that the pooling function will not include values that can not be contained in kernel size. in your example, kernel is 2, that means the 11th element was not included in the operation. Here is a good link about the padding="valid" flag, the second answer has a good visual on how some elements of the input are left out during this operation.
Additionally, it may be good idea to do strides instead of max pooling.
You can easily find folks discussing and comparing it on the net. As you're stumbling with dimensionality, using strides is less mind-boggling.
https://stats.stackexchange.com/questions/387482/pooling-vs-stride-for-downsampling
https://www.pyimagesearch.com/2018/12/31/keras-conv2d-and-convolutional-layers/
https://machinelearningmastery.com/padding-and-stride-for-convolutional-neural-networks/
Hope it helps.
Related
Given input signal x (e.g. a voltage, sampled thousand times per second couple of minutes long), I'd like to calculate e.g.
/ this is not q
y[3] = -3*x[0] - x[1] + x[2] + 3*x[3]
y[4] = -3*x[1] - x[2] + x[3] + 3*x[4]
. . .
I'm aiming for variable window length and weight coefficients. How can I do it in q? I'm aware of mavg and signal processing in q and moving sum qidiom
In the DSP world it's called applying filter kernel by doing convolution. Weight coefficients define the kernel, which makes a high- or low-pass filter. The example above calculates the slope from last four points, placing the straight line via least squares method.
Something like this would work for parameterisable coefficients:
q)x:10+sums -1+1000?2f
q)f:{sum x*til[count x]xprev\:y}
q)f[3 1 -1 -3] x
0n 0n 0n -2.385585 1.423811 2.771659 2.065391 -0.951051 -1.323334 -0.8614857 ..
Specific cases can be made a bit faster (running 0 xprev is not the best thing)
q)g:{prev[deltas x]+3*x-3 xprev x}
q)g[x]~f[3 1 -1 -3]x
1b
q)\t:100000 f[3 1 1 -3] x
4612
q)\t:100000 g x
1791
There's a kx white paper of signal processing in q if this area interests you: https://code.kx.com/q/wp/signal-processing/
This may be a bit old but I thought I'd weigh in. There is a paper I wrote last year on signal processing that may be of some value. Working purely within KDB, dependent on the signal sizes you are using, you will see much better performance with a FFT based convolution between the kernel/window and the signal.
However, I've only written up a simple radix-2 FFT, although in my github repo I do have the untested work for a more flexible Bluestein algorithm which will allow for more variable signal length. https://github.com/callumjbiggs/q-signals/blob/master/signal.q
If you wish to go down the path of performing a full manual convolution by a moving sum, then the best method would be to break it up into blocks equal to the kernel/window size (which was based on some work Arthur W did many years ago)
q)vec:10000?100.0
q)weights:30?1.0
q)wsize:count weights
q)(weights$(((wsize-1)#0.0),vec)til[wsize]+) each til count v
32.5931 75.54583 100.4159 124.0514 105.3138 117.532 179.2236 200.5387 232.168.
If your input list not big then you could use the technique mentioned here:
https://code.kx.com/q/cookbook/programming-idioms/#how-do-i-apply-a-function-to-a-sequence-sliding-window
That uses 'scan' adverb. As that process creates multiple lists which might be inefficient for big lists.
Other solution using scan is:
q)f:{sum y*next\[z;x]} / x-input list, y-weights, z-window size-1
q)f[x;-3 -1 1 3;3]
This function also creates multiple lists so again might not be very efficient for big lists.
Other option is to use indices to fetch target items from the input list and perform the calculation. This will operate only on input list.
q) f:{[l;w;i]sum w*l i+til 4} / w- weight, l- input list, i-current index
q) f[x;-3 -1 1 3]#'til count x
This is a very basic function. You can add more variables to it as per your requirements.
for my thesis I have to calculate the number of workers at risk of substitution by machines. I have calculated the probability of substitution (X) and the number of employee at risk (Y) for each occupation category. I have a dataset like this:
X Y
1 0.1300 0
2 0.1000 0
3 0.0841 1513
4 0.0221 287
5 0.1175 3641
....
700 0.9875 4000
I tried to plot a histogram with this command:
hist(dataset1$X,dataset1$Y,xlim=c(0,1),ylim=c(0,30000),breaks=100,main="Distribution",xlab="Probability",ylab="Number of employee")
But I get this error:
In if (freq) x$counts else x$density
length > 1 and only the first element will be used
Can someone tell me what is the problem and write me the right command?
Thank you!
It is worth pointing out that the message displayed is a Warning message, and should not prevent the results being plotted. However, it does indicate there are some issues with the data.
Without the full dataset, it is not 100% obvious what may be the problem. I believe it is caused by the data not being in the correct format, with two potential issues. Firstly, some values have a value of 0, and these won't be plotted on the histogram. Secondly, the observations appear to be inconsistently spaced.
Histograms are best built from one of two datasets:
A dataframe which has been aggregated grouped into consistently sized bins.
A list of values X which in the data
I prefer the second technique. As originally shown here The expandRows() function in the package splitstackshape can be used to repeat the number of rows in the dataframe by the number of observations:
set.seed(123)
dataset1 <- data.frame(X = runif(900, 0, 1), Y = runif(900, 0, 1000))
library(splitstackshape)
dataset2 <- expandRows(dataset1, "Y")
hist(dataset2$X, xlim=c(0,1))
dataset1$bins <- cut(dataset1$X, breaks = seq(0,1,0.01), labels = FALSE)
I have a dataset of time series that I use as input to an LSTM-RNN for action anticipation. The time series comprises a time of 5 seconds at 30 fps (i.e. 150 data points), and the data represents the position/movement of facial features.
I sample additional sub-sequences of smaller length from my dataset in order to add redundancy in the dataset and reduce overfitting. In this case I know the starting and ending frame of the sub-sequences.
In order to train the model in batches, all time series need to have the same length, and according to many papers in the literature padding should not affect the performance of the network.
Example:
Original sequence:
1 2 3 4 5 6 7 8 9 10
Subsequences:
4 5 6 7
8 9 10
2 3 4 5 6
considering that my network is trying to anticipate an action (meaning that as soon as P(action) > threshold as it goes from t = 0 to T = tmax, it will predict that action) will it matter where the padding goes?
Option 1: Zeros go to substitute original values
0 0 0 4 5 6 7 0 0 0
0 0 0 0 0 0 0 8 9 10
0 2 3 4 5 6 0 0 0 0
Option 2: all zeros at the end
4 5 6 7 0 0 0 0 0 0
8 9 10 0 0 0 0 0 0 0
2 3 4 5 0 0 0 0 0 0
Moreover, some of the time series are missing a number of frames, but it is not known which ones they are - meaning that if we only have 60 frames, we don't know whether they are taken from 0 to 2 seconds, from 1 to 3s, etc. These need to be padded before the subsequences are even taken. What is the best practice for padding in this case?
Thank you in advance.
The most powerful attribute of LSTMs and RNNs in general is that their parameters are shared along the time frames(Parameters recur over time frames) but the parameter sharing relies upon the assumption that the same parameters can be used for different time steps i.e. the relationship between the previous time step and the next time step does not depend on t as explained here in page 388, 2nd paragraph.
In short, padding zeros at the end, theoretically should not change the accuracy of the model. I used the adverb theoretically because at each time step LSTM's decision depends on its cell state among other factors and this cell state is kind of a short summary of the past frames. As far as I understood, that past frames may be missing in your case. I think what you have here is a little trade-off.
I would rather pad zeros at the end because it doesn't completely conflict with the underlying assumption of RNNs and it's more convenient to implement and keep track of.
On the implementation side, I know tensorflow calculates the loss function once you give it the sequences and the actual sequence size of each sample(e.g. for 4 5 6 7 0 0 0 0 0 0 you also need to give it the actual size which is 4 here) assuming you're implementing the option 2. I don't know whether there is an implementation for option 1, though.
Better go for padding zeroes in the beginning, as this paper suggests Effects of padding on LSTMs and CNNs,
Though post padding model peaked it’s efficiency at 6 epochs and started to overfit after that, it’s accuracy is way less than pre-padding.
Check table 1, where the accuracy of pre-padding(padding zeroes in the beginning) is around 80%, but for post-padding(padding zeroes in the end), it is only around 50%
In case you have sequences of variable length, pytorch provides a utility function torch.nn.utils.rnn.pack_padded_sequence. The general workflow with this function is
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
embedding = nn.Embedding(4, 5)
rnn = nn.GRU(5, 5)
sequences = torch.tensor([[1,2,0], [3,0,0], [2,1,3]])
lens = [2, 1, 3] # indicating the actual length of each sequence
embeddings = embedding(sequences)
packed_seq = pack_padded_sequence(embeddings, lens, batch_first=True, enforce_sorted=False)
e, hn = rnn(packed_seq)
One can collect the embedding of each token by
e = pad_packed_sequence(e, batch_first=True)
Using this function is better than padding by yourself, because torch will limit RNN to only inspecting the actual sequence and stop before the padded token.
I am trying to compute the similarity between n entities that are being described by entity_id, type_of_order, total_value.
An example of the data might look like:
NR entity_id type_of_order total_value
1 1 A 10
2 1 B 90
3 1 C 70
4 2 B 20
5 2 C 40
6 3 A 10
7 3 B 50
8 3 C 20
9 4 B 50
10 4 C 80
My question would be what is a god way of measuring the similarity between entity_id 1 and 2 for example with regards to the type_of_order and the total_value for that type of order.
Would a simple KNN give satisfactory results or should I consider other algorithms?
Any suggestion would be much appreciated.
The similarity metric is a heuristic to capture a relationship between two data rows, with respect to the data semantics and the purpose of the training. We don't know your data; we don't know your usage. It would be irresponsible to suggest metrics to solve a problem when we have no idea what problem we're solving.
You have to address this question to the person you find in the mirror. You've given us three features with no idea of what they mean or how they relate. You need to quantify ...
relative distances within features: under type_of_order, what is the relationship (distance) between any two measurements? If we arbitrarily assign d(A, B) = 1, then what is d(B, C)? We have no information to help you construct this. Further, if we give that some value c, then what is d(A, C)? In various popular metrics, it could be 1+c, |1-c|, all distances could be 1, or perhaps it's something else -- even more than 1+c in some applications.
Even in the last column, we cannot assume that d(10, 20) = d(40, 50); the actual difference could be a ratio, difference of squares, etc. Again, this depends on the semantics behind these labels.
relative weights between features: How do the differences in the various columns combine to provide a similarity? For instance, how does d([A, 10], [B, 20]) compare to d([A, 10], [C, 30])? That's two letters in the left column, two steps of 10 in the right column. How about d([A, 10], [A, 20]) vs d([A, 10], [B, 10])? Are the distances linear, or do the relationships change as we slide up the alphabet or to higher numbers?
This has become quite a frustrating question, but I've asked in the Coursera discussions and they won't help. Below is the question:
I've gotten it wrong 6 times now. How do I normalize the feature? Hints are all I'm asking for.
I'm assuming x_2^(2) is the value 5184, unless I am adding the x_0 column of 1's, which they don't mention but he certainly mentions in the lectures when talking about creating the design matrix X. In which case x_2^(2) would be the value 72. Assuming one or the other is right (I'm playing a guessing game), what should I use to normalize it? He talks about 3 different ways to normalize in the lectures: one using the maximum value, another with the range/difference between max and mins, and another the standard deviation -- they want an answer correct to the hundredths. Which one am I to use? This is so confusing.
...use both feature scaling (dividing by the
"max-min", or range, of a feature) and mean normalization.
So for any individual feature f:
f_norm = (f - f_mean) / (f_max - f_min)
e.g. for x2,(midterm exam)^2 = {7921, 5184, 8836, 4761}
> x2 <- c(7921, 5184, 8836, 4761)
> mean(x2)
6676
> max(x2) - min(x2)
4075
> (x2 - mean(x2)) / (max(x2) - min(x2))
0.306 -0.366 0.530 -0.470
Hence norm(5184) = 0.366
(using R language, which is great at vectorizing expressions like this)
I agree it's confusing they used the notation x2 (2) to mean x2 (norm) or x2'
EDIT: in practice everyone calls the builtin scale(...) function, which does the same thing.
It's asking to normalize the second feature under second column using both feature scaling and mean normalization. Therefore,
(5184 - 6675.5) / 4075 = -0.366
Usually we normalize all of them to have zero mean and go between [-1, 1].
You can do that easily by dividing by the maximum of the absolute value and then remove the mean of the samples.
"I'm assuming x_2^(2) is the value 5184" is this because it's the second item in the list and using the subscript _2? x_2 is just a variable identity in maths, it applies to all rows in the list. Note that the highest raw mid-term exam result (i.e. that which is not squared) goes down on the final test and the lowest raw mid-term result increases the most for the final exam result. Theta is a fixed value, a coefficient, so somewhere your normalisation of x_1 and x_2 values must become (EDIT: not negative, less than 1) in order to allow for this behaviour. That should hopefully give you a starting basis, by identifying where the pivot point is.
I had the same problem, in my case the thing was that I was using as average the maximum x2 value (8836) minus minimum x2 value (4761) divided by two, instead of the sum of each x2 value divided by the number of examples.
For the same training set, I got the question as
Q. What is the normalized feature x^(3)_1?
Thus, 3rd training ex and 1st feature makes out to 94 in above table.
Now, normalized form is
x = (x - mean(x's)) / range(x)
Values are :
x = 94
mean(89+72+94+69) / 4 = 81
range = 94 - 69 = 25
Normalized x = (94 - 81) / 25 = 0.52
I'm taking this course at the moment and a really trivial mistake I made first time I answered this question was using comma instead of dot in the answer, since I did by hand and in my country we use comma to denote decimals. Ex:(0,52 instead of 0.52)
So in the second time I tried I used dot and works fine.