What is a full working example (not snippets) of variable-length sequence inputs into recurrent neural networks (RNNs)?
For example PyTorch supposedly can implement variable-length sequences as input into RNNs, but there do not seem to be examples of full working code.
Relevant:
https://github.com/pytorch/pytorch/releases/tag/v0.1.10
https://discuss.pytorch.org/t/about-the-variable-length-input-in-rnn-scenario/345
Sadly, there is no such thing as 'variable length' neural networks. This is because there is no way a network can 'know' which weights to use for extra input nodes that it wasn't trained for.
However, the reason you are seeing a 'variable length' on that page, is because they process:
a b c d e
a b c d e f g h
a b c d
a b
as
a b c d e 0 0 0
a b c d e f g h
a b c d 0 0 0 0
a b 0 0 0 0 0 0
They convert all 'empty' variables to 0. Which makes sense, as 0 does not add anything tot he networks hidden layers regardless of the weights, as anything*0 = 0.
So basically, you can have 'variable length' inputs, but you have to define some kind of maximum size; all inputs that are smaller than that size should be padded with zeros.
If you are classifying sentences on the other hand, you could use LSTM/GRU networks to handle the input sequentially.
Related
As I understand it, in a deep neural network, we use an activation function (g) after applying the weights (w) and bias(b) (z := w * X + b | a := g(z)). So there is a composition function of (g o z) and the activation function makes so our model can learn function other than linear functions. I see that Sigmoid and Tanh activation function makes our model non-linear, but I have some trouble seeing that a ReLu (which takes the max out of 0 and z) can make a model non-linear...
Let's say if every Z is always positive, then it would be as if there was no activation function...
So why does ReLu make a neural network model non-linear?
Deciding if a function is linear or not is of course not a matter of opinion or debate; there is a very simple definition of a linear function, which is roughly:
f(a*x + b*y) = a*f(x) + b*f(y)
for every x & y in the function domain and a & b constants.
The requirement "for every" means that, if we are able to find even a single example where the above condition does not hold, then the function is nonlinear.
Assuming for simplicity that a = b = 1, let's try x=-5, y=1 with f being the ReLU function:
f(-5 + 1) = f(-4) = 0
f(-5) + f(1) = 0 + 1 = 1
so, for these x & y (in fact for every x & y with x*y < 0) the condition f(x + y) = f(x) + f(y) does not hold, hence the function is nonlinear...
The fact that we may be able to find subdomains (e.g. both x and y being either negative or positive here) where the linearity condition holds is what defines some functions (such as ReLU) as piecewise-linear, which are still nonlinear nevertheless.
Now, to be fair to your question, if in a particular application the inputs happened to be always either all positive or all negative, then yes, in this case the ReLU would in practice end up behaving like a linear function. But for neural networks this is not the case, hence we can rely on it indeed to provide our necessary non-linearity...
When there is only one input, I can use lstm to complete the forecast. When the following two cases, I will be confused, do not know how to build a neural network:
The data format is shown in the picture。
The first case:
Use a, b, c, d to predict d (t + 1)
The second case:
d= f (a, b, c) f is an unknown nonlinear function, using a, b, c, d to predict d (t + 1)
Simply concatenate the inputs in an array with the following dimensions:
(number_of_samples, timesteps, number_of_features)
Where number_of_features in your case is 4 as you have a,b,c,d. Your input_shape of the first layer will be (timesteps, number_of_features).
I'm currently studying Machine Learning but I don't have a statistics background. Everywhere I've seen the logistic function, it has always been:
wx + b
but this example in Theano documentation used:
wx - b
Please which one is it? I'm new to this and I don't want to get confused.
The example on your linked page is not using wx - b. Here is the formula I assume you are referencing:
p_1 = 1 / (1 + T.exp(-T.dot(x, w) - b))
You can break this up into the sigmoid argument and the sigmoid function:
arg = T.dot(x, w) + b # sigmoid argument
p_1 = 1 / (1 + T.exp(-arg)) # sigmoid function
So there are two issues. The first is that you didn't factor the sign of the b variable properly (the formula is using wx + b). Second is that the formula you quoted isn't actually the sigmoid function; rather, it is the argument (a linear weighted sum of input variables) that is passed to the sigmoid function.
I was studying probabilistic pca from bishop's book, there an EM algo is provdied to calculate principal subspace.
Here M is MxM matrix, W is DxM matrix and (xn − x) is vector Dx1 matrix.
Later in the book there is statement regarding the time complexity:
"Instead, the most computationally demanding steps are those involving sums over
the data set that are O(NDM)."
I was wondering if anyone can help me understanding the time complexity of the algorithm. Thanks in advance.
Let us go one by one
E[zn] = M^-1 W' (xn - x)
M^-1 can be precomputed, thus you do not pay O(M^3) everytime you need this kind of value, but rather a single O(M^3) cost at the end
despite that it is multiplication of matrices of sizes MxM * MxD * Dx1 which is O(M^2D)
result is of size Mx1
E[zn zn'] = sigma^2 M^-1 + E[zn]E[zn]'
sigma^2 M^-1 is just multiplication by constant thus linear in size of matrix, O(M^2)
second operation is outer product of Mx1 and 1xM vectors, thus result is MxM again, and takes O(M^2) too
result is M x M matrix
Wnew = [SUM (xn-x) E[zn]'][SUM E[zn zn']]
First part is N times repeated (sum) operation of multiplying Dx1 matrix by 1xM, thus complexity is O(NDM); result is of size D x M
Second part is again sum of N elements, each being a matrix of M x M, thus in total O(NM^2)
Finally we compute product of D x M and M x M, which is O(DM^2), and again results in D x M matrix
sigma^2new = 1/ND SUM[||xn-x||^2 - 2E[zn]'Wnew'(xn-x) + Tr(E[zn zn']Wnew'Wnew)]
Again we sum N times, this time 3 element sum - first part is just a norm, thus we compute it in O(D) (linear in size of vectors), second term is multiplication of 1 x M, M x D and D x 1 resulting in complexity of O(MD) (per each of iterations, thus in total O(NMD)), and last part is again about multiplying three matrices of sizes M x M, M x D, D x M thus leading to O(M^3D) (*N), but you just need the trace and you can precompute Wnew'Wnew, thus this part is just trace of MxM times MxM matrices leading to O(M^2) (*N)
In total you get O(M^3) + O(NMD) + O(M^2D) + O(M^2N), and I suppose there is an assumption that M<=D<=N thus O(NMD)
I've constructed a decision tree that takes every sample equally weighted. Now to construct a decision tree which gives different weights to different samples. Is the only change that I need to make is in finding Expected Entropy before calculating information gain. I'm a little confused how to proceed, plz explain....
For example: Consider a node containing p positive node and n negative nodes.So the nodes entropy will be -p/(p+n)log(p/(p+n)) -n/(p+n)log(n/(p+n)). Now if a split is found somehow dividing the parent node in two child nodes.Suppose the child 1 contains p' positives and n' negatives(so child 2 contains p-p' and n-n').Now for child 1 we will calculate entropy as calculated for parent and take the probability of reaching it i.e. (p'+n')/(p+n). Now expected reduction in entropy will be entropy(parent)-(prob of reaching child1*entropy(child1)+prob of reaching child2*entropy(child2)). And the split with max info gain will be chosen.
Now to do this same procedure when we have weights available for each sample.What changes need to be made? What changes need to be made specifically for adaboost(using stumps only)?
(I guess this is the same idea as in some comments, e.g., #Alleo)
Suppose you have p positive examples and n negative examples. Let's denote the weights of examples to be:
a1, a2, ..., ap ---------- weights of the p positive examples
b1, b2, ..., bn ---------- weights of the n negative examples
Suppose
a1 + a2 + ... + ap = A
b1 + b2 + ... + bn = B
As you pointed out, if the examples have unit weights, the entropy would be:
p p n n
- _____ log (____ ) - ______log(______ )
p + n p + n p + n p + n
Now you only need to replace p with A and replace n with B and then you can obtain the new instance-weighted entropy.
A A B B
- _____ log (_____) - ______log(______ )
A + B A + B A + B A + B
Note: nothing fancy here. What we did is just to figure out the weighted importance of the group of positive and negative examples. When examples are equally weighted, the importance of positive examples is proportional to the ratio of positive numbers w.r.t number of all examples. When examples are non-equally weighted, we just perform a weighted average to get the importance of positive examples.
Then you follow the same logic to choose the attribute with largest Information Gain by comparing entropy before splitting and after splitting on an attribute.