Unexpected behavior at torch7

Unexpected behavior at torch7 - lua

I was working on generating an XOR gate dataset with torch7. But when i printed the dataset i saw that the data was wrong, but i could not find the bug. There seems to be nothing wrong with the code. But I'm new to torch, so mistakes can happen.
So, here is my code
input = torch.Tensor (4,2)
input:random(0,1)
output = torch.Tensor(1)
dataset={};
function dataset:size() return 4 end
for i=1,dataset:size() do
if input[i][1]==input[i][2] then
output[1] = 0
else
output[1] = 1
end
print("original")
print(input[i][1].." "..input[i][2].." "..output[1]) -- the values that are going to dataset
dataset[i] = {input[i], output}
print("dataset")
print(dataset[i][1][1].." "..dataset[i][1][2].." "..dataset[i][2][1]) -- for double checking i read from dataset again
end
print("Why dataset is different now?")
for i=1,4 do
print(dataset[i][1][1].." "..dataset[i][1][2].." "..dataset[i][2][1]) -- So, why this is different?
end
As you can see, I printed the values that are being inserted into the dataset list and for double checking i read from dataset again.
And finally i checked from dataset after full insertion. The dataset was different somehow. I ran couple of times. Every time it was different. Like it was stuck on 1 or 0.
So here is my output
original
1 0 1
dataset
1 0 1
original
0 0 0
dataset
0 0 0
original
1 1 0
dataset
1 1 0
original
0 0 0
dataset
0 0 0
Why dataset is different now?
1 0 0
0 0 0
1 1 0
0 0 0
As you can see, the format is like this
input input output
I printed original when i read from input[i] and output.
I printed dataset when i read from dataset, after being inserted.
Also you can see that the first set of values are different when i printed. It should be 1 0 1. But it is 1 0 0.
I could not find the bug in my code. Can anyone help? If the question is not clear please let me know.

Problem is here: dataset[i] = {input[i], output}
You're not saving calculated result, you're saving reference to value that is changed with subsequent calculations for 'xor' function.Naturally, when you read result, you're always getting the same number - last result written to output[1]
To fix it, either change output variable to store actual temporary value (not table), or at least read actual value from output table when saving to dataset[i], do not just save link to table, you won't get deep copy that way.

Related

Padding time-series subsequences for LSTM-RNN training

I have a dataset of time series that I use as input to an LSTM-RNN for action anticipation. The time series comprises a time of 5 seconds at 30 fps (i.e. 150 data points), and the data represents the position/movement of facial features.
I sample additional sub-sequences of smaller length from my dataset in order to add redundancy in the dataset and reduce overfitting. In this case I know the starting and ending frame of the sub-sequences.
In order to train the model in batches, all time series need to have the same length, and according to many papers in the literature padding should not affect the performance of the network.
Example:
Original sequence:
1 2 3 4 5 6 7 8 9 10
Subsequences:
4 5 6 7
8 9 10
2 3 4 5 6
considering that my network is trying to anticipate an action (meaning that as soon as P(action) > threshold as it goes from t = 0 to T = tmax, it will predict that action) will it matter where the padding goes?
Option 1: Zeros go to substitute original values
0 0 0 4 5 6 7 0 0 0
0 0 0 0 0 0 0 8 9 10
0 2 3 4 5 6 0 0 0 0
Option 2: all zeros at the end
4 5 6 7 0 0 0 0 0 0
8 9 10 0 0 0 0 0 0 0
2 3 4 5 0 0 0 0 0 0
Moreover, some of the time series are missing a number of frames, but it is not known which ones they are - meaning that if we only have 60 frames, we don't know whether they are taken from 0 to 2 seconds, from 1 to 3s, etc. These need to be padded before the subsequences are even taken. What is the best practice for padding in this case?
Thank you in advance.

The most powerful attribute of LSTMs and RNNs in general is that their parameters are shared along the time frames(Parameters recur over time frames) but the parameter sharing relies upon the assumption that the same parameters can be used for different time steps i.e. the relationship between the previous time step and the next time step does not depend on t as explained here in page 388, 2nd paragraph.
In short, padding zeros at the end, theoretically should not change the accuracy of the model. I used the adverb theoretically because at each time step LSTM's decision depends on its cell state among other factors and this cell state is kind of a short summary of the past frames. As far as I understood, that past frames may be missing in your case. I think what you have here is a little trade-off.
I would rather pad zeros at the end because it doesn't completely conflict with the underlying assumption of RNNs and it's more convenient to implement and keep track of.
On the implementation side, I know tensorflow calculates the loss function once you give it the sequences and the actual sequence size of each sample(e.g. for 4 5 6 7 0 0 0 0 0 0 you also need to give it the actual size which is 4 here) assuming you're implementing the option 2. I don't know whether there is an implementation for option 1, though.

Better go for padding zeroes in the beginning, as this paper suggests Effects of padding on LSTMs and CNNs,
Though post padding model peaked it’s efficiency at 6 epochs and started to overfit after that, it’s accuracy is way less than pre-padding.
Check table 1, where the accuracy of pre-padding(padding zeroes in the beginning) is around 80%, but for post-padding(padding zeroes in the end), it is only around 50%

In case you have sequences of variable length, pytorch provides a utility function torch.nn.utils.rnn.pack_padded_sequence. The general workflow with this function is
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
embedding = nn.Embedding(4, 5)
rnn = nn.GRU(5, 5)
sequences = torch.tensor([[1,2,0], [3,0,0], [2,1,3]])
lens = [2, 1, 3] # indicating the actual length of each sequence
embeddings = embedding(sequences)
packed_seq = pack_padded_sequence(embeddings, lens, batch_first=True, enforce_sorted=False)
e, hn = rnn(packed_seq)
One can collect the embedding of each token by
e = pad_packed_sequence(e, batch_first=True)
Using this function is better than padding by yourself, because torch will limit RNN to only inspecting the actual sequence and stop before the padded token.

minimize the maximum continious subarray in array of 0/1

Algo question
Binary array of 0/1 given
In one operation i can flip any array[index] of array i.e. 0->1 or 1->0
so aim is to minimize the maximum lenth of continious 1's or 0's by using atmost k flips
eg if 11111 if array and k=1 ,best is to make array as 11011
And minimized value of maximum continous 1's or 0's is 2
for 111110111111 and k=3 ans is 2
I tried Brute Force (by trying various position flips) but its not efficient
I think Greedy ,but can not figure out exactly
can you please help me for algo,O(n) or similar

A solution could be devised by reading each bit in order and recording the size of each continuous group of 1 into a list A.
Once you are done filling A, you can follow the algorithm narrated by the pseudocode below:
result = N
for i = 1 to N
flips_needed = 0
for a in A:
flips_needed += <number of flips needed to make sure largest group remaining in a is of size i>
if k >= flips_needed:
result = flips_needed
break
return result
N is the number of bits in the entire initial sequence.
The algorithm above works by dividing the groups of 1 into sizes of at most i. Whenever doing that requires <= k, we have the result we are looking for, as i starts from 1 and goes up. (i.e. when we found flips_needed <= k, we know the groups of 1 are as minimal as they can get)

Neural Network Character Recognition

Suppose I'm trying to create a Neural Network to recognize characters on a simple 5x5 grid of pixels. I have only 6 possible characters (symbols) - X,+,/,\,|
At the moment I have a Feedforward Neural Network - with 25 input nodes, 6 hidden nodes and a single output node (between 0 and 1 - sigmoid).
The output corresponds to a symbol. Such as 'X' = 0.125, '+' = 0.275, '/' = 0.425 etc.
Whatever the output of the network (on testing) is, corresponds to whatever character is closest numerically. i.e - 0.13 = 'X'
On Input, 0.1 means the pixel is not shaded at all, 0.9 means fully shaded.
After training the network on the 6 symbols I test it by adding some noise.
Unfortunately, if I add a tiny bit of noise to '/', the network thinks it's '\'.
I thought maybe the ordering of the 6 symbols (i.,e - what numeric representation they correspond to) might make a difference.
Maybe the number of hidden nodes is causing this problem.
Maybe my general concept of mapping characters to numbers is causing the problem.
Any help would be hugely appreciated to make the network more accurate.

The output encoding is the biggest problem. You should better use a one-hot encoding for the output so that you have six output nodes.
For example,
- 1 0 0 0 0 0
X 0 1 0 0 0 0
+ 0 0 1 0 0 0
/ 0 0 0 1 0 0
\ 0 0 0 0 1 0
| 0 0 0 0 0 1
This is much easier for the neural network to learn. At prediction time, pick the node that has the highest value as your prediction. For example, if you have below output values at each output node:
- 0.01
X 0.5
+ 0.2
/ 0.1
\ 0.2
| 0.1
Predict the character as "X".

Handling features not correlated with output prediction?

I do regression analysis with multiple features. Number of features is 20-23. For now, I check each feature correlation with output variable. Some features show correlation coefficient close to 1 or -1 (highly correlated). Some features show correlation coefficient near 0. My question is: do I have to remove this feature if it has close to 0 correlation coefficient? Or I can keep it and the only problem is that this feature will no make some noticeable effect to regression model or will have faint affect on it. Or removing that kind of features is obligatory?

In short
High (absolute) correlation between a feature and output implies that this feature should be valuable as predictor
Lack of correlation between feature and output implies nothing
More details
Pair-wise correlation only shows you how one thing affects the other, it says completely nothing about how good is this feature connected with others. So if your model is not trivial then you should not drop variables because they are not correlated with output). I will give you the example which should show you why.
Consider following sample, we have 2 features (X, Y), and one output value (Z, say red is 1, black is 0)
X Y Z
1 1 1
1 2 0
1 3 0
2 1 0
2 2 1
2 3 0
3 1 0
3 2 0
3 3 1
Let us compute the correlations:
CORREL(X, Z) = 0
CORREL(Y, Z) = 0
So... we should drop all values? One of them? If we drop any variable - our prolem becomes completely impossible to model! "magic" lies in the fact that there is actually a "hidden" relation in the data.
|X-Y|
0
1
2
1
0
1
2
1
0
And
CORREL(|X-Y|, Z) = -0.8528028654
Now this is a good predictor!
You can actually get a perfect regressor (interpolator) through
Z = 1 - sign(|X-Y|)

Difference-in-difference analysis in SPSS

I am trying to compare means of the two groups 'single mothers with one child' and 'single mothers with more than one child' before and after the reform of the EITC system in 1993.
Through the procedure T-test in SPSS, I can get the difference between groups before and after the reform. But how do I get the difference of the difference (I still want standard errors)?
I found these methods for STATA and R (http://thetarzan.wordpress.com/2011/06/20/differences-in-differences-estimation-in-r-and-stata/), but I can't seem to figure it out in SPSS.
Hope someone will be able to help.
All the best,
Anne

This can be done with the GENLIN procedure. Here's some random data I generated to show how:
data list list /after oneChild value.
begin data.
0 1 12
0 1 12
0 1 11
0 1 13
0 1 11
1 1 10
1 1 9
1 1 8
1 1 9
1 1 7
0 0 16
0 0 16
0 0 18
0 0 15
0 0 17
1 0 6
1 0 6
1 0 5
1 0 5
1 0 4
end data.
dataset name exampleData WINDOW=front.
EXECUTE.
value labels after 0 'before' 1 'after'.
value labels oneChild 0 '>1 child' 1 '1 child'.
The mean for the groups (in order, before I truncated to integers) are 17, 6, 12, and 9 respectively. So our GENLIN procedure should generate values of -11 (the after-before difference in the >1 child group), -5 (the difference of 1 child - >1 child), and 8 (the child difference of the after-before differences).
To graph the data, just so you can see what we're expecting:
* Chart Builder.
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=after value oneChild MISSING=LISTWISE REPORTMISSING=NO
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: after=col(source(s), name("after"), unit.category())
DATA: value=col(source(s), name("value"))
DATA: oneChild=col(source(s), name("oneChild"), unit.category())
GUIDE: axis(dim(2), label("value"))
GUIDE: legend(aesthetic(aesthetic.color.interior), label(""))
SCALE: linear(dim(2), include(0))
ELEMENT: line(position(smooth.linear(after*value)), color.interior(oneChild))
ELEMENT: point.dodge.symmetric(position(after*value), color.interior(oneChild))
END GPL.
Now, for the GENLIN:
* Generalized Linear Models.
GENLIN value BY after oneChild (ORDER=DESCENDING)
/MODEL after oneChild after*oneChild INTERCEPT=YES
DISTRIBUTION=NORMAL LINK=IDENTITY
/CRITERIA SCALE=MLE COVB=MODEL PCONVERGE=1E-006(ABSOLUTE) SINGULAR=1E-012 ANALYSISTYPE=3(WALD)
CILEVEL=95 CITYPE=WALD LIKELIHOOD=FULL
/MISSING CLASSMISSING=EXCLUDE
/PRINT CPS DESCRIPTIVES MODELINFO FIT SUMMARY SOLUTION.
The results table shows just what we expect.
The >1 child group is 12.3 - 10.1 lower after vs. before. This 95% CI contains the "real" value of 11
The before difference between >1 children and 1 child is 5.7 - 3.5, containing the real value of 5
The difference-of-differences is 9.6 - 6.4, containing the real value of (17-6) - (12-9) = 8
Std. errors, p values, and the other hypothesis testing values are all reported as well. Hope that helps.
EDIT: this can be done with less "complicated" syntax by computing the interaction term yourself and doing simple linear regression:
compute interaction = after*onechild.
execute.
REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF OUTS CI(95) R ANOVA
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT value
/METHOD=ENTER after oneChild interaction.
Note that the resulting standard errors and confidence intervals are actually different from the previous method. I don't know enough about SPSS's GENLIN and REGRESSION procedures to tell you why that's the case. In this contrived example, the conclusion you'd draw from your data would be approximately the same. In real life, the data aren't likely to be this clean, so I don't know which method is "better".

General Linear model, i take it as a 'ANOVA' model.
So use the related module in SPSS's Analyze menu.
After T-test, you need to check the sigma equality of each group .

Regarding the first answer above:
* Note that GENLIN uses maximum likelihood estimation (MLE) whereas REGRESSION
* uses ordinary least squares (OLS). Therefore, GENLIN reports z- and Chi-square tests
* where REGRESSION reports t- and F-tests. Rather than using GENLIN, use UNIANOVA
* to get the same results as REGRESSION, but without the need to compute your own
* product term.
UNIANOVA value BY after oneChild
/PLOT=PROFILE(after*oneChild)
/PLOT=PROFILE(oneChild*after)
/PRINT PARAMETER
/EMMEANS=TABLES(after*oneChild) COMPARE(after)
/EMMEANS=TABLES(after*oneChild) COMPARE(oneChild)
/DESIGN=after oneChild after*oneChild.
HTH.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Unexpected behavior at torch7 - lua

Related

Padding time-series subsequences for LSTM-RNN training

minimize the maximum continious subarray in array of 0/1

Neural Network Character Recognition

Handling features not correlated with output prediction?

Difference-in-difference analysis in SPSS

Categories

Resources