How to interpret GAN training improve output? - machine-learning

After a few tries, I had trained a GAN to produce semi-sensible output. In this model, it almost instantly found a solution and got stuck there. The loss for both the discriminator and generator were 0.68 (I have used a BCE loss), and the accuracies for both went to around 50%. The output of the generator looked at first glance good enough to be real data, but after analysing it I could see it was still not very good.
My solution here was to increase the power of the discriminator (increased the size of it) and re-train. I hoped by making it larger it would force the generator to create better samples. I got the following output.
It seems that as the GAN loss increases, and is producing worse samples, the discriminator can pick it out more easily.
When I check my output from the trained generator I see it follows some basic rules the real data is following, but again under closer scrutiny, they fail more complex tests the real data would pass. I would like to improve this.
My questions are:
Is my above interpretation of the plots correct?
For this run, have I made the discriminator to powerful? Should I increase the power of the generator?
Is there another technique I should investigate to stop this form of mode collapse?
EDIT: The architecture I am using is a form of Graph GAN. The generator is just a series of linear layers. The discriminator is 3 Graph Conv Layers, then some linear layers. Slightly similar to this paper. Two potentially unconventional things I am doing:
There is no batch normalisation, I have found this has a very negative effect on the training. Though I could try and persevere with it.
I am using StandardScaler to scale my data. This choice was made as it easily allows you to unscale data. This is useful as I can take the output of the generator and easily transform it into an original scale. However, StandardScaler does not scale things between 1 and -1, so I cannot use tanh as the final activation function of my generator, instead, the final layer of the generator is just Linear.
The outputs of the GAN (once rescaled and the shape has been changed) are similar to:
[[ 46.09169 -25.462175 20.705683 -31.696495 ]
[ 35.10637 -18.956036 15.20579 -24.803787 ]
[ 10.253135 -5.759581 5.9068713 -6.3003526]]
An example of the truth is:
[[ 45.6 30.294546 -17.218746 -29.41284 ]
[ 1.8186008 1.7064333 0.5984112 0.19312467]
[ 44.31433 28.234058 -17.615921 -29.262213 ]]
Notably, the top-left value in the matrix will always be 45.6. My Generator does not even consistently produce this.

Related

What is the purpose of having the same input and output in PyTorch nn.Linear function?

I think this is a comprehension issue, but I would appreciate any help.
I'm trying to learn how to use PyTorch for autoencoding. In the nn.Linear function, there are two specified parameters,
nn.Linear(input_size, hidden_size)
When reshaping a tensor to its minimum meaningful representation, as one would in autoencoding, it makes sense that the hidden_size would be smaller. However, in the PyTorch tutorial there is a line specifying identical input_size and hidden_size:
class NeuralNetwork(nn.Module):
def __init__(self):
super(NeuralNetwork, self).__init__()
self.flatten = nn.Flatten()
self.linear_relu_stack = nn.Sequential(
nn.Linear(28*28, 512),
nn.ReLU(),
nn.Linear(512, 512),
nn.ReLU(),
nn.Linear(512, 10),
)
I guess my question is, what is the purpose of having the same input and hidden size? Wouldn't this just return an identical tensor?
I suspect that this just a requirement after calling the nn.ReLU() activation function.
As well stated by wikipedia:
An autoencoder is a type of artificial neural network used to learn
efficient codings of unlabeled data. The
encoding is validated and refined by attempting to regenerate the
input from the encoding.
In other words, the idea of the autoencoder is to learn an identity. This identity-function will be learned only for particular inputs (i.e. without anomalies). From this, the following points derive:
Input will have same dimensions as output
Autoencoders are (generally) built to learn the essential features of the input
Because of point (1), you have that autoencoder will have a series of layers (e.g. a series of nn.Linear() or nn.Conv()).
Because of point (2), you generally have an Encoder which compresses the information (as your code-snippet, you start from 28x28 to the ending 10) and a Decoder that decompress the information (10 -> 28x28). Generally the latent space dimensionality (10) is much smaller than the input (28x28) across several implementation of this theoretical architecture. Now that the end-goal of the Encoder part is clear, you may appreciate that the compression may produce additional data during the compression itself (nn.Linear(28*28, 512)), which will disappear when the series of layers will give the final output (10).
Note that because the model in your question includes a nonlinearity after the linear layer, the model will not learn an identity transform between the input and output. In the specific case of the relu nonlinearity, the model could learn an identity transform if all of the input values were positive, but in general this won't be the case.
I find it a little easier to imagine the issue if we had an even smaller model consisting of Linear --> Sigmoid --> Linear. In such a case, the input will be mapped through the first matrix transform and then "squashed" into the space [0, 1] as the "hidden" layer representation. The next ("output") layer would need to take this squashed view of the input and come up with some way of "unsquashing" it back into the original. But with an affine output layer, it's not possible to do this, so the model will have to learn some other, non-identity, transforms for the two matrices.
There are some neat visualizations of this concept on Chris Olah's blog that are well worth a look.

Keras Conv1D on ECG Signal

I am trying to classify different ECG signals. I am using Keras' Conv1D, but am not getting any good results.
I have tried changing the number of layers, window size, etc, but every time I run this I get predictions all of the same class (the classes are 0,1,2, so I get a prediction output of something like [1,1,1,1,1,1,1,1,1,1,1,1,1,1], but the class changes each time I run the script).
The ECG signals are in 1000 point numpy arrays.
Are there any glaringly obvious things I am doing wrong here? I was thinking it would've worked great to use a few layers to just classify into 3 different ECG signals.
#arrange and randomize data
y1=[[0]]*len(lead1)
y2=[[1]]*len(lead2)
y3=[[2]]*len(lead3)
y=np.concatenate((y1,y2,y3))
data=np.concatenate((lead1,lead2,lead3))
data = keras.utils.normalize(data)
data=np.concatenate((data,y),axis=1)
data=np.random.permutation((data))
print(data)
#separate data and create categories
Xtrain=data[0:130,0:-1]
Xtrain=np.reshape(Xtrain,(len(Xtrain),1000,1))
Xpred=data[130:,0:-1]
Xpred=np.reshape(Xpred,(len(Xpred),1000,1))
Ytrain=data[0:130,-1]
Yt=to_categorical(Ytrain)
Ypred=data[130:,-1]
Yp=to_categorical(Ypred)
#create CNN model
model = Sequential()
model.add(Conv1D(20,20,activation='relu',input_shape=(1000,1)))
model.add(MaxPooling1D(3))
model.add(Conv1D(20,10,activation='relu'))
model.add(MaxPooling1D(3))
model.add(Conv1D(20,10,activation='relu'))
model.add(GlobalAveragePooling1D())
model.add(Dense(3,activation='relu',use_bias=False))
model.compile(optimizer='adam', loss='categorical_crossentropy',metrics=['accuracy'])
model.fit(Xtrain,Yt)
#test model
print(model.evaluate(Xpred,Yp))
print(model.predict_classes(Xpred,verbose=1))
Are there any glaringly obvious things I am doing wrong here?
Indeed there is: the output you report is not surprising, given that you are currently using the ReLU as activation for your last layer, which does not make any sense.
In multi-class settings, such as yours, the activation of the last layer must be the softmax, and certainly not the ReLU; change your last layer to:
model.add(Dense(3, activation='softmax'))
Not quite sure why you ask for use_bias=False, but you can try both with and without it and experiment...

Scikit_learn's PolynomialFeatures with logistic regression resulting in lower scores

I have a dataset X whose shape is (1741, 61). Using logistic regression with cross_validation I was getting around 62-65% for each split (cv =5).
I thought that if I made the data quadratic, the accuracy is supposed to increase. However, I'm getting the opposite effect (I'm getting each split of cross_validation to be in the 40's, percentage-wise) So,I'm presuming I'm doing something wrong when trying to make the data quadratic?
Here is the code I'm using,
from sklearn import preprocessing
X_scaled = preprocessing.scale(X)
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(3)
poly_x =poly.fit_transform(X_scaled)
classifier = LogisticRegression(penalty ='l2', max_iter = 200)
from sklearn.cross_validation import cross_val_score
cross_val_score(classifier, poly_x, y, cv=5)
array([ 0.46418338, 0.4269341 , 0.49425287, 0.58908046, 0.60518732])
Which makes me suspect, I'm doing something wrong.
I tried transforming the raw data into quadratic, then using preprocessing.scale, to scale the data, but it was resulting in an error.
UserWarning: Numerical issues were encountered when centering the data and might not be solved. Dataset may contain too large values. You may need to prescale your features.
warnings.warn("Numerical issues were encountered "
So I didn't bother going this route.
The other thing that's bothering is the speed of the quadratic computations. cross_val_score is taking around a couple of hours to output the score when using polynomial features. Is there any way to speed this up? I have an intel i5-6500 CPU with 16 gigs of ram, Windows 7 OS.
Thank you.
Have you tried using the MinMaxScaler instead of the Scaler? Scaler will output values that are both above and below 0, so you will run into a situation where values with a scaled value of -0.1 and those with a value of 0.1 will have the same squared value, despite not really being similar at all. Intuitively this would seem to be something that would lower the score of a polynomial fit. That being said I haven't tested this, it's just my intuition. Furthermore, be careful with Polynomial fits. I suggest reading this answer to "Why use regularization in polynomial regression instead of lowering the degree?". It's a great explanation and will likely introduce you to some new techniques. As an aside #MatthewDrury is an excellent teacher and I recommend reading all of his answers and blog posts.
There is a statement that "the accuracy is supposed to increase" with polynomial features. That is true if the polynomial features brings the model closer to the original data generating process. Polynomial features, especially making every feature interact and polynomial, may move the model further from the data generating process; hence worse results may be appropriate.
By using a 3 degree polynomial in scikit, the X matrix went from (1741, 61) to (1741, 41664), which is significantly more columns than rows.
41k+ columns will take longer to solve. You should be looking at feature selection methods. As Grr says, investigate lowering the polynomial. Try L1, grouped lasso, RFE, Bayesian methods. Try SMEs (subject matter experts who may be able to identify specific features that may be polynomial). Plot the data to see which features may interact or be best in a polynomial.
I have not looked at it for a while but I recall discussions on hierarchically well-formulated models (can you remove x1 but keep the x1 * x2 interaction). That is probably worth investigating if your model behaves best with an ill-formulated hierarchical model.

Interpretation of a learning curve in machine learning

While following the Coursera-Machine Learning class, I wanted to test what I learned on another dataset and plot the learning curve for different algorithms.
I (quite randomly) chose the Online News Popularity Data Set, and tried to apply a linear regression to it.
Note : I'm aware it's probably a bad choice but I wanted to start with linear reg to see later how other models would fit better.
I trained a linear regression and plotted the following learning curve :
This result is particularly surprising for me, so I have questions about it :
Is this curve even remotely possible or is my code necessarily flawed?
If it is correct, how can the training error grow so quickly when adding new training examples? How can the cross validation error be lower than the train error?
If it is not, any hint to where I made a mistake?
Here's my code (Octave / Matlab) just in case:
Plot :
lambda = 0;
startPoint = 5000;
stepSize = 500;
[error_train, error_val] = ...
learningCurve([ones(mTrain, 1) X_train], y_train, ...
[ones(size(X_val, 1), 1) X_val], y_val, ...
lambda, startPoint, stepSize);
plot(error_train(:,1),error_train(:,2),error_val(:,1),error_val(:,2))
title('Learning curve for linear regression')
legend('Train', 'Cross Validation')
xlabel('Number of training examples')
ylabel('Error')
Learning curve :
S = ['Reg with '];
for i = startPoint:stepSize:m
temp_X = X(1:i,:);
temp_y = y(1:i);
% Initialize Theta
initial_theta = zeros(size(X, 2), 1);
% Create "short hand" for the cost function to be minimized
costFunction = #(t) linearRegCostFunction(X, y, t, lambda);
% Now, costFunction is a function that takes in only one argument
options = optimset('MaxIter', 50, 'GradObj', 'on');
% Minimize using fmincg
theta = fmincg(costFunction, initial_theta, options);
[J, grad] = linearRegCostFunction(temp_X, temp_y, theta, 0);
error_train = [error_train; [i J]];
[J, grad] = linearRegCostFunction(Xval, yval, theta, 0);
error_val = [error_val; [i J]];
fprintf('%s %6i examples \r', S, i);
fflush(stdout);
end
Edit : if I shuffle the whole dataset before splitting train/validation and doing the learning curve, I have very different results, like the 3 following :
Note : the training set size is always around 24k examples, and validation set around 8k examples.
Is this curve even remotely possible or is my code necessarily flawed?
It's possible, but not very likely. You might be picking the hard to predict instances for the training set and the easy ones for the test set all the time. Make sure you shuffle your data, and use 10 fold cross validation.
Even if you do all this, it is still possible for it to happen, without necessarily indicating a problem in the methodology or the implementation.
If it is correct, how can the training error grow so quickly when adding new training examples? How can the cross validation error be lower than the train error?
Let's assume that your data can only be properly fitted by a 3rd degree polynomial, and you're using linear regression. This means that the more data you add, the more obviously it will be that your model is inadequate (higher training error). Now, if you choose few instances for the test set, the error will be smaller, because linear vs 3rd degree might not show a big difference for too few test instances for this particular problem.
For example, if you do some regression on 2D points, and you always pick 2 points for your test set, you will always have 0 error for linear regression. An extreme example, but you get the idea.
How big is your test set?
Also, make sure that your test set remains constant throughout the plotting of the learning curves. Only the train set should increase.
If it is not, any hint to where I made a mistake?
Your test set might not be large enough or your train and test sets might not be properly randomized. You should shuffle the data and use 10 fold cross validation.
You might want to also try to find other research regarding that data set. What results are other people getting?
Regarding the update
That makes a bit more sense, I think. Test error is generally higher now. However, those errors look huge to me. Probably the most important information this gives you is that linear regression is very bad at fitting this data.
Once more, I suggest you do 10 fold cross validation for learning curves. Think of it as averaging all of your current plots into one. Also shuffle the data before running the process.

Backpropogation neural network - error not converging

I am using backpropogation algorithm for my model. It works perfectly fine a simple xor case and when I tested it for a smaller subset of my actual data.
There are 3 inputs in total and a single output(0,1,2)
I have split the data set into training set (80% amounting to approx 5.5k) and the rest 20% as validation data.
I use trainingRate and momentum for calculating the delta weights.
I have normalized the input as below
from sklearn import preprocessing
min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(input_array)
I use 1 hidden layer with sigmoid and linear activation functions for input-hidden and hidden-output respectively.
I train with trainingRate = 0.0005, momentum = 0.6, Epochs = 100,000. Any higher trainingRate shoots up the error to Nan. momentum values between 0.5 and 0.9 works fine and any other value makes the error Nan.
I tried various number of nodes in the hidden layer such as 3,6,9,10 and the error converged to 4140.327574 in each case. I am not sure how to reduce this. Changing the activation functions doesn't help. I even tried adding another hidden layer with gaussian activation function but I cannot reduce the error whatsoever.
Is it because of the outliers? Do i need to clean those values from the training data?
Any suggestion would be of great help be it the activation function, hidden layers, etc. I had been trying to get this working for quite some time and I am sort of stuck now.
Well I'm having kind of a similar problem, still haven fixed it, but I can tell you a couple of things I have found. I think the net is overfitting, my error at some point goes down and then starts going up again, also the verification set... is this you case also?
Check if you are implementing well the "early stopping" algorithm, most of the times the problem is not the backpropagation, but the error analysis or the validation analysis.
Hope this helps!

Resources