I have made program of neural network in python with help of neuralnetworksanddeeplearning.com. In which i have randomly initialized hiddenLayer weight of (784,100) and outputLayer weight (100,10).Algorithm is working on minibatch based theory and regularization overfitting with mnist.pkl.gz data set. I am using minibatch of size 10, learning rate(eta)=3, regularization parameter=2.5 . After run program its accuracy increses and then decrease...... So please help me how can i make it better for get more accuracy. following are itirations of algorithm. Thanks in Advance..
>>> stochastic(training_data,10,20,hiddenW,outW,hiddenB,outB,3,test_data,2.5)
Epoch 0 correct data: 9100.0/10000
Total cost of test data [ 307.75991542]
Epoch 1 correct data: 9136.0/10000
Total cost of test data [ 260.61199829]
Epoch 2 correct data: 9233.0/10000
Total cost of test data [ 244.9429907]
Epoch 3 correct data: 9149.0/10000
Total cost of test data [ 237.08391208]
Epoch 4 correct data: 9012.0/10000
Total cost of test data [ 227.14709858]
Epoch 5 correct data: 8714.0/10000
Total cost of test data [ 215.23668711]
Epoch 6 correct data: 8694.0/10000
Total cost of test data [ 201.79958056]
Epoch 7 correct data: 8224.0/10000
Total cost of test data [ 193.37639124]
Epoch 8 correct data: 7915.0/10000
Total cost of test data [ 183.83249811]
Epoch 9 correct data: 7615.0/10000
Total cost of test data [ 166.59631548]
# forward proppagation with with bais 3 para
def forward(weight,inp,b):
val=np.dot(weight.T,inp)+b
return val
# sigmoid function
def sigmoid(x):
val=1.0/(1.0+np.exp(-x))
return val
# Backpropagation for gradient check
def backpropagation(x,weight1,weight2,bais1,bais2,yTarget):
hh=forward(weight1,x,bais1)
hhout=sigmoid(hh)
oo=forward(weight2,hhout,bais2)
oout=sigmoid(oo)
ooe=-(yTarget-oout)*(oout*(1-oout))
hhe=np.dot(weight2,ooe)*(hhout*(1-hhout))
a2=np.dot(hhout,ooe.T)
a1=np.dot(x,hhe.T)
b1=hhe
b2=ooe
return a1,a2,b1,b2
def totalCost(data,weight1,weight2,bais1,bais2,lmbda):
m=len(data)
cost=0.0
for x,y in data:
hh=forward(weight1,x,bais1)
hhout=sigmoid(hh)
oo=forward(weight2,hhout,bais2)
oout=sigmoid(oo)
c=sum(-y*np.log(oout)-(1-y)*np.log(1-oout))
cost=cost+c/m
cost=cost+0.5*(lmbda/m)*(sum(map(sum,(weight1**2)))+sum(map(sum,(weight2**2))))
return cost
def stochastic(tdata,batch_size,epoch,w1,w2,b1,b2,eta,testdata,lmbda):
n=len(tdata)
for j in xrange(epoch):
random.shuffle(tdata)
mini_batches = [tdata[k:k+batch_size]for k in xrange(0, n, batch_size)]
for minibatch in mini_batches:
w1,w2,b1,b2=updateminibatch(minibatch,w1,w2,b1,b2,eta,lmbda)
print 'Epoch {0} correct data: {1}/{2}'.format(j,evaluate(testdata,w1,w2,b1,b2),len(testdata))
print 'Total cost of test data {0}'.format(totalCost(testdata,w1,w2,b1,b2,lmbda))
return w1,w2,b1,b2
def updateminibatch(data,w1,w2,b1,b2,eta,lmbda):
n=len(training_data)
q1=np.zeros(w1.shape)
q2=np.zeros(w2.shape)
q3=np.zeros(b1.shape)
q4=np.zeros(b2.shape)
for xin,yout in data:
delW1,delW2,delB1,delB2=backpropagation(xin,w1,w2,b1,b2,yout)
q1=q1+delW1
q2=q2+delW2
q3=q3+delB1
q4=q4+delB2
w1=(1-eta*(lmbda/n))*w1-(eta/len(data))*q1
w2=(1-eta*(lmbda/n))*w2-(eta/len(data))*q2
b1=b1-(eta/len(data))*q3
b2=b2-(eta/len(data))*q4
return w1,w2,b1,b2
def evaluate(testdata,w1,w2,b1,b2):
i=0
z=np.zeros(len(testdata))
for x,y in testdata:
h=forward(w1,x,b1)
hout=sigmoid(h)
o=forward(w2,hout,b2)
out=sigmoid(o)
p=np.argmax(out)
if (p==y):
a=int(p==y)
z[i]=a
i=i+1
return sum(z)
When you train a machine learning model, you must take care not to overfit your training data.
To understand if you are overfitting the data, is useful use 3 different sets of data during the training:
a training set, that you should use to train the model
a validation set, that you can use during the training to check if you are fitting the data accurately (clearly you don't have to use this set to train the model, but also as test during the training).
and a test set as final test of your model.
In particular is very useful the validation set. In fact if you are overfitting the data, is possible that you have a very good performance on the training set, but a low accuracy on this set. (-> in this case you model is too specialized on the training data, but probably has low accuracy in predict new data.)
So when the accuracy on the validation set start to decrease, is the moment to stop your training, because you have reached the best accuracy possible.
If you want to improve your model accuracy you could use more data for training, or, if you haven't or if accuracy doesn't increase, you should change you model, for example adding more layer in the neural network.
Related
I have a dataset of around 1M rows with a high imbalance (743 / 1072780). I am training xgboost model in h2o with the following parameters and it looks like it is overfitting
H2OXGBoostEstimator(max_depth=10,
subsample=0.7,
ntrees=200,
learn_rate=0.5,
min_rows=3,
col_sample_rate_per_tree = .75,
reg_lambda=2.0,
reg_alpha=2.0,
sample_rate = .5,
booster='gbtree',
nfolds=10,
keep_cross_validation_predictions = True,
stopping_metric = 'AUCPR',
min_split_improvement= 1e-5,
categorical_encoding = 'OneHotExplicit',
weights_column = "Products"
)
The output is:
Training data AUCPR: 0.6878932664592388 Validation data AUCPR: 0.04033158660014747
Training data AUC: 0.9992170372214433 Validation data AUC: 0.7000804189162043
Training data MSE: 0.0005722912424124134 Validation data MSE: 0.0010002949568585474
Training data RMSE: 0.023922609439866994 Validation data RMSE: 0.03162743993526108
Training data Gini: 0.9984340744428866 Validation data Gini: 0.40016083783240863
Confusion Matrix for Training Data:
Confusion Matrix (Act/Pred) for max f1 # threshold = 0.15900755567210062:
0 1 Error Rate
----- ------ --- ------- ----------------
0 709201 337 0.0005 (337.0/709538.0)
1 189 516 0.2681 (189.0/705.0)
Total 709390 853 0.0007 (526.0/710243.0)
Confusion Matrix (Act/Pred) for max f1 # threshold = 0.047459165255228676:
0 1 Error Rate
----- ------ --- ------- ----------------
0 202084 365 0.0018 (365.0/202449.0)
1 140 52 0.7292 (140.0/192.0)
Total 202224 417 0.0025 (505.0/202641.0)
{'train': , 'valid': }
I am using h2o 3.32.0.1 version (since it's a requirement), xgboost h2o doesnt support balance_classes or scale_pos_weight hyperparameters.
What can cause this to have such performance? Also, What can be improved here for such an imbalanced dataset that might improve the performance?
Training with such severely imbalanced data set is pointless. I would try a combination of up sampling and down sampling to get a more balanced data set that does not get too small.
This may be the worst class imbalance I have ever seen in a problem.
If you can subset your majority class - not until the point that it is balanced - but until the balance is less sever while still being representative (i.e., 15/85% minority/majority), you'll have more luck with other conventional techniques, or a mixture (i.e., up sampling and
augmentation.) Can the data logically be subset to help with the imbalance? For example if data ranges back several years, you could use only the last year's worth of data. I'd also manually optimize the threshold against the minority class, like true positive rate.
The goal is to build a model that can rank the stocks based on their Target Return.
The dataset I am using is structured the following way:
Date
Stock_Id
Volume
Open
Close
High
Low
Target
2022-01-04
8341
103422
2734
2742
2755
2730
0.0007304601899196
The data is chronological, and all the stocks are grouped by day. There are only 2000 stocks.
Using the Open, High, Low, Close, and Volume features, I was able to generate about 65+ new features using talib library
The Neural Net takes in all 74 total features and is meant to predict the Target number of each stock. The Target represents the rate of change of the stock between t+2 and t+1.
What I've done:
I have normalized the data accross time. This means that I take all of the time domain data for any given StockId and use it to compute mean and std in order to normalize the data.
The dataloader indexes into the dataset by day. This acts as a mini batch since when getitem is called a (2000, 74) tensor is returned.
The train dataloader is set to shuffle, so the indexing is random.
The loss function is meant to optimize for the highest returns all while keeping the standard deviation of those returns in check.
Because of the non chronological nature of the dataloader and the fact that I need all outputs to compute mean and std, the training loop using optim.SGD but acts as Gradient Descent.
During training, the loss converges very well, but the predictions are very off.
After training, I use the test set to view the accuracy of the prediction. This is the distribution of the predictions. And this is the distribution of the true targets.
The range of the Targets percentages are mainly between -15 and 15. Where as the predictions have a smaller range, -0.07 to -0.05.
I have tried running the model using different learning rates, changing the hidden layer sizes in the network, multiplying the targets by 100 to represent them as percentages, changing the timeperiod argument in many of the talib features. But I always get a model that poorly represents the data.
More Info:
The data is from 1300 trading days with each day containing data for 2000 stocks.
I have adjusted the Open, Close, High, Low prices before normalizing.
I have tried different neural network sizes. Having input layer length 74, and output being 1, with any arbitrary number of layers and sizes in between.
This is what the training loop looks like.
loss_lst = []
for epoch in range(50): # loop over the dataset multiple times
optimizer.zero_grad()
running_error = torch.tensor(0.0)
running_return = []
running_loss = torch.tensor(0.0)
for i, data in enumerate(trainloader, 0):
# get the inputs; data is a list of [inputs, labels]
inputs, labels, date = data
inputs = inputs.cuda().squeeze()
labels = labels.cuda().squeeze()
# forward pass
outputs = net(inputs).squeeze()
error = error_criterion(outputs, labels) # Error between prediction and label
running_error += error.cpu()
day_return = return_criterion(outputs,labels).cpu()
running_return.append(day_return) # Array of daily returns that will be used to compute the std for the loss function.
del outputs
del inputs
del labels
avg_error = running_error / len(trainloader)
std_return = torch.stack(running_return).cuda().std()
loss = (mean_hp * avg_error) + (std_hp * std_return)
loss.backward()
optimizer.step()
running_loss += loss.item()
loss_lst.append(loss.item())
print(f'[{epoch + 1}] loss: {running_loss:.10f}')
# loss_lst.append(sum(running_loss)/len(trainloader))
running_loss = 0.0
print('Finished Training')
Let me know if any additional clarifications are needed.
what is the standard way to detect if a model has converged? I was going to record 5 losses with 95 confidence intervals each loss and if they all agreed then I’d halt the script. I assume training until convergence must be implemented already in PyTorch or PyTorch Lightning somewhere. I don’t need a perfect solution, just the standard way to do this automatically - i.e. halt when converged.
My solution is easy to implement. Once create a criterion and changes the reduction to none. Then it will output a tensor of size [B]. Every you log you record that and it's 95 confidence interval (or std if you prefer, but that is much less accuracy). Then every time you add a new loss with it's confidence interval make sure it remains of size 5 (or 10) and that the 5 losses are within a 95 CI of each other. Then if that is true halt.
You can compute the CI with this:
def torch_compute_confidence_interval(data: Tensor,
confidence: float = 0.95
) -> Tensor:
"""
Computes the confidence interval for a given survey of a data set.
"""
n = len(data)
mean: Tensor = data.mean()
# se: Tensor = scipy.stats.sem(data) # compute standard error
# se, mean: Tensor = torch.std_mean(data, unbiased=True) # compute standard error
se: Tensor = data.std(unbiased=True) / (n**0.5)
t_p: float = float(scipy.stats.t.ppf((1 + confidence) / 2., n - 1))
ci = t_p * se
return mean, ci
and you can create the criterion as follow:
loss: nn.Module = nn.CrossEntropyLoss(reduction='none')
so the train loss is now of size [B].
note that I know how to train with a fixed number of epochs, so I am not really looking for that - just the halting criterion for when to stop when models looks converged, what a person would sort of do when they look at their learning curve but automatically.
ref:
https://forums.pytorchlightning.ai/t/what-is-the-standard-way-to-halt-a-script-when-it-has-converged/1415
Set an EarlyStopping (https://pytorch-lightning.readthedocs.io/en/stable/api/pytorch_lightning.callbacks.EarlyStopping.html#pytorch_lightning.callbacks.EarlyStopping) callback in your trainer by
checkpoint_callbacks = [
EarlyStopping(
monitor="val_f1_score",
min_delta=0.01,
patience=10, # NOTE no. val epochs, not train epochs
verbose=False,
mode="min",
),
]
trainer = pl.Trainer(callbacks=callbacks)
This will monitor changes in val_f1_score during training (notice that you have to log this value with self.log("val_f1_score", val_f1) in your pl.LightningModule). And it will stop the training if the minimum change to quantity to qualify as an improvement (min_delta) for more than the number of epoch specified as patience
I am trying to predict the hygrothermal response of a wall, given the interior and exterior climate. Based on literature research, I believe this should be possible with RNN but I have not been able to get good accuracy.
The dataset has 12 input features (time-series of exterior and interior climate data) and 10 output features (time-series of hygrothermal response), both containing hourly values for 10 years. This data was created with hygrothermal simulation software, there is no missing data.
Dataset features:
Dataset targets:
Unlike most time-series prediction problems, I want to predict the response for the full length of the input features time-series at each time-step, rather than the subsequent values of a time-series (eg financial time-series prediction). I have not been able to find similar prediction problems (in similar or other fields), so if you know of one, references are very welcome.
I think this should be possible with RNN, so I am currently using LSTM from Keras. Before training, I preprocess my data the following way:
Discard first year of data, as the first time steps of the hygrothermal response of the wall is influenced by the initial temperature and relative humidity.
Split into training and testing set. Training set contains the first 8 years of data, the test set contains the remaining 2 years.
Normalise training set (zero mean, unit variance) using StandardScaler from Sklearn. Normalise test set analogously using mean an variance from training set.
This results in: X_train.shape = (1, 61320, 12), y_train.shape = (1, 61320, 10), X_test.shape = (1, 17520, 12), y_test.shape = (1, 17520, 10)
As these are long time-series, I use stateful LSTM and cut the time-series as explained here, using the stateful_cut() function. I only have 1 sample, so batch_size is 1. For T_after_cut I have tried 24 and 120 (24*5); 24 appears to give better results. This results in X_train.shape = (2555, 24, 12), y_train.shape = (2555, 24, 10), X_test.shape = (730, 24, 12), y_test.shape = (730, 24, 10).
Next, I build and train the LSTM model as follows:
model = Sequential()
model.add(LSTM(128,
batch_input_shape=(batch_size,T_after_cut,features),
return_sequences=True,
stateful=True,
))
model.addTimeDistributed(Dense(targets)))
model.compile(loss='mean_squared_error', optimizer=Adam())
model.fit(X_train, y_train, epochs=100, batch_size=batch=batch_size, verbose=2, shuffle=False)
Unfortunately, I don't get accurate prediction results; not even for the training set, thus the model has high bias.
The prediction results of the LSTM model for all targets
How can I improve my model? I have already tried the following:
Not discarding the first year of the dataset -> no significant difference
Differentiating the input features time-series (subtract previous value from current value) -> slightly worse results
Up to four stacked LSTM layers, all with the same hyperparameters -> no significant difference in results but longer training time
Dropout layer after LSTM layer (though this is usually used to reduce variance and my model has high bias) -> slightly better results, but difference might not be statistically significant
Am I doing something wrong with the stateful LSTM? Do I need to try different RNN models? Should I preprocess the data differently?
Furthermore, training is very slow: about 4 hours for the model above. Hence I am reluctant to do an extensive hyperparameter gridsearch...
In the end, I managed to solve this the following way:
Using more samples to train instead of only 1 (I used 18 samples to train and 6 to test)
Keep the first year of data, as the output time-series for all samples have the same 'starting point' and the model needs this information to learn
Standardise both input and output features (zero mean, unit variance). I found this improved prediction accuracy and training speed
Use stateful LSTM as described here, but add reset states after epoch (see below for code). I used batch_size = 6 and T_after_cut = 1460. If T_after_cut is longer, training is slower; if T_after_cut is shorter, accuracy decreases slightly. If more samples are available, I think using a larger batch_size will be faster.
use CuDNNLSTM instead of LSTM, this speed up the training time x4!
I found that more units resulted in higher accuracy and faster convergence (shorter training time). Also I found that the GRU is as accurate as the LSTM tough converged faster for the same number of units.
Monitor validation loss during training and use early stopping
The LSTM model is build and trained as follows:
def define_reset_states_batch(nb_cuts):
class ResetStatesCallback(Callback):
def __init__(self):
self.counter = 0
def on_batch_begin(self, batch, logs={}):
# reset states when nb_cuts batches are completed
if self.counter % nb_cuts == 0:
self.model.reset_states()
self.counter += 1
def on_epoch_end(self, epoch, logs={}):
# reset states after each epoch
self.model.reset_states()
return(ResetStatesCallback)
model = Sequential()
model.add(layers.CuDNNLSTM(256, batch_input_shape=(batch_size,T_after_cut ,features),
return_sequences=True,
stateful=True))
model.add(layers.TimeDistributed(layers.Dense(targets, activation='linear')))
optimizer = RMSprop(lr=0.002)
model.compile(loss='mean_squared_error', optimizer=optimizer)
earlyStopping = EarlyStopping(monitor='val_loss', min_delta=0.005, patience=15, verbose=1, mode='auto')
ResetStatesCallback = define_reset_states_batch(nb_cuts)
model.fit(X_dev, y_dev, epochs=n_epochs, batch_size=n_batch, verbose=1, shuffle=False, validation_data=(X_eval,y_eval), callbacks=[ResetStatesCallback(), earlyStopping])
This gave me very statisfying accuracy (R2 over 0.98):
This figure shows the temperature (left) and relative humidity (right) in the wall over 2 years (data not used in training), prediction in red and true output in black. The residuals show that the error is very small and that the LSTM learns to capture the long-term dependencies to predict the relative humidity.
How do you compute for the training accuracy for SGD? Do you compute it using the batch data you trained your network with? Or using the entire dataset? (for each batch optimization iteration)
I tried computing the training accuracy for each iteration using the batch data I trained my network with. And it almost always gives me 100% training accuracy (sometimes 100%, 90%, 80%, always multiples of 10%, but the very first iteration gave me 100%). Is this because I am computing the accuracy on the same batch data I trained it with for that iteration? Or is my model overfitting that it gave me 100% instantly, but the validation accuracy is low? (this is the main question here, if this is acceptable, or there is something wrong with the model)
Here are the hyperparameters I used.
batch_size = 64
kernel_size = 60 #from 60 #optimal 2
depth = 15 #from 60 #optimal 15
num_hidden = 1000 #from 1000 #optimal 80
learning_rate = 0.0001
training_epochs = 8
total_batches = train_x.shape[0] // batch_size
Calculating the training accuracy on the batch data during the training process is correct. If the number of the accuracy is always multiple of 10%, then most likely it is because your batch size is 10. For example, if 8 of the training outputs match the labels, then your training accuracy will be 80%. If the training accuracy number goes up and down, there are two main possibilities:
1. If you print out the accuracy numbers multiple time over one epoch, it is normal, especially at the early stage of training, because the model is predicting over different data samples;
2. If you print out the accuracy once each epoch, and if you see the training accuracy goes up and down during the later stage of the training, that means your learning rate is too big. You need to decease that overtime during the training.
If these do not answer your question, please provider more details so that we can help.