Keras + Tensorflow optimization stalls - machine-learning

I installed Theano (TH), Tensorflow (TF) and Keras.
Basic testing seem to indicate that they work with the GPU (GTX 1070), Cuda 8.0, cuDNN5.1 .
If I run the cifar10_cnn.py Keras example with TH as backend, it seems to work ok, taking ~18s/epoch.
If I run it with TF then,almost all the times (it has worked occasionally, can't reproduce it), the optimization stalls with acc=0.1 after every epoch. It is as if weights were not updated.
This is a shame because TF backend was taking ~10s/epoch (even the very few times it worked). I'm using Conda and I am very new to Python. If that helps, "conda list" seems to show two versions for some of the packages.
If you have any clues, please let me know. Thanks. Screenshot below :
python cifar10_cnn.py
Using TensorFlow backend.
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so locally
X_train shape: (50000, 32, 32, 3)
50000 train samples
10000 test samples
Using real-time data augmentation.
Epoch 1/200
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: GeForce GTX 1070
major: 6 minor: 1 memoryClockRate (GHz) 1.7845
pciBusID 0000:01:00.0
Total memory: 7.92GiB
Free memory: 7.60GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0)
50000/50000 [==============================] - 11s - loss: 2.3029 - acc: 0.0999 - val_loss: 2.3026 - val_acc: 0.1000
Epoch 2/200
50000/50000 [==============================] - 10s - loss: 2.3028 - acc: 0.0980 - val_loss: 2.3026 - val_acc: 0.1000
Epoch 3/200
50000/50000 [==============================] - 10s - loss: 2.3028 - acc: 0.0992 - val_loss: 2.3026 - val_acc: 0.1000
Epoch 4/200
50000/50000 [==============================] - 10s - loss: 2.3028 - acc: 0.0980 - val_loss: 2.3026 - val_acc: 0.1000
Epoch 5/200
13184/50000 [======>.......................] - ETA: 7s - loss: 2.3026 - acc: 0.1044^CTraceback (most recent call last):

It look to me like it is just random guessing since there are 10 possibilities and it is right 10% of the time. The only thing I can think of is that you learning rate is a bit too high. I have seen with a high learning rate models will sometimes converge and sometimes not converge. On the backend right now I think theano performs more optimizations so maybe this is slightly affecting something. Try lowering the learning rate by a factor of 10 and see if it converges.

Related

Why I am retrieving high value loss for neural network regression

I have data in the following format consisting of 80 instances. I need to predict two-parameter latency and accuracy
No Model Technique Latency Accuracy
0 1 Net Repartition 31308.4 0.99
1 2 Net Connection 30338.2 0.79
2 3 MobiNet Repartition 20360.1 0.89
predictors=data.drop(['Latency','Accuracy'], axis = 1)
target=data[['Latency', 'Accuracy']]
predictors_cat_converted=pd.get_dummies(predictors, prefix=['Model', 'Technique'])
pre_norms = (predictors_cat_converted-predictors_cat_converted.mean()/predictors_cat_converted.std())
def regression():
model=Sequential()
model.add(Dense(50, activation= 'relu',input_shape=(n_cols,)))
model.add(Dense(50, activation='relu'))#hidden layer
model.add(Dense(2))#output
model.compile(optimizer='adam',loss='mean_squared_error')
return model
model=regression()
model.fit(pre_norms, target,validation_split=.3,epochs=100,verbose=1)
Output retrieving high value loss
Epoch 1/100
2/2 [==============================] - 1s 275ms/step - loss: 256321162.6667 - val_loss: 262150224.0000
Epoch 2/100
2/2 [==============================] - 0s 23ms/step - loss: 246612645.3333 - val_loss: 262146176.0000
Epoch 3/100
2/2 [==============================] - 0s 22ms/step - loss: 251778928.0000 - val_loss: 262142000.0000
Epoch 4/100
2/2 [==============================] - 0s 26ms/step - loss: 252470826.6667 - val_loss: 262137664.0000
Epoch 5/100
2/2 [==============================] - 0s 25ms/step - loss: 255799392.0000 - val_loss: 262133200.0000
Epoch 6/100
You have very less data, just 2 columns, 80 rows and 2 target variables. All you can do is:
Add more data.
Normalize your data and then feed it to the neural network.
If neural network not giving good accuracy, try Random Forest or XGBoost.
I also want to add one thing that is your neural network architecture is wrong. Dense layer with 2 outputs and a softmax activation isn't going to give you good result here. You have to use TensorFlow's Funtional API and make 1 input 2 output neural network architecture.
One of your target variables reaches quite big values. As shown in the excerpt of your data, "Latency" reaches values around 30,000 and 20,000.
Evidently if your model makes quite wrong predictions in the beginning, f.e. if it predicts "1" for your Latency, the MSE will be extremely high.
You could normalize your targets as you did with your inputs to make it easier for your network to learn the targets. Your MSE and hence your loss should be much smaller then

LSTM Accuracy unchanged while loss decrease

We put a sensor to detect anomalies in accelerometer.
There is only one sensor so my data is 1-D array.
I tried to use LSTM autoencoder for anomaly detection.
But my model didn't work as the losses of the training and validation sets were decreasing but accuracy unchanged.
Here is my Code and training log:
dim = 1
timesteps = 32
data.shape = (-1,timesteps,dim)
model = Sequential()
model.add(LSTM(50,input_shape=(timesteps,dim),return_sequences=True))
model.add(Dense(dim))
lr = 0.00001
Nadam = optimizers.Nadam(lr=lr)
model.compile(loss='mae', optimizer=Nadam ,metrics=['accuracy'])
EStop = EarlyStopping(monitor='val_loss', min_delta=0.001,patience=150, verbose=2, mode='auto',restore_best_weights=True)
history = model.fit(data,data,validation_data=(data,data),epochs=2000,batch_size=64,verbose=2,shuffle=False,callbacks=[EStop]).history
Trainging Log
Train on 4320 samples, validate on 4320 samples
Epoch 1/2000
- 3s - loss: 0.3855 - acc: 7.2338e-06 - val_loss: 0.3760 - val_acc: 7.2338e-06
Epoch 2/2000
- 2s - loss: 0.3666 - acc: 7.2338e-06 - val_loss: 0.3567 - val_acc: 7.2338e-06
Epoch 3/2000
- 2s - loss: 0.3470 - acc: 7.2338e-06 - val_loss: 0.3367 - val_acc: 7.2338e-06
...
Epoch 746/2000
- 2s - loss: 0.0021 - acc: 1.4468e-05 - val_loss: 0.0021 - val_acc: 1.4468e-05
Epoch 747/2000
- 2s - loss: 0.0021 - acc: 1.4468e-05 - val_loss: 0.0021 - val_acc: 1.4468e-05
Epoch 748/2000
- 2s - loss: 0.0021 - acc: 1.4468e-05 - val_loss: 0.0021 - val_acc: 1.4468e-05
Restoring model weights from the end of the best epoch
Epoch 00748: early stopping
A couple of things
As Matias in the comment field pointed out, you're doing a regression, not a classification. Accuracy will not give expected values for regression. That said, you can see that the accuracy did improve (from 0.0000072 to 0.0000145). Check the direct output from your model to check how well it approximates to original time series.
You can safely omit the validation data when your validation data is the same as the training data
With autoencoders, you generally want to compress the data in some way as to be able to represent the same data in a lower dimension which is easier to analyze (for anomalies or otherwise. In your case, you are expanding the dimensionality instead of reducing it, meaning the optimal strategy for your autoencoder would be to pass through the same values it gets in (value of your timeseries is sent to 50 LSTM units, which send their result to 1 Dense unit). You might be able to combat this if you set return_sequence to False (i.e. only the result from the last timestep is returned), preferably into more than one unit, and you then try to rebuild the timeseries from this instead. It might fail, but is still likely to lead to a better model
As #MatiasValdenegro said you shouldn't use accuracy when you want to do regression.
You can see that your model might be fine because your loss is decreasing over the epochs and is very low when early stopping.
In Regression Problems normaly these Metrics are used:
Mean Squared Error: mean_squared_error, MSE or mse
Mean Absolute Error: mean_absolute_error, MAE, mae
Mean Absolute Percentage Error: mean_absolute_percentage_error, MAPE,
mape
Cosine Proximity: cosine_proximity, cosine
Resource
To geht the right metrics you should change this (e.g. for "Mean Squared Error"):
model.compile(loss='mae', optimizer=Nadam ,metrics=['mse'])
As already said your model seems to be fine, you are just looking at the wrong metrics.
Hope this helps and feel free to ask.
Early stopping is not the best technique for regularization while you are facing this problem. At least, while you are still struggling to fix it I would rather take it out or at replace it with other regularization method. to figure out what happens.
Also another suggestion. Can you change a bit the validation set and see what is the behavior ? How did you build the validation set ?
Did you normalize / standardize the data ? Please note normalization is even more important for LSTMs
the metric is definitely a problem. The above suggestions are good.

Validaton loss decrease and validation accuracy decrease in CNN classification

Im training classification on 2 classes (spawned fish or not from image of scale). The dataset is unbalanced. There is only 5% spawned scales.
I havnt checked how many spawned fish are in each of train/validation/test sets, but there are 9073 images. Splitt in 70/15/15 %. Then I observe in epoke 2 that val_loss decrease while val_acc decrease. How is that possible?
Im using Keras. The network is EfficientNetB4 from github.com/qubvel.
1600/1600 [==============================] - 1557s 973ms/step - loss: 1.3353 - acc: 0.6474 - val_loss: 0.8055 - val_acc: 0.7046
Epoch 00001: val_loss improved from inf to 0.80548, saving model to ./checkpoints_missing_loss2/salmon_scale_inception.001-0.81.hdf5
Epoch 2/150
1600/1600 [==============================] - 1508s 943ms/step - loss: 0.8013 - acc: 0.7084 - val_loss: 0.6816 - val_acc: 0.6973
Epoch 00002: val_loss improved from 0.80548 to 0.68164, saving model to ./checkpoints_missing_loss2/salmon_scale_inception.002-0.68.hdf5
Edit: here is another example - only 1010 images but its balanced - 50/50.
Epoch 5/150
1600/1600 [==============================] - 1562s 976ms/step - loss: 0.0219 - acc: 0.9933 - val_loss: 0.2639 - val_acc: 0.9605
Epoch 00005: val_loss improved from 0.28715 to 0.26390, saving model to ./checkpoints_missing_loss2/salmon_scale_inception.005-0.26.hdf5
Epoch 6/150
1600/1600 [==============================] - 1565s 978ms/step - loss: 0.0059 - acc: 0.9982 - val_loss: 0.4140 - val_acc: 0.9276
Epoch 00006: val_loss did not improve from 0.26390
Epoch 7/150
1600/1600 [==============================] - 1561s 976ms/step - loss: 0.0180 - acc: 0.9941 - val_loss: 0.2379 - val_acc: 0.9276
and val_loss decrease aswell as val_acc.
If you have such an unbalanced dataset, the model first classifies everything as the majority class which gets relatively high accuracy, but all probability is distributed to the majority class. The reason is that the final bias can be learned very quickly because the back-propagation path is very short.
In the later stages of the training, the model basically finds reasons not to classify the input with the majority class. At this point, the model starts to make mistakes, the accuracy goes down, but the probability is more evenly distributed, so from the loss perspective, the error is smaller.
With such an imbalanced dataset, I would rather track F-measure instead of accuracy.

What exactly this accuracy loss value means? How to relate with input data?

When model is trained, i got the below accuracy.
Using TensorFlow backend.
Found 2 images belonging to 2 classes.
Epoch 1/1
3/3 [==============================] - 0s - loss: 5.3142 - acc: 0.6667
What exactly 5.3142 and 0.6667 means? how can i relate with this input data ?

Keras NoteBook GPU Timeout

I am trying to run keras with tensorflow on a windows 10 machine with my GTX 980 gpu on a jupyter notebook. If I run tensorflow alone with my gpu, its works perfectly fine without any issues. But problems arise with the keras interface for high number of epochs.
The keras model uses the GPU and gives an output if my number of epochs is low like the following
with tf.device('/gpu:0'):
model.compile('adam', 'categorical_crossentropy', ['accuracy'])
history = model.fit(X_normalized,y_one_hot,batch_size=128,nb_epoch=2,validation_split=0.2)
Following is the output
Train on 31367 samples, validate on 7842 samples
Epoch 1/2
31367/31367 [==============================] - 3s - loss: 1.7640 - acc: 0.5438 - val_loss: 1.2872 - val_acc: 0.6486 - ETA: 0s - loss: 1.8827 - acc: 0.5145 - ETA: 0s - loss: 1.7732 - acc: 0.5416
Epoch 2/2
31367/31367 [==============================] - 2s - loss: 0.8539 - acc: 0.7765 - val_loss: 0.7958 - val_acc: 0.7615
If the number of epochs is high then it will timeout with the following error and the webpage says busy
WebSocket ping timeout after 119999 ms.
How do i fix this error?
I guess this issue is related to TDR(Timeout Detection and Recovery) on Windows.
Basically, the OS thought the GPU hang and do not response any more, so OS will reboot the graphics card. You can try to disable the TDR or extend the up limit of TdRDelay. More details can be found https://learn.microsoft.com/en-us/windows-hardware/drivers/display/tdr-registry-keys.

Resources