I'm somewhat new to machine learning in general, and I wanted to make a simple experiment to get more familiar with neural network autoencoders: To make an extremely basic autoencoder that would learn the identity function.
I'm using Keras to make life easier, so I did this first to make sure it works:
# Weights are given as [weights, biases], so we give
# the identity matrix for the weights and a vector of zeros for the biases
weights = [np.diag(np.ones(84)), np.zeros(84)]
model = Sequential([Dense(84, input_dim=84, weights=weights)])
model.compile(optimizer='sgd', loss='mean_squared_error')
model.fit(X, X, nb_epoch=10, batch_size=8, validation_split=0.3)
As expected, the loss is zero, both in train and validation data:
Epoch 1/10
97535/97535 [==============================] - 27s - loss: 0.0000e+00 - val_loss: 0.0000e+00
Epoch 2/10
97535/97535 [==============================] - 28s - loss: 0.0000e+00 - val_loss: 0.0000e+00
Then I tried to do the same but without initializing the weights to the identity function, expecting that after a while of training it would learn it. It didn't. I've let it run for 200 epochs various times in different configurations, playing with different optimizers, loss functions, and adding L1 and L2 activity regularizers. The results vary, but the best I've got is still really bad, looking nothing like the original data, just being kinda in the same numeric range.
The data is simply some numbers oscillating around 1.1. I don't know if an activation layer makes sense for this problem, should I be using one?
If this "neural network" of one layer can't learn something as simple as the identity function, how can I expect it to learn anything more complex? What am I doing wrong?
EDIT
To have better context, here's a way to generate a dataset very similar to the one I'm using:
X = np.random.normal(1.1090579, 0.0012380764, (139336, 84))
I'm suspecting that the variations between the values might be too small. The loss function ends up having decent values (around 1e-6), but it's not enough precision for the result to have a similar shape to the original data. Maybe I should scale/normalize it somehow? Thanks for any advice!
UPDATE
In the end, as it was suggested, the issue was with the dataset having too small variations between the 84 values, so the resulting prediction was actually pretty good in absolute terms (loss function) but comparing it to the original data, the variations were far off. I solved it by normalizing the 84 values in each sample around the sample's mean and dividing by the sample's standard deviation. Then I used the original mean and standard deviation to denormalize the predictions at the other end. I guess this could be done in a few different ways, but I did it by adding this normalization/denormalization into the model itself by using some Lambda layers that operated on the tensors. That way all the data processing was incorporated into the model, which made it nicer to work with. Let me know if you would like to see the actual code.
I believe the problem could be either the number of epoch or the way you inizialize X.
I ran your code with an X of mine for 100 epochs and printed the argmax() and max values of the weights, it gets really close to the identity function.
I'm adding the code snippet that I used
from keras.models import Sequential
from keras.layers import Dense
import numpy as np
import random
import pandas as pd
X = np.array([[random.random() for r in xrange(84)] for i in xrange(1,100000)])
model = Sequential([Dense(84, input_dim=84)], name="layer1")
model.compile(optimizer='sgd', loss='mean_squared_error')
model.fit(X, X, nb_epoch=100, batch_size=80, validation_split=0.3)
l_weights = np.round(model.layers[0].get_weights()[0],3)
print l_weights.argmax(axis=0)
print l_weights.max(axis=0)
And I'm getting:
Train on 69999 samples, validate on 30000 samples
Epoch 1/100
69999/69999 [==============================] - 1s - loss: 0.2092 - val_loss: 0.1564
Epoch 2/100
69999/69999 [==============================] - 1s - loss: 0.1536 - val_loss: 0.1510
Epoch 3/100
69999/69999 [==============================] - 1s - loss: 0.1484 - val_loss: 0.1459
.
.
.
Epoch 98/100
69999/69999 [==============================] - 1s - loss: 0.0055 - val_loss: 0.0054
Epoch 99/100
69999/69999 [==============================] - 1s - loss: 0.0053 - val_loss: 0.0053
Epoch 100/100
69999/69999 [==============================] - 1s - loss: 0.0051 - val_loss: 0.0051
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83]
[ 0.85000002 0.85100001 0.79799998 0.80500001 0.82700002 0.81900001
0.792 0.829 0.81099999 0.80800003 0.84899998 0.829 0.852
0.79500002 0.84100002 0.81099999 0.792 0.80800003 0.85399997
0.82999998 0.85100001 0.84500003 0.847 0.79699999 0.81400001
0.84100002 0.81 0.85100001 0.80599999 0.84500003 0.824
0.81999999 0.82999998 0.79100001 0.81199998 0.829 0.85600001
0.84100002 0.792 0.847 0.82499999 0.84500003 0.796
0.82099998 0.81900001 0.84200001 0.83999997 0.815 0.79500002
0.85100001 0.83700001 0.85000002 0.79900002 0.84100002 0.79699999
0.838 0.847 0.84899998 0.83700001 0.80299997 0.85399997
0.84500003 0.83399999 0.83200002 0.80900002 0.85500002 0.83899999
0.79900002 0.83399999 0.81 0.79100001 0.81800002 0.82200003
0.79100001 0.83700001 0.83600003 0.824 0.829 0.82800001
0.83700001 0.85799998 0.81999999 0.84299999 0.83999997]
When I used only 5 numbers as an input and printed the actual weights I got this:
array([[ 1., 0., -0., 0., 0.],
[ 0., 1., 0., -0., -0.],
[-0., 0., 1., 0., 0.],
[ 0., -0., 0., 1., -0.],
[ 0., -0., 0., -0., 1.]], dtype=float32)
Related
After some amount of training on a custom Multi-agent environment using RLlib's (1.4.0) PPO network, I found that my continuous actions turn into nan (explodes?) which is probably caused by a bad gradient update which in turn depends on the loss/objective function.
As I understand it, PPO's loss function relies on three terms:
The PPO Gradient objective [depends on outputs of old policy and new policy, the advantage, and the "clip" parameter=0.3, say]
The Value Function Loss
The Entropy Loss [mainly there to encourage exploration]
Total Loss = PPO Gradient objective (clipped) - vf_loss_coeff * VF Loss + entropy_coeff * entropy.
I have set entropy coeff to 0. So I am focusing on the other two functions contributing to the total loss. As seen below in the progress table, the relevant portion where the total loss becomes inf is the problem area. The only change I found is that the policy loss was all negative until row #445.
So my question is: Can anyone explain what policy loss is supposed to look like and if this is normal? How do I resolve this issue with continuous actions becoming nan after a while? Is it just a question of lowering the learning rate?
EDIT
Here's a link to the related question (if you need more context)
END OF EDIT
I would really appreciate any tips! Thank you!
Total loss
policy loss
VF loss
430
6.068537
-0.053691725999999995
6.102932
431
5.9919114
-0.046943977000000005
6.0161843
432
8.134636
-0.05247503
8.164852
433
4.222730599999999
-0.048518334
4.2523246
434
6.563492
-0.05237444
6.594456
435
8.171028999999999
-0.048245672
8.198222999999999
436
8.948264
-0.048484523
8.976327000000001
437
7.556602000000001
-0.054372005
7.5880575
438
6.124418
-0.05249534
6.155608999999999
439
4.267647
-0.052565258
4.2978816
440
4.912957700000001
-0.054498855
4.9448576
441
16.630292999999998
-0.043477765999999994
16.656229
442
6.3149705
-0.057527818
6.349851999999999
443
4.2269225
-0.05446908599999999
4.260793700000001
444
9.503102
-0.052135203
9.53277
445
inf
0.2436709
4.410831
446
nan
-0.00029848056
22.596403
447
nan
0.00013323531
0.00043436907999999994
448
nan
1.5656527000000002e-05
0.0002645221
449
nan
1.3344318000000001e-05
0.0003139485
450
nan
6.941916999999999e-05
0.00025863337
451
nan
0.00015686743
0.00013607396
452
nan
-5.0206604e-06
0.00027541115000000003
453
nan
-4.5543664e-05
0.0004247162
454
nan
8.841756999999999e-05
0.00020278389999999998
455
nan
-8.465959e-05
9.261127e-05
456
nan
3.8680790000000003e-05
0.00032097592999999995
457
nan
2.7373152999999996e-06
0.0005146417
458
nan
-6.271608e-06
0.0013273798000000001
459
nan
-0.00013192794
0.00030621013
460
nan
0.00038987884
0.00038019830000000004
461
nan
-3.2747877999999998e-06
0.00031471922
462
nan
-6.9349815e-05
0.00038836736000000006
463
nan
-4.666238e-05
0.0002851575
464
nan
-3.7067155e-05
0.00020161088
465
nan
3.0623291e-06
0.00019258813999999998
466
nan
-8.599938e-06
0.00036465342000000005
467
nan
-1.1529375e-05
0.00016500981
468
nan
-3.0851965e-07
0.00022042097
469
nan
-0.0001133984
0.00030230957999999997
470
nan
-1.0735256e-05
0.00034000343000000003
It appears that RLLIB's PPO configuration of grad_clip is way too big (grad_clip=40). I changed it to grad_clip=4 and it worked.
I met the same problem when running the rllib example. I also post my problem in this issue. I am also running PPO in a countious and bounded action space. The PPO output actions that are quite large and finally crash dued to Nan related error.
For me, it seems that when the log_std of the action normal distribution is too large, large actions(about 1e20) will appear. I copy the codes for calculate loss in RLlib(v1.10.0) ppo_torch_policy.py and paste them below.
logp_ratio = torch.exp(
curr_action_dist.logp(train_batch[SampleBatch.ACTIONS]) -
train_batch[SampleBatch.ACTION_LOGP])
action_kl = prev_action_dist.kl(curr_action_dist)
mean_kl_loss = reduce_mean_valid(action_kl)
curr_entropy = curr_action_dist.entropy()
mean_entropy = reduce_mean_valid(curr_entropy)
surrogate_loss = torch.min(
train_batch[Postprocessing.ADVANTAGES] * logp_ratio,
train_batch[Postprocessing.ADVANTAGES] * torch.clamp(
logp_ratio, 1 - self.config["clip_param"],
1 + self.config["clip_param"]))
For that large actions, the logp curr_action_dist.logp(train_batch[SampleBatch.ACTIONS])computed by <class 'torch.distributions.normal.Normal'> will be -inf. And then curr_action_dist.logp(train_batch[SampleBatch.ACTIONS]) -train_batch[SampleBatch.ACTION_LOGP]) return Nan. torch.min and torch.clamp will still keep the Nan output(refer to the doc).
So in conclusion, I guess that the Nan is caused by the -inf value of the log probability of very large actions, and the torch failed to clip it according to the the "clip" parameter.
The difference is that I do not set entropy_coeff to zero. In my case, the std variance is encouraged to be as large as possible since the entropy is computed for the total normal distribution instead of the distribution restricted to the action space. I am not sure whether you get large σ as I do. In addition, I am using Pytorch, things may be different for Tf.
I'm building a Keras model to predict predict if the user will select the certain product or not (binary classification).
Model seems to be making progress on Validation set that is heldout while training, but the model's predictions are all 0s when it comes to the test set.
My dataset looks something like this:
train_dataset
customer_id id target customer_num_id
0 TCHWPBT 4 0 1
1 TCHWPBT 13 0 1
2 TCHWPBT 20 0 1
3 TCHWPBT 23 0 1
4 TCHWPBT 28 0 1
... ... ... ... ...
1631695 D4Q7TMM 849 0 7417
1631696 D4Q7TMM 855 0 7417
1631697 D4Q7TMM 856 0 7417
1631698 D4Q7TMM 858 0 7417
1631699 D4Q7TMM 907 0 7417
I split it into Train/Val sets using:
from sklearn.model_selection import train_test_split
Train, Val = train_test_split(train_dataset, test_size=0.1, random_state=42, shuffle=False)
After I split the dataset, I select the features that are used when training and validating the model:
train_customer_id = Train['customer_num_id']
train_vendor_id = Train['id']
train_target = Train['target']
val_customer_id = Val['customer_num_id']
val_vendor_id = Val['id']
val_target = Val['target']
... And run the model:
epochs = 2
for e in range(epochs):
print('EPOCH: ', e)
model.fit([train_customer_id, train_vendor_id], train_target, epochs=1, verbose=1, batch_size=384)
prediction = model.predict(x=[train_customer_id, train_vendor_id], verbose=1, batch_size=384)
train_f1 = f1_score(y_true=train_target.astype('float32'), y_pred=prediction.round())
print('TRAIN F1: ', train_f1)
val_prediction = model.predict(x=[val_customer_id, val_vendor_id], verbose=1, batch_size=384)
val_f1 = f1_score(y_true=val_target.astype('float32'), y_pred=val_prediction.round())
print('VAL F1: ', val_f1)
EPOCH: 0
1468530/1468530 [==============================] - 19s 13us/step - loss: 0.0891
TRAIN F1: 0.1537511577647422
VAL F1: 0.09745762711864409
EPOCH: 1
1468530/1468530 [==============================] - 19s 13us/step - loss: 0.0691
TRAIN F1: 0.308748569645272
VAL F1: 0.2076433121019108
The validation accuracy seems to be improving with time, and model predicts both 1s and 0s:
prediction = model.predict(x=[val_customer_id, val_vendor_id], verbose=1, batch_size=384)
np.unique(prediction.round())
array([0., 1.], dtype=float32)
But when I try predict the test set, model predicts 0 for all values:
prediction = model.predict(x=[test_dataset['customer_num_id'], test_dataset['id']], verbose=1, batch_size=384)
np.unique(prediction.round())
array([0.], dtype=float32)
Test dataset looks similar to the training and validation sets, and it has been left out during training just like the validation set, yet the model can't output values other than 0.
Here's what test dataset looks like:
test_dataset
customer_id id customer_num_id
0 Z59FTQD 243 7418
1 0JP29SK 243 7419
... ... ... ...
1671995 L9G4OFV 907 17414
1671996 L9G4OFV 907 17414
1671997 FDZFYBA 907 17415
Does anyone know what might be the issue here?
EDIT: made dataset text more readable
Please take a look at the distribution of your data. I see in the sample data you've shown that target is all 0's. Consider that if most users don't select the product, then if the model always predicts 0, it will be right most of the time. So, it could be improving it's accuracy by over-fitting to the majority class (0).
You can prevent over-fitting by adjusting params like the learning rate and model architecture by adding dropout layers.
Also, I'm not sure what your model looks like, but you're only training for 2 epochs so it may not have had enough time to generalize the data, and depending on how deep your model is it could need a lot more training time
I'm trying to apply a pretrained HuggingFace ALBERT transformer model to my own text classification task, but the loss is not decreasing beyond a certain point.
Here's my code:
There are four labels in my text classification dataset which are:
0, 1, 2, 3
Define the tokenizer
maxlen=25
albert_path = 'albert-large-v1'
from transformers import AlbertTokenizer, TFAlbertModel, AlbertConfig
tokenizer = AlbertTokenizer.from_pretrained(albert_path, do_lower_case=True, add_special_tokens=True,
max_length=maxlen, pad_to_max_length=True)
Encode all sentences in text, using the tokenizer
encodings = []
for t in text:
encodings.append(tokenizer.encode(t, max_length=maxlen, pad_to_max_length=True, add_special_tokens=True))
Define the pretrained transformer model and add Dense layer on top
from tensorflow.keras.layers import Input, Flatten, Dropout, Dense
from tensorflow.keras import Model
optimizer = tf.keras.optimizers.Adam(learning_rate= 1e-4)
token_inputs = Input((maxlen), dtype=tf.int32, name='input_word_ids')
config = AlbertConfig(num_labels=4, dropout=0.2, attention_dropout=0.2)
albert_model = TFAlbertModel.from_pretrained(pretrained_model_name_or_path=albert_path, config=config)
X = albert_model(token_inputs)[1]
X = Dropout(0.2)(X)
output_= Dense(4, activation='softmax', name='output')(X)
bert_model2 = Model(token_inputs,output_)
print(bert_model2.summary())
bert_model2.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy')
Finally, feed the encoded text and labels to the model
encodings = np.asarray(encodings)
labels = np.asarray(labels)
bert_model2.fit(x=encodings, y = labels, epochs=20, batch_size=128)
Epoch 11/20
5/5 [==============================] - 2s 320ms/step - loss: 1.2923
Epoch 12/20
5/5 [==============================] - 2s 319ms/step - loss: 1.2412
Epoch 13/20
5/5 [==============================] - 2s 322ms/step - loss: 1.3118
Epoch 14/20
5/5 [==============================] - 2s 319ms/step - loss: 1.2531
Epoch 15/20
5/5 [==============================] - 2s 318ms/step - loss: 1.2825
Epoch 16/20
5/5 [==============================] - 2s 322ms/step - loss: 1.2479
Epoch 17/20
5/5 [==============================] - 2s 321ms/step - loss: 1.2623
Epoch 18/20
5/5 [==============================] - 2s 319ms/step - loss: 1.2576
Epoch 19/20
5/5 [==============================] - 2s 321ms/step - loss: 1.3143
Epoch 20/20
5/5 [==============================] - 2s 319ms/step - loss: 1.2716
Loss has decreased from 6 to around 1.23 but doesn't seem to decrease any further, even after 30+ epochs.
What am I doing wrong?
All advice is greatly appreciated!
You can try using SGD Optimizer
Introduce Batch Normalization
Try adding a few layers (not-pretrained) on the top of Albert layer.
This question already has an answer here:
TensorFlow Only running on 1/32 of the Training data provided [duplicate]
(1 answer)
Closed 2 years ago.
I'm trying to train a model for mnist.
import tensorflow as tf
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
print(x_train.shape)
What i got is (60000, 28, 28), there are 60,000 items in the data set.
Then, I create the model with the following code.
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10)
])
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer='adam',
loss=loss_fn,
metrics=['accuracy'])
model.fit(x_train, y_train, epochs=5)
However, I got only 1875 items for each epoch.
2020-06-02 04:33:45.706474: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'cudart64_101.dll'; dlerror: cudart64_101.dll not found
2020-06-02 04:33:45.706617: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2020-06-02 04:33:47.437837: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'nvcuda.dll'; dlerror: nvcuda.dll not found
2020-06-02 04:33:47.437955: E tensorflow/stream_executor/cuda/cuda_driver.cc:313] failed call to cuInit: UNKNOWN ERROR (303)
2020-06-02 04:33:47.441329: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: DESKTOP-H3BEO7F
2020-06-02 04:33:47.441480: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: DESKTOP-H3BEO7F
2020-06-02 04:33:47.441876: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2020-06-02 04:33:47.448274: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x27fc6b2c210 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-06-02 04:33:47.448427: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
Epoch 1/5
1875/1875 [==============================] - 1s 664us/step - loss: 0.2971 - accuracy: 0.9140
Epoch 2/5
1875/1875 [==============================] - 1s 661us/step - loss: 0.1421 - accuracy: 0.9582
Epoch 3/5
1875/1875 [==============================] - 1s 684us/step - loss: 0.1068 - accuracy: 0.9675
Epoch 4/5
1875/1875 [==============================] - 1s 695us/step - loss: 0.0868 - accuracy: 0.9731
Epoch 5/5
1875/1875 [==============================] - 1s 682us/step - loss: 0.0764 - accuracy: 0.9762
Process finished with exit code 0
You are using the whole data, no worries!
Due to the Keras documentation, https://github.com/keras-team/keras/blob/master/keras/engine/training.py
when you use model.fit and you do not specify the batch size, it got assigned to 32 by default.
batch_size Integer or NULL. Number of samples per gradient update. If
unspecified, batch_size will default to 32
It means that for each epoch you have 1875 steps, and in each step, your model has taken 32 data examples into the account. And guess what, 1875*32 is equal to 60,000.
When I run the below code :
from keras.models import Sequential
from keras.layers import Dense
import numpy
import time
# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)
# load dataset
dataset = numpy.loadtxt("C:/Users/AQader/Desktop/Keraslearn/mammm.csv", delimiter=",")
# split into input (X) and output (Y) variables
X = dataset[:,0:5]
Y = dataset[:,5]
# create model
model = Sequential()
model.add(Dense(50, input_dim=5, init='uniform', activation='relu'))
model.add(Dense(25, init='uniform', activation='tanh'))
model.add(Dense(15, init='uniform', activation='tanh'))
model.add(Dense(1, init='uniform', activation='sigmoid'))
# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# Fit the model
model.fit(X, Y, nb_epoch=200, batch_size=20, verbose = 0)
time.sleep(0.1)
# evaluate the model
scores = model.evaluate(X, Y)
print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
I end up receiving the following.
32/829 [>.............................] - ETA: 0sacc: 84.20%
That's it. Only one line that shows up after a half minute of training. After looking through other questions, the usual output looks like:
Epoch 1/20
1213/1213 [==============================] - 0s - loss: 0.1760
Epoch 2/20
1213/1213 [==============================] - 0s - loss: 0.1840
Epoch 3/20
1213/1213 [==============================] - 0s - loss: 0.1816
Epoch 4/20
1213/1213 [==============================] - 0s - loss: 0.1915
Epoch 5/20
1213/1213 [==============================] - 0s - loss: 0.1928
Epoch 6/20
1213/1213 [==============================] - 0s - loss: 0.1964
Epoch 7/20
1213/1213 [==============================] - 0s - loss: 0.1948
Epoch 8/20
1213/1213 [==============================] - 0s - loss: 0.1971
Epoch 9/20
1213/1213 [==============================] - 0s - loss: 0.1899
Epoch 10/20
1213/1213 [==============================] - 0s - loss: 0.1957
Can anyone tell me what may be wrong here? I am a beginner in this yet this doesn't seem normal. Please note that there are no errors in the "code" sections. What I mean is that the 0sacc is what shows up. I'm running this in the Anaconda Environment Python 2.7 on a Windows 7 64-bit machine. 8GB RAM and Core i5 5th gen.
By calling model.fit with verbose = 0 you have suppressed detailed output. Try setting verbose = 1.