Why does MITIE get stuck on segment classifier? - machine-learning

I'm building a model using MITIE with a training dataset of 1,400 sentences, between 3-10 words long, paired to around 120 intents. My model training get stuck at Part II: train segment classifier. I've let it run for 14 hours before terminating.
My machines has 2.4 GHz Intel Core i7 and 8 GB 1600 MHz DDR3 and the segment classifier uses all available memory (around 7gb), eventually relying on compressed memory, and at the end of the last session activity monitor showed 32gb used and 27gb compressed. And segment classifier has never completed.
My current output is below:
INFO:rasa_nlu.model:Starting to train component nlp_mitie
INFO:rasa_nlu.model:Finished training component.
INFO:rasa_nlu.model:Starting to train component tokenizer_mitie
INFO:rasa_nlu.model:Finished training component.
INFO:rasa_nlu.model:Starting to train component ner_mitie
Training to recognize 20 labels: 'pet', 'room_number', 'broken_things', '#sys.ignore', 'climate', 'facility', 'gym', 'medicine', 'item', 'exercise_equipment
', 'service', 'number', 'electronic_device', 'charger', 'toiletries', 'time', 'date', 'facility_hours', 'cost_inquiry', 'tv channel'
Part I: train segmenter
words in dictionary: 200000
num features: 271
now do training
C: 20
epsilon: 0.01
num threads: 1
cache size: 5
max iterations: 2000
loss per missed segment: 3
C: 20 loss: 3 0.669591
C: 35 loss: 3 0.690058
C: 20 loss: 4.5 0.701754
C: 5 loss: 3 0.616959
C: 20 loss: 1.5 0.634503
C: 28.3003 loss: 5.74942 0.71345
C: 25.9529 loss: 5.72171 0.707602
C: 27.7407 loss: 5.97907 0.707602
C: 30.2561 loss: 5.61669 0.701754
C: 27.747 loss: 5.66612 0.710526
C: 28.9754 loss: 5.82319 0.707602
best C: 28.3003
best loss: 5.74942
num feats in chunker model: 4095
train: precision, recall, f1-score: 0.805851 0.885965 0.844011
Part I: elapsed time: 180 seconds.
Part II: train segment classifier
now do training
num training samples: 415
I understand this could be an issue caused by redundant labels (as explained here); however, all of my labels are unique. My understanding is that training shouldn't take this long or use this much memory. I've seen others posting similar issues with no solution provided yet. What is causing this high memory usage and insane training time? How is it fixed?

Related

Fixing seeds in keras tanks accuracy

I'm following this example on how to use Keras to build a CNN trained on MNIST. I modified it so that the batch size is 20, the number of classes is 10, and the number of epochs is 3. I also only use the first 1000 images from the dataset. This is all to decrease the time spent training and to lower the accuracy a bit (for separate reasons).
Of course training as-is is not fully reproducible. Here are 3 results from training:
Test loss: 0.30017868733406067
Test accuracy: 0.901
Test loss: 0.30246967363357546
Test accuracy: 0.894
Test loss: 0.355930775642395
Test accuracy: 0.887
They're not super far apart I suppose, but I'd like to have full reproducibility. So, I tried to fix the seeds and reduce the multithreading using the following lines of code:
import os
import numpy as np
import tensorflow as tf
import random
seed = 124129
os.environ['PYTHONHASHSEED']=str(seed)
random.seed(seed)
np.random.seed(seed)
tf.set_random_seed(seed)
session_conf = tf.ConfigProto(intra_op_parallelism_threads=1, inter_op_parallelism_threads=1)
sess = tf.Session(graph=tf.get_default_graph(), config=session_conf)
K.set_session(sess)
I'm running this on my laptop which only has a CPU, so fixing the seeds should have been sufficient. However, the results seem to be even worse:
Test loss: 14.74805696105957
Test accuracy: 0.085
Test loss: 12.71728395843506
Test accuracy: 0.211
Test loss: 12.340894721984863
Test accuracy: 0.232
It's not just the fact that the accuracy is so low that is concerning, but also the fact that it's still not reproducible. What other sources of irreproducibility could there be?
For reference, here is the full code: https://pastebin.com/XixmKUC6

Compute the number of epoch from iteration in training?

I have a Caffe prototxt as follows:
stepsize: 20000
iter_size: 4
batch_size: 10
gamma =0.1
in which, the dataset has 40.000 images. It means after 20000 iters, the learning rate will decrease 10 times. In pytorch, I want to compute the number of the epoch to have the same behavior in caffe (for learning rate). How many epoch should I use to decrease learning rate 10 times (note that, we have iter_size=4 and batch_size=10). Thanks
Ref: Epoch vs Iteration when training neural networks
My answer: Example: if you have 40000 training examples, and batch size is 10, then it will take 40000/10 =4000 iterations to complete 1 epoch. Hence, 20000 iters to reduce learning rate in caffe will same as 5 epochs in pytorch.
You did not take into account iter_size: 4: when batch is too large to fit into memory, you can "split" it into several iterations.
In your example, the actual batch size is batch_sizexiter_size=10 * 4 = 40. Therefore, an epoch takes only 1,000 iterations and therefore you need to decrease the learning rate after 20 epochs.

ResNet: 100% accuracy during training, but 33% prediction accuracy with the same data

I am new to machine learning and deep learning, and for learning purposes I tried to play with Resnet. I tried to overfit over small data (3 different images) and see if I can get almost 0 loss and 1.0 accuracy - and I did.
The problem is that predictions on the training images (i.e. the same 3 images used for training) are not correct..
Training Images
Image labels
[1,0,0], [0,1,0], [0,0,1]
My python code
#loading 3 images and resizing them
imgs = np.array([np.array(Image.open("./Images/train/" + fname)
.resize((197, 197), Image.ANTIALIAS)) for fname in
os.listdir("./Images/train/")]).reshape(-1,197,197,1)
# creating labels
y = np.array([[1,0,0],[0,1,0],[0,0,1]])
# create resnet model
model = ResNet50(input_shape=(197, 197,1),classes=3,weights=None)
# compile & fit model
model.compile(loss='categorical_crossentropy', optimizer='adam',metrics=['acc'])
model.fit(imgs,y,epochs=5,shuffle=True)
# predict on training data
print(model.predict(imgs))
The model does overfit the data:
3/3 [==============================] - 22s - loss: 1.3229 - acc: 0.0000e+00
Epoch 2/5
3/3 [==============================] - 0s - loss: 0.1474 - acc: 1.0000
Epoch 3/5
3/3 [==============================] - 0s - loss: 0.0057 - acc: 1.0000
Epoch 4/5
3/3 [==============================] - 0s - loss: 0.0107 - acc: 1.0000
Epoch 5/5
3/3 [==============================] - 0s - loss: 1.3815e-04 - acc: 1.0000
but predictions are:
[[ 1.05677405e-08 9.99999642e-01 3.95520459e-07]
[ 1.11955103e-08 9.99999642e-01 4.14905685e-07]
[ 1.02637095e-07 9.99997497e-01 2.43751242e-06]]
which means that all images got label=[0,1,0]
why? and how can that happen?
It's because of the batch normalization layers.
In training phase, the batch is normalized w.r.t. its mean and variance. However, in testing phase, the batch is normalized w.r.t. the moving average of previously observed mean and variance.
Now this is a problem when the number of observed batches is small (e.g., 5 in your example) because in the BatchNormalization layer, by default moving_mean is initialized to be 0 and moving_variance is initialized to be 1.
Given also that the default momentum is 0.99, you'll need to update the moving averages quite a lot of times before they converge to the "real" mean and variance.
That's why the prediction is wrong in the early stage, but is correct after 1000 epochs.
You can verify it by forcing the BatchNormalization layers to operate in "training mode".
During training, the accuracy is 1 and the loss is close to zero:
model.fit(imgs,y,epochs=5,shuffle=True)
Epoch 1/5
3/3 [==============================] - 19s 6s/step - loss: 1.4624 - acc: 0.3333
Epoch 2/5
3/3 [==============================] - 0s 63ms/step - loss: 0.6051 - acc: 0.6667
Epoch 3/5
3/3 [==============================] - 0s 57ms/step - loss: 0.2168 - acc: 1.0000
Epoch 4/5
3/3 [==============================] - 0s 56ms/step - loss: 1.1921e-07 - acc: 1.0000
Epoch 5/5
3/3 [==============================] - 0s 53ms/step - loss: 1.1921e-07 - acc: 1.0000
Now if we evaluate the model, we'll observe high loss and low accuracy because after 5 updates, the moving averages are still pretty close to the initial values:
model.evaluate(imgs,y)
3/3 [==============================] - 3s 890ms/step
[10.745396614074707, 0.3333333432674408]
However, if we manually specify the "learning phase" variable and let the BatchNormalization layers use the "real" batch mean and variance, the result becomes the same as what's observed in fit().
sample_weights = np.ones(3)
learning_phase = 1 # 1 means "training"
ins = [imgs, y, sample_weights, learning_phase]
model.test_function(ins)
[1.192093e-07, 1.0]
It's also possible to verify it by changing the momentum to a smaller value.
For example, by adding momentum=0.01 to all the batch norm layers in ResNet50, the prediction after 20 epochs is:
model.predict(imgs)
array([[ 1.00000000e+00, 1.34882026e-08, 3.92139575e-22],
[ 0.00000000e+00, 1.00000000e+00, 0.00000000e+00],
[ 8.70998792e-06, 5.31159838e-10, 9.99991298e-01]], dtype=float32)
ResNet50V2 (the 2nd version) has the much higher accuracy than ResNet50in predicting a given image such as the classical Egyptian cat.
Predicted: [[('n02124075', 'Egyptian_cat', 0.8233388), ('n02123159', 'tiger_cat', 0.103765756), ('n02123045', 'tabby', 0.07267675), ('n03958227', 'plastic_bag', 3.6531426e-05), ('n02127052', 'lynx', 3.647774e-05)]]
Comparing with the EfficientNet(90% accuracy), the ResNet50/101/152 predicts quite a bad result(15~50% accuracy) while adopting the given weights provided by Francios Cholett. It is not related to the weights, but related to the inherent complexity of the above model. In other words, it is necessary to re-train the above model to predict an given image. But EfficientNet does not need such the training to predict an image.
For instance, while given a classical cat image, it shows the final result as follows.
1. Adoption of the decode_predictions
from keras.applications.imagenet_utils import decode_predictions
Predicted: [[('n01930112', 'nematode', 0.122968934), ('n03041632', 'cleaver', 0.04236396), ('n03838899', 'oboe', 0.03846453), ('n02783161', 'ballpoint', 0.027445247), ('n04270147', 'spatula', 0.024508419)]]
2. Adoption of the CV2
img = cv2.resize(cv2.imread('/home/mike/Documents/keras_resnet_common/images/cat.jpg'), (224, 224)).astype(np.float32)
# Remove the train image mean
img[:,:,0] -= 103.939
img[:,:,1] -= 116.779
img[:,:,2] -= 123.68
Predicted: [[('n04065272', 'recreational_vehicle', 0.46529356), ('n01819313', 'sulphur-crested_cockatoo', 0.31684962), ('n04074963', 'remote_control', 0.051597465), ('n02111889', 'Samoyed', 0.040776145), ('n04548362', 'wallet', 0.029898684)]]
Therefore, ResNet50/101/152 models are not suitable to predict an image without training even provided with the weights. But users can feel its value after 100~1000 epochs training for prediction because it helps obtain a better moving average. If users want an easy prediction, EfficientNet is a good choice with the given weights.
It seems that predicting with a batch of images will not work correctly in Keras. It is better to do prediction for each image individually and then calculate the accuracy manually.
As an example, in the following code, I don't use batch prediction, but use individual image prediction.
import os
from PIL import Image
import keras
import numpy
###
# I am not including code to load models or train model
###
print("Prediction result:")
dir = "/path/to/test/images"
files = os.listdir(dir)
correct = 0
total = 0
#dictionary to label all traffic signs class.
classes = {
0:'This is Cat',
1:'This is Dog',
}
for file_name in files:
total += 1
image = Image.open(dir + "/" + file_name).convert('RGB')
image = image.resize((100,100))
image = numpy.expand_dims(image, axis=0)
image = numpy.array(image)
image = image/255
pred = model.predict_classes([image])[0]
sign = classes[pred]
if ("cat" in file_name) and ("cat" in sign):
print(correct,". ", file_name, sign)
correct+=1
elif ("dog" in file_name) and ("dog" in sign):
print(correct,". ", file_name, sign)
correct+=1
print("accuracy: ", (correct/total))
What happens is basically that keras.fit() i.e your
model.fit()
is while having the best fit the precision is lost. As, the precision is lost the models fit gives problems and varied results.The keras.fit only has a good fit not the required precision

Training accuracy increases aggresively, test accuracy settles

While training a convolutional neural network following this article, the accuracy of the training set increases too much while the accuracy on the test set settles.
Below is an example with 6400 training examples, randomly chosen at each epoch (so some examples might be seen at the previous epochs, some might be new), and 6400 same test examples.
For a bigger data set (64000 or 100000 training examples), the increase in training accuracy is even more abrupt, going to 98 on the third epoch.
I also tried using the same 6400 training examples each epoch, just randomly shuffled. As expected, the result is worse.
epoch 3 loss 0.54871 acc 79.01
learning rate 0.1
nr_test_examples 6400
TEST epoch 3 loss 0.60812 acc 68.48
nr_training_examples 6400
tb 91
epoch 4 loss 0.51283 acc 83.52
learning rate 0.1
nr_test_examples 6400
TEST epoch 4 loss 0.60494 acc 68.68
nr_training_examples 6400
tb 91
epoch 5 loss 0.47531 acc 86.91
learning rate 0.05
nr_test_examples 6400
TEST epoch 5 loss 0.59846 acc 68.98
nr_training_examples 6400
tb 91
epoch 6 loss 0.42325 acc 92.17
learning rate 0.05
nr_test_examples 6400
TEST epoch 6 loss 0.60667 acc 68.10
nr_training_examples 6400
tb 91
epoch 7 loss 0.38460 acc 95.84
learning rate 0.05
nr_test_examples 6400
TEST epoch 7 loss 0.59695 acc 69.92
nr_training_examples 6400
tb 91
epoch 8 loss 0.35238 acc 97.58
learning rate 0.05
nr_test_examples 6400
TEST epoch 8 loss 0.60952 acc 68.21
This is my model (I'm using RELU activation after each convolution):
conv 5x5 (1, 64)
max-pooling 2x2
dropout
conv 3x3 (64, 128)
max-pooling 2x2
dropout
conv 3x3 (128, 256)
max-pooling 2x2
dropout
conv 3x3 (256, 128)
dropout
fully_connected(18*18*128, 128)
dropout
output(128, 128)
What could be the cause?
I'm using Momentum Optimizer with learning rate decay:
batch = tf.Variable(0, trainable=False)
train_size = 6400
learning_rate = tf.train.exponential_decay(
0.1, # Base learning rate.
batch * batch_size, # Current index into the dataset.
train_size*5, # Decay step.
0.5, # Decay rate.
staircase=True)
# Use simple momentum for the optimization.
optimizer = tf.train.MomentumOptimizer(learning_rate,
0.9).minimize(cost, global_step=batch)
This is very much expected. This problem is called over-fitting. This is when your model starts "memorizing" the training examples without actually learning anything useful for the Test set. In fact, this is exactly why we use a test set in the first place. Since if we have a complex enough model we can always fit the data perfectly, even if not meaningfully. The test set is what tells us what the model has actually learned.
Its also useful to use a Validation set which is like a test set, but you use it to find out when to stop training. When the Validation error stops lowering you stop training. why not use the test set for this? The test set is to know how well your model would do in the real world. If you start using information from the test set to choose things about your training process, than its like your cheating and you will be punished by your test error no longer representing your real world error.
Lastly, convolutional neural networks are notorious for their ability to over-fit. It has been shown the Conv-nets can get zero training error even if you shuffle the labels and even random pixels. That means that there doesn't have to be a real pattern for the Conv-net to learn to represent it. This means that you have to regularize a conv-net. That is, you have to use things like Dropout, batch normalization, early stopping.
I'll leave a few links if you want to read more:
Over-fitting, validation, early stopping
https://elitedatascience.com/overfitting-in-machine-learning
Conv-nets fitting random labels:
https://arxiv.org/pdf/1611.03530.pdf
(this paper is a bit advanced, but its interresting to skim through)
P.S. to actually improve your test accuracy you will need to change your model or train with data augmentation. You might want to try transfer learning as well.

How to visualize this kind of information

I am training a logistic regression algorithm and it returns me the following information for each iteration. I am collecting these entities as arrays for the entire classification.
Can you suggest me some ways to visualize it? For example, is it appropriate to plot loss vs accuracy? Or what kind of graphic type I should use?
***** Iteration #74 *****
Loss: 170.07
Feature L2-norm: 12.5714
Learning rate (eta): 0.00778819
Total number of feature updates: 236800
Loss variance: 5.01839
Seconds required for this iteration: 0.01
Accuracy: 0.9800 (784/800)
Micro P, R, F1: 0.9771 (384/393), 0.9821 (384/391), 0.9796
***** Iteration #75 *****
Loss: 166.81
Feature L2-norm: 12.4385
Learning rate (eta): 0.00769234
Total number of feature updates: 240000
Loss variance: 4.68113
Seconds required for this iteration: 0.01
Accuracy: 0.9800 (784/800)
Micro P, R, F1: 0.9771 (384/393), 0.9821 (384/391), 0.9796
I' dont think taht you should visualize this information. All what you could see is that L2 norm is decreased over time(since it is target minimsation function) and accuracy increased. But since F1 is so high I think it is metrics for evaluation on training data.
So I would recommend to do Micro P, R, F1: 0.9771 (384/393), 0.9821 (384/391), 0.9796 such report on test data(data wich is not used for training) and create plot of iteration vs F1. And then you will see when you actually start overfitting data by peak on the plot.
For your own analysis you should plot accuracy vs. time, so you know when you start to overfit.
For publication, you can pick the metrics others have reported, so you can compare to them.

Resources