How to Use Keras Model to Predict Output After Unpacking the Model - machine-learning

I can unpack my RNN model onto my website, but I am having trouble getting it to predict a numpy array of predictions using a list as input (contains only one string called text but needs to be a list for preprocessing from what I've gathered) and I am coming across the problem:
ValueError: Error when checking : expected embedding_1_input to have shape (None, 72)
but got array with shape (1, 690)
Here is how I am currently preprocessing and predicting with the model:
tokenizer = Tokenizer(num_words = 5000, split=' ')
tokenizer.fit_on_texts([text])
X = tokenizer.texts_to_sequences([text])
X = pad_sequences(X)
prediction = loadedModel.predict(X)
print(prediction)
And this is how I trained my model:
HIDDEN_LAYER_SIZE = 195 # Details the amount of nodes in a hidden layer.
TOP_WORDS = 5000 # Most-used words in the dataset.
MAX_REVIEW_LENGTH = 500 # Char length of each text being sent in (necessary).
EMBEDDING_VECTOR_LENGTH = 128 # The specific Embedded later will have 128-length vectors to
# represent each word.
BATCH_SIZE = 32 # Takes 64 sentences at a time and continually retrains RNN.
NUMBER_OF_EPOCHS = 10 # Fits RNN to more accurately guess the data's political bias.
DROPOUT = 0.2 # Helps slow down overfitting of data (slower convergence rate)
# Define the model
model = Sequential()
model.add(Embedding(TOP_WORDS, EMBEDDING_VECTOR_LENGTH, \
input_length=X.shape[1]))
model.add(SpatialDropout1D(DROPOUT))
model.add(LSTM(HIDDEN_LAYER_SIZE))
model.add(Dropout(DROPOUT))
model.add(Dense(2, activation='softmax'))
# Compile the model
model.compile(loss='categorical_crossentropy', optimizer='adam', \
metrics=['accuracy'])
#printModelSummary(model)
# Fit the model
model.fit(X_train, Y_train, validation_data=(X_test, Y_test), \
epochs=NUMBER_OF_EPOCHS, batch_size=BATCH_SIZE)
How can I fix my preprocessing code in the codebox starting with "tokenizer" to stop getting the ValueError?
Thank you, and I can definitely provide more code or expand upon the purpose of the project.

So there are two problems here:
Set max_len in pad_sequences: it seems that all of your training sequences were padded to have length 72 so - you need to change the following line:
X = pad_sequences(X, max_len=72)
Use training Tokenizer: this is a subtle problem - you are creating and fitting a totally new Tokenizer so it could be different than one which you used for training. This could cause problems - because different words could have different indexes - and this will make your model to work awful. Try to pickle your training Tokenizer and load it during deployment in order to transform sentences into data points fed to your model properly.

Related

Is there a way to increase the variance of model's prediction?

I created a randomly generated(using numpy, between range 30 and 60) Data of about 12000 points (to
generate an artificial time-series data for more than a year in Time).
Now I am trying to fit that data points in an LSTM model and forecast
based upon that.
The LSTM model i applied,(here data is a single series so n_features = 1, and steps-in and out are for sequence-generation function for time-series, i took both equal to 5. Also the for the activation functions i tried all with both relu, both tanh and 1st tanh & 2nd relu (as shown here))
X, y = split_sequences(data, n_steps_in, n_steps_out)
n_features = X.shape[2]
model = Sequential()
model.add(LSTM(200, activation='tanh', input_shape=(n_steps_in,
n_features)))
model.add(RepeatVector(n_steps_out))
model.add(LSTM(200, activation='relu', return_sequences=True))
model.add(TimeDistributed(Dense(n_features)))
opt = keras.optimizers.Adam(learning_rate=0.05)
model.compile(optimizer=opt, loss='mse')
model.fit(X, y, epochs= n, batch_size=10, verbose=1,
workers=4, use_multiprocessing = True, initial_epoch = 0)
I also tried smoothening of the data-points as they are randomly
distributed (in the predefined boundaries).
and then applied the model on the smoothed data, but still i am getting similar results.
for e.g., In this image showing both the smoothed-training data and the forecasted-prediction from the model
plt.plot(Training_data, 'g')
plt.plot(Pred_Forecasts,'r')
Every time the models are giving straight lines in prediction.
and which is obvious since it is a set of random numbers so model tends to get to a mean value between the upper and lower limits of the data, but still is there any way to generate a somewhat real looking model.
P.S-1 - I have also tried applying different models like prophet, sarima, arima.
But i think i need to find a way to increase the Variance of the prediction, which i am unable to find.
PS-2 - Sorry for the long question i am new to deep-learning so i tried to explain more.

How to Fine-tune HuggingFace BERT model for Text Classification [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 1 year ago.
Improve this question
Is there a Step by step explanation on how to Fine-tune HuggingFace BERT model for text classification?
Fine Tuning Approach
There are multiple approaches to fine-tune BERT for the target tasks.
Further Pre-training the base BERT model
Custom classification layer(s) on top of the base BERT model being trainable
Custom classification layer(s) on top of the base BERT model being non-trainable (frozen)
Note that the BERT base model has been pre-trained only for two tasks as in the original paper.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
3.1 Pre-training BERT ...we pre-train BERT using two unsupervised tasks
Task #1: Masked LM
Task #2: Next Sentence Prediction (NSP)
Hence, the base BERT model is like half-baked which can be fully baked for the target domain (1st way). We can use it as part of our custom model training with the base trainable (2nd) or not-trainable (3rd).
1st approach
How to Fine-Tune BERT for Text Classification? demonstrated the 1st approach of Further Pre-training, and pointed out the learning rate is the key to avoid Catastrophic Forgetting where the pre-trained knowledge is erased during learning of new knowledge.
We find that a lower learning rate, such as 2e-5,
is necessary to make BERT overcome the catastrophic forgetting problem. With an aggressive learn rate of 4e-4, the training set fails to converge.
Probably this is the reason why the BERT paper used 5e-5, 4e-5, 3e-5, and 2e-5 for fine-tuning.
We use a batch size of 32 and fine-tune for 3 epochs over the data for all GLUE tasks. For each task, we selected the best fine-tuning learning rate (among 5e-5, 4e-5, 3e-5, and 2e-5) on the Dev set
Note that the base model pre-training itself used higher learning rate.
bert-base-uncased - pretraining
The model was trained on 4 cloud TPUs in Pod configuration (16 TPU chips total) for one million steps with a batch size of 256. The sequence length was limited to 128 tokens for 90% of the steps and 512 for the remaining 10%. The optimizer used is Adam with a learning rate of 1e-4, β1=0.9 and β2=0.999, a weight decay of 0.01, learning rate warmup for 10,000 steps and linear decay of the learning rate after.
Will describe the 1st way as part of the 3rd approach below.
FYI:
TFDistilBertModel is the bare base model with the name distilbert.
Model: "tf_distil_bert_model_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
distilbert (TFDistilBertMain multiple 66362880
=================================================================
Total params: 66,362,880
Trainable params: 66,362,880
Non-trainable params: 0
2nd approach
Huggingface takes the 2nd approach as in Fine-tuning with native PyTorch/TensorFlow where TFDistilBertForSequenceClassification has added the custom classification layer classifier on top of the base distilbert model being trainable. The small learning rate requirement will apply as well to avoid the catastrophic forgetting.
from transformers import TFDistilBertForSequenceClassification
model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
model.compile(optimizer=optimizer, loss=model.compute_loss) # can also use any keras loss fn
model.fit(train_dataset.shuffle(1000).batch(16), epochs=3, batch_size=16)
Model: "tf_distil_bert_for_sequence_classification_2"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
distilbert (TFDistilBertMain multiple 66362880
_________________________________________________________________
pre_classifier (Dense) multiple 590592
_________________________________________________________________
classifier (Dense) multiple 1538
_________________________________________________________________
dropout_59 (Dropout) multiple 0
=================================================================
Total params: 66,955,010
Trainable params: 66,955,010 <--- All parameters are trainable
Non-trainable params: 0
Implementation of the 2nd approach
import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split
from transformers import (
DistilBertTokenizerFast,
TFDistilBertForSequenceClassification,
)
DATA_COLUMN = 'text'
LABEL_COLUMN = 'category_index'
MAX_SEQUENCE_LENGTH = 512
LEARNING_RATE = 5e-5
BATCH_SIZE = 16
NUM_EPOCHS = 3
# --------------------------------------------------------------------------------
# Tokenizer
# --------------------------------------------------------------------------------
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
def tokenize(sentences, max_length=MAX_SEQUENCE_LENGTH, padding='max_length'):
"""Tokenize using the Huggingface tokenizer
Args:
sentences: String or list of string to tokenize
padding: Padding method ['do_not_pad'|'longest'|'max_length']
"""
return tokenizer(
sentences,
truncation=True,
padding=padding,
max_length=max_length,
return_tensors="tf"
)
# --------------------------------------------------------------------------------
# Load data
# --------------------------------------------------------------------------------
raw_train = pd.read_csv("./train.csv")
train_data, validation_data, train_label, validation_label = train_test_split(
raw_train[DATA_COLUMN].tolist(),
raw_train[LABEL_COLUMN].tolist(),
test_size=.2,
shuffle=True
)
# --------------------------------------------------------------------------------
# Prepare TF dataset
# --------------------------------------------------------------------------------
train_dataset = tf.data.Dataset.from_tensor_slices((
dict(tokenize(train_data)), # Convert BatchEncoding instance to dictionary
train_label
)).shuffle(1000).batch(BATCH_SIZE).prefetch(1)
validation_dataset = tf.data.Dataset.from_tensor_slices((
dict(tokenize(validation_data)),
validation_label
)).batch(BATCH_SIZE).prefetch(1)
# --------------------------------------------------------------------------------
# training
# --------------------------------------------------------------------------------
model = TFDistilBertForSequenceClassification.from_pretrained(
'distilbert-base-uncased',
num_labels=NUM_LABELS
)
optimizer = tf.keras.optimizers.Adam(learning_rate=LEARNING_RATE)
model.compile(
optimizer=optimizer,
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
)
model.fit(
x=train_dataset,
y=None,
validation_data=validation_dataset,
batch_size=BATCH_SIZE,
epochs=NUM_EPOCHS,
)
3rd approach
Basics
Please note that the images are taken from A Visual Guide to Using BERT for the First Time and modified.
Tokenizer
Tokenizer generates the instance of BatchEncoding which can be used like a Python dictionary and the input to the BERT model.
BatchEncoding
Holds the output of the encode_plus() and batch_encode() methods (tokens, attention_masks, etc).
This class is derived from a python dictionary and can be used as a dictionary. In addition, this class exposes utility methods to map from word/character space to token space.
Parameters
data (dict) – Dictionary of lists/arrays/tensors returned by the encode/batch_encode methods (‘input_ids’, ‘attention_mask’, etc.).
The data attribute of the class is the tokens generated which has input_ids and attention_mask elements.
input_ids
input_ids
The input ids are often the only required parameters to be passed to the model as input. They are token indices, numerical representations of tokens building the sequences that will be used as input by the model.
attention_mask
Attention mask
This argument indicates to the model which tokens should be attended to, and which should not.
If the attention_mask is 0, the token id is ignored. For instance if a sequence is padded to adjust the sequence length, the padded words should be ignored hence their attention_mask are 0.
Special Tokens
BertTokenizer addes special tokens, enclosing a sequence with [CLS] and [SEP]. [CLS] represents Classification and [SEP] separates sequences. For Question Answer or Paraphrase tasks, [SEP] separates the two sentences to compare.
BertTokenizer
cls_token (str, optional, defaults to "[CLS]")The Classifier Token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens.
sep_token (str, optional, defaults to "[SEP]")The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for sequence classification or for a text and a question for question answering. It is also used as the last token of a sequence built with special tokens.
A Visual Guide to Using BERT for the First Time show the tokenization.
[CLS]
The embedding vector for [CLS] in the output from the base model final layer represents the classification that has been learned by the base model. Hence feed the embedding vector of [CLS] token into the classification layer added on top of the base model.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
The first token of every sequence is always a special classification token ([CLS]). The final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks. Sentence pairs are packed together into a single sequence. We differentiate the sentences in two ways. First, we separate them with a special token ([SEP]). Second, we add a learned embedding to every token indicating whether it belongs to sentence A or sentence B.
The model structure will be illustrated as below.
Vector size
In the model distilbert-base-uncased, each token is embedded into a vector of size 768. The shape of the output from the base model is (batch_size, max_sequence_length, embedding_vector_size=768). This accords with the BERT paper about the BERT/BASE model (as indicated in distilbert-base-uncased).
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT/BASE (L=12, H=768, A=12, Total Parameters=110M) and BERT/LARGE (L=24, H=1024, A=16, Total Parameters=340M).
Base Model - TFDistilBertModel
Hugging Face Transformers: Fine-tuning DistilBERT for Binary Classification Tasks
TFDistilBertModel class to instantiate the base DistilBERT model without any specific head on top (as opposed to other classes such as TFDistilBertForSequenceClassification that do have an added classification head).
We do not want any task-specific head attached because we simply want the pre-trained weights of the base model to provide a general understanding of the English language, and it will be our job to add our own classification head during the fine-tuning process in order to help the model distinguish between toxic comments.
TFDistilBertModel generates an instance of TFBaseModelOutput whose last_hidden_state parameter is the output from the model last layer.
TFBaseModelOutput([(
'last_hidden_state',
<tf.Tensor: shape=(batch_size, sequence_lendgth, 768), dtype=float32, numpy=array([[[...]]], dtype=float32)>
)])
TFBaseModelOutput
Parameters
last_hidden_state (tf.Tensor of shape (batch_size, sequence_length, hidden_size)) – Sequence of hidden-states at the output of the last layer of the model.
Implementation
Python modules
import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split
from transformers import (
DistilBertTokenizerFast,
TFDistilBertModel,
)
Configuration
TIMESTAMP = datetime.datetime.now().strftime("%Y%b%d%H%M").upper()
DATA_COLUMN = 'text'
LABEL_COLUMN = 'category_index'
MAX_SEQUENCE_LENGTH = 512 # Max length allowed for BERT is 512.
NUM_LABELS = len(raw_train[LABEL_COLUMN].unique())
MODEL_NAME = 'distilbert-base-uncased'
NUM_BASE_MODEL_OUTPUT = 768
# Flag to freeze base model
FREEZE_BASE = True
# Flag to add custom classification heads
USE_CUSTOM_HEAD = True
if USE_CUSTOM_HEAD == False:
# Make the base trainable when no classification head exists.
FREEZE_BASE = False
BATCH_SIZE = 16
LEARNING_RATE = 1e-2 if FREEZE_BASE else 5e-5
L2 = 0.01
Tokenizer
tokenizer = DistilBertTokenizerFast.from_pretrained(MODEL_NAME)
def tokenize(sentences, max_length=MAX_SEQUENCE_LENGTH, padding='max_length'):
"""Tokenize using the Huggingface tokenizer
Args:
sentences: String or list of string to tokenize
padding: Padding method ['do_not_pad'|'longest'|'max_length']
"""
return tokenizer(
sentences,
truncation=True,
padding=padding,
max_length=max_length,
return_tensors="tf"
)
Input layer
The base model expects input_ids and attention_mask whose shape is (max_sequence_length,). Generate Keras Tensors for them with Input layer respectively.
# Inputs for token indices and attention masks
input_ids = tf.keras.layers.Input(shape=(MAX_SEQUENCE_LENGTH,), dtype=tf.int32, name='input_ids')
attention_mask = tf.keras.layers.Input((MAX_SEQUENCE_LENGTH,), dtype=tf.int32, name='attention_mask')
Base model layer
Generate the output from the base model. The base model generates TFBaseModelOutput. Feed the embedding of [CLS] to the next layer.
base = TFDistilBertModel.from_pretrained(
MODEL_NAME,
num_labels=NUM_LABELS
)
# Freeze the base model weights.
if FREEZE_BASE:
for layer in base.layers:
layer.trainable = False
base.summary()
# [CLS] embedding is last_hidden_state[:, 0, :]
output = base([input_ids, attention_mask]).last_hidden_state[:, 0, :]
Classification layers
if USE_CUSTOM_HEAD:
# -------------------------------------------------------------------------------
# Classifiation leayer 01
# --------------------------------------------------------------------------------
output = tf.keras.layers.Dropout(
rate=0.15,
name="01_dropout",
)(output)
output = tf.keras.layers.Dense(
units=NUM_BASE_MODEL_OUTPUT,
kernel_initializer='glorot_uniform',
activation=None,
name="01_dense_relu_no_regularizer",
)(output)
output = tf.keras.layers.BatchNormalization(
name="01_bn"
)(output)
output = tf.keras.layers.Activation(
"relu",
name="01_relu"
)(output)
# --------------------------------------------------------------------------------
# Classifiation leayer 02
# --------------------------------------------------------------------------------
output = tf.keras.layers.Dense(
units=NUM_BASE_MODEL_OUTPUT,
kernel_initializer='glorot_uniform',
activation=None,
name="02_dense_relu_no_regularizer",
)(output)
output = tf.keras.layers.BatchNormalization(
name="02_bn"
)(output)
output = tf.keras.layers.Activation(
"relu",
name="02_relu"
)(output)
Softmax Layer
output = tf.keras.layers.Dense(
units=NUM_LABELS,
kernel_initializer='glorot_uniform',
kernel_regularizer=tf.keras.regularizers.l2(l2=L2),
activation='softmax',
name="softmax"
)(output)
Final Custom Model
name = f"{TIMESTAMP}_{MODEL_NAME.upper()}"
model = tf.keras.models.Model(inputs=[input_ids, attention_mask], outputs=output, name=name)
model.compile(
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),
optimizer=tf.keras.optimizers.Adam(learning_rate=LEARNING_RATE),
metrics=['accuracy']
)
model.summary()
---
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_ids (InputLayer) [(None, 256)] 0
__________________________________________________________________________________________________
attention_mask (InputLayer) [(None, 256)] 0
__________________________________________________________________________________________________
tf_distil_bert_model (TFDistilB TFBaseModelOutput(la 66362880 input_ids[0][0]
attention_mask[0][0]
__________________________________________________________________________________________________
tf.__operators__.getitem_1 (Sli (None, 768) 0 tf_distil_bert_model[1][0]
__________________________________________________________________________________________________
01_dropout (Dropout) (None, 768) 0 tf.__operators__.getitem_1[0][0]
__________________________________________________________________________________________________
01_dense_relu_no_regularizer (D (None, 768) 590592 01_dropout[0][0]
__________________________________________________________________________________________________
01_bn (BatchNormalization) (None, 768) 3072 01_dense_relu_no_regularizer[0][0
__________________________________________________________________________________________________
01_relu (Activation) (None, 768) 0 01_bn[0][0]
__________________________________________________________________________________________________
02_dense_relu_no_regularizer (D (None, 768) 590592 01_relu[0][0]
__________________________________________________________________________________________________
02_bn (BatchNormalization) (None, 768) 3072 02_dense_relu_no_regularizer[0][0
__________________________________________________________________________________________________
02_relu (Activation) (None, 768) 0 02_bn[0][0]
__________________________________________________________________________________________________
softmax (Dense) (None, 2) 1538 02_relu[0][0]
==================================================================================================
Total params: 67,551,746
Trainable params: 1,185,794
Non-trainable params: 66,365,952 <--- Base BERT model is frozen
Data allocation
# --------------------------------------------------------------------------------
# Split data into training and validation
# --------------------------------------------------------------------------------
raw_train = pd.read_csv("./train.csv")
train_data, validation_data, train_label, validation_label = train_test_split(
raw_train[DATA_COLUMN].tolist(),
raw_train[LABEL_COLUMN].tolist(),
test_size=.2,
shuffle=True
)
# X = dict(tokenize(train_data))
# Y = tf.convert_to_tensor(train_label)
X = tf.data.Dataset.from_tensor_slices((
dict(tokenize(train_data)), # Convert BatchEncoding instance to dictionary
train_label
)).batch(BATCH_SIZE).prefetch(1)
V = tf.data.Dataset.from_tensor_slices((
dict(tokenize(validation_data)), # Convert BatchEncoding instance to dictionary
validation_label
)).batch(BATCH_SIZE).prefetch(1)
Train
# --------------------------------------------------------------------------------
# Train the model
# https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit
# Input data x can be a dict mapping input names to the corresponding array/tensors,
# if the model has named inputs. Beware of the "names". y should be consistent with x
# (you cannot have Numpy inputs and tensor targets, or inversely).
# --------------------------------------------------------------------------------
history = model.fit(
x=X, # dictionary
# y=Y,
y=None,
epochs=NUM_EPOCHS,
batch_size=BATCH_SIZE,
validation_data=V,
)
To implement the 1st approach, change the configuration as below.
USE_CUSTOM_HEAD = False
Then FREEZE_BASE is changed to False and LEARNING_RATE is changed to 5e-5 which will run Further Pre-training on the base BERT model.
Saving the model
For the 3rd approach, saving the model will cause issues. The save_pretrained method of the Huggingface Model cannot be used as the model is not a direct sub class from of Huggingface PreTrainedModel.
Keras save_model causes an error with the default save_traces=True, or causes a different error with save_traces=True when loading the model with Keras load_model.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-71-01d66991d115> in <module>()
----> 1 tf.keras.models.load_model(MODEL_DIRECTORY)
11 frames
/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/saving/saved_model/load.py in _unable_to_call_layer_due_to_serialization_issue(layer, *unused_args, **unused_kwargs)
865 'recorded when the object is called, and used when saving. To manually '
866 'specify the input shape/dtype, decorate the call function with '
--> 867 '`#tf.function(input_signature=...)`.'.format(layer.name, type(layer)))
868
869
ValueError: Cannot call custom layer tf_distil_bert_model of type <class 'tensorflow.python.keras.saving.saved_model.load.TFDistilBertModel'>, because the call function was not serialized to the SavedModel.Please try one of the following methods to fix this issue:
(1) Implement `get_config` and `from_config` in the layer/model class, and pass the object to the `custom_objects` argument when loading the model. For more details, see: https://www.tensorflow.org/guide/keras/save_and_serialize
(2) Ensure that the subclassed model or layer overwrites `call` and not `__call__`. The input shape and dtype will be automatically recorded when the object is called, and used when saving. To manually specify the input shape/dtype, decorate the call function with `#tf.function(input_signature=...)`.
Only Keras Model save_weights worked as far as I tested.
Experiments
As far as I tested with Toxic Comment Classification Challenge, the 1st approach gave better recall (identify true toxic comment, true non-toxic comment). Code can be accessed as below. Please provide correction/suggestion if anything.
Code for 1st and 3rd approach
Related
BERT Document Classification Tutorial with Code - Fine tuning using TFDistilBertForSequenceClassification and Pytorch
Hugging Face Transformers: Fine-tuning DistilBERT for Binary Classification Tasks - Fine tuning using TFDistilBertModel

Validation accuracy fluctuating while training accuracy increase?

I have a multiclassification problem that depends on historical data. I am trying LSTM using loss='sparse_categorical_crossentropy'. The train accuracy and loss increase and decrease respectively. But, my test accuracy starts to fluctuate wildly.
What I am doing wrong?
Input data:
X = np.reshape(X, (X.shape[0], X.shape[1], 1))
X.shape
(200146, 13, 1)
My model
# fix random seed for reproducibility
seed = 7
np.random.seed(seed)
# define 10-fold cross validation test harness
kfold = StratifiedKFold(n_splits=10, shuffle=False, random_state=seed)
cvscores = []
for train, test in kfold.split(X, y):
regressor = Sequential()
# Units = the number of LSTM that we want to have in this first layer -> we want very high dimentionality, we need high number
# return_sequences = True because we are adding another layer after this
# input shape = the last two dimensions and the indicator
regressor.add(LSTM(units=50, return_sequences=True, input_shape=(X[train].shape[1], 1)))
regressor.add(Dropout(0.2))
# Extra LSTM layer
regressor.add(LSTM(units=50, return_sequences=True))
regressor.add(Dropout(0.2))
# 3rd
regressor.add(LSTM(units=50, return_sequences=True))
regressor.add(Dropout(0.2))
#4th
regressor.add(LSTM(units=50))
regressor.add(Dropout(0.2))
# output layer
regressor.add(Dense(4, activation='softmax', kernel_regularizer=regularizers.l2(0.001)))
# Compile the RNN
regressor.compile(optimizer='adam', loss='sparse_categorical_crossentropy',metrics=['accuracy'])
# Set callback functions to early stop training and save the best model so far
callbacks = [EarlyStopping(monitor='val_loss', patience=9),
ModelCheckpoint(filepath='best_model.h5', monitor='val_loss', save_best_only=True)]
history = regressor.fit(X[train], y[train], epochs=250, callbacks=callbacks,
validation_data=(X[test], y[test]))
# plot train and validation loss
pyplot.plot(history.history['loss'])
pyplot.plot(history.history['val_loss'])
pyplot.title('model train vs validation loss')
pyplot.ylabel('loss')
pyplot.xlabel('epoch')
pyplot.legend(['train', 'validation'], loc='upper right')
pyplot.show()
# evaluate the model
scores = regressor.evaluate(X[test], y[test], verbose=0)
print("%s: %.2f%%" % (regressor.metrics_names[1], scores[1]*100))
cvscores.append(scores[1] * 100)
print("%.2f%% (+/- %.2f%%)" % (np.mean(cvscores), np.std(cvscores)))
Results:
trainingmodel
Plot
What you are describing here is overfitting. This means your model keeps learning about your training data and doesn't generalize, or other said it is learning the exact features of your training set. This is the main problem you can deal with in deep learning. There is no solution per se. You have to try out different architectures, different hyperparameters and so on.
You can try with a small model that underfits (that is the train acc and validation are at low percentage) and keep increasing your model until it overfits. Then you can play around with the optimizer and other hyperparameters.
By smaller model I mean one with fewer hidden units or fewer layers.
you seem to have too many LSTM layers stacked over and over again which eventually leads to overfitting. Probably should decrease the num of layers.
Your model seems to be overfitting, since the training error keeps on reducing while validation error fails to. Overall, it fails to generalize.
You should try reducing the model complexity by removing some of the LSTM layers. Also, try varying the batch sizes, it will reduce the number of fluctuations in the loss.
You can also consider varying the learning rate.

How to interpret MSE in Keras Regressor

I am trying to build a model to predict house prices.
I have some features X (no. of bathrooms , etc.) and target Y (ranging around $300,000 to $800,000)
I have used sklearn's Standard Scaler to standardize Y before fitting it to the model.
Here is my Keras model:
def build_model():
model = Sequential()
model.add(Dense(36, input_dim=36, activation='relu'))
model.add(Dense(18, input_dim=36, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='mse', optimizer='sgd', metrics=['mae','mse'])
return model
I am having trouble trying to interpret the results -- what does a MSE of 0.617454319755 mean?
Do I have to inverse transform this number, and square root the results, getting an error rate of 741.55 in dollars?
math.sqrt(sc.inverse_transform([mse]))
I apologise for sounding silly as I am starting out!
I apologise for sounding silly as I am starting out!
Do not; this is a subtle issue of great importance, which is usually (and regrettably) omitted in tutorials and introductory expositions.
Unfortunately, it is not as simple as taking the square root of the inverse-transformed MSE, but it is not that complicated either; essentially what you have to do is:
Transform back your predictions to the initial scale of the original data
Get the MSE between these invert-transformed predictions and the original data
Take the square root of the result
in order to get a performance indicator of your model that will be meaningful in the business context of your problem (e.g. US dollars here).
Let's see a quick example with toy data, omitting the model itself (which is irrelevant here, and in fact can be any regression model - not only a Keras one):
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
import numpy as np
# toy data
X = np.array([[1,2], [3,4], [5,6], [7,8], [9,10]])
Y = np.array([3, 4, 5, 6, 7])
# feature scaling
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X)
# outcome scaling:
sc_Y = StandardScaler()
Y_train = sc_Y.fit_transform(Y.reshape(-1, 1))
Y_train
# array([[-1.41421356],
# [-0.70710678],
# [ 0. ],
# [ 0.70710678],
# [ 1.41421356]])
Now, let's say that we fit our Keras model (not shown here) using the scaled sets X_train and Y_train, and get predictions on the training set:
prediction = model.predict(X_train) # scaled inputs here
print(prediction)
# [-1.4687586 -0.6596055 0.14954728 0.95870024 1.001172 ]
The MSE reported by Keras is actually the scaled MSE, i.e.:
MSE_scaled = mean_squared_error(Y_train, prediction)
MSE_scaled
# 0.052299712818541934
while the 3 steps I have described above are simply:
MSE = mean_squared_error(Y, sc_Y.inverse_transform(prediction)) # first 2 steps, combined
MSE
# 0.10459946572909758
np.sqrt(MSE) # 3rd step
# 0.323418406602187
So, in our case, if our initial Y were US dollars, the actual error in the same units (dollars) would be 0.32 (dollars).
Notice how the naive approach of inverse-transforming the scaled MSE would give a very different (and incorrect) result:
np.sqrt(sc_Y.inverse_transform([MSE_scaled]))
# array([2.25254588])
MSE is mean square error, here is the formula.
Basically it is a mean of square of different of expected output and prediction. Making square root of this will not give you the difference between error and output. This is useful for training.
Currently you have build a model.
If you want to train the model use these function.
mode.fit(x=input_x_array, y=input_y_array, batch_size=None, epochs=1, verbose=1, callbacks=None, validation_split=0.0, validation_data=None, shuffle=True, class_weight=None, sample_weight=None, initial_epoch=0, steps_per_epoch=None, validation_steps=None)
If you want to do prediction of the output you should use following code.
prediction = model.predict(np.array(input_x_array))
print(prediction)
You can find more details here.
https://keras.io/models/about-keras-models/
https://keras.io/models/sequential/

Keras: model with one input and two outputs, trained jointly on different data (semi-supervised learning)

I would like to code with Keras a neural network that acts both as an autoencoder AND a classifier for semi-supervised learning. Take for example this dataset where there is a few labeled images and a lot of unlabeled images: https://cs.stanford.edu/~acoates/stl10/
Some papers listed here achieved that, or very similar things, successfully.
To sum up: if the model would have the same input data shape and the same "encoding" convolutional layers, but would split into two heads (fork-style), so there is a classification head and a decoding head, in a way that the unsupervised autoencoder will contribute to a good learning for the classification head.
With TensorFlow there would be no problem doing that as we have full control over the computational graph.
But with Keras, things are more high-level and I feel that all the calls to ".fit" must always provide all the data at once (so it would force me to tie together the classification head and the autoencoding head into one time-step).
One way in keras to almost do that would be with something that goes like this:
input = Input(shape=(32, 32, 3))
cnn_feature_map = sequential_cnn_trunk(input)
classification_predictions = Dense(10, activation='sigmoid')(cnn_feature_map)
autoencoded_predictions = decode_cnn_head_sequential(cnn_feature_map)
model = Model(inputs=[input], outputs=[classification_predictions, ])
model.compile(optimizer='rmsprop',
loss='binary_crossentropy',
metrics=['accuracy'])
model.fit([images], [labels, images], epochs=10)
However, I think and I fear that if I just want to fit things in that way it will fail and ask for the missing head:
for epoch in range(10):
# classifications step
model.fit([images], [labels, None], epochs=1)
# "semi-unsupervised" autoencoding step
model.fit([images], [None, images], epochs=1)
# note: ".train_on_batch" could probably be used rather than ".fit" to avoid doing a whole epoch each time.
How should one implement that behavior with Keras? And could the training be done jointly without having to split the two calls to the ".fit" function?
Sometimes when you don't have a label you can pass zero vector instead of one hot encoded vector. It should not change your result because zero vector doesn't have any error signal with categorical cross entropy loss.
My custom to_categorical function looks like this:
def tricky_to_categorical(y, translator_dict):
encoded = np.zeros((y.shape[0], len(translator_dict)))
for i in range(y.shape[0]):
if y[i] in translator_dict:
encoded[i][translator_dict[y[i]]] = 1
return encoded
When y contains labels, and translator_dict is a python dictionary witch contains labels and its unique keys like this:
{'unisex':2, 'female': 1, 'male': 0}
If an UNK label can't be found in this dictinary then its encoded label will be a zero vector
If you use this trick you also have to modify your accuracy function to see real accuracy numbers. you have to filter out all zero vectors from our metrics
def tricky_accuracy(y_true, y_pred):
mask = K.not_equal(K.sum(y_true, axis=-1), K.constant(0)) # zero vector mask
y_true = tf.boolean_mask(y_true, mask)
y_pred = tf.boolean_mask(y_pred, mask)
return K.cast(K.equal(K.argmax(y_true, axis=-1), K.argmax(y_pred, axis=-1)), K.floatx())
note: You have to use larger batches (e.g. 32) in order to prevent zero matrix update, because It can make your accuracy metrics crazy, I don't know why
Alternative solution
Use Pseudo Labeling :)
you can train jointly, you have to pass an array insted of single label.
I used fit_generator, e.g.
model.fit_generator(
batch_generator(),
steps_per_epoch=len(dataset) / batch_size,
epochs=epochs)
def batch_generator():
batch_x = np.empty((batch_size, img_height, img_width, 3))
gender_label_batch = np.empty((batch_size, len(gender_dict)))
category_label_batch = np.empty((batch_size, len(category_dict)))
while True:
i = 0
for idx in np.random.choice(len(dataset), batch_size):
image_id = dataset[idx][0]
batch_x[i] = load_and_convert_image(image_id)
gender_label_batch[i] = gender_labels[idx]
category_label_batch[i] = category_labels[idx]
i += 1
yield batch_x, [gender_label_batch, category_label_batch]

Resources