How to feed key-value features (aggregated data) to LSTM? - machine-learning

I have the following time-series aggregated input for an LSTM-based model:
x(0): {y(0,0): {a(0,0), b(0,0)}, y(0,1): {a(0,1), b(0,1)}, ..., y(0,n): {a(0,n), b(0,n)}}
x(1): {y(1,0): {a(1,0), b(1,0)}, y(1,1): {a(1,1), b(1,1)}, ..., y(1,n): {a(1,n), b(1,n)}}
...
x(m): {y(m,0): {a(m,0), b(m,0)}, y(m,1): {a(m,1), b(m,1)}, ..., y(m,n): {a(m,n), b(m,n)}}
where x(m) is a timestep, a(m,n) and b(m,n) are features aggregated by the non-temporal sequential key y(m,n) which might be 0...1,000.
Example:
0: {90: {4, 4.2}, 91: {6, 0.2}, 92: {1, 0.4}, 93: {12, 11.2}}
1: {103: {1, 0.2}}
2: {100: {3, 0.1}, 101: {0.4, 4}}
Where 90-93, 103, and 100-101 are aggregation keys.
How can I feed this kind of input to LSTM?
Another approach would be to use non-aggregated data. In that case, I'd get the proper input for LSTM. Example:
Aggregated input:
0: {100: {3, 0.1}, 101: {0.4, 4}}
Original input:
0: 100, 1, 0.05
1: 101, 0.2, 2
2: 100, 1, 0
3: 100, 1, 0.05
4: 101, 0.2, 2
But in that case, the aggregation would be lost, and the whole purpose of aggregation is to minimize the number of steps so that I get 500 timesteps instead of e.g. 40,000, which is impossible to feed to LSTM. If you have any ideas I'd appreciate it.

Related

Plotting decision boundary for a multiclass Random Forest model

I am crossposting from Cross Validated and Data Science stackexchange, since I was told my questions is code heavy. I will delete if rules disallow - I don't know what the position is.
I am using the MNIST dataset with 10 classes (the digits 0 to 9). I am using a compressed version with 49 predictor variables(x1,x2,...,x49). I have trained a Random Forest model and have created a Test data set, which is a grid, on which I have used the trained model to generate predictions as class probabilities as well as the classes. I am trying to generalise the code here that generates a decision boundary when there are only two outcome classes:
Variation on "How to plot decision boundary of a k-nearest neighbor classifier from Elements of Statistical Learning?"
and here:
https://stats.stackexchange.com/questions/21572/how-to-plot-decision-boundary-of-a-k-nearest-neighbor-classifier-from-elements-o
and here:
Decision boundary plots in ggplot2
I have tried to visualise the boundary using the first 2 predictors(x1 and x2), though predictions have been made with all 49.
Here is my code:
## Create a grid of data which is the Test data...
## traindat is the dataset that the model was trained om
data<- traindat
resolution = 50 (there will be 50 rows)
## Extract the 49 predictor variables and drop the toutcome variable
data <- data[,2:50]
head(data)
## Get the variable names in a list
ln <- vector(mode="list", length=49)
ln<-as.list(names(data))
data_mat<-matrix(0,50,49)
r <- sapply(data, range, na.rm = TRUE)
for (i in 1:49){
data_mat[,i]<- seq(r[1,i], r[2,i], length.out = resolution)
}
data_mat
mat<-as.matrix(data_mat)
m<-as.data.frame(mat)
## Create test data grid
fn<-function(x) seq(min(x)+1, max(x) + 1, length.out=50)
test2<-apply(m, 2, fn)
test2<-as.data.frame(test2)
colnames(test2)<-unlist(ln)
test2<-as.data.frame(test2)
## label is a column that should contain the Predicted class labels
test2$label<-"-1"
test2<-test2 %>%
relocate(label, .before = x1)
## finalModel is the model obtained from training the Random Forest on traindat
prob=predict(rf_gridsearch$finalModel,test2,type="prob")
prob2=predict(rf_gridsearch$finalModel,test2,type="response")
prob2<-as.data.frame(prob)
head(prob2)
## Create predicted classes 0 to 9 and probabilities for the Test data
fn<-function(x) which.max(x)-1
outCls<-apply(prob2, 1, fn)
outCls
fn<-function(x) max(x)
outProb<-apply(prob2, 1, fn)
outProb
##Data structure for plotting
require(dplyr)
dataf2 <- bind_rows(mutate(test2,
prob=outProb,
cls=0,
prob_cls=ifelse(outCls==cls,
1, 0)),
mutate(test2,
prob=outProb,
cls=1,
prob_cls=ifelse(outCls==cls,
1, 0)),
mutate(test2,
prob=outProb,
cls=2,
prob_cls=ifelse(outCls==cls,
1, 0)),
mutate(test2,
prob=outProb,
cls=3,
prob_cls=ifelse(outCls==cls,
1, 0)),
mutate(test2,
prob=outProb,
cls=4,
prob_cls=ifelse(outCls==cls,
1, 0)),
mutate(test2,
prob=outProb,
cls=5,
prob_cls=ifelse(outCls==cls,
1, 0)),
mutate(test2,
prob=outProb,
cls=6,
prob_cls=ifelse(outCls==cls,
1, 0)),
mutate(test2,
prob=outProb,
cls=7,
prob_cls=ifelse(outCls==cls,
1, 0)),
mutate(test2,
prob=outProb,
cls=8,
prob_cls=ifelse(outCls==cls,
1, 0)),
mutate(test2,
prob=outProb,
cls=9,
prob_cls=ifelse(outCls==cls,
1, 0))
)
## Solution from Stackexchange based on only two outcome classes
ggplot()+
geom_raster(data= dataf2, aes(x= x1, y=x2, fill=dataf2$cls ), interpolate = TRUE)+
geom_contour(data= NULL, aes(x= dataf2$x1, y=dataf2$x2, z= dataf2$prob), breaks=c(1.5), color="black", size=1)+
theme_few()+
scale_colour_manual(values = cols)+
labs(colour = "", fill="")+
scale_fill_gradient2(low="#338cea", mid="white", high="#dd7e7e",
midpoint=0.5, limits=range(dataf2$prob))+
theme(legend.position = "none")
My output doesn't look right - what does it mean? Also, why does the contour plot have to be based on the predicted probablity? What is the idea behind the code to generate a decision boundary for any classfier? What am I doing wrong?

How to compute mean/max of HuggingFace Transformers BERT token embeddings with attention mask?

I'm using the HuggingFace Transformers BERT model, and I want to compute a summary vector (a.k.a. embedding) over the tokens in a sentence, using either the mean or max function. The complication is that some tokens are [PAD], so I want to ignore the vectors for those tokens when computing the average or max.
Here's an example. I initially instantiate a BertTokenizer and a BertModel:
import torch
import transformers
from transformers import AutoTokenizer, AutoModel
transformer_name = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(transformer_name, use_fast=True)
model = AutoModel.from_pretrained(transformer_name)
I then input some sentences into the tokenizer and get out input_ids and attention_mask. Notably, an attention_mask value of 0 means that the token was a [PAD] that I can ignore.
sentences = ['Deep learning is difficult yet very rewarding.',
'Deep learning is not easy.',
'But is rewarding if done right.']
tokenizer_result = tokenizer(sentences, max_length=32, padding=True, return_attention_mask=True, return_tensors='pt')
input_ids = tokenizer_result.input_ids
attention_mask = tokenizer_result.attention_mask
print(input_ids.shape) # torch.Size([3, 11])
print(input_ids)
# tensor([[ 101, 2784, 4083, 2003, 3697, 2664, 2200, 10377, 2075, 1012, 102],
# [ 101, 2784, 4083, 2003, 2025, 3733, 1012, 102, 0, 0, 0],
# [ 101, 2021, 2003, 10377, 2075, 2065, 2589, 2157, 1012, 102, 0]])
print(attention_mask.shape) # torch.Size([3, 11])
print(attention_mask)
# tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
# [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
# [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]])
Now, I call the BERT model to get the 768-D token embeddings (the top-layer hidden states).
model_result = model(input_ids, attention_mask=attention_mask, return_dict=True)
token_embeddings = model_result.last_hidden_state
print(token_embeddings.shape) # torch.Size([3, 11, 768])
So at this point, I have:
token embeddings in a [3, 11, 768] matrix: 3 sentences, 11 tokens, 768-D vector for each token.
attention mask in a [3, 11] matrix: 3 sentences, 11 tokens. A 1 value indicates non-[PAD].
How do I compute the mean / max over the vectors for the valid, non-[PAD] tokens?
I tried using the attention mask as a mask and then called torch.max(), but I don't get the right dimensions:
masked_token_embeddings = token_embeddings[attention_mask==1]
print(masked_token_embeddings.shape) # torch.Size([29, 768] <-- WRONG. SHOULD BE [3, 11, 768]
pooled = torch.max(masked_token_embeddings, 1)
print(pooled.values.shape) # torch.Size([29]) <-- WRONG. SHOULD BE [3, 768]
What I really want is a tensor of shape [3, 768]. That is, a 768-D vector for each of the 3 sentences.
For max, you can multiply with attention_mask:
pooled = torch.max((token_embeddings * attention_mask.unsqueeze(-1)), axis=1)
For mean, you can sum along the axis and divide by attention_mask along that axis:
mean_pooled = token_embeddings.sum(axis=1) / attention_mask.sum(axis=-1).unsqueeze(-1)
In addition to #Quang, you can have a look at sentence_transformers Pooling layer.
For max pooling, they do this:
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
token_embeddings[input_mask_expanded == 0] = -1e9 # Set padding tokens to large negative value
pooled = torch.max(token_embeddings, 1)[0]
And for mean pooling they do the following:
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
sum_mask = input_mask_expanded.sum(1)
sum_mask = torch.clamp(sum_mask, min=1e-9)
pooled = sum_embeddings / sum_mask
The max pooling presented in the accepted answer will suffer when the max is negative, and the implementation from sentence transformers changes token_embeddings, which throw an error when you want to use the embedding for back propagation:
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation:
If you're interested on anything back-prop related, you can do something like this:
input_mask_expanded = torch.where(attention_mask==0, -1e-9, 0.).unsqueeze(-1).expand(token_embeddings.size()).float()
pooled = torch.max(token_embeddings-input_mask_expanded, 1)[0] # Set padding tokens to large negative value
It's the same idea of making all masked tokens to be very small, but it doesn't change the token_embeddings while at it.
Alex is right.
Look on hidden states for strings that go into tokenizer. For different strings, padding will have different embeddings.
So, in order to properly pool embeddings, you need to ignore those padding vectors.
Let's say you want to get embeddings out of the last 4 layers of BERT (as it yields the best classification results):
#iterate over the last 4 layers and get embeddings for
#strings without having embeddings from PAD tokens
m = []
for i in range(len(hidden_states[0])):
m.append([hidden_states[j+9][i,:,:][tokens["attention_mask"][i] !=0] for j in range(4)])
#average over all tokens embeddings
means = []
for i in range(len(hidden_states[0])):
means.append(torch.stack(m[i]).mean(dim=1))
#stack embeddings for all strings
pooled = torch.stack(means).reshape(-1,1,3072)

Data shuffling for Image Classification

I want to develop a CNN model to identify 24 hand signs in American Sign Language. I created a custom dataset that contains 3000 images for each hand sign i.e. 72000 images in the entire dataset.
For training the model, I would be using 80-20 dataset split (2400 images/hand sign in the training set and 600 images/hand sign in the validation set).
My question is:
Should I randomly shuffle the images when creating the dataset? And Why?
Based on my previous experience, it led to validation loss being lower than training loss and validation accuracy more than training accuracy. Check this link.
Random shuffling of data is a standard procedure in all machine learning pipelines, and image classification is not an exception; its purpose is to break possible biases during data preparation - e.g. putting all the cat images first and then the dog ones in a cat/dog classification dataset.
Take for example the famous iris dataset:
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)
y
# result:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
As you can clearly see, the dataset has been prepared in such a way that the first 50 samples are all of label 0, the next 50 of label 1, and the last 50 of label 2. Try to perform a 5-fold cross validation in such a dataset without shuffling and you'll find most of your folds containing only a single label; try a 3-fold CV, and all your folds will include only one label. Bad... BTW, it's not just a theoretical possibility, it has actually happened.
Even if no such bias exists, shuffling never hurts, so we do it always just to be on the safe side (you never know...).
Based on my previous experience, it led to validation loss being lower than training loss and validation accuracy more than training accuracy. Check this link.
As noted in the answer there, it is highly unlikely that this was due to shuffling. Data shuffling is not anything sophisticated - essentially, it is just the equivalent of shuffling a deck of cards; it may have happened once that you insisted on "better" shuffling and subsequently you ended up with a straight flush hand, but obviously this was not due to the "better" shuffling of the cards.
Here is my two cents on the topic.
First of all make sure to extract a test set that has equal number of samples for each hand sign. (hand sign #1 - 500 samples, hand sign #2 - 500 samples and so on)
I think this is referred to as stratified sampling.
When it comes to the training set, there is no huge mistake in shuffling the entire set. However, when splitting the training set into training and validation set make sure that the validation set is good enough to be a representation for the test set.
One of my personal experiences with shuffling:
After splitting the training set into training and validation sets, the validation set turned out to be very easy to predict. Therefore, I saw good learning metric values. However, the performance of the model on the test set was horrible.

multilayer_perceptron : ConvergenceWarning: Stochastic Optimizer: Maximum iterations reached and the optimization hasn't converged yet.Warning?

I have written a basic program to understand what's happening in MLP classifier?
from sklearn.neural_network import MLPClassifier
data: a dataset of body metrics (height, width, and shoe size) labeled male or female:
X = [[181, 80, 44], [177, 70, 43], [160, 60, 38], [154, 54, 37], [166, 65, 40],
[190, 90, 47], [175, 64, 39],
[177, 70, 40], [159, 55, 37], [171, 75, 42], [181, 85, 43]]
y = ['male', 'male', 'female', 'female', 'male', 'male', 'female', 'female',
'female', 'male', 'male']
prepare the model:
clf= MLPClassifier(hidden_layer_sizes=(3,), activation='logistic',
solver='adam', alpha=0.0001,learning_rate='constant',
learning_rate_init=0.001)
train
clf= clf.fit(X, y)
attributes of the learned classifier:
print('current loss computed with the loss function: ',clf.loss_)
print('coefs: ', clf.coefs_)
print('intercepts: ',clf.intercepts_)
print(' number of iterations the solver: ', clf.n_iter_)
print('num of layers: ', clf.n_layers_)
print('Num of o/p: ', clf.n_outputs_)
test
print('prediction: ', clf.predict([ [179, 69, 40],[175, 72, 45] ]))
calc. accuracy
print( 'accuracy: ',clf.score( [ [179, 69, 40],[175, 72, 45] ], ['female','male'], sample_weight=None ))
RUN1
current loss computed with the loss function: 0.617580287851
coefs: [array([[ 0.17222046, -0.02541928, 0.02743722],
[-0.19425909, 0.14586716, 0.17447281],
[-0.4063903 , 0.148889 , 0.02523247]]), array([[-0.66332919],
[ 0.04249613],
[-0.10474769]])]
intercepts: [array([-0.05611057, 0.32634023, 0.51251098]), array([ 0.17996649])]
number of iterations the solver: 200
num of layers: 3
Num of o/p: 1
prediction: ['female' 'male']
accuracy: 1.0
/home/anubhav/anaconda3/envs/mytf/lib/python3.6/site-packages/sklearn/neural_network/multilayer_perceptron.py:563: ConvergenceWarning: Stochastic Optimizer: Maximum iterations reached and the optimization hasn't converged yet.
% (), ConvergenceWarning)
RUN2
current loss computed with the loss function: 0.639478303643
coefs: [array([[ 0.02300866, 0.21547873, -0.1272455 ],
[-0.2859666 , 0.40159542, 0.55881399],
[ 0.39902066, -0.02792529, -0.04498812]]), array([[-0.64446013],
[ 0.60580985],
[-0.22001532]])]
intercepts: [array([-0.10482234, 0.0281211 , -0.16791644]), array([-0.19614561])]
number of iterations the solver: 39
num of layers: 3
Num of o/p: 1
prediction: ['female' 'female']
accuracy: 0.5
RUN3
current loss computed with the loss function: 0.691966937074
coefs: [array([[ 0.21882191, -0.48037975, -0.11774392],
[-0.15890357, 0.06887471, -0.03684797],
[-0.28321762, 0.48392007, 0.34104955]]), array([[ 0.08672174],
[ 0.1071615 ],
[-0.46085333]])]
intercepts: [array([-0.36606747, 0.21969636, 0.10138625]), array([-0.05670653])]
number of iterations the solver: 4
num of layers: 3
Num of o/p: 1
prediction: ['male' 'male']
accuracy: 0.5
RUN4:
current loss computed with the loss function: 0.697102567593
coefs: [array([[ 0.32489731, -0.18529689, -0.08712877],
[-0.35425908, 0.04214241, 0.41249622],
[-0.19993622, -0.38873908, -0.33057999]]), array([[ 0.43304555],
[ 0.37959392],
[ 0.55998979]])]
intercepts: [array([ 0.11555407, -0.3473817 , -0.16852093]), array([ 0.31326347])]
number of iterations the solver: 158
num of layers: 3
Num of o/p: 1
prediction: ['male' 'male']
accuracy: 0.5
-----------------------------------------------------------------
I have following questions:
1.Why in the RUN1 the optimizer did not converge?
2.Why in RUN3 the number of iteration were suddenly becomes so low and in the RUN4 so high?
3.What else can be done to increase the accuracy which I get in RUN1.?
1: Your MLP didn't converge:
The algorithm is optimizing by a stepwise convergence to a minimum and in run 1 your minimum wasn't found.
2 Difference of runs:
You have some random starting values for your MLP, so you dont get the same results as you see in your data. Seems that you started very close to a minimum in your fourth run. You can change the random_state parameter of your MLP to a constant e.g. random_state=0 to get the same result over and over.
3 is the most difficult point.
You can optimize parameters with
from sklearn.model_selection import GridSearchCV
Gridsearch splits up your test set in eqally sized parts, uses one part as test data and the rest as training data. So it optimizes as many classifiers as parts you split your data into.
you need to specify (your data is small so i suggest 2 or 3) the number of parts you split, a classifier (your MLP), and a Grid of parameters you want to optimize like this:
param_grid = [
{
'activation' : ['identity', 'logistic', 'tanh', 'relu'],
'solver' : ['lbfgs', 'sgd', 'adam'],
'hidden_layer_sizes': [
(1,),(2,),(3,),(4,),(5,),(6,),(7,),(8,),(9,),(10,),(11,), (12,),(13,),(14,),(15,),(16,),(17,),(18,),(19,),(20,),(21,)
]
}
]
Beacuse you once got 100 percent accuracy with a hidden layer of three neurons, you can try to optimize parameters like learning rate and momentum instead of the hidden layers.
Use Gridsearch like that:
clf = GridSearchCV(MLPClassifier(), param_grid, cv=3,
scoring='accuracy')
clf.fit(X,y)
print("Best parameters set found on development set:")
print(clf.best_params_)
You can consider increasing the number of iterations eg.
clf = MLPClassifier(max_iter=500)
This cleared the error when I did same.

Keras model predict train/test shape

I am training a CNN with Keras but with 30x30 patches from an image. I want to test the network with a full image but I get the following error:
ValueError: GpuElemwise. Input dimension mis-match. Input 2 (indices
start at 0) has shape[1] == 30, but the output's size on that axis is
100. Apply node that caused the error: GpuElemwise{Composite{((i0 + i1) - i2)}}[(0, 0)](GpuDimShuffle{0,2,3,1}.0, GpuReshape{4}.0,
GpuFromHost.0) Toposort index: 79 Inputs types:
[CudaNdarrayType(float32, 4D), CudaNdarrayType(float32, (True, True,
True, False)), CudaNdarrayType(float32, 4D)] Inputs shapes: [(10, 100,
100, 3), (1, 1, 1, 3), (10, 30, 30, 3)] Inputs strides: [(30000, 100,
1, 10000), (0, 0, 0, 1), (2700, 90, 3, 1)] Inputs values: ['not
shown', CudaNdarray([[[[ 0.01060364 0.00988821 0.00741314]]]]), 'not
shown'] Outputs clients:
[[GpuCAReduce{pre=sqr,red=add}{0,1,1,1}(GpuElemwise{Composite{((i0 +
i1) - i2)}}[(0, 0)].0)]]
This is my model.predict:
predict_image = model.predict(np.array([test_images[1]]), batch_size=1)[0]
It's seems like the issue is that the input size cannot be anything other than 30x30 but the first input shape for the first layer of my network is none, none, 3.
model.add(Convolution2D(n1, f1, f1, border_mode='same', input_shape=(None, None, 3), activation='relu'))
Is it simply not possible to test an image with different dimensions to the ones I trained with?
As fchollet himself described here, you should be able to define the input as so:
input_shape=(1, None, None)
However this will fail if you have layers that use the Flatten operation.
This suggests that you should be able to accomplish your goal with a fully convolutional NN.

Resources