Plotting decision boundary for a multiclass Random Forest model - machine-learning

I am crossposting from Cross Validated and Data Science stackexchange, since I was told my questions is code heavy. I will delete if rules disallow - I don't know what the position is.
I am using the MNIST dataset with 10 classes (the digits 0 to 9). I am using a compressed version with 49 predictor variables(x1,x2,...,x49). I have trained a Random Forest model and have created a Test data set, which is a grid, on which I have used the trained model to generate predictions as class probabilities as well as the classes. I am trying to generalise the code here that generates a decision boundary when there are only two outcome classes:
Variation on "How to plot decision boundary of a k-nearest neighbor classifier from Elements of Statistical Learning?"
and here:
https://stats.stackexchange.com/questions/21572/how-to-plot-decision-boundary-of-a-k-nearest-neighbor-classifier-from-elements-o
and here:
Decision boundary plots in ggplot2
I have tried to visualise the boundary using the first 2 predictors(x1 and x2), though predictions have been made with all 49.
Here is my code:
## Create a grid of data which is the Test data...
## traindat is the dataset that the model was trained om
data<- traindat
resolution = 50 (there will be 50 rows)
## Extract the 49 predictor variables and drop the toutcome variable
data <- data[,2:50]
head(data)
## Get the variable names in a list
ln <- vector(mode="list", length=49)
ln<-as.list(names(data))
data_mat<-matrix(0,50,49)
r <- sapply(data, range, na.rm = TRUE)
for (i in 1:49){
data_mat[,i]<- seq(r[1,i], r[2,i], length.out = resolution)
}
data_mat
mat<-as.matrix(data_mat)
m<-as.data.frame(mat)
## Create test data grid
fn<-function(x) seq(min(x)+1, max(x) + 1, length.out=50)
test2<-apply(m, 2, fn)
test2<-as.data.frame(test2)
colnames(test2)<-unlist(ln)
test2<-as.data.frame(test2)
## label is a column that should contain the Predicted class labels
test2$label<-"-1"
test2<-test2 %>%
relocate(label, .before = x1)
## finalModel is the model obtained from training the Random Forest on traindat
prob=predict(rf_gridsearch$finalModel,test2,type="prob")
prob2=predict(rf_gridsearch$finalModel,test2,type="response")
prob2<-as.data.frame(prob)
head(prob2)
## Create predicted classes 0 to 9 and probabilities for the Test data
fn<-function(x) which.max(x)-1
outCls<-apply(prob2, 1, fn)
outCls
fn<-function(x) max(x)
outProb<-apply(prob2, 1, fn)
outProb
##Data structure for plotting
require(dplyr)
dataf2 <- bind_rows(mutate(test2,
prob=outProb,
cls=0,
prob_cls=ifelse(outCls==cls,
1, 0)),
mutate(test2,
prob=outProb,
cls=1,
prob_cls=ifelse(outCls==cls,
1, 0)),
mutate(test2,
prob=outProb,
cls=2,
prob_cls=ifelse(outCls==cls,
1, 0)),
mutate(test2,
prob=outProb,
cls=3,
prob_cls=ifelse(outCls==cls,
1, 0)),
mutate(test2,
prob=outProb,
cls=4,
prob_cls=ifelse(outCls==cls,
1, 0)),
mutate(test2,
prob=outProb,
cls=5,
prob_cls=ifelse(outCls==cls,
1, 0)),
mutate(test2,
prob=outProb,
cls=6,
prob_cls=ifelse(outCls==cls,
1, 0)),
mutate(test2,
prob=outProb,
cls=7,
prob_cls=ifelse(outCls==cls,
1, 0)),
mutate(test2,
prob=outProb,
cls=8,
prob_cls=ifelse(outCls==cls,
1, 0)),
mutate(test2,
prob=outProb,
cls=9,
prob_cls=ifelse(outCls==cls,
1, 0))
)
## Solution from Stackexchange based on only two outcome classes
ggplot()+
geom_raster(data= dataf2, aes(x= x1, y=x2, fill=dataf2$cls ), interpolate = TRUE)+
geom_contour(data= NULL, aes(x= dataf2$x1, y=dataf2$x2, z= dataf2$prob), breaks=c(1.5), color="black", size=1)+
theme_few()+
scale_colour_manual(values = cols)+
labs(colour = "", fill="")+
scale_fill_gradient2(low="#338cea", mid="white", high="#dd7e7e",
midpoint=0.5, limits=range(dataf2$prob))+
theme(legend.position = "none")
My output doesn't look right - what does it mean? Also, why does the contour plot have to be based on the predicted probablity? What is the idea behind the code to generate a decision boundary for any classfier? What am I doing wrong?

Related

How can i do caret training with cross validation on predefined (grouped) splits in the training data?

I would like to train a ML model in Caret based on training data.
I have a training data from the following structure:
df <- data.frame(Label = c("A","A","A","B","A", "A","A","B","B","A", "B","B","A","A","A"), EXPERIMENT = c("X","X","X","X","X", "Y","Y","Y","Y","Y", "Z","Z","Z","Z","Z"), VALUE1 = c( 1, 2, 1, 5, 1, 3, 1, 5, 6, 1, 7, 5, 1, 2, 2), VALUE2 = c( 9, 7, 8, 1, 8, 2, 1, 9, 8, 2, 7, 7, 2, 1, 1) )
I would want to use train and split the data according to experiments for cross-validation training (in this experiment 3 crossvalidation splits).
that is
Split1: training = X,Y and validation = Z
Split2: training = X,Z and validation = Y
Split2: training = Y,Z and validation = X
How can I do that? With traincontrol?
I found a index option in traincontrol, but did not understand, if that can do it.

How to compute mean/max of HuggingFace Transformers BERT token embeddings with attention mask?

I'm using the HuggingFace Transformers BERT model, and I want to compute a summary vector (a.k.a. embedding) over the tokens in a sentence, using either the mean or max function. The complication is that some tokens are [PAD], so I want to ignore the vectors for those tokens when computing the average or max.
Here's an example. I initially instantiate a BertTokenizer and a BertModel:
import torch
import transformers
from transformers import AutoTokenizer, AutoModel
transformer_name = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(transformer_name, use_fast=True)
model = AutoModel.from_pretrained(transformer_name)
I then input some sentences into the tokenizer and get out input_ids and attention_mask. Notably, an attention_mask value of 0 means that the token was a [PAD] that I can ignore.
sentences = ['Deep learning is difficult yet very rewarding.',
'Deep learning is not easy.',
'But is rewarding if done right.']
tokenizer_result = tokenizer(sentences, max_length=32, padding=True, return_attention_mask=True, return_tensors='pt')
input_ids = tokenizer_result.input_ids
attention_mask = tokenizer_result.attention_mask
print(input_ids.shape) # torch.Size([3, 11])
print(input_ids)
# tensor([[ 101, 2784, 4083, 2003, 3697, 2664, 2200, 10377, 2075, 1012, 102],
# [ 101, 2784, 4083, 2003, 2025, 3733, 1012, 102, 0, 0, 0],
# [ 101, 2021, 2003, 10377, 2075, 2065, 2589, 2157, 1012, 102, 0]])
print(attention_mask.shape) # torch.Size([3, 11])
print(attention_mask)
# tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
# [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
# [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]])
Now, I call the BERT model to get the 768-D token embeddings (the top-layer hidden states).
model_result = model(input_ids, attention_mask=attention_mask, return_dict=True)
token_embeddings = model_result.last_hidden_state
print(token_embeddings.shape) # torch.Size([3, 11, 768])
So at this point, I have:
token embeddings in a [3, 11, 768] matrix: 3 sentences, 11 tokens, 768-D vector for each token.
attention mask in a [3, 11] matrix: 3 sentences, 11 tokens. A 1 value indicates non-[PAD].
How do I compute the mean / max over the vectors for the valid, non-[PAD] tokens?
I tried using the attention mask as a mask and then called torch.max(), but I don't get the right dimensions:
masked_token_embeddings = token_embeddings[attention_mask==1]
print(masked_token_embeddings.shape) # torch.Size([29, 768] <-- WRONG. SHOULD BE [3, 11, 768]
pooled = torch.max(masked_token_embeddings, 1)
print(pooled.values.shape) # torch.Size([29]) <-- WRONG. SHOULD BE [3, 768]
What I really want is a tensor of shape [3, 768]. That is, a 768-D vector for each of the 3 sentences.
For max, you can multiply with attention_mask:
pooled = torch.max((token_embeddings * attention_mask.unsqueeze(-1)), axis=1)
For mean, you can sum along the axis and divide by attention_mask along that axis:
mean_pooled = token_embeddings.sum(axis=1) / attention_mask.sum(axis=-1).unsqueeze(-1)
In addition to #Quang, you can have a look at sentence_transformers Pooling layer.
For max pooling, they do this:
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
token_embeddings[input_mask_expanded == 0] = -1e9 # Set padding tokens to large negative value
pooled = torch.max(token_embeddings, 1)[0]
And for mean pooling they do the following:
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
sum_mask = input_mask_expanded.sum(1)
sum_mask = torch.clamp(sum_mask, min=1e-9)
pooled = sum_embeddings / sum_mask
The max pooling presented in the accepted answer will suffer when the max is negative, and the implementation from sentence transformers changes token_embeddings, which throw an error when you want to use the embedding for back propagation:
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation:
If you're interested on anything back-prop related, you can do something like this:
input_mask_expanded = torch.where(attention_mask==0, -1e-9, 0.).unsqueeze(-1).expand(token_embeddings.size()).float()
pooled = torch.max(token_embeddings-input_mask_expanded, 1)[0] # Set padding tokens to large negative value
It's the same idea of making all masked tokens to be very small, but it doesn't change the token_embeddings while at it.
Alex is right.
Look on hidden states for strings that go into tokenizer. For different strings, padding will have different embeddings.
So, in order to properly pool embeddings, you need to ignore those padding vectors.
Let's say you want to get embeddings out of the last 4 layers of BERT (as it yields the best classification results):
#iterate over the last 4 layers and get embeddings for
#strings without having embeddings from PAD tokens
m = []
for i in range(len(hidden_states[0])):
m.append([hidden_states[j+9][i,:,:][tokens["attention_mask"][i] !=0] for j in range(4)])
#average over all tokens embeddings
means = []
for i in range(len(hidden_states[0])):
means.append(torch.stack(m[i]).mean(dim=1))
#stack embeddings for all strings
pooled = torch.stack(means).reshape(-1,1,3072)

Naive Bayes model from scikit learn

What is the difference between the two gnb.fit()?
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train.ravel())
gnb.fit(X_train, y_train).predict(X_test)
I see two differences:
np.ravel flattens arrays, e.g.
x = np.array([[1, 2, 3], [4, 5, 6]])
np.ravel(x)
array([1, 2, 3, 4, 5, 6])
gnb.fit returns the model itself, and it can be used to immediately predict. Predicting does not modify the model.
If your y_train is one-dimensional, you'll get the same model regardless of whether you call gbn.fit(x, y) or gbn.fit(x, y.ravel()) because y == y.ravel()

How to feed key-value features (aggregated data) to LSTM?

I have the following time-series aggregated input for an LSTM-based model:
x(0): {y(0,0): {a(0,0), b(0,0)}, y(0,1): {a(0,1), b(0,1)}, ..., y(0,n): {a(0,n), b(0,n)}}
x(1): {y(1,0): {a(1,0), b(1,0)}, y(1,1): {a(1,1), b(1,1)}, ..., y(1,n): {a(1,n), b(1,n)}}
...
x(m): {y(m,0): {a(m,0), b(m,0)}, y(m,1): {a(m,1), b(m,1)}, ..., y(m,n): {a(m,n), b(m,n)}}
where x(m) is a timestep, a(m,n) and b(m,n) are features aggregated by the non-temporal sequential key y(m,n) which might be 0...1,000.
Example:
0: {90: {4, 4.2}, 91: {6, 0.2}, 92: {1, 0.4}, 93: {12, 11.2}}
1: {103: {1, 0.2}}
2: {100: {3, 0.1}, 101: {0.4, 4}}
Where 90-93, 103, and 100-101 are aggregation keys.
How can I feed this kind of input to LSTM?
Another approach would be to use non-aggregated data. In that case, I'd get the proper input for LSTM. Example:
Aggregated input:
0: {100: {3, 0.1}, 101: {0.4, 4}}
Original input:
0: 100, 1, 0.05
1: 101, 0.2, 2
2: 100, 1, 0
3: 100, 1, 0.05
4: 101, 0.2, 2
But in that case, the aggregation would be lost, and the whole purpose of aggregation is to minimize the number of steps so that I get 500 timesteps instead of e.g. 40,000, which is impossible to feed to LSTM. If you have any ideas I'd appreciate it.

Keras model predict train/test shape

I am training a CNN with Keras but with 30x30 patches from an image. I want to test the network with a full image but I get the following error:
ValueError: GpuElemwise. Input dimension mis-match. Input 2 (indices
start at 0) has shape[1] == 30, but the output's size on that axis is
100. Apply node that caused the error: GpuElemwise{Composite{((i0 + i1) - i2)}}[(0, 0)](GpuDimShuffle{0,2,3,1}.0, GpuReshape{4}.0,
GpuFromHost.0) Toposort index: 79 Inputs types:
[CudaNdarrayType(float32, 4D), CudaNdarrayType(float32, (True, True,
True, False)), CudaNdarrayType(float32, 4D)] Inputs shapes: [(10, 100,
100, 3), (1, 1, 1, 3), (10, 30, 30, 3)] Inputs strides: [(30000, 100,
1, 10000), (0, 0, 0, 1), (2700, 90, 3, 1)] Inputs values: ['not
shown', CudaNdarray([[[[ 0.01060364 0.00988821 0.00741314]]]]), 'not
shown'] Outputs clients:
[[GpuCAReduce{pre=sqr,red=add}{0,1,1,1}(GpuElemwise{Composite{((i0 +
i1) - i2)}}[(0, 0)].0)]]
This is my model.predict:
predict_image = model.predict(np.array([test_images[1]]), batch_size=1)[0]
It's seems like the issue is that the input size cannot be anything other than 30x30 but the first input shape for the first layer of my network is none, none, 3.
model.add(Convolution2D(n1, f1, f1, border_mode='same', input_shape=(None, None, 3), activation='relu'))
Is it simply not possible to test an image with different dimensions to the ones I trained with?
As fchollet himself described here, you should be able to define the input as so:
input_shape=(1, None, None)
However this will fail if you have layers that use the Flatten operation.
This suggests that you should be able to accomplish your goal with a fully convolutional NN.

Resources