How the max_feature hyperparameter is used in a Random Forest? - random-forest

I am working with a python and I created a random forest model that looks like this
clf = RFC(n_estimators=100, random_state=0, max_depth=7, max_features = 2)
As you can see my max_features is set to 2. I have one doubt about how this works. Does this means that for the creation of one individual decision tree in my random forest the model will look for two random features to create the decision tree or does it means that every time a new node is created within a decision tree the model randomly selects 2 features and then it only uses the one that better reduces entropy/impurity?
to put it mode simply, imagine that I have feature1, feature2, ......, feature10.
max_features = 2
randomly selected features to create the root = feature3 and feature7
(sample decision tree)
feature3 <= 5.35 (sample = 200)
/ \
/ \
/ \
feature3 <=3.35 feature7 <=6.05
(sample = 80) (sample = 120)
etc...
and it continues using feature3 and feature7 in
the best way until reaching purity or until reaching
max_depth
The second option (the one that I thought until now was the correct one).
max_features = 2
randomly selected features to create the root = feature3 and feature7
(sample decision tree)
(the model select feature3 because is the one that better reduces entropy/impurity)
feature3 <= 5.35 (sample = 200)
/ \
/ \
/ *** \
feature5 <=3.35 feature2 <=6.05
(sample = 80) (sample = 120)
***now, for the left node, it randomly extracts 2 features
(let's say feature1 and feature5) and uses the one that better
reduces entropy (feature5), and for the right node, it shuffles
the features again and extracts two new features (let's say
feature2 and feature3) and selects the one that better reduces
entropy (feature2)
which is correct? It shuffles two features at the being and keeps using them until it is finished or it selects a feature each and every time a node is created?
Thanks.

Related

Have RMSE in Random Survival Forest in R program

I should have RMSE in three model to compare them with each other to say which one is better than the others. My models which I should run are Survival decision tree , Random survival forest and Bagging. I have been running my models but in the end I only have some predict. I brought Random survival forest result in the following. What should I do to have RMSE?
library(survival)
library(randomForestSRC)
dataset<-data.frame(data)
dataset
n.sample=round(0.5*nrow(dataset))
dataset1=sample (1: nrow(dataset),n.sample)
train=data[dataset1,]
test= data[-dataset1 ,]
set.seed(1369)
rsf0=rfsrc(Surv(time,status)~.,train,importance=TRUE,forest=T,ensemble="oob",mtry=NULL,block.size=1,splitrule="logrank")
print(rsf0)
Results:
Sample size: 821
Number of deaths: 209
Number of trees: 1000
Forest terminal node size: 15
Average no. of terminal nodes: 38.62
No. of variables tried at each split: 4
Total no. of variables: 14
Resampling used to grow trees: swor
Resample size used to grow trees: 519
Analysis: RSF
Family: surv
Splitting rule: logrank random
Number of random split points: 10
Error rate: 36.15%
I think you slightly misunderstand what survival analysis models are usually used for. Normally we want to predict the distribution of the survival time and not the survival time itself. The RMSE can only be used when the actual survival time is predicted. In your example, the models you discuss make a distribution prediction.
So firstly I've cleaned up your code slightly and added an example dataset to make it reproducible:
library(survival)
library(randomForestSRC)
# use the rats dataset to make the example reproducible
dataset <- data.frame(survival::rats)
dataset$sex <- factor(dataset$sex)
# note that you need to set.seed before you use `sample`
set.seed(1369)
# again specifying train/test split but this time as two separate sets of integers
train = sample(nrow(dataset), 0.5 * nrow(dataset))
test = setdiff(seq(nrow(dataset)), train)
# train the random forest model on the training data
rsf0 = rfsrc(Surv(time,status)~., dataset[train, ], importance=TRUE, forest=T,
ensemble="oob", mtry=NULL, block.size=1, splitrule="logrank")
# now make predictions
predictions = predict(rsf0, newdata = dataset[-train, ])
# view the predicted survival probabilities
predictions$survival
With these probabilities, you have to make a decision about how to convert them to survival time predictions, and then you have to manually compute the RMSE after first removing all censored observations. Common conversions to survival time are to take the mean of the predicted individual distributions or the median.
As an alternative, and plugging my own package here, you could use {mlr3proba} which does this for you:
# load required packages
library(mlr3); library(mlr3proba);library(mlr3extralearners); library(mlr3pipelines)
# use the rats dataset to make the example reproducible
dataset <- data.frame(survival::rats)
dataset$sex <- factor(dataset$sex)
# note that you need to set.seed before you use `sample`
set.seed(1369)
# again specifying train/test split but this time as two separate sets of integers
train = sample(nrow(dataset), 0.5 * nrow(dataset))
test = setdiff(seq(nrow(dataset)), train)
# select the random forest model and use the `crankcompositor` to automatically
# create survival time predictions
learn = ppl("crankcompositor", lrn("surv.rfsrc"), response = TRUE, graph_learner = TRUE)
# create a task which stores your dataset
task = TaskSurv$new("data", backend = dataset, time = "time", event = "status")
# train your learner on training data
learn$train(task, row_ids = train)
# make predictions on test data
predictions = learn$predict(task, row_ids = test)
# view your survival time predictions
predictions$response
# calculate RMSE
predictions$score(msr("surv.rmse"))
This second option is more complicated if you're not used to R6, but I suspect that in your use-case it will benefit you as you can also compare multiple models at the same time with this.

How to update vocabulary of pre-trained bert model while doing my own training task?

I am now working on a task of predicting masked word using BERT model. Unlike others, the answer needs to be chosen from specific options.
For instance:
sentence: "In my daily [MASKED], ..."
options: A.word1 B.word2 C.word3 D.word4
the predict word will be chosen from four given words
I use hugging face's BertForMaskedLM to do this task. This model will give me a probability matrix which representing every word's probability of appearing in the [MASK] and I just need to compare the probability of word in options to select the answser.
# Predict all tokens
with torch.no_grad():
predictions = model(tokens_tensor, segments_tensors)
#predicted_index = torch.argmax(predictions[0, masked_index]).item()
#predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
A = predictions[0, masked_pos][tokenizer.convert_tokens_to_ids([option1])]
B = predictions[0, masked_pos][tokenizer.convert_tokens_to_ids([option2])]
C = predictions[0, masked_pos][tokenizer.convert_tokens_to_ids([option3])]
D = predictions[0, masked_pos][tokenizer.convert_tokens_to_ids([option4])]
#And then select from ABCD
But the problem is:
If the options are not in the “bert-vocabulary.txt”, the above method is not going to work since the output matrix does not give their probability. The same problem will also appear if the option is not a single word.
Should I update the vocabulary and how to do that? Or how to train the model
to add new words on the basis of pre-training?

How to do link prediction with node embeddings?

I am currently working on an item embedding task in recommendation system and I want to evaluate the performance of the new embedding algorithm with the old ones. I have read some papers about graph embedding and almost every paper mentioned a normal method to evaluate the embeddings which is link prediction. But none of these papers described exactly how you do it. So my question is how to evaluate the embeddings using link prediction?
The algorithm I am trying to apply is:
First a directed graph is built on user click sequences, each node in the graph represents an item, and if a user once clicked item A then clicked B, there should be two nodes A and B and an edge A-B with weight of 1. When another user clicked A then clicked B, the weight of edge A-B is added by 1.
Then a new sequence dataset is generated by random walking the graph, using the outbound weights as the teleport probabilities.
Finally SkipGram is performed on the new sequences to generate the node embeddings.
As many papers mentioned, I removed a certain proportion of the edges in the graph as the positive samples of test set(e.g. 0.25) and randomly generated some fake edges as the negative ones. So what's next? Should I simply generate fake edges for the real edges in the training set, concatenate the embeddings of the two nodes on each edge, and build a common classifier such as logistic regression and test it on the test set? Or should I calculate the AUC on test set with cosine similarity of the two nodes and a label of 0/1 indicating if the two nodes are really connected? Or should I calculate the AUC with the sigmoided dot product of the embeddings of two nodes and a label of 0/1 indicating if the two nodes are really connected, since this is how you compute the probability at last layer?
# these are example describing the three methods above
item_emb = np.random.random(400).reshape(100, 4) # assume we have 100 items and have embedded them into a 4-dimension vector space.
test_node = np.random.randint(0, 100, size=200).reshape(100, 2) # assume we have 100 pairs of nodes
test_label = np.random.randint(0, 2, size=100).reshape(100, 1) # assume this is the label indicating if the pair of nodes are really connected
def test_A():
# use logistic regression
train_node = ... # generate true and fake node pairs in a similar way
train_label = ... # generate true and fake node pairs in a similar way
train_feat = np.hstack(
item_emb[train_node[:, 0]],
item_emb[train_node[:, 1]]) # concatenate the embeddings
test_feat = np.hstack(
item_emb[test_node[:, 0]],
item_emb[test_node[:, 1]]) # concatenate the embeddings
lr = sklearn.linear_models.LogisticRegression().fit(train_feat, train_label)
auc = roc_auc_score(test_label, lr.predict_proba(test_feat)[:,1])
return auc
def test_B():
# use cosine similarity
emb1 = item_emb[test_node[:, 0]]
emb2 = item_emb[test_node[:, 1]]
cosine_sim = emb1 * emb2 / (np.linalg.norm(emb1, axis=1)*np.linalg.norm(emb2,axis=1)
auc = roc_auc_score(test_label, cosine_sim)
return auc
def test_C():
# use dot product
# here we extract the softmax weights and biases from the training network
softmax_weights = ... # same shape as item_emb
softmax_biases = ... # shape of (item_emb.shape[0], 1)
embedded_item = item_emb[test_node[:, 0]] # target item embedding
softmaxed_context = softmax_weights[test_node[:, 1]] + softmax_biases
dot_prod = np.sum(embeded_item * softmaxed_context, axis=1)
auc = roc_auc_score(test_label, dot_prod)
return auc
I have tried the three method in several tests, and they are not always telling the same thing. Some parameter combinations perform better with testA() and bad in other metrics, some the opposite..etc. Sadly there is no such a parameter combination that out performs others in all three metrics...The question is which metric should I use?
You should investigate some implementations:
StellarGraph: Link prediction with node2vec+Logistic regression
AmpliGraph: Link prediction with ComplEx
Briefly, one should sample edges (not nodes!) from the original graph, remove them, and learn embeddings on that truncated graph. Then an evaluation is performed on removed edges.
Also, there are two possible cases:
All possible edges between any pair of nodes are labeled. In this case evaluation metric is ROC AUC, when we learn a classifier to distinguish positives and negatives edges.
Only positive (real) edges are observed. We don't know if rest pairs are connected or not in the real world. Here we generate negative (fake) nodes for every positive one. The task is considered as Entity ranking with next evaluation metrics:
Rank
Mean Rank (MR)
Mean Reciprocal Rank (MRR)
Hits#N
An example can be found in the paper, sections 5.1-5.3.

Predictive modelling

How to perform regression(Random Forest,Neural Networks) for this kind of data?
The data contains features and we need to predict sales qty based on week and attributes
here I am attaching the sample data
Here we are trying to predict sales quantity based on other attributes
Multivariate linear regression
Assuming
input variables x[][] (each row corresponds to a sample, each column corresponds to a variable such as week, season, ..)
expected output y[] (as many rows as x)
parameters being learned theta[] (as many as there are input variables + 1)
you are optimizing a function h:
h = sum for all j of { x[j][i] * p[i] - y[j] } is minimal
This can easily be achieved through gradient descent.
You can also include combinations of parameters (and simply include more thetas for those pseudo-parameters)
I have some code lying around in a GitHub repository that performs basic multivariate linear regression (for a course I sometimes teach).
https://github.com/jorisschellekens/ml/tree/master/linear_regression

Parameter selection and k-fold cross-validation

I have one dataset, and need to do cross-validation, for example, a 10-fold cross-validation, on the entire dataset. I would like to use radial basis function (RBF) kernel with parameter selection (there are two parameters for an RBF kernel: C and gamma). Usually, people select the hyperparameters of SVM using a dev set, and then use the best hyperparameters based on the dev set and apply it to the test set for evaluations. However, in my case, the original dataset is partitioned into 10 subsets. Sequentially one subset is tested using the classifier trained on the remaining 9 subsets. It is obviously that we do not have fixed training and test data. How should I do hyper-parameter selection in this case?
Is your data partitioned into exactly those 10 partitions for a specific reason? If not you could concatenate/shuffle them together again, then do regular (repeated) cross validation to perform a parameter grid search. For example, with using 10 partitions and 10 repeats gives a total of 100 training and evaluation sets. Those are now used to train and evaluate all parameter sets, hence you will get 100 results per parameter set you tried. The average performance per parameter set can be computed from those 100 results per set then.
This process is built-in in most ML tools already, like with this short example in R, using the caret library:
library(caret)
library(lattice)
library(doMC)
registerDoMC(3)
model <- train(x = iris[,1:4],
y = iris[,5],
method = 'svmRadial',
preProcess = c('center', 'scale'),
tuneGrid = expand.grid(C=3**(-3:3), sigma=3**(-3:3)), # all permutations of these parameters get evaluated
trControl = trainControl(method = 'repeatedcv',
number = 10,
repeats = 10,
returnResamp = 'all', # store results of all parameter sets on all partitions and repeats
allowParallel = T))
# performance of different parameter set (e.g. average and standard deviation of performance)
print(model$results)
# visualization of the above
levelplot(x = Accuracy~C*sigma, data = model$results, col.regions=gray(100:0/100), scales=list(log=3))
# results of all parameter sets over all partitions and repeats. From this the metrics above get calculated
str(model$resample)
Once you have evaluated a grid of hyperparameters you can chose a reasonable parameter set ("model selection", e.g. by choosing a well performing while still reasonable incomplex model).
BTW: I would recommend repeated cross validation over cross validation if possible (eventually using more than 10 repeats, but details depend on your problem); and as #christian-cerri already recommended, having an additional, unseen test set that is used to estimate the performance of your final model on new data is a good idea.

Resources