parameterization of a random forest learner in mlr3 - mlr3

I am struggling to train a random forest via the library ranger while using mlr3. I set 3 parameters but donĀ“t know how to initialize the training. Can someone help me debug this code in r?
pension_trunc<-subset(pension,select=-c(1:8,10:11,18,20,22,24:31,33:44))
names(pension_trunc)
str(pension_trunc)
#setting up the training and test sets
library(mlr3)
#
#define task
task<-TaskRegr$new(id="pension",backend=pension_trunc,target="net_tfa")
print(task)
#splitting the truncated dataset into training and test samples
trn_trunc<-sample(task$nrow,0.8*task$nrow)
test_trunc<-setdiff(seq_len(task$nrow),trn_trunc)
str(trn_trunc)
str(test_trunc)
#------------------------------------------------------------------------
#Set parameters
PS=ParamSet$new(list(
ParamInt$new(id="mtry",default=3L,lower=1L,upper=5L,tags="train"),
ParamInt$new(id="max.depth",default=5L,lower=1L,upper=30L,tags="train"),
ParamInt$new(id="min.node.size",default=10L,lower=1L,upper=30L,tags="train")))
PS
#define learner for random forest using ranger library
learner8=lrn("regr.ranger")
#num.trees = 500 set as default value
#------------------------------------------------
#--------------------------------------------
#train random forest learner
lp_rf<-learner4$train(task=task,row_ids=trn_trunc,paramSet=PS)
lp_rf
#predict rf on training sample
pp_rf=learner4$predict(task=task,row_ids=trn_trunc)
pp_rf
autoplot(pp_rf)
coef(learner4$model,newdata=trn_trunc)
#predict rf on test sample
pp_rf_test=learner4$predict(task=task,row_ids=test_trunc)
pp_rf_test
#
measure=msr("regr.mse")
prediction$score(measure)

Related

Is it possible to obtain predictions on the training data from resample results?

After executing the code below, it is possible to obtain predictions on the testing partition by using rr$predictions()[[1]]. But is it possible to obtain the predictions on the training partition?
task = tsk("penguins")
learner = lrn("classif.rpart")
resampling = rsmp("holdout")
rr = resample(task, learner, resampling)
Thanks!
You need to set predict_sets field of learner to both train and test, like this:
learner$predict_sets=c("test", "train")
Keep everything else the same and get train set predictions with
rr$prediction("train")

Should I use MinMaxScaler which was fit on train dataset to transform test dataset, or use a separate MinMaxScaler to fit and transform test dataset?

Assume that I have 3 dataset in a ML problem.
train dataset: used to estimate ML model parameters (training)
test dataset: used to evaulate trained model, calculate accuracy of trained model
prediction dataset: used only for prediction after model deployment
I don't have evaluation dataset, and I use Grid Search with k-fold cross validation to find the best model.
Also, I have two python scripts as follows:
train.py: used to train and test ML model, load train and test dataset, save the trained model, best model is found by Grid Search.
predict.py: used to load pre-trained model & load prediction dataset, predict model output and calculate accuracy.
Before starting training process in train.py, I use MinMaxScaler as follows:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(x_train) # fit only on train dataset
x_train_norm = scaler.transform(x_train)
x_test_norm = scaler.transform(x_test)
In predict.py, after loding prediction dataset, I need to use the same data pre-processing as below:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(x_predict)
x_predict_norm = scaler.transform(x_predict)
As you can see above, both fit and transform are done on prediction dataset. However, in train.py, fit is done on train dataset, and the same MinMaxScaler is applied to transform test dataset.
My understanding is that test dataset is a simulation of real data that model is supposed to predict after deployment. Therefore, data pre-processing of test and prediction dataset should be the same.
I think separate MinMaxScaler should be used in train.py for train and test dataset as follows:
from sklearn.preprocessing import MinMaxScaler
scaler_train = MinMaxScaler()
scaler_test = MinMaxScaler()
scaler_train.fit(x_train) # fit only on train dataset
x_train_norm = scaler_train.transform(x_train)
scaler_test.fit(x_test) # fit only on test dataset
x_test_norm = scaler_test.transform(x_test)
What is the difference?
Value of x_test_norm will be different if I use separate MinMaxScaler as explained above. In this case, value of x_test_norm is in the range of [-1, 1]. However, If I transform test dataset by a MinMaxScaler which was fit by train dataset, value of x_test_norm can be outside the range of [-1, 1].
Please let me know your idea about it.
When you run .transform() MinMax scaling does something like: (value - min) / (Max - min) The value of min and Max are defined when you run .fit(). So the answer - yes, you should fit MinMaxScaller on the training dataset and then use it on the test dataset.
Just imagine the situation when in the training dataset you have some feature with Max=100 and min=10, while in the test dataset Max=10 and min=1. If you will train separate MinMaxScaller for test subset, yes, it will scale the feature in the range [-1, 1], but in comparison to the training dataset, the called values should be lower.
Also, regarding Grid Search with k-fold cross-validation, you should use the Pipeline. In this case, Grid Search will automatically fit MinMaxScaller on the k-1 folds. Here is a good example of how to organize pipeline with Mixed Types.

why test data is also involved in lightGBM train() and also used for calculate prediction error?

I would like to use lightGBM to do a machine learning model training.
I checked the example at https://github.com/microsoft/LightGBM/blob/master/examples/python-guide/advanced_example.py
I have some questions about the correctness of the code.
(1) What kind models can be created from lightgbm.train() ?
https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.train.html
It is a regressor or classifier ?
(2) Why test dataset is also used for training ? How this can assure that the test results are still valid ?
# line 31
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train,
weight=W_test, free_raw_data=False)
# line 52
gbm = lgb.train(params,
lgb_train,
num_boost_round=10,
valid_sets=lgb_train, # eval training data with test data !!!
feature_name=feature_name,
categorical_feature=[21])
# line 84
y_pred = bst.predict(X_test) # why x_test is also used to predict y? X_test has been involved in training the model !!!
Thanks
You can train both regression and classifier models using lgb.train. It depends on the parameters, which you define, namely objective.
Test set (valid_sets) is used only for validation, it isn't used for training.

How to load unlabelled data for sentiment classification after training SVM model?

I am trying to do sentiment classification and I used sklearn SVM model. I used the labeled data to train the model and got 89% accuracy. Now I want to use the model to predict the sentiment of unlabeled data. How can I do that? and after classification of unlabeled data, how to see whether it is classified as positive or negative?
I used python 3.7. Below is the code.
import random
import pandas as pd
data = pd.read_csv("label data for testing .csv", header=0)
sentiment_data = list(zip(data['Articles'], data['Sentiment']))
random.shuffle(sentiment_data)
train_x, train_y = zip(*sentiment_data[:350])
test_x, test_y = zip(*sentiment_data[350:])
from nltk import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn import metrics
clf = Pipeline([
('vectorizer', CountVectorizer(analyzer="word",
tokenizer=word_tokenize,
preprocessor=lambda text: text.replace("<br />", " "),
max_features=None)),
('classifier', LinearSVC())
])
clf.fit(train_x, train_y)
pred_y = clf.predict(test_x)
print("Accuracy : ", metrics.accuracy_score(test_y, pred_y))
print("Precision : ", metrics.precision_score(test_y, pred_y))
print("Recall : ", metrics.recall_score(test_y, pred_y))
When I run this code, I get the output:
ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning)
Accuracy : 0.8977272727272727
Precision : 0.8604651162790697
Recall : 0.925
What is the meaning of ConvergenceWarning?
Thanks in Advance!
What is the meaning of ConvergenceWarning?
As Pavel already mention, ConvergenceWArning means that the max_iteris hitted, you can supress the warning here: How to disable ConvergenceWarning using sklearn?
Now I want to use the model to predict the sentiment of unlabeled
data. How can I do that?
You will do it with the command: pred_y = clf.predict(test_x), the only thing you will adjust is :pred_y (this is your free choice), and test_x, this should be your new unseen data, it has to have the same number of features as your data test_x and train_x.
In your case as you are doing:
sentiment_data = list(zip(data['Articles'], data['Sentiment']))
You are forming a tuple: Check this out
then you are shuffling it and unzip the first 350 rows:
train_x, train_y = zip(*sentiment_data[:350])
Here you train_x is the column: data['Articles'], so all you have to do if you have new data:
new_ data = pd.read_csv("new_data.csv", header=0)
new_y = clf.predict(new_data['Articles'])
how to see whether it is classified as positive or negative?
You can run then: pred_yand there will be either a 1 or a 0 in your outcome. Normally 0 should be negativ, but it depends on your dataset-up
Check out this site about model's persistence. Then you just load it and call predict method. Model will return predicted label. If you used any encoder (LabelEncoder, OneHotEncoder), you need to dump and load it separately.
If I were you, I'd rather do full data-driven approach and use some pretrained embedder. It'll also work for dozens of languages out-of-the-box with is quite neat.
There's LASER from facebook. There's also pypi package, though unofficial. It works just fine.
Nowadays there's a lot of pretrained models, so it shouldn't be that hard to reach near-seminal scores.
Now I want to use the model to predict the sentiment of unlabeled data. How can I do that? and after classification of unlabeled data, how to see whether it is classified as positive or negative?
Basically, you aggregate unlabeled data in same way as train_x or test_x is generated. Probably, it's 2D matrix of shape n_samples x 1, which you would then use in clf.predict to obtain predictions. clf.predict outputs most probable class. In your case 0 is negative and 1 is positive, but it's hard to tell without the dataset.
What is the meaning of ConvergenceWarning?
LinearSVC model is optimized using iterative algorithm. There is an argument max_iter (1000 by default) that controls maximum amount of iterations. If stopping criteria wasn't met during this process, you will get ConvergenceWarning. It shouldn't bother you much, as long as you have acceptable performance in terms of accuracy, or other metrics.

The proper way of using IsolationForest to detect outliers of high-dim dataset

I use the following simple IsolationForest algorithm to detect the outliers of given dataset X of 20K samples and 16 features, I run the following
train_X, tesy_X, train_y, test_y = train_test_split(X, y, train_size=.8)
clf = IsolationForest()
clf.fit(X) # Notice I am using the entire dataset X when fitting!!
print (clf.predict(X))
I get the result:
[ 1 1 1 -1 ... 1 1 1 -1 1]
This question is: Is it logically correct to use the entire dataset X when fitting into IsolationForest or only train_X?
Yes, it is logically correct to ultimately train on the entire dataset.
With that in mind, you could measure the test set performance against the training set's performance. This could tell you if the test set is from a similar distribution as your training set.
If the test set scores anomalous as compared to the training set, then you can expect future data to be similar. In this case, I would like more data to have a more complete view of what is 'normal'.
If the test set scores similarly to the training set, I would be more comfortable with the final Isolation Forest trained on all data.
Perhaps you could use sklearn TimeSeriesSplit CV in this fashion to get a sense for how much data is enough for your problem?
Since this is unlabeled data to the anomaly detector, the more data the better when defining 'normal'.

Resources