Prediction using sklearn's RandomForestRegressor - machine-learning

Probably a very dumb question so be easy on me, but here I go.
So here's what my data looks like...
date,locale,category,site,alexa_rank,sessions,user_logins
20170110,US,1,google,1,500,5000
20170110,EU,1,google,2,400,2000
20170111,US,2,facebook,2,400,2000
... and so on. This is just a toy dataset I came up with, but which resembles the original data.
I'm trying to build a model to predict how many user logins and sessions a particular site would have, using sklearn's RandomForestRegressor.
I do the usual stuff, encoding categories to labels and I've trained my model on the first eight months of the year and now I'd like to predict logins and sessions for the ninth month. I've created one model trained on logins and another one trained on sessions.
My test dataset is of the same form:
date,locale,category,site,alexa_rank,sessions,user_logins
20170910,US,1,google,1,500,5000
20170910,EU,1,google,2,400,2000
20170911,US,2,facebook,2,400,2000
Ideally I'd like to pass in the test dataset without the columns I need predicted, but RandomForestRegressor complains about the dimensions being different between the training and test set.
When I pass the test dataset in its current form, the model predicts the exact values in the sessions and user_logins columns in most cases and values with tiny variations otherwise.
I zeroed out the sessions and user_logins columns in the test data and passed it to the model but the model predicted nearly all zeroes.
Is my workflow correct? Am I using RandomForestRegressor correctly?
How am I getting so close to the actual values when my test dataset does contain actual values? Are the actual values in the test data being used in the prediction?
If the model works correctly, shouldn't I be getting the same values predicted if I zero out the columns I'm looking to predict (sessions and user_logins)?

You shouldn't pass the column you want to predict in the test data. You workflow is not correct.
If X is the set of columns of the information you have,
if y is the set of columns of the information you want to predict,
then you should pass (X_train, y_train) during the training (using the method fit), and (X_test, ) only during the testing (using the method predict). You will obtain y_pred that you can compare with y_test if you have it.
In your example, if you want to predict user_logins:
X_train = array([['20170110', 'US', '1', 'google', '1', '500'],
['20170110', 'EU', '1', 'google', '2', '400'],
['20170111', 'US', '2', 'facebook', '2', '400']],
dtype='<U21')
y_train = array(['5000', '2000', '2000'], dtype='<U21')
X_test = array([['20170112', 'EU', '2', 'google', '1', '500'],
['20170113', 'US', '1', 'facebook', '2', '400'],
['2017014', 'US', '2', 'google', '1', '500']],
dtype='<U21')
estimator = RandomForestRegressor().fit(X_train, y_train)
y_pred = estimator.predict(X_test)
Take a look at the documentation for more examples, or at the tutorials.

Related

Is it fair enough to make model evaluation based on just "train_test_split"?

I'm absolutely confused about model evaluation, interpreting its results and using cross_val_score. I don't understand why evaluation on a test set is usually considered as a final and solid result, while if we just choose other split, we'll get a different value which could be far worse (or far better) than the previous one. Below, I'll illustrate what I'm talking about with an example and after that I'll ask some more precise questions.
*I used a dataset from Jason Brownlee: https://github.com/jbrownlee/Datasets/blob/master/pima-indians-diabetes.data.csv
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv'
df = pd.read_csv(url, names=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'Target'], header=None)
X, y = df.drop('Target', axis=1), df['Target']
Here is our target distribution:
X_rest, X_test, y_rest, y_test = train_test_split(X, y, test_size=0.3,
random_state=777, stratify=y)
X_train, X_val, y_train, y_val = train_test_split(X_rest, y_rest, test_size=0.25,
random_state=777, stratify=y_rest)
Checking sample's sizes:
Train size: 52.3 %
Val size: 17.6 %
Test size: 30.1 %
Tuning the hyper-parameters:
base_model = LogisticRegression(random_state=777, max_iter=2000)
params = {
'C': np.arange(0, 5, 0.1)
}
grdSrch = GridSearchCV(base_model, params, scoring='average_precision', cv=5)
grdSrch.fit(X_val, y_val)
print(f'Best params: {grd.best_params_}')
Best params: {'C': 2.5}
Training model with the best parameter and getting an average_precision_score value:
model = LogisticRegression(C=grdSrch.best_params_['C'], random_state=777, max_iter=2000)
model.fit(X_train, y_train)
y_pred_scores = model.predict_proba(X_test)[:, 1]
print(f'Avg. precision: {average_precision_score(y_test, y_pred_scores)}')
Avg. precision: 0.7067839537770597
Now, I want to be sure that result is not unfair because of some unexpectedly good train/test splitting. And I use cross_val_score for that purpose:
res_ = cross_val_score(LogisticRegression(C=grd.best_params_['C'], random_state=777, max_iter=2000),
X,
y,
scoring='average_precision',
cv=StratifiedKFold(n_splits=15, shuffle=True))
print(res_)
print()
print(f'Mean score: {np.round(res_.mean(), 4)}')
Then I get:
[0.7779402 0.69200873 0.63972188 0.82368544 0.6044146 0.70668374
0.85022563 0.79848536 0.60740097 0.68802039 0.92567494 0.84554528
0.61855088 0.78731357 0.79852637]
Mean score: 0.7443
And what do we see here? We got a pretty high variance among those results plus higher overall mean value. So, at this point I totally lost it. My questions are:
Can we use cross_val_score on a whole dataset to assess a fairness (?) of our final evaluation result?
If we can, why do we even use train_test_split with just one score when the cross_val_score gives us more clear picture about actual scores?
If we cannot, then for what reason?
It seems like we actually don't have any "final" result for any metric, because we can always get some pool of various scores depending on a train/test splitting. So, how can we make a real business decisions in such circumstances?
Depend on data set but most preferred method because it gives your model the opportunity to train on multiple train-test splits. This gives you a better indication of how well your model will perform on unseen data.suppose ,you take 10-fold cross validation, the dataset would be split into 10 groups, and the model would be trained and tested 10 separate times so each group would get a chance to be the test set but in train test split ,there is one time .
Train test split method is good to use when you have a very large dataset and you are starting to build an initial model in your data science project. Keep in mind that because cross-validation uses multiple train-test splits, it takes more computational power and time to run than using the holdout method.

Calculating Probability of a Classification Model Prediction

I have a classification task. The training data has 50 different labels. The customer wants to differentiate the low probability predictions, meaning that, I have to classify some test data as Unclassified / Other depending on the probability (certainty?) of the model.
When I test my code, the prediction result is a numpy array (I'm using different models, this is one is pre-trained BertTransformer). The prediction array doesn't contain probabilities such as in Keras predict_proba() method. These are numbers generated by prediction method of pretrained BertTransformer model.
[[-1.7862008 -0.7037363 0.09885322 1.5318055 2.1137428 -0.2216074
0.18905772 -0.32575375 1.0748093 -0.06001111 0.01083148 0.47495762
0.27160102 0.13852511 -0.68440574 0.6773654 -2.2712054 -0.2864312
-0.8428862 -2.1132915 -1.0157436 -1.0340284 -0.35126117 -1.0333195
9.149789 -0.21288703 0.11455813 -0.32903734 0.10503325 -0.3004114
-1.3854568 -0.01692022 -0.4388664 -0.42163098 -0.09182278 -0.28269592
-0.33082992 -1.147654 -0.6703184 0.33038092 -0.50087476 1.1643585
0.96983343 1.3400391 1.0692116 -0.7623776 -0.6083422 -0.91371405
0.10002492]]
I'm using numpy.argmax() to identify the correct label. The prediction works just fine. However, since these are not probabilities, I cannot compare the best result with a threshold value.
My question is, how can I define a threshold (say, 0.6), and then compare the probability of the argmax() element of the BertTransformer prediction array so that I can classify the prediction as "Other" if the probability is less than the threshold value?
Edit 1:
We are using 2 different models. One is Keras, and the other is BertTransformer. We have no problem in Keras since it gives the probabilities so I'm skipping Keras model.
The Bert model is pretrained. Here is how it is generated:
def model(self, data):
number_of_categories = len(data['encoded_categories'].unique())
model = BertForSequenceClassification.from_pretrained(
"dbmdz/bert-base-turkish-128k-uncased",
num_labels=number_of_categories,
output_attentions=False,
output_hidden_states=False,
)
# model.cuda()
return model
The output given above is the result of model.predict() method. We compare both models, Bert is slightly ahead, therefore we know that the prediction works just fine. However, we are not sure what those numbers signify or represent.
Here is the Bert documentation.
BertForSequenceClassification returns logits, i.e., the classification scores before normalization. You can normalize the scores by calling F.softmax(output, dim=-1) where torch.nn.functional was imported as F.
With thousands of labels, the normalization can be costly and you do not need it when you are only interested in argmax. This is probably why the models return the raw scores only.

Train Test Valid data sets... General question about fitting the models

So I was given Xtrain, ytrain, Xtest, ytest, Xvalid, yvalid data for a HW assignment. This assignment is for a Random Forest but I think my question can apply to any/most models.
So my understanding is that you use Xtrain and ytrain to fit the model such as (clf.fit(Xtrain, ytrain)) and this creates the model which can provide you a score and predictions for your training data
So when I move on to Test and Valid data sets, I only use ytest and yvalid to see how they predict and score. My professor provided us with three X dataset (Xtrain, Xtest, Xvalid), but to me I only need the Xtrain to train the model initially and then test the model on the different y data sets.
If i did .fit() for each pair of X,y I would create/fit three different models from completely different data so the models are not comparable from my perspective.
Am I wrong?
Training step :
Assuming your are using sklearn, the clf.fit(Xtrain, ytrain) method enables you to train your model (clf) to best fit the training data Xtrain and labels ytrain. At this stage, you can compute a score to evaluate your model on training data, as you said.
#train step
clf = your_classifier
clf.fit(Xtrain, ytrain)
Test step :
Then, you have to use the test data Xtest to feed the prior trained model in order to generate new labels ypred.
#test step
ypred = clf.predict(Xtest)
Finally, you have to compare these generated labels ypred with the true labels ytest to provide a robust evaluation of the model performance on unknown data (data not used during training) with tools like confusion matrix, metrics...
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
test_cm = confusion_matrix(ytest,ypred)
test_report = classification_report(ytest,ypred)
test_accuracy = accuracy_score(ytest, ypred)

Classification using H2O.ai H2O-3 Automl Algorithm on AWS SageMaker: Categorical Columns

I'm trying to train a model using H2O.ai's H2O-3 Automl Algorithm on AWS SageMaker using the console.
My model's goal is to predict if an arrest will be made based upon the year, type of crime, and location.
My data has 8 columns:
primary_type: enum
description: enum
location_description: enum
arrest: enum (true/false), this is the target column
domestic: enum (true/false)
year: number
latitude: number
longitude: number
When I use the SageMaker console on AWS and create a new training job using the H2O-3 Automl Algorithm, I specify the primary_type, description, location_description, and domestic columns as categorical.
However in the logs of the training job I always see the following two lines:
Converting specified columns to categorical values:
[]
This leads me to believe the categorical_columns attribute in the training hyperparameter is not being taken into account.
I have tried the following hyperparameters with the same output in the logs each time:
{'classification': 'true', 'categorical_columns':'primary_type,description,location_description,domestic', 'target': 'arrest'}
{'classification': 'true', 'categorical_columns':['primary_type','description','location_description','domestic'], 'target': 'arrest'}
I thought the list of categorical columns was supposed to be delimited by comma, which would then be split into a list.
I expected the list of categorical column names to be output in the logs instead of an empty list, like so:
Converting specified columns to categorical values:
['primary_type','description','location_description','domestic']
Can anyone help me figure out how to get these categorical columns to apply to the training of my model?
Also-
I think this is the code that's running when I train my model but I have yet to confirm that: https://github.com/h2oai/h2o3-sagemaker/blob/master/automl/automl_scripts/train#L93-L151
This seems to be a bug by h2o package. The code in https://github.com/h2oai/h2o3-sagemaker/blob/master/automl/automl_scripts/train#L106 shows that it's reading categorical_columns directly from the hyperparameters, not nested under the training field. However when move up the categorical_columns field a level, the algorithm doesn't recognize it. So no solution for this.
It seems based on the code here: https://github.com/h2oai/h2o3-sagemaker/blob/master/automl/automl_scripts/train#L106
that the parameter is looking for a comma separated string. E.g. "cat,dog,bird"
I would try: "primary_type,description,location_description,domestic"as the input parameter, rather than ['primary_type', 'description'... etc]

Logistic Regression with Non-Integer feature value

Hi I was following the Machine Learning course by Andrew Ng.
I found that in regression problems, specially logistic regression they have used integer values for the features which could be plotted in a graph. But there are so many use cases where the feature values may not be integer.
Let's consider the follow example :
I want to build a model to predict if any particular person will take a leave today or not. From my historical data I may find the following features helpful to build the training set.
Name of the person, Day of the week, Number of Leaves left for him till now (which maybe a continuous decreasing variable), etc.
So here are the following questions based on above
How do I go about designing the training set for my logistic regression model.
In my training set, I find some variables are continuously decreasing (ex no of leaves left). Would that create any problem, because I know continuously increasing or decreasing variables are used in linear regression. Is that true ?
Any help is really appreciated. Thanks !
Well, there are a lot of missing information in your question, for example, it'll be very much clearer if you have provided all the features you have, but let me dare to throw some assumptions!
ML Modeling in classification always requires dealing with numerical inputs, and you can easily infer each of the unique input as an integer, especially the classes!
Now let me try to answer your questions:
How do I go about designing the training set for my logistic regression model.
How I see it, you have two options (not necessary both are practical, it's you who should decide according to the dataset you have and the problem), either you predict the probability of all employees in the company who will be off in a certain day according to the historical data you have (i.e. previous observations), in this case, each employee will represent a class (integer from 0 to the number of employees you want to include). Or you create a model for each employee, in this case the classes will be either off (i.e. Leave) or on (i.e. Present).
Example 1
I created a dataset example of 70 cases and 4 employees which looks like this:
Here each name is associated with the day and month they took as off with respect to how many Annual Leaves left for them!
The implementation (using Scikit-Learn) would be something like this (N.B date contains only day and month):
Now we can do something like this:
import math
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold
# read dataset example
df = pd.read_csv('leaves_dataset.csv')
# assign unique integer to every employee (i.e. a class label)
mapping = {'Jack': 0, 'Oliver': 1, 'Ruby': 2, 'Emily': 3}
df.replace(mapping, inplace=True)
y = np.array(df[['Name']]).reshape(-1)
X = np.array(df[['Leaves Left', 'Day', 'Month']])
# create the model
parameters = {'penalty': ['l1', 'l2'], 'C': [0.1, 0.5, 1.0, 10, 100, 1000]}
lr = LogisticRegression(random_state=0)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=2, random_state=0)
clf = GridSearchCV(lr, parameters, cv=cv)
clf.fit(X, y)
#print(clf.best_estimator_)
#print(clf.best_score_)
# Example: probability of all employees who have 10 days left today
# warning: date must be same format
prob = clf.best_estimator_.predict_proba([[10, 9, 11]])
print({'Jack': prob[0,0], 'Oliver': prob[0,1], 'Ruby': prob[0,2], 'Emily': prob[0,3]})
Result
{'Ruby': 0.27545, 'Oliver': 0.15032,
'Emily': 0.28201, 'Jack': 0.29219}
N.B
To make this relatively work you need a real big dataset!
Also this can be better than the second one if there are other informative features in the dataset (e.g. the health status of the employee at that day..etc).
The second option is to create a model for each employee, here the result would be more accurate and more reliable, however, it's almost a nightmare if you have too many employees!
For each employee, you collect all their leaves in the past years and concatenate them into one file, in this case you have to complete all days in the year, in other words: for every day that employee has never got it off, that day should be labeled as on (or numerically speaking 1) and for the days off they should be labeled as off (or numerically speaking 0).
Obviously, in this case, the classes will be 0 and 1 (i.e. on and off) for each employee's model!
For example, consider this dataset example for the particular employee Jack:
Example 2
Then you can do for example:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold
# read dataset example
df = pd.read_csv('leaves_dataset2.csv')
# assign unique integer to every on and off (i.e. a class label)
mapping = {'off': 0, 'on': 1}
df.replace(mapping, inplace=True)
y = np.array(df[['Type']]).reshape(-1)
X = np.array(df[['Leaves Left', 'Day', 'Month']])
# create the model
parameters = {'penalty': ['l1', 'l2'], 'C': [0.1, 0.5, 1.0, 10, 100, 1000]}
lr = LogisticRegression(random_state=0)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=2, random_state=0)
clf = GridSearchCV(lr, parameters, cv=cv)
clf.fit(X, y)
#print(clf.best_estimator_)
#print(clf.best_score_)
# Example: probability of the employee "Jack" who has 10 days left today
prob = clf.best_estimator_.predict_proba([[10, 9, 11]])
print({'Off': prob[0,0], 'On': prob[0,1]})
Result
{'On': 0.33348, 'Off': 0.66651}
N.B in this case you have to create a dataset for each employee + training especial model + filling all the days the never taken in the past years as off!
In my training set, I find some variables are continuously decreasing (ex no of leaves left). Would that create any problem,
because I know continuously increasing or decreasing variables are
used in linear regression. Is that true ?
Well, there is nothing preventing you from using contentious values as features (e.g. number of leaves) in Logistic Regression; actually it doesn't make any difference if it's used in Linear or Logistic Regression but I believe you got confused between the features and the response:
The thing is, discrete values should be used in the response of Logistic Regression and Continuous values should be used in the response of the Linear Regression (a.k.a dependent variable or y).

Resources