How does RandomizedSearchCV decide what the best parameters are? - machine-learning

I understand that there is really no "best model" because being the best depends on what evaluation metrics you want to have the best values on. So my question is, what is the metric that RandomizedSearchCV uses to decide which are the best parameters?

I hope you are referring to the RandomizedSearchCV. This uses the given estimator's scoring value by default and you can modify it by changing the scoring param.
From Documentation:
scoring str, callable, list/tuple or dict, default=None.
A single str (see The scoring parameter: defining model evaluation rules) or a callable (see Defining your scoring strategy from metric functions) to evaluate the predictions on the test set.
For evaluating multiple metrics, either give a list of (unique) strings or a dict with names as keys and callables as values.
NOTE that when using custom scorers, each scorer should return a single value. Metric functions returning a list/array of values can be wrapped into multiple scorers that return one value each.
See Specifying multiple metrics for evaluation for an example.
If None, the estimator’s score method is used.
Sklearn's default scoring for a classifier is accuracy and for regressor its r2 score
For example, you can see that for LinearRegresssion, it is r2 score - see here.

Related

Hyperparameter optimisation in Python with a separate validation set

I am trying to optimise the hyper parameters of a random forest regressor in Python.
I have 3 separate datasets: train/validate/test. Therefore, rather than using a cross validation method I want to use the specific validation set to tune the hyperparameters, i.e. the "First Approach" described in this stackoverflow post.
Now, sklearn has some nice inbuilt methods for hyperparameter optimisation using cross validation (e.g. this tutorial), but what about if I want to tune my hyperparameters with a specific validation set? Is it still possible to use a method like RandomizedSearchCV?
It is indeed possible with cv option. As the documentation suggests, one of the possible inputs is an iterable of train/test index tuples:
An iterable yielding (train, test) splits as arrays of indices.
So, a list of size one with train and validation indices packed as a tuple would be ok.
I think we should just have some wording clarified:
'Validation set'
A validation-set is used to evaluate your model on a unseen set of data i.e data not used for training. This is to simulate how your model would behave on new data. We use the validation-set to tune our hyper-parameters such as number of trees, max-depths etc. and chose the hyper-parameters which works best on the validation set.
'Cross-validate'
When you CV (cross-validate) with, say, 5 folds you divide your data into 5 sets where set [1,2,3,4] are used for traning, and set 5 is used for validation. Then you use [2,3,4,5] for training and use set 1 for validation - you repeat this untill all sets (i.e 5 times when using 5 fold) have been used as a validation-set and then you would average your 5 validation-score e.g accuracy to get one score which you want to (often) maximize.
Answer
So, to answer your question; yes, you can use GridSearchCV on your validation-set but that wouldn't often be the case since. You would often do one of the following:
a) Use a (i.e one) validation-set to tune your hyper-parameters against, as explained in "Validation set"
b) Use all your data i.e train+validation as one data-set and then run a, say, 5-fold grid-CV search as explained in "Cross-validate"

How to use ordered categorial variables in building ML models?

I am trying to build a logistic regression model and a lot of my features have ordered categorical variables. I think dummy variable may not be useful as it treats each category with equal weightage. So, do i need to treat to ordered categorial variable like numerical ?
Thanks in advance .
Ordered categorical values are termed as "Ordinal" attribute in data mining where one value is less than or greater than another value. You can treat these values as nominal values or continuous values (numbers).
Some of the pros and cons of treating them as numbers (continuous) are:
Pros:
This gives you a lot of flexibility in your choice of analysis and
preserves the information in the ordering. More importantly to many
analysts, it allows you to analyze the data easily.
Cons:
This approach requires the assumption that the numerical distance
between each set of subsequent categories is equal. Otherwise
depending on the domain you can make the interval large.

Why K-fold cross validation will built K+1 models?

I have read the general step for K-fold cross validation under
https://machinelearningmastery.com/k-fold-cross-validation/
It describe the general procedure is as follows:
Shuffle the dataset randomly.
Split the dataset into k groups (folds)
For each unique group:Take the group as a hold out or test data set
Take the remaining groups as a training data set Fit a model on the training set and evaluate it on the test set
Retain the evaluation score and discard the model
Summarize the skill of the model using the sample of model
evaluation scores
So if it is K-fold then K models will be built, right? But why I read from the following link from H2O which is saying it built K+1 models?
https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/tutorials/gbm/gbmTuning.ipynb
Arguably, "I read somewhere else" is too vague a statement (where?), because context does matter.
Most probably, such statements refer to some libraries which, by default, after finishing the CV proper procedure, go on to build a model on the whole training data using the hyperparameters found by CV to give best performance; see for example the relevant train function of the caret R package, which, apart from performing CV (if requested), returns also the finalModel:
finalModel
A fit object using the best parameters
Similarly, scikit-learn GridSearchCV has also a relevant parameter refit:
refit : boolean, or string, default=True
Refit an estimator using the best found parameters on the whole dataset.
[...]
The refitted estimator is made available at the best_estimator_ attribute and permits using predict directly on this GridSearchCV instance.
But even then, the models fitted are almost never just K+1: when you use CV in practice for hyperparameter tuning (and keep in mind that there there are other uses, too, for CV), you will end up fitting m*K models, where m is the length of your hyperparameters combination set (all K-folds in a single round are run with one single set of hyperparameters).
In other words, if your hypeparameter search grid consists of, say, 3 values for the no. of trees and 2 values for the tree depth, you will fit 2*3*K = 6*K models during the CV procedure, and possibly +1 for fitting your model at the end to the whole data with the best hyperparameters found.
So, to summarize:
By definition, each K-fold CV procedure consists of fitting just K models, one for each fold, with fixed hyperparameters across all folds
In case of CV for hyperparameter search, this procedure will be repeated for each hyperparameter combination of the search grid, leading to m*K fits
Having found the best hyperparameters, you may want to use them for fitting the final model, i.e. 1 more fit
leading to a total of m*K + 1 model fits.
Hope this helps...

What splitting criterion does Random Tree in Weka 3.7.11 use for numerical attributes?

I'm using RandomForest from Weka 3.7.11 which in turn is bagging Weka's RandomTree. My input attributes are numerical and the output attribute(label) is also numerical.
When training the RandomTree, K attributes are chosen at random for each node of the tree. Several splits based on those attributes are attempted and the "best" one is chosen. How does Weka determine what split is best in this (numerical) case?
For nominal attributes I believe Weka is using the information gain criterion which is based on conditional entropy.
IG(T|a) = H(T) - H(T|a)
Is something similar used for numerical attributes? Maybe differential entropy?
When tree is split on numerical attribute, it is split on the condition like a>5. So, this condition effectively becomes binary variable and the criterion (information gain) is absolutely the same.
P.S. For regression commonly used is the sum of squared errors (for each leaf, then sum over leaves). But I do not know specifically about Weka

How to deal with missing attribute values in C4.5 (J48) decision tree?

What's the best way to handle missing feature attribute values with Weka's C4.5 (J48) decision tree? The problem of missing values occurs during both training and classification.
If values are missing from training instances, am I correct in assuming that I place a '?' value for the feature?
Suppose that I am able to successfully build the decision tree and then create my own tree code in C++ or Java from Weka's tree structure. During classification time, if I am trying to classify a new instance, what value do I put for features that have missing values? How would I descend the tree past a decision node for which I have an unknown value?
Would using Naive Bayes be better for handling missing values? I would just assign a very small non-zero probability for them, right?
From Pedro Domingos' ML course in University of Washington:
Here are three approaches what Pedro suggests for missing value of A:
Assign most common value of A among other examples sorted to node n
Assign most common value of A among other examples with same target value
Assign probability p_i to each possible value v_i of A; Assign fraction p_i of example to each descendant in tree.
The slides and video is now viewable at here.
An alternative approach is to leave the missing value as the '?', and not use it for the information gain calculation. No node should have an unknown value during classification because you ignored it during the information gain step. For classifying, I believe you simply consider the missing value unknown and do not delete it during classification on that specific attribute.

Resources