RandomCutForest hyperparameter value limit is too small in Sagemaker - machine-learning

I'm trying to use RandomCutForest in Sagemaker with the data as below:
Number of rows: 420000
Feature dimension: 30
The problem is that RandomCutForest hyperparameters have the following restrictions (https://docs.aws.amazon.com/sagemaker/latest/dg/rcf_hyperparameters.html).
num_samples_per_tree: min: 1, max: 2048
num_trees: min: 50, max: 1000
I think RandomCutForest is not suitable for large dataset as described above because of that hyperparameter restrictions.
Even if you set the max values to those hyperparameters, 2048 num_samples_per_tree is too small in comparison with 420000-rows data.
I wonder why Sagemaker's RandomCutForest has such a restriction (due to performance issue, hardware capability or any other reason?), even though IsolationForest in sklearn has no such restrictions.
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html
If there's any workaround on this problem, please let me know.

Choosing an optimal value for num_samples_per_tree is dependent on your application and data set. This parameter is related to the expected density of anomalies that lie in your data set. Specifically, you should choose num_samples_per_tree such that 1/num_samples_per_tree roughly approximates the ratio of anomalous data points in your data. To illustrate this with an example, if 10 samples are used in each tree then you should expect that your data set contains an anomaly 1/10 of the time. Note that in most applications the range covered by the min and max value of this parameter should suffice to yield optimal performance of this algorithm.

Related

Optimal parameter estimation for a classifier with multiple parameters

The image on the left shows a standard ROC curve formed by sweeping a single threshold and recording the corresponding True Positive Rate (TPR) and False Positive Rate (FPR).
The image on the right shows my problem setup where there are 3 parameters, and for each, we have only 2 choices. Together, it produces 8 points as depicted on the graph. In practice, I intend to have thousands of possible combinations of 100s of parameters, but the concept remains the same in this down-scaled case.
I intend to find 2 things here:
Determine the optimum parameter(s) for the given data
Provide an overall performance score for all combinations of parameters
In the case of the ROC curve on the left, this is done easily using the following methods:
Optimal parameter: Maximal difference of TPR and FPR with a cost component (I believe it is called the J-statistic?)
Overall performance: Area under the curve (the shaded portion in the graph)
However, for my case in the image on the right, I do not know if the methods I have chosen are the standard principled methods that are normally used.
Optimal parameter set: Same maximal difference of TPR and FPR
Parameter score = TPR - FPR * cost_ratio
Overall performance: Average of all "parameter scores"
I have found a lot of reference material for the ROC curve with a single threshold and while there are other techniques available to determine the performance, the ones mentioned in this question is definitely considered a standard approach. I found no such reading material for the scenario presented on the right.
Bottomline, the question here is two-fold: (1) Provide methods to evaluate the optimal parameter set and overall performance in my problem scenario, (2) Provide reference that claims the suggested methods to be a standard approach for the given scenario.
P.S.: I had first posted this question on the "Cross Validated" forum, but didn't get any responses, in fact, got only 7 views in 15 hours.
I'm going to expand a little on aberger's previous answer on a Grid Search. As with any tuning of a model it's best to optimise hyper-parameters using one portion of the data and evaluate those parameters using another proportion of the data, so GridSearchCV is best for this purpose.
First I'll create some data and split it into training and test
import numpy as np
from sklearn import model_selection, ensemble, metrics
np.random.seed(42)
X = np.random.random((5000, 10))
y = np.random.randint(0, 2, 5000)
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.3)
This gives us a classification problem, which is what I think you're describing, though the same would apply to regression problems too.
Now it's helpful to think about what parameters you may want to optimise. A cross-validated grid search is a computational expensive process, so the smaller the search space the quicker it gets done. I will show an example for a RandomForestClassifier because it's my go to model.
clf = ensemble.RandomForestClassifier()
parameters = {'n_estimators': [10, 20, 30],
'max_features': [5, 8, 10],
'max_depth': [None, 10, 20]}
So now I have my base estimator and a list of parameters that I want to optimise. Now I just have to think about how I want to evaluate each of the models that I'm going to build. It seems from your question that you're interested in the ROC AUC, so that's what I'll use for this example. Though you can chose from many default metrics in scikit or even define your own.
gs = model_selection.GridSearchCV(clf, param_grid=parameters,
scoring='roc_auc', cv=5)
gs.fit(X_train, y_train)
This will fit a model for all possible combinations of parameters that I have given it, using 5-fold cross-validation evaluate how well those parameters performed using the ROC AUC. Once that's been fit, we can look at the best parameters and pull out the best performing model.
print gs.best_params_
clf = gs.best_estimator_
Outputs:
{'max_features': 5, 'n_estimators': 30, 'max_depth': 20}
Now at this point you may want to retrain your classifier on all of the training data, as currently it's been trained using cross-validation. Some people prefer not to, but I'm a retrainer!
clf.fit(X_train, y_train)
So now we can evaluate how well the model performs on both our training and test set.
print metrics.classification_report(y_train, clf.predict(X_train))
print metrics.classification_report(y_test, clf.predict(X_test))
Outputs:
precision recall f1-score support
0 1.00 1.00 1.00 1707
1 1.00 1.00 1.00 1793
avg / total 1.00 1.00 1.00 3500
precision recall f1-score support
0 0.51 0.46 0.48 780
1 0.47 0.52 0.50 720
avg / total 0.49 0.49 0.49 1500
We can see that this model has overtrained by the poor score on the test set. But this is not surprising as the data is just random noise! Hopefully when performing these methods on data with a signal you will end up with a well-tuned model.
EDIT
This is one of those situations where 'everyone does it' but there's no real clear reference to say this is the best way to do it. I would suggest looking for an example close to the classification problem that you're working on. For example using Google Scholar to search for "grid search" "SVM" "gene expression"
I feeeeel like we're talking about Grid Search in scikit-learn. It (1), provides methods to evaluate optimal (hyper)parameters and (2), is implemented in a massively popular and well referenced statistical software package.

Parameter tuning/model selection using resampling

I have been trying to get into more details of resampling methods and implemented them on a small data set of 1000 rows. The data was split into 800 training set and 200 validation set. I used K-fold cross validation and repeated K-fold cross validation to train the KNN using the training set. Based on my understanding I have done some interpretations of the results - however, I have certain doubts about them (see questions below):
Results :
10 Fold Cv
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 720, 720, 720, 720, 720, 720, ...
Resampling results across tuning parameters:
k Accuracy Kappa
5 0.6600 0.07010791
7 0.6775 0.09432414
9 0.6800 0.07054371
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 9.
Repeated 10 fold with 10 repeats
Resampling results across tuning parameters:
k Accuracy Kappa
5 0.670250 0.10436607
7 0.676875 0.09288219
9 0.683125 0.08062622
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 9.
10 fold, 1000 repeats
k Accuracy Kappa
5 0.6680438 0.09473128
7 0.6753375 0.08810406
9 0.6831800 0.07907891
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 9.
10 fold with 2000 repeats
k Accuracy Kappa
5 0.6677981 0.09467347
7 0.6750369 0.08713170
9 0.6826894 0.07772184
Doubts:
While selecting the parameter, K=9 is the optimal value for highest accuracy. However, I don't understand how to take Kappa into consideration while finally choosing parameter value?
Repeat number has to be increased until we get stabilised result, the accuracy changes when the repeats are increased from 10 to 1000. However,the results are similar for 1000 repeats and 2000 repeats. Will it be right to consider the results of 1000/2000 repeats to be stabilised performance estimate?
Any thumb rule for the repeat number?
Finally,should I train the model on my complete training data (800 rows) now test the accuracy on the validation set ?
Accuracy and Kappa are just different classification performance metrics. In a nutshell, their difference is that Accuracy does not take possible class imbalance into account when calculating the metrics, while Kappa does. Therefore, with imbalanced classes, you might be better off using Kappa. With R caret you can do so via the train::metric parameter.
You could see a similar effect of slightly different performance results when running e.g. the 10CV with 10 repeats multiple times - you will just get slightly different results for those as well. Something you should look out for is the variance of classification performance over your partitions and repeats. In case you obtain a small variance you can derive that you by training on all your data, you likely obtain a model that will give you similar (hence stable) results on new data. But, in case you obtain a huge variance, you can derive that just by chance (being lucky or unlucky) you might instead obtain a model that either gives you rather good or rather bad performance on new data. BTW: the prediction performance variance is something e.g. R caret::train will give you automatically, hence I'd advice on using it.
See above: look at the variance and increase the repeats until you can e.g. repeat the whole process and obtain a similar average performance and variance of performance.
Yes, CV and resampling methods exist to give you information about how well your model will perform on new data. So, after performing CV and resampling and obtaining this information, you will usually use all your data to train a final model that you use in your e.g. application scenario (this includes both train and test partition!).

Normalizing feature values for SVM

I've been playing with some SVM implementations and I am wondering - what is the best way to normalize feature values to fit into one range? (from 0 to 1)
Let's suppose I have 3 features with values in ranges of:
3 - 5.
0.02 - 0.05
10-15.
How do I convert all of those values into range of [0,1]?
What If, during training, the highest value of feature number 1 that I will encounter is 5 and after I begin to use my model on much bigger datasets, I will stumble upon values as high as 7? Then in the converted range, it would exceed 1...
How do I normalize values during training to account for the possibility of "values in the wild" exceeding the highest(or lowest) values the model "seen" during training? How will the model react to that and how I make it work properly when that happens?
Besides scaling to unit length method provided by Tim, standardization is most often used in machine learning field. Please note that when your test data comes, it makes more sense to use the mean value and standard deviation from your training samples to do this scaling. If you have a very large amount of training data, it is safe to assume they obey the normal distribution, so the possibility that new test data is out-of-range won't be that high. Refer to this post for more details.
You normalise a vector by converting it to a unit vector. This trains the SVM on the relative values of the features, not the magnitudes. The normalisation algorithm will work on vectors with any values.
To convert to a unit vector, divide each value by the length of the vector. For example, a vector of [4 0.02 12] has a length of 12.6491. The normalised vector is then [4/12.6491 0.02/12.6491 12/12.6491] = [0.316 0.0016 0.949].
If "in the wild" we encounter a vector of [400 2 1200] it will normalise to the same unit vector as above. The magnitudes of the features is "cancelled out" by the normalisation and we are left with relative values between 0 and 1.

svm scaling input values

I am using libSVM.
Say my feature values are in the following format:
instance1 : f11, f12, f13, f14
instance2 : f21, f22, f23, f24
instance3 : f31, f32, f33, f34
instance4 : f41, f42, f43, f44
..............................
instanceN : fN1, fN2, fN3, fN4
I think there are two scaling can be applied.
scale each instance vector such that each vector has zero mean and unit variance.
( (f11, f12, f13, f14) - mean((f11, f12, f13, f14) ). /std((f11, f12, f13, f14) )
scale each colum of the above matrix to a range. for example [-1, 1]
According to my experiments with RBF kernel (libSVM) I found that the second scaling (2) improves the results by about 10%. I did not understand the reason why (2) gives me a improved results.
Could anybody explain me what is the reason for applying scaling and why the second option gives me improved results?
The standard thing to do is to make each dimension (or attribute, or column (in your example)) have zero mean and unit variance.
This brings each dimension of the SVM into the same magnitude. From http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf:
The main advantage of scaling is to avoid attributes in greater numeric
ranges dominating those in smaller numeric ranges. Another advantage is to avoid
numerical diculties during the calculation. Because kernel values usually depend on
the inner products of feature vectors, e.g. the linear kernel and the polynomial ker-
nel, large attribute values might cause numerical problems. We recommend linearly
scaling each attribute to the range [-1,+1] or [0,1].
I believe that it comes down to your original data a lot.
If your original data has SOME extreme values for some columns, then in my opinion you lose some definition when scaling linearly, for example in the range [-1,1].
Let's say that you have a column where 90% of values are between 100-500 and in the remaining 10% the values are as low as -2000 and as high as +2500.
If you scale this data linearly, then you'll have:
-2000 -> -1 ## <- The min in your scaled data
+2500 -> +1 ## <- The max in your scaled data
100 -> -0.06666666666666665
234 -> -0.007111111111111068
500 -> 0.11111111111111116
You could argue that the discernibility between what was originally 100 and 500 is smaller in the scaled data in comparison to what it was in the original data.
At the end, I believe it very much comes down to the specifics of your data and I believe the 10% improved performance is very coincidental, you will certainly not see a difference of this magnitude in every dataset you try both scaling methods on.
At the same time, in the paper in the link listed in the other answer, you can clearly see that the authors recommend data to be scaled linearly.
I hope someone finds this useful!
The accepted answer speaks of "Standard Scaling", which is not efficient for high-dimensional data stored in sparse matrices (text data is a use-case); in such cases, you may resort to "Max Scaling" and its variants, which works with sparse matrices.

scikit-learn RandomForestClassifier produces 'unexpected' results

I'm trying to use sk-learn's RandomForestClassifier for a binary classification task (positive and negative examples). My training data contains 1.177.245 examples with 40 features, in SVM-light format (sparse vectors) which I load using sklearn.dataset's load_svmlight_file. It produces a sparse matrix of 'feature values' (1.177.245 * 40) and one array of 'target classes' (1s and 0s, 1.177.245 of them). I don't know whether this is worrysome, but the trainingdata has 3552 positives and the rest are all negative.
As the sk-learn's RFC doesn't accept sparse matrices, I convert the sparse matrix to a dense array (if I'm saying that right? Lots of 0s for absent features) using .toarray(). I print the matrix before and after converting to arrays and that seems to be going all right.
When I initiate the classifier and start fitting it to the data, it takes this long:
[Parallel(n_jobs=40)]: Done 1 out of 40 | elapsed: 24.7min remaining: 963.3min
[Parallel(n_jobs=40)]: Done 40 out of 40 | elapsed: 27.2min finished
(is that output right? Those 963 minutes take about 2 and a half...)
I then dump it using joblib.dump.
When I re-load it:
RandomForestClassifier: RandomForestClassifier(bootstrap=True, compute_importances=True,
criterion=gini, max_depth=None, max_features=auto,
min_density=0.1, min_samples_leaf=1, min_samples_split=1,
n_estimators=1500, n_jobs=40, oob_score=False,
random_state=<mtrand.RandomState object at 0x2b2d076fa300>,
verbose=1)
And test it on real trainingdata (consisting out of 750.709 examples, exact same format as training data) I get "unexpected" results. To be exact; only one of the examples in the testingdata is classified as true. When I train on half the initial trainingdata and test on the other half, I get no positives at all.
Now I have no reason to believe anything is wrong with what's happening, it's just that I get weird results, and furthermore I think it's all done awfully quick. It's probably impossible to make a comparison, but training a RFClassifier on the same data using rt-rank (also with 1500 iterations, but with half the cores) takes over 12 hours...
Can anyone enlighten me whether I have any reason to believe something is not working the way it's supposed to? Could it be the ratio of positives to negatives in the training data? Cheers.
Indeed this dataset is very very imbalanced. I would advise you to subsample the negative examples (e.g. pick n_positive_samples of them at random) or to oversample the positive example (the latter is more expensive and but might yield better models).
Also are you sure that all your features are numerical features (larger values means something in real life)? If some of them are categorical integer markers, those feature should be exploded as one-of-k boolean encodings instead as scikit-learn implementation of random forest s cannot directly deal with categorical data.

Resources