I am trying to change the default cutoff of randomForest in R. I'm using the form cutoff=c(0.7,0.3) and get an "Incorrect cutoff specified" error for any value used. What is the proper way to change the cutoff?
If it's R RandomForest I know the correct format i this:
cutoff=c(k,1-k)
cutoff=c(0.7,0.3)
EDIT
The cutoff vector should have length equal to the number of classes. SO if you have 3 classes you must have
cutoff=c(a,b,c)
where a+b+c=1
Related
The expanding mean is a way to prevent overfitting when performing target encoding. But what I do not understand is how to use this technic to apply a fit on the train set and a transform on the test set to encode my features, as this encoding technic encode the features dynamically; the encoding value for a given feature level is varying input after input as it depends from a cumulative sum.
cumulative_sum = training.groupby(column)["target"].cumsum() - training["target"]
cumulative_count = training.groupby(column).cumcount()
train_new[column + "_mean_target"] = cumulative_sum/cumulative_count
Shouldn't you simply map the mean values of the target variable calculated for different categories to the corresponding categories in your test set? The cumulative means are needed only for the training part for regularization purposes.
I would also be interested to know how to compute the mean encoding for the test set.
For now, I am recomputing the mean over the train set and assigning the values to the test set.
test.merge(training.groupby(column).mean().reset_index(), on=column)
When setting up a neural network, or any numeric optimization system using gradient descent, it's necessary to provide initial values for the weights (or whatever the system parameters are to be called).
One strategy is to initialize them to random values (set the random number seed to a known value, change for a different starting point). But this isn't always desirable (e.g. right now I'm comparing accuracy in single versus double precision, the TensorFlow random number generator outputs different values in each case). So I'm talking about a scenario where the initial values will be nonrandom.
Some initial value must be provided. In the absence of any information to specify a value, what should it be? The most obvious values are 0.0 and 1.0. Is there a reason to prefer one of those over the other? Or is there some other value that tends to be preferable for some reason?
As sascha observes, constant initial weights aren't a solution in general anyway because you have to break symmetry. Better solution for the particular context in which I came across the problem: a random number generator that gives the same sequence regardless of type.
dtype = np.float64
# Random number generator that returns the correct type
# and returns the same sequence regardless of type
def rnd(shape=(), **kwargs):
if type(shape) == int or type(shape) == float:
shape = shape,
x = tf.random_normal(shape, **kwargs, dtype=np.float64)
if dtype == np.float32:
x = tf.to_float(x)
return x.eval()
I am trying randomForest using the 'caret' package. When I run the basic command without providing any controls, it shows that caret used mtry=5 in the final model. ie, it used 5 predictors.
However, my data has 4 predictors. Can anyone explain why it shows mtry=5?
Here is my code:
library(caret)
data(iris)
set.seed(100)
model.rf = train(Petal.Length~., data=iris, method="rf")
print(model.rf$finalModel)
Call:
randomForest(x = x, y = y, mtry = param$mtry)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 5
Mean of squared residuals: 0.06799251
% Var explained: 97.8
If you do not specify a grid search, then the model info for method = "rf" will by default use var_seq(p = ncol(x)) where in this case x is the dataset iris. If you use var_seq(ncol(iris)) it will return 2 3 and 5. These values will be used in the default grid search for the mtry paramater. This returns 3 rf models and the one with the lowest rmse will be chosen as the final model. You can see this by just typing model.rf.
The reason you see 5 has to do with your seed. If you set the seed to 99 the chosen model will have an mtry of 3.
Of course just because the mtry is 5 does not mean that there suddenly is an extra variable to be chosen. It will just take all the variables available.
#phiver, Thank you for explaining var_seq. I am afraid it does not provide the full answer to my question. I found out that the following function provides the answer.
predictors(model.rf)
#[1] "Sepal.Length" "Sepal.Width" "Petal.Width"
#[4] "Speciesversicolor" "Speciesvirginica"
We see that caret replaces the categorical predictor 'Species' with 2 dummy variables. This is why we see 5 predictors, though we have 4 actual predictors for predicting Sepal.Length. (I assume that you are familiar with iris data, hence I am not giving the details of its data structure.).
I think that the mtry value means the number of forests used in you model.
What is the difference between an unknown value and an omitted value for an attribute in WEKA?
I learned that for a missing value, we put the ? mark as the value for the corresponding attribute, and 0 for an omitted value. What is the difference.
Suppose we were to plot the data in a n dimensional space, then how will the unknown values be represented along their axes, because they are not zero.
Thanks
Abhishek S
The unknown values are dealt with differently by each classifier. For example, some will assign the mean value of that feature to each unknown value. This way the unknown values can plotted.
Omitted values are only used in sparse ARFF files. These files are useful if your dataset is sparse (i.e. where most values are 0). Instead of writing all the 0's in the file you only have to write the non-zero values and their corresponding location. In this case all the values that are non represented are thus assumed to be 0.
Basically; If you don't know a value then you assign the unknown value ?.
The opencv document said "If true and the problem is 2-class classification then the method returns the decision function value that is signed distance to the margin"
Is this mean that if the sample belong to the class, it will return positive number. Otherwise return a negative number.
I have wrote a test project to test the return value mean( returnDFVAL = true), I found that the return value is the distance of the test sample to the margin.
In 2-class, you have 2 label(i.e. 1, -1).
If the test sample belong to smaller label(-1). The distance is plus.
Else if the test sample belong to the bigger label(1) , The distance is minus.
Yes, but I don't know what is the MAXIMUM limit (the range) of Decision Function!
I have 2-class problem [-1,1] where the feature value are in range [0,1].
When i predict the class with OpenCV, using a test-set, i got various values (in abs) for example: 0.22 ,.., 1.75,..,3.75 (i don't know the absolute maximum value in decision function scale,but only the relative 3.75).
Thank you very much.