How do you apply hypothesis testing to your features in a ML model? Let say for example that I am doing a regression task and I want to cut some features (once I have trained my model) to increase performance. How do I apply hypothesis testing to decide whether that feature is useful or not? I am just a bit confused about what my null hypothesis would be, level of significance and how to run the experimentation to get the p-value of the feature (I have heard that a level of significance of 0.15 is a good threshold, but I am not sure).
For example. I am doing a regression task to predict the cost of my factory, considering the production of three machines (A,B,C). I make a linear regression with the data and I find out that the p-values of machine A is greater than my level of significance, hence, it is not statistically significant and I decide to discard that feature for my model.
I have taken this example from a video on Youtube. I put the link below.
The relevant bit start from min 4:00 to 7:00
https://www.youtube.com/watch?v=HgfHefwK7VQ
I have tried reading about it, but I haven't been able to understand how he decided that level of significance and how he applied hypothesis testing in this case.
The data looks something like this
d = ('Cost': [44439, 43936, 44464, 41533, 46343],
'A': [515, 929, 800, 979, 1165],
'B': [541, 710, 675, 1147, 939],
'C': [928, 711, 824, 758, 635, 901])
df = pd.DataFrame(data=d)
After the model has been fit, the weights are as follow:
Bias weight: 35102,
Machine A: 2.066,
Machine B: 4.17,
Machine C: 4.79
Now, the issue is that the p-value for Machine A = 0.23, which was considered too high and therefore, this feature was excluded from the predictive model
Related
I am training a random forest model for the first time and I find this situation.
My accuracy on the training set, with the default parameters (as in
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html ) is very high, 0.95 or more , which looks a lot like overfitting. On the test set, accuracy goes to 0.66. My goal would be to make the model less overfitting, hoping to improve performance on the test set.
I tried to perform 5-fold cross validation, using random grid search like here ( https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74 ) with the following grid:
n_estimators = [16,32,64,128]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]
random_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'bootstrap': bootstrap}
The best model had an accuracy of 0.7 across the folds.
I used the best selected parameters in step 2 on the training set and test set, but again accuracy on training set was 0.95 and test set 0.66.
Any suggestion ? What do you think is going on here ? How can I reach the result to avoid overfitting ( and maybe improve model performance ) ?
Over here someone had the same question and received some helpful answers:
https://stats.stackexchange.com/questions/111968/random-forest-how-to-handle-overfitting
Your approach to use 5-fold crossvalidation is already very good and can perhaps be improved by utilizing 10-fold crossvalidation instead.
Another question you can ask yourself is about the quality of your data set. Are your classes balanced? If they aren't you could try to handle a class imbalance issue, because with imbalance comes usually a bias towards the majority class.
It is also possible that the dataset is perhaps not big enough and increasing it could boost your performance as well.
I hope this helps a bit.
Adding this late comment in case it helps others.
In addition to the parameters mentioned above (n_estimators, max_features, max_depth, and min_samples_leaf) consider setting 'min_impurity_decrease'.
You can use 'gini' or 'entropy' for the Criterion, however, I recommend sticking with 'gini', the default. In the majority of cases, they produce the same result but 'entropy' is more computational expensive to compute.
Max depth works well and is an intuitive way to stop a tree from growing, however, just because a node is less than the max depth doesn't always mean it should split. If the information gained from splitting only addresses a single/few misclassification(s) then splitting that node may be supporting overfitting. You may or may not find this parameter useful, depending on the size of your dataset and/or your feature space size and complexity, but it is worth considering while tuning your parameters.
You do not describe how you split your dataset, so consider using a slightly smaller training set. Also make sure you do not have categorical variables in your feature space. If you do, use OneHotEncoding or pd.get_dummies to break those out.
I'm not sure how large your feature space is but you may want to use a smaller subset of your features (depending on how many noise variables you have). You also may want to look at a smaller max_depth. Your depth test all the way up to 110, that's very large. Again, I do not know your feature space but look to the lower end of your range to start and expand from there. i.e. try: [5, 7, 9] if 9 is optimal then adjust to say [9, 11, 13] etc. Although even a depth of 9 can cause overfitting (depending on the data) so be careful not to grow this too much. Possible pair with the gini index.
I have 38 variables, like oxygen, temperature, pressure, etc and have a task to determine the total yield produced every day from these variables. When I calculate the regression coefficients and intercept value, they seem to be abnormal and very high (Impractical). For example, if 'temperature' coefficient was found to be +375.456, I could not give a meaning to them saying an increase in one unit in temperature would increase yield by 375.456g. That's impractical in my scenario. However, the prediction accuracy seems right. I would like to know, how to interpret these huge intercept( -5341.27355) and huge beta values shown below. One other important point is that I removed multicolinear columns and also, I am not scaling the variables/normalizing them because I need beta coefficients to have meaning such that I could say, increase in temperature by one unit increases yield by 10g or so. Your inputs are highly appreciated!
modl.intercept_
Out[375]: -5341.27354961415
modl.coef_
Out[376]:
array([ 1.38096017e+00, -7.62388829e+00, 5.64611255e+00, 2.26124164e-01,
4.21908571e-01, 4.50695302e-01, -8.15167717e-01, 1.82390184e+00,
-3.32849969e+02, 3.31942553e+02, 3.58830763e+02, -2.05076898e-01,
-3.06404757e+02, 7.86012402e+00, 3.21339318e+02, -7.00817205e-01,
-1.09676321e+04, 1.91481734e+00, 6.02929848e+01, 8.33731416e+00,
-6.23433431e+01, -1.88442804e+00, 6.86526274e+00, -6.76103795e+01,
-1.11406021e+02, 2.48270706e+02, 2.94836048e+01, 1.00279016e+02,
1.42906659e-02, -2.13019683e-03, -6.71427100e+02, -2.03158515e+02,
9.32094007e-03, 5.56457014e+01, -2.91724945e+00, 4.78691176e-01,
8.78121854e+00, -4.93696073e+00])
It's very unlikely that all of these variables are linearly correlated, so I would suggest that you have a look at simple non-linear regression techniques, such as Decision Trees or Kernel Ridge Regression. These are however more difficult to interpret.
Going back to your issue, these high weights might well be due to there being some high amount of correlation between the variables, or that you simply don't have very much training data.
If you instead of linear regression use Lasso Regression, the solution is biased away from high regression coefficients, and the fit will likely improve as well.
A small example on how to do this in scikit-learn, including cross validation of the regularization hyper-parameter:
from sklearn.linear_model LassoCV
# Make up some data
n_samples = 100
n_features = 5
X = np.random.random((n_samples, n_features))
# Make y linear dependent on the features
y = np.sum(np.random.random((1,n_features)) * X, axis=1)
model = LassoCV(cv=5, n_alphas=100, fit_intercept=True)
model.fit(X,y)
print(model.intercept_)
If you have a linear regression, the formula looks like this (y= target, x= features inputs):
y= x1*b1 +x2*b2 + x3*b3 + x4*b4...+ c
where b1,b2,b3,b4... are your modl.coef_. AS you already realized one of your bigges number is 3.319+02 = 331 and the intercept is also quite big with -5431.
As you already mentioned the coeffiecient variables means how much the target variable changes, if the coeffiecient feature changes with 1 unit and all others features are constant.
so for your interpretation, the higher the absoult coeffienct, the higher the influence of your analysis. But it is important to note that the model is using a lot of high coefficient, that means your model is not depending only of one variable
The image on the left shows a standard ROC curve formed by sweeping a single threshold and recording the corresponding True Positive Rate (TPR) and False Positive Rate (FPR).
The image on the right shows my problem setup where there are 3 parameters, and for each, we have only 2 choices. Together, it produces 8 points as depicted on the graph. In practice, I intend to have thousands of possible combinations of 100s of parameters, but the concept remains the same in this down-scaled case.
I intend to find 2 things here:
Determine the optimum parameter(s) for the given data
Provide an overall performance score for all combinations of parameters
In the case of the ROC curve on the left, this is done easily using the following methods:
Optimal parameter: Maximal difference of TPR and FPR with a cost component (I believe it is called the J-statistic?)
Overall performance: Area under the curve (the shaded portion in the graph)
However, for my case in the image on the right, I do not know if the methods I have chosen are the standard principled methods that are normally used.
Optimal parameter set: Same maximal difference of TPR and FPR
Parameter score = TPR - FPR * cost_ratio
Overall performance: Average of all "parameter scores"
I have found a lot of reference material for the ROC curve with a single threshold and while there are other techniques available to determine the performance, the ones mentioned in this question is definitely considered a standard approach. I found no such reading material for the scenario presented on the right.
Bottomline, the question here is two-fold: (1) Provide methods to evaluate the optimal parameter set and overall performance in my problem scenario, (2) Provide reference that claims the suggested methods to be a standard approach for the given scenario.
P.S.: I had first posted this question on the "Cross Validated" forum, but didn't get any responses, in fact, got only 7 views in 15 hours.
I'm going to expand a little on aberger's previous answer on a Grid Search. As with any tuning of a model it's best to optimise hyper-parameters using one portion of the data and evaluate those parameters using another proportion of the data, so GridSearchCV is best for this purpose.
First I'll create some data and split it into training and test
import numpy as np
from sklearn import model_selection, ensemble, metrics
np.random.seed(42)
X = np.random.random((5000, 10))
y = np.random.randint(0, 2, 5000)
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.3)
This gives us a classification problem, which is what I think you're describing, though the same would apply to regression problems too.
Now it's helpful to think about what parameters you may want to optimise. A cross-validated grid search is a computational expensive process, so the smaller the search space the quicker it gets done. I will show an example for a RandomForestClassifier because it's my go to model.
clf = ensemble.RandomForestClassifier()
parameters = {'n_estimators': [10, 20, 30],
'max_features': [5, 8, 10],
'max_depth': [None, 10, 20]}
So now I have my base estimator and a list of parameters that I want to optimise. Now I just have to think about how I want to evaluate each of the models that I'm going to build. It seems from your question that you're interested in the ROC AUC, so that's what I'll use for this example. Though you can chose from many default metrics in scikit or even define your own.
gs = model_selection.GridSearchCV(clf, param_grid=parameters,
scoring='roc_auc', cv=5)
gs.fit(X_train, y_train)
This will fit a model for all possible combinations of parameters that I have given it, using 5-fold cross-validation evaluate how well those parameters performed using the ROC AUC. Once that's been fit, we can look at the best parameters and pull out the best performing model.
print gs.best_params_
clf = gs.best_estimator_
Outputs:
{'max_features': 5, 'n_estimators': 30, 'max_depth': 20}
Now at this point you may want to retrain your classifier on all of the training data, as currently it's been trained using cross-validation. Some people prefer not to, but I'm a retrainer!
clf.fit(X_train, y_train)
So now we can evaluate how well the model performs on both our training and test set.
print metrics.classification_report(y_train, clf.predict(X_train))
print metrics.classification_report(y_test, clf.predict(X_test))
Outputs:
precision recall f1-score support
0 1.00 1.00 1.00 1707
1 1.00 1.00 1.00 1793
avg / total 1.00 1.00 1.00 3500
precision recall f1-score support
0 0.51 0.46 0.48 780
1 0.47 0.52 0.50 720
avg / total 0.49 0.49 0.49 1500
We can see that this model has overtrained by the poor score on the test set. But this is not surprising as the data is just random noise! Hopefully when performing these methods on data with a signal you will end up with a well-tuned model.
EDIT
This is one of those situations where 'everyone does it' but there's no real clear reference to say this is the best way to do it. I would suggest looking for an example close to the classification problem that you're working on. For example using Google Scholar to search for "grid search" "SVM" "gene expression"
I feeeeel like we're talking about Grid Search in scikit-learn. It (1), provides methods to evaluate optimal (hyper)parameters and (2), is implemented in a massively popular and well referenced statistical software package.
I have time series data consisting of a vector
v=(x_1,…, x_n)
of binary categorical variables and the probabilities for four outcomes
p_1, p_2, p_3, p_4.
Given a new vector of categorical variables I want to predict the probabilities
p_1,…,p_4
The probabilities are very unbalanced with
p_1>.99 and p_2, p_3, p_4 < .01.
For example
v_1= (1,0,0,0,1,0,0,0) , p_1=.99, p_2=.005, p_3=.0035, p_4= .0015
v_2=(0,0,1,0,0,0,0,1), p_1=.99, p_2=.006, p_3=.0035, p_4= .0005
v_3=(0,1,0,0,1,1,1,0), p_1=.99, p_2=.005, p_3=.003, p_4= .002
v_4=(0,0,1,0,1,0,0,1), p_1=.99, p_2=.0075, p_3=.002, p_4= .0005
Given a new vector
v_5= (0,0,1,0,1,1,0,0)
I want to predict
p_1, p_2, p_3, p_4.
I should also note that the new vector could be identical to one of the input vectors, i.e.,
v_5=(0,0,1,0,1,0,0,1)= v_4.
My initial approach is to turn this into 4 regression problems.
The first would predict p_1, the second would predict p_2, the third would predict p_3, and the fourth would predict p_4. The problem with this is that I need
p_1+p_2+p_3+p_4=1
I’m not classifying, but should I also be worried about the unbalanced probabilities. Any ideas would be welcome.
Your suggestion of considering this as a multiple problem + final normalization, has some sense, but it's known to be problematic in many cases (see, e.g., the problem of masking).
What you're describing here is multiclass (soft) classification, and there are many many known techniques for doing so. You didn't specify which language/tool/library you're using, or if you're planning on rolling your own (which only makes sense for didactic purposes). I'd suggest starting with Linear Discriminant Analysis which is very simple to understand and implement, and - despite its strong assumptions - is known to often work well in practice (see the classical book by Hastie & Tibshirani).
Irrespective of the underlying algorithm you use for soft binary classification (e.g., LDA or not), It is not very difficult to transform aggregate input into labeled input.
Consider for example the instance
v_1= (1,0,0,0,1,0,0,0) , p_1=.99, p_2=.005, p_3=.0035, p_4= .0015
If your classifier supports instance weights, feed it 4 instances, labeled 1, 2, ..., with weights given by p_1, p_2, ..., respectively.
If it does not support instance weights, simply simulate what the law of large numbers says would happen: generate some large n instance from this input; for each such new input, choose a label randomly proportionally to its probability.
This is regarding a 3-layer MLP (Input, Hidden, Output) in Ward Systems NeuroShell 2
I would prefer to split these input layer classes (PR & F) into 2 separate nets with their own hidden layers that then feed a single output layer - this would be a 3 layer network. There could be a 4 layer version using a new hidden layer to combine the 2 nets:
1) Inputs (partitioned into F and PR classes)
2) Hiddens (partitioned into F and PR classes)
3) Hiddens (fully connected "mixing" layer)
4) Output
These structures would be trained at once as opposed to training the two networks, getting the output/prediction, and then averaging those 2 numbers.
I've found that while averaging outputs works, "letting a net do it" works even better. But this requires layer partitioning which my platform (NeuralShell 2) cannot. And I've never read a paper where anyone attempts to do better than averaging.
FYI the ratio of PR to F inputs is 10:1.
Most discussion of nets is Forecasting with usually is of the Order of 10 inputs. Pattern Recognition has Orders more, 100's to 1000's and even more.
In fact, it seems that the two types of problems are virtually mutually exclusive when searching the research.
So my conclusion is having both types of structure in a single network is probably a very bad idea.
Agreed?
Not a bad idea at all! In fact this approach is a very common one that is used very frequently, you just missed out on some secret lingo.
Basically, what you're trying to do here is an ensemble prediction. The best way to approach this is to train two entirely separate nets for both halves of your problem. Then use those outputs as inputs to a new neural network.
The field is known as ensemble learning and the results are often quite good.
As far as your question of blending pattern recognition and forecasting, well it's really impossible to make a call on that without knowing more specifics around the data you're working with, but just because people haven't tried it doesn't mean you shouldn't either.
forecasting using time series creates a sliding window of data sequences for input into the network. I don't think you show mix forecasting with classification. The output is for classification can be either a softmax or a binary result. Whereas the output of a forecast will be a dense output with one neuron.
https://machinelearningmastery.com/how-to-develop-multilayer-perceptron-models-for-time-series-forecasting/
[10, 20, 30, 40, 50, 60, 70, 80, 90]
X, y
10, 20, 30 40
20, 30, 40 50
30, 40, 50 60
The input X and y is then fit by the multi layer perceptron.