Different scaling methods affect SVM regression performance way too much - machine-learning

I have a dataset with 1000 rows and 8 columns. I am doing regression with SVM, using scikit's svm.SVR. Then I do a grid search with cross-validation on some of the parameters.
My grid looks like this:
grid = [ {'C': [.1,1,10,50,100,500,1000], 'kernel': ['linear']},
{'C': [.1,1,10,50,100,500,1000], 'gamma': [0.001, 0.0001, 'auto'],
'kernel': ['rbf']}, {'C': [.1,1,10,50,100,500,1000], 'gamma':
[0.001, 0.0001, 'auto'], 'kernel': ['sigmoid']}, {'C':
[.1,1,10,50,100,500,1000], 'gamma': [0.001, 0.0001, 'auto'], 'kernel':
['poly'], 'degree':[2,3,4]} ]
I know this grid seems large, but I think trying different kernels with SVM is important.
When I scale my data to have 0-mean and unit variance, 2 features remained a little skewed and have min values around -14. So it may not be the best scaling method for my data.
Whereas if I use scikit's MinMaxScaler, the data is stuck between -1 and +1; and histograms look pretty normal for those 2 skewed features as well.
When I run the grid search with MinMax scaled data, I get the results within a few seconds.
When I run the grid with standard scaled data, the cross validation never ends and I need to stop the execution. I think the longest I've run it for is for 1.5 hours.
I know the kernel trick works by mapping the entire dataset to higher-dimensional space and that SVM works by calculating distances between those datapoints, and finding the margin can be difficult as well. But WHY would this little difference cause such a big difference in the performance???

Related

Avoiding overfitting with random forest

I am training a random forest model for the first time and I find this situation.
My accuracy on the training set, with the default parameters (as in
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html ) is very high, 0.95 or more , which looks a lot like overfitting. On the test set, accuracy goes to 0.66. My goal would be to make the model less overfitting, hoping to improve performance on the test set.
I tried to perform 5-fold cross validation, using random grid search like here ( https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74 ) with the following grid:
n_estimators = [16,32,64,128]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]
random_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'bootstrap': bootstrap}
The best model had an accuracy of 0.7 across the folds.
I used the best selected parameters in step 2 on the training set and test set, but again accuracy on training set was 0.95 and test set 0.66.
Any suggestion ? What do you think is going on here ? How can I reach the result to avoid overfitting ( and maybe improve model performance ) ?
Over here someone had the same question and received some helpful answers:
https://stats.stackexchange.com/questions/111968/random-forest-how-to-handle-overfitting
Your approach to use 5-fold crossvalidation is already very good and can perhaps be improved by utilizing 10-fold crossvalidation instead.
Another question you can ask yourself is about the quality of your data set. Are your classes balanced? If they aren't you could try to handle a class imbalance issue, because with imbalance comes usually a bias towards the majority class.
It is also possible that the dataset is perhaps not big enough and increasing it could boost your performance as well.
I hope this helps a bit.
Adding this late comment in case it helps others.
In addition to the parameters mentioned above (n_estimators, max_features, max_depth, and min_samples_leaf) consider setting 'min_impurity_decrease'.
You can use 'gini' or 'entropy' for the Criterion, however, I recommend sticking with 'gini', the default. In the majority of cases, they produce the same result but 'entropy' is more computational expensive to compute.
Max depth works well and is an intuitive way to stop a tree from growing, however, just because a node is less than the max depth doesn't always mean it should split. If the information gained from splitting only addresses a single/few misclassification(s) then splitting that node may be supporting overfitting. You may or may not find this parameter useful, depending on the size of your dataset and/or your feature space size and complexity, but it is worth considering while tuning your parameters.
You do not describe how you split your dataset, so consider using a slightly smaller training set. Also make sure you do not have categorical variables in your feature space. If you do, use OneHotEncoding or pd.get_dummies to break those out.
I'm not sure how large your feature space is but you may want to use a smaller subset of your features (depending on how many noise variables you have). You also may want to look at a smaller max_depth. Your depth test all the way up to 110, that's very large. Again, I do not know your feature space but look to the lower end of your range to start and expand from there. i.e. try: [5, 7, 9] if 9 is optimal then adjust to say [9, 11, 13] etc. Although even a depth of 9 can cause overfitting (depending on the data) so be careful not to grow this too much. Possible pair with the gini index.

Layers for predicting financial data using Tensorflow/tflearn

I'd like to predict the interest rate and I've got some relevant factors like stock index and money supply number, something like that. The number of factors may be up to 200.
For example,the training data like, X contains factors and y is the interest rate I want to train and predict.
factor1 factor2 factor3 factor176 factor177 factor178
X= [[ 2.1428 6.1557 5.4101 ..., 5.86 6.0735 6.191 ]
[ 2.168 6.1533 5.2315 ..., 5.8185 6.0591 6.189 ]
[ 2.125 4.7965 3.9443 ..., 5.7845 5.9873 6.1283]...]
y= [[ 3.5593]
[ 3.014 ]
[ 2.7125]...]
So I want to use tensorflow/tflearn to train this model but I don't really know what method exactly I should choose to do regression. I have tried LinearRegression from tflearn before, but the result is not so great.
For now, I just use the code I found online.
net = tflearn.input_data([None, 178])
net = tflearn.fully_connected(net, 64, activation='linear',
weight_decay=0.0005)
net = tflearn.fully_connected(net, 1, activation='linear')
net = tflearn.regression(net, optimizer=
tflearn.optimizers.AdaGrad(learning_rate=0.01, initial_accumulator_value=0.01),
loss='mean_square', learning_rate=0.05)
model = tflearn.DNN(net, tensorboard_verbose=0, checkpoint_path='tmp/')
model.fit(X, y, show_metric=True,
batch_size=1, n_epoch=100)
The result is roughly 50% accuracy when the error range is ±10%.
I have tried to make the window to 7 days but the result is still bad. So I want to know what additional layer I can use to make this network better.
First of all this network makes no sense. If you do not have any activations on your hidden units, you network is equivalent to linear regression.
So first of all change
net = tflearn.fully_connected(net, 64, activation='linear',
weight_decay=0.0005)
to
net = tflearn.fully_connected(net, 64, activation='relu',
weight_decay=0.0005)
Another general thing is to always normalise your data. Your X's are big, y's are big as well - make sure they aren't, by for example whitening them (making them 0 mean and 1 std).
Finding right architecture is hard problem and you will not find any "magical recipies" for that. Start with understanding what you are doing. Log your training, see if the training loss converges to small values, if it does not - you either do not train long enough, network is too small, or training hyperparameters are off (like too big learning right, too high regularisation etc.)

Optimal parameter estimation for a classifier with multiple parameters

The image on the left shows a standard ROC curve formed by sweeping a single threshold and recording the corresponding True Positive Rate (TPR) and False Positive Rate (FPR).
The image on the right shows my problem setup where there are 3 parameters, and for each, we have only 2 choices. Together, it produces 8 points as depicted on the graph. In practice, I intend to have thousands of possible combinations of 100s of parameters, but the concept remains the same in this down-scaled case.
I intend to find 2 things here:
Determine the optimum parameter(s) for the given data
Provide an overall performance score for all combinations of parameters
In the case of the ROC curve on the left, this is done easily using the following methods:
Optimal parameter: Maximal difference of TPR and FPR with a cost component (I believe it is called the J-statistic?)
Overall performance: Area under the curve (the shaded portion in the graph)
However, for my case in the image on the right, I do not know if the methods I have chosen are the standard principled methods that are normally used.
Optimal parameter set: Same maximal difference of TPR and FPR
Parameter score = TPR - FPR * cost_ratio
Overall performance: Average of all "parameter scores"
I have found a lot of reference material for the ROC curve with a single threshold and while there are other techniques available to determine the performance, the ones mentioned in this question is definitely considered a standard approach. I found no such reading material for the scenario presented on the right.
Bottomline, the question here is two-fold: (1) Provide methods to evaluate the optimal parameter set and overall performance in my problem scenario, (2) Provide reference that claims the suggested methods to be a standard approach for the given scenario.
P.S.: I had first posted this question on the "Cross Validated" forum, but didn't get any responses, in fact, got only 7 views in 15 hours.
I'm going to expand a little on aberger's previous answer on a Grid Search. As with any tuning of a model it's best to optimise hyper-parameters using one portion of the data and evaluate those parameters using another proportion of the data, so GridSearchCV is best for this purpose.
First I'll create some data and split it into training and test
import numpy as np
from sklearn import model_selection, ensemble, metrics
np.random.seed(42)
X = np.random.random((5000, 10))
y = np.random.randint(0, 2, 5000)
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.3)
This gives us a classification problem, which is what I think you're describing, though the same would apply to regression problems too.
Now it's helpful to think about what parameters you may want to optimise. A cross-validated grid search is a computational expensive process, so the smaller the search space the quicker it gets done. I will show an example for a RandomForestClassifier because it's my go to model.
clf = ensemble.RandomForestClassifier()
parameters = {'n_estimators': [10, 20, 30],
'max_features': [5, 8, 10],
'max_depth': [None, 10, 20]}
So now I have my base estimator and a list of parameters that I want to optimise. Now I just have to think about how I want to evaluate each of the models that I'm going to build. It seems from your question that you're interested in the ROC AUC, so that's what I'll use for this example. Though you can chose from many default metrics in scikit or even define your own.
gs = model_selection.GridSearchCV(clf, param_grid=parameters,
scoring='roc_auc', cv=5)
gs.fit(X_train, y_train)
This will fit a model for all possible combinations of parameters that I have given it, using 5-fold cross-validation evaluate how well those parameters performed using the ROC AUC. Once that's been fit, we can look at the best parameters and pull out the best performing model.
print gs.best_params_
clf = gs.best_estimator_
Outputs:
{'max_features': 5, 'n_estimators': 30, 'max_depth': 20}
Now at this point you may want to retrain your classifier on all of the training data, as currently it's been trained using cross-validation. Some people prefer not to, but I'm a retrainer!
clf.fit(X_train, y_train)
So now we can evaluate how well the model performs on both our training and test set.
print metrics.classification_report(y_train, clf.predict(X_train))
print metrics.classification_report(y_test, clf.predict(X_test))
Outputs:
precision recall f1-score support
0 1.00 1.00 1.00 1707
1 1.00 1.00 1.00 1793
avg / total 1.00 1.00 1.00 3500
precision recall f1-score support
0 0.51 0.46 0.48 780
1 0.47 0.52 0.50 720
avg / total 0.49 0.49 0.49 1500
We can see that this model has overtrained by the poor score on the test set. But this is not surprising as the data is just random noise! Hopefully when performing these methods on data with a signal you will end up with a well-tuned model.
EDIT
This is one of those situations where 'everyone does it' but there's no real clear reference to say this is the best way to do it. I would suggest looking for an example close to the classification problem that you're working on. For example using Google Scholar to search for "grid search" "SVM" "gene expression"
I feeeeel like we're talking about Grid Search in scikit-learn. It (1), provides methods to evaluate optimal (hyper)parameters and (2), is implemented in a massively popular and well referenced statistical software package.

how to train data with large differences between values

I'm currently working on recurrent neural networks for text-to-speech but I'm stuck at one point.
I've some input files and they have characteristic features of text(phonemes etc.) with dimension 490. The output files are mgc(60-d), bap(25-d) and lf0(1-d). mgc and bap files are ok because there are no big gaps between values. I can train them with reasonable time and accuracy. Inputs and outputs are sequential and properly aligned, e.g. if an input is of shape (300, 490), then the shapes of mgc, bap and lf0 are (300, 60), (300, 25) and (300, 1), respectively.
My problem here is with the lf0(log of fundamental frequency, I suppose). The values are like, say, [0.23, 1.2, 0.54, 3.4, -10e9, -10e9, -10e9, 3.2, 0.25]. I tried to train it using MSE but the error is too high and not decreasing at all.
I'd like to hear any suggestion for this problem. I'm open to anything.
PS: I'm using 2 gru layers with 256 or 512 unit of each.

Convolution neural network prediction results are same

I am running a simple convolutional neural network, doing regression and predicting the results. It predicts 30 outputs (floats)
The prediction results are almost the same irrespective of any input. (converging to mean on trained outputs)
The training after 1000 iterations converges to maximum loss of 0.0107 (which is good one) based on this dataset.
What is causing this?
I tried to set the bias to 1.0, it brings little variables but still the same below. When i set bias to 0, the results are far worse, all outputs are 100% same. i am already using regularisation max(0,x) no improvement with results.
The outputs are below. As you can see, the first, second, third arrays are almost same..
[[ 66.60850525 37.19641876 29.36295891 ..., 71.91300964 47.92261505
85.02180481]
[ 66.4874115 37.09647369 29.23101997 ..., 71.90777588 47.74259186
85.10979462]
[ 66.54870605 37.19485474 29.36085892 ..., 71.84892273 47.8970108
85.05699921]
...,
[ 65.7435379 36.78604889 28.57537079 ..., 71.98916626 47.03699493
85.88017273]
[ 65.7435379 36.78604889 28.57537079 ..., 71.98916626 47.03699493
85.88017273]
[ 65.7435379 36.78604889 28.57537079 ..., 71.98916626 47.03699493
85.88017273]]
The network model runs with this parameters
base_lr: 0.001
lr_policy: "fixed"
display: 100
max_iter: 1000
momentum: 0.9
Judging by the output and the fact that bias affect the result greatly I have a feeling that maybe you didn't normalize your input and output.
Try to normalize them between -1 & +1.

Resources