I have used Hyperparameter using GridSearchCV and returned back changed alpha value, but it had resulted in lower accuracy
from sklearn.model_selection import GridSearchCV
param_grid = {'alpha':np.arange(1,10) / 10}
ridge_cv = GridSearchCV(ridge,param_grid,cv=5)
ridge_cv.fit(X,y)
print("Tuned Logistic Regression Parameters: {}".format(ridge_cv.best_params_))
print("Best score is {}".format(ridge_cv.best_score_))
Output:
Tuned Logistic Regression Parameters: {'alpha': 0.1}
Best score is 0.6913858638355904
Then I returned back and put it into my Ridge Regression.
from sklearn.linear_model import Ridge
ridge = Ridge(alpha=0.1,normalize=True)
ridge.fit(rescaledX_train,y_train)
y_pred_r = ridge.predict(rescaledX_test)
print('MSE for Ridge: {}'.format(mean_squared_error(y_test,y_pred_r)))
# 4.771925701539793 - previous score,before using hyperparam
Output:
MSE for Ridge: 5.81212747265781
What have I done wrong?
Related
I am trying to train a SVM model on the Iris dataset. The aim is to classify Iris virginica flowers from other types of flowers. Here is the code:
import numpy as np
from sklearn import datasets
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
iris = datasets.load_iris()
X = iris["data"][:, (2,3)] # petal length, petal width
y = (iris["target"]==2).astype(np.float64) # Iris virginica
svm_clf = Pipeline([
("scaler", StandardScaler()),
("linear_svc", LinearSVC(C=1, loss="hinge", dual=False))
])
svm_clf.fit(X,y)
My book, which is Aurelien Geron's "Hands-On Machine Learning with Scikit-Learn , Keras and TensorFlow", 2nd edition, at page 156 says:
For better performance, you should set the dual hyperparameter to
False, unless there are more features than training instances
But If I set the dual hyperparameter to False, I get the following error:
ValueError: Unsupported set of arguments: The combination of penalty='l2' and loss='hinge' are not supported when dual=False, Parameters: penalty='l2', loss='hinge', dual=False
It instead works if I set the dual hyperparameter to True.
Why is this set of hyperparameters not supported?
L2 SVM with L1 loss (hinge) cannot be solving in the primal form. Only its dual form can be solved effectively. This is due to the limitation of the LIBLINEAR library used by sklearn. If you want to solve the primal form of the L2 SVM you will have to use L2 loss (squared hinge) instead.
LinearSVC(C=1, loss='squared_hinge', dual=False).fit(X,y)
For mode details: Link 1
i have to train a model with logistic Regression in sklearn. I saw everywhere that the outcome has to be binary but my label is good, bad or normal. I have 12 features and i don't know how can i deal with three Labels ? I am very thankful for every answer
You can use Multinomial Logistic Regression.
In python, you can modify your Logistic Regression code as:
LogisticRegression(multi_class='multinomial').fit(X_train,y_train)
You can see Logistic Regression documentation in Scikit-Learn for more details.
It's called as one-vs-all Classification or Multi class classification.
From sklearn.linear_model.LogisticRegression:
In the multiclass case, the training algorithm uses the one-vs-rest (OvR) scheme if the ‘multi_class’ option is set to ‘ovr’, and uses the cross-entropy loss if the ‘multi_class’ option is set to ‘multinomial’. (Currently the ‘multinomial’ option is supported only by the ‘lbfgs’, ‘sag’, ‘saga’ and ‘newton-cg’ solvers.)
Code example:
# Authors: Tom Dupre la Tour <tom.dupre-la-tour#m4x.org>
# License: BSD 3 clause
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.linear_model import LogisticRegression
# make 3-class dataset for classification
centers = [[-5, 0], [0, 1.5], [5, -1]]
X, y = make_blobs(n_samples=1000, centers=centers, random_state=40)
transformation = [[0.4, 0.2], [-0.4, 1.2]]
X = np.dot(X, transformation)
for multi_class in ('multinomial', 'ovr'):
clf = LogisticRegression(solver='sag', max_iter=100, random_state=42,
multi_class=multi_class).fit(X, y)
# print the training scores
print("training score : %.3f (%s)" % (clf.score(X, y), multi_class))
Check for full code example: Plot multinomial and One-vs-Rest Logistic Regression
Could you please tell me how can I use Elastic-Net, Lasso and Ridge regression in Pyspark? Actually I chose Linear, Elastic-Net, Lasso and Ridge regression these 4 algorithms according to machine learning cheatsheet. However, I don't know how to import Elastic-Net, Lasso and Ridge regression in Pyspark and cannot google the right answers. I just know using the Linear Regression in Pyspark.
Have a look at https://spark.apache.org/docs/1.5.2/ml-linear-methods.html
You can use something like:
from pyspark.ml.regression import LinearRegression
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.util import MLUtils
# Load training data
training = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt").toDF()
lr = LinearRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
# Fit the model
lrModel = lr.fit(training)
# Print the weights and intercept for linear regression
print("Weights: " + str(lrModel.weights))
print("Intercept: " + str(lrModel.intercept))
If you read into the setup, you'll find that:
By setting α properly, elastic net contains both L1 and L2 regularization as special cases. For example, if a linear regression model is trained with the elastic net parameter α set to 1, it is equivalent to a Lasso model. On the other hand, if α is set to 0, the trained model reduces to a ridge regression model.
Where:
elasticNetParam corresponds to α and regParam corresponds to λ.
Post Average Ensemble classification, I am getting a wierd confusion matrix and even weirder metric scores.
Code:-
x = data_train[categorical_columns + numerical_columns]
y = data_train['target']
from imblearn.over_sampling import SMOTE
x_sample, y_sample = SMOTE().fit_sample(x, y.values.ravel())
x_sample = pd.DataFrame(x_sample)
y_sample = pd.DataFrame(y_sample)
# checking the sizes of the sample data
print("Size of x-sample :", x_sample.shape)
print("Size of y-sample :", y_sample.shape)
# Train-Test split.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x_sample, y_sample,
test_size=0.40,
shuffle=False)
Accuracy is 99.9% but recall,f1-score and precision are 0. Never faced this problem before.I have used Adaboost Classifier.
Confusion Matrix for ADB:
[[46399 25]
[ 0 0]]
Accuracy for ADB:
0.9994614854385663
Precision for ADB:
0.0
Recall for ADB:
0.0
f1_score for ADB:
0.0
Since it is an imbalanced dataset so I have used SMOTE. And now I am getting the results as follows:
Confusion Matrix for ETC:
[[ 0 0]
[ 336 92002]]
Accuracy for ETC:
0.99636119474106
Precision for ETC:
1.0
Recall for ETC:
0.99636119474106
f1_score for ETC:
0.9981772811109906
This is happening because you have unbalanced dataset (99.9% 0's and only 0.1% 1's). In such scenario's using accuracy as metric can be misleading.
You can read more about what metrics to use in such scenario's here
HI as above answers mentioned it is because of skewed(unbalanced data). However, I would like to give a simpler solution. Use SVM's.
model = sklearn.svm.SVC(class_weight = 'balanced')
model.fit(X_train, y_train)
Using balanced class_weight would automatically give equal importance to all the classes irrespective of the number of datapoints of each class in the dataset. Also, using 'rbf' kernel in SVM would give a really good accuracy.
I'm dealing with a simple logistic regression problem. Each sample contains 7423 features. Totally 4000 training samples and 1000 testing samples. Sklearn takes 0.01s to train the model and achieves 97% accuracy, but Keras (TensorFlow backend) takes 10s to achieve same accuracy after 50 epoches (even one epoch is 20x slower than sklearn). Anyone can shed light on this huge gap?
Samples:
X_train: matrix of 4000*7423, 0.0 <= value <= 1.0
y_train: matrix of 4000*1, value = 0.0 or 1.0
X_test: matrix of 1000*7423, 0.0 <= value <= 1.0
y_test: matrix of 1000*1, value = 0.0 or 1.0
Sklearn code:
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.metrics import accuracy_score
classifier = LogisticRegression()
**# Finished in 0.01s**
classifier.fit(X_train, y_train)
predictions = classifier.predict(X_test)
print('test accuracy = %.2f' % accuracy_score(predictions, y_test))
*[output]: test accuracy = 0.97*
Keras code:
# Using TensorFlow as backend
from keras.models import Sequential
from keras.layers import Dense, Activation
model = Sequential()
model.add(Dense(1, input_dim=X_train.shape[1], activation='sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer='rmsprop',
metrics=['accuracy'])
**# Finished in 10s**
model.fit(X_train, y_train, batch_size=64, nb_epoch=50, verbose=0)
result = model.evaluate(X_test, y_test, verbose=0)
print('test accuracy = %.2f' % result[1])
*[output]: test accuracy = 0.97*
It might be the optimizer or the loss. You use a non linearity. You also use probably a different batch size under the hood in sklearn.
But the way I see it is that you have a specific task, one of the tool is tailored and made to resolve it, the other is a more complex structure that can resolve it but is not optimized to do so and probably does a lot of things that are not needed for this problem which slows everything down.