Decision tree regression producing multiple lines - machine-learning

I'm trying to make a single variable regression using decision tree regression. However when I'm plotting the results. Multiple lines show in the plot just like the photo below. I didn't encounter this problem when I used linear regression.
https://snipboard.io/v9QaoC.jpg - I can't post images since i have less than 10 reputation
My code:
import numpy as np
from sklearn.tree import DecisionTreeRegressor
import matplotlib.pyplot as plt
# Fit regression model
regr_1 = DecisionTreeRegressor(max_depth=2)
regr_2 = DecisionTreeRegressor(max_depth=5)
regr_1.fit(X_train.values.reshape(-1, 1), y_train.values.reshape(-1, 1))
regr_2.fit(X_train.values.reshape(-1, 1), y_train.values.reshape(-1, 1))
# Predict
y_1 = regr_1.predict(X_test.values.reshape(-1, 1))
y_2 = regr_2.predict(X_test.values.reshape(-1, 1))
# Plot the results
plt.figure()
plt.scatter(X_train, y_train, s=20, edgecolor="black", c="darkorange", label="data")
plt.plot(X_test, y_1, color="cornflowerblue", label="max_depth=2", linewidth=2)
plt.plot(X_test, y_2, color="yellowgreen", label="max_depth=5", linewidth=2)
plt.xlabel("data")
plt.ylabel("target")
plt.title("Decision Tree Regression")
plt.legend()
plt.show()

Your plot is likely unattractive because your test samples aren't sorted, so you are 'connecting the dots' between different test datapoints randomly. This was unclear for your linear regression solution because the lines were overlapping.
You can get the plot you expect by sorting your test data:
# Sort
X_test = np.sort(X_test) # Need to specify axis=0 if X_test has shape (n_samples, 0)
# Predict
y_1 = regr_1.predict(X_test.values.reshape(-1, 1))
y_2 = regr_2.predict(X_test.values.reshape(-1, 1))
# Plot the results
plt.figure()
plt.scatter(X_train, y_train, s=20, edgecolor="black", c="darkorange", label="data")
plt.plot(X_test, y_1, color="cornflowerblue", label="max_depth=2", linewidth=2)
plt.plot(X_test, y_2, color="yellowgreen", label="max_depth=5", linewidth=2)
plt.xlabel("data")
plt.ylabel("target")
plt.title("Decision Tree Regression")
plt.legend()
plt.show()

Related

After hyperparameter tuning accuracy remains the same

I was trying to hyper tune param but after I did it, the accuracy score has not changed at all, what I do wrong?
# Log reg
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=0.3326530612244898,max_iter=100,tol=0.01)
logreg.fit(X_train,y_train)
from sklearn.metrics import confusion_matrix
​
y_pred = logreg.predict(X_test)
​
print('Accuracy of log reg is: ', logreg.score(X_test,y_test))
​
confusion_matrix(y_test,y_pred)
# 0.9181286549707602 - acurracy before tunning
Output:
Accuracy of log reg is: 0.9181286549707602
array([[ 54, 9],
[ 5, 103]])
Here is me Using Grid Search CV:
from sklearn.model_selection import GridSearchCV
params ={'tol':[0.01,0.001,0.0001],
'max_iter':[100,150,200],
'C':np.linspace(1,20)/10}
grid_model = GridSearchCV(logreg,param_grid=params,cv=5)
grid_model_result = grid_model.fit(X_train,y_train)
print(grid_model_result.best_score_,grid_model_result.best_params_)
Output:
0.8867405063291139 {'C': 0.3326530612244898, 'max_iter': 100, 'tol': 0.01}
The problem was that in the first chunk you evaluate the model's performance on the test set, while in the GridSearchCV you only looked at the performance on the training set after hyperparameter optimization.
The code below shows that both procedures, when used to predict the test set labels, perform equally well in terms of accuracy (~0.93).
Note, you might want to consider using a hyperparameter grid with other solvers and a larger range of max_iter because I obtained convergence warnings.
# Load packages
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
# Load the dataset and split in X and y
df = pd.read_csv('Breast_cancer_data.csv')
X = df.iloc[:, 0:5]
y = df.iloc[:, 5]
# Perform train and test split (80/20)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize a model
Log = LogisticRegression(n_jobs=-1)
# Initialize a parameter grid
params = [{'tol':[0.01,0.001,0.0001],
'max_iter':[100,150,200],
'C':np.linspace(1,20)/10}]
# Perform GridSearchCV and store the best parameters
grid_model = GridSearchCV(Log,param_grid=params,cv=5)
grid_model_result = grid_model.fit(X_train,y_train)
best_param = grid_model_result.best_params_
# This step is only to prove that both procedures actually result in the same accuracy score
Log2 = LogisticRegression(C=best_param['C'], max_iter=best_param['max_iter'], tol=best_param['tol'], n_jobs=-1)
Log2.fit(X_train, y_train)
# Perform two predictions one straight from the GridSearch and the other one with manually inputting the best params
y_pred1 = grid_model_result.best_estimator_.predict(X_test)
y_pred2 = Log2.predict(X_test)
# Compare the accuracy scores and see that both are the same
print("Accuracy:",metrics.accuracy_score(y_test, y_pred1))
print("Accuracy:",metrics.accuracy_score(y_test, y_pred2))

Precision recall curve when results of estimator known

I have the results of an estimator running on X, as well as the ground truth, and I want to use plot_precision_recall_curve, but that requires passing in the estimator and X - which I can't do, the estimator is very complex and resides in another system... What should I do? (it would be nice to have a version of plot_precision_recall_curve that takes in y_pred and y_true ...).
You can use precision_recall_curve which accepts y_true and y_pred, and returns precision, recall, and thresholds, to be used further to find f1_score and auc, the latter can let you plot it manually.
This is an example:
# precision-recall curve and f1
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import f1_score
from sklearn.metrics import auc
from matplotlib import pyplot
# generate 2 class dataset
X, y = make_classification(n_samples=1000, n_classes=2, random_state=1)
# split into train/test sets
trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2)
# fit a model
model = LogisticRegression(solver='lbfgs')
model.fit(trainX, trainy)
# predict probabilities
lr_probs = model.predict_proba(testX)
# keep probabilities for the positive outcome only
lr_probs = lr_probs[:, 1]
# predict class values
yhat = model.predict(testX)
lr_precision, lr_recall, _ = precision_recall_curve(testy, lr_probs)
lr_f1, lr_auc = f1_score(testy, yhat), auc(lr_recall, lr_precision)
# summarize scores
print('Logistic: f1=%.3f auc=%.3f' % (lr_f1, lr_auc))
# plot the precision-recall curves
no_skill = len(testy[testy==1]) / len(testy)
pyplot.plot([0, 1], [no_skill, no_skill], linestyle='--', label='No Skill')
pyplot.plot(lr_recall, lr_precision, marker='.', label='Logistic')
# axis labels
pyplot.xlabel('Recall')
pyplot.ylabel('Precision')
# show the legend
pyplot.legend()
# show the plot
pyplot.show()

How to overfit data with Keras?

I'm trying to build a simple regression model using keras and tensorflow. In my problem I have data in the form (x, y), where x and y are simply numbers. I'd like to build a keras model in order to predict y using x as an input.
Since I think images better explains thing, these are my data:
We may discuss if they are good or not, but in my problem I cannot really cheat them.
My keras model is the following (data are splitted 30% test (X_test, y_test) and 70% training (X_train, y_train)):
model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(32, input_shape=() activation="relu", name="first_layer"))
model.add(tf.keras.layers.Dense(16, activation="relu", name="second_layer"))
model.add(tf.keras.layers.Dense(1, name="output_layer"))
model.compile(loss = "mean_squared_error", optimizer = "adam", metrics=["mse"] )
history = model.fit(X_train, y_train, epochs=500, batch_size=1, verbose=0, shuffle=False)
eval_result = model.evaluate(X_test, y_test)
print("\n\nTest loss:", eval_result, "\n")
predict_Y = model.predict(X)
note: X contains both X_test and X_train.
Plotting the prediction I get (blue squares are the prediction predict_Y)
I'm playing a lot with layers, activation funztions and other parameters. My goal is to find the best parameters to train the model, but the actual question, here, is slightly different: in fact I have hard times to force the model to overfit the data (as you can see from the above results).
Does anyone have some sort of idea about how to reproduce overfitting?
This is the outcome I would like to get:
(red dots are under blue squares!)
EDIT:
Here I provide you the data used in the example above: you can copy paste directly to a python interpreter:
X_train = [0.704619794270697, 0.6779457393024553, 0.8207082120250023, 0.8588819357831449, 0.8692320257603844, 0.6878750931810429, 0.9556331888763945, 0.77677964510883, 0.7211381534179618, 0.6438319113259414, 0.6478339581502052, 0.9710222750072649, 0.8952188423349681, 0.6303124926673513, 0.9640316662124185, 0.869691568491902, 0.8320164648420931, 0.8236399177660375, 0.8877334038470911, 0.8084042532069621, 0.8045680821762038]
y_train = [0.7766424210611557, 0.8210846773655833, 0.9996114311913593, 0.8041331063189883, 0.9980525368790883, 0.8164056182686034, 0.8925487603333683, 0.7758207470960685, 0.37345286573743475, 0.9325789202459493, 0.6060269037514895, 0.9319771743389491, 0.9990691225991941, 0.9320002808310418, 0.9992560731072977, 0.9980241561997089, 0.8882905258641204, 0.4678339275898943, 0.9312152374846061, 0.9542371205095945, 0.8885893668675711]
X_test = [0.9749191829308574, 0.8735366740730178, 0.8882783211709133, 0.8022891400991644, 0.8650601322313454, 0.8697902997857514, 1.0, 0.8165876695985228, 0.8923841531760973]
y_test = [0.975653685270635, 0.9096752789481569, 0.6653736469114154, 0.46367666660348744, 0.9991817903431941, 1.0, 0.9111205717076893, 0.5264993912088891, 0.9989199241685126]
X = [0.704619794270697, 0.77677964510883, 0.7211381534179618, 0.6478339581502052, 0.6779457393024553, 0.8588819357831449, 0.8045680821762038, 0.8320164648420931, 0.8650601322313454, 0.8697902997857514, 0.8236399177660375, 0.6878750931810429, 0.8923841531760973, 0.8692320257603844, 0.8877334038470911, 0.8735366740730178, 0.8207082120250023, 0.8022891400991644, 0.6303124926673513, 0.8084042532069621, 0.869691568491902, 0.9710222750072649, 0.9556331888763945, 0.8882783211709133, 0.8165876695985228, 0.6438319113259414, 0.8952188423349681, 0.9749191829308574, 1.0, 0.9640316662124185]
Y = [0.7766424210611557, 0.7758207470960685, 0.37345286573743475, 0.6060269037514895, 0.8210846773655833, 0.8041331063189883, 0.8885893668675711, 0.8882905258641204, 0.9991817903431941, 1.0, 0.4678339275898943, 0.8164056182686034, 0.9989199241685126, 0.9980525368790883, 0.9312152374846061, 0.9096752789481569, 0.9996114311913593, 0.46367666660348744, 0.9320002808310418, 0.9542371205095945, 0.9980241561997089, 0.9319771743389491, 0.8925487603333683, 0.6653736469114154, 0.5264993912088891, 0.9325789202459493, 0.9990691225991941, 0.975653685270635, 0.9111205717076893, 0.9992560731072977]
Where X contains the list of the x values and Y the corresponding y value. (X_test, y_test) and (X_train, y_train) are two (non overlapping) subset of (X, Y).
To predict and show the model results I simply use matplotlib (imported as plt):
predict_Y = model.predict(X)
plt.plot(X, Y, "ro", X, predict_Y, "bs")
plt.show()
Overfitted models are rarely useful in real life. It appears to me that OP is well aware of that but wants to see if NNs are indeed capable of fitting (bounded) arbitrary functions or not. On one hand, the input-output data in the example seems to obey no discernible pattern. On the other hand, both input and output are scalars in [0, 1] and there are only 21 data points in the training set.
Based on my experiments and results, we can indeed overfit as requested. See the image below.
Numerical results:
x y_true y_pred error
0 0.704620 0.776642 0.773753 -0.002889
1 0.677946 0.821085 0.819597 -0.001488
2 0.820708 0.999611 0.999813 0.000202
3 0.858882 0.804133 0.805160 0.001026
4 0.869232 0.998053 0.997862 -0.000190
5 0.687875 0.816406 0.814692 -0.001714
6 0.955633 0.892549 0.893117 0.000569
7 0.776780 0.775821 0.779289 0.003469
8 0.721138 0.373453 0.374007 0.000554
9 0.643832 0.932579 0.912565 -0.020014
10 0.647834 0.606027 0.607253 0.001226
11 0.971022 0.931977 0.931549 -0.000428
12 0.895219 0.999069 0.999051 -0.000018
13 0.630312 0.932000 0.930252 -0.001748
14 0.964032 0.999256 0.999204 -0.000052
15 0.869692 0.998024 0.997859 -0.000165
16 0.832016 0.888291 0.887883 -0.000407
17 0.823640 0.467834 0.460728 -0.007106
18 0.887733 0.931215 0.932790 0.001575
19 0.808404 0.954237 0.960282 0.006045
20 0.804568 0.888589 0.906829 0.018240
{'me': -0.00015776709314323828,
'mae': 0.00329163070145315,
'mse': 4.0713782563067185e-05,
'rmse': 0.006380735268216915}
OP's code seems good to me. My changes were minor:
Use deeper networks. It may not actually be necessary to use a depth of 30 layers but since we just want to overfit, I didn't experiment too much with what's the minimum depth needed.
Each Dense layer has 50 units. Again, this may be overkill.
Added batch normalization layer every 5th dense layer.
Decreased learning rate by half.
Ran optimization for longer using the all 21 training examples in a batch.
Used MAE as objective function. MSE is good but since we want to overfit, I want to penalize small errors the same way as large errors.
Random numbers are more important here because data appears to be arbitrary. Though, you should get similar results if you change random number seed and let the optimizer run long enough. In some cases, optimization does get stuck in a local minima and it would not produce overfitting (as requested by OP).
The code is below.
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense, BatchNormalization
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
import matplotlib.pyplot as plt
# Set seed just to have reproducible results
np.random.seed(84)
tf.random.set_seed(84)
# Load data from the post
# https://stackoverflow.com/questions/61252785/how-to-overfit-data-with-keras
X_train = np.array([0.704619794270697, 0.6779457393024553, 0.8207082120250023,
0.8588819357831449, 0.8692320257603844, 0.6878750931810429,
0.9556331888763945, 0.77677964510883, 0.7211381534179618,
0.6438319113259414, 0.6478339581502052, 0.9710222750072649,
0.8952188423349681, 0.6303124926673513, 0.9640316662124185,
0.869691568491902, 0.8320164648420931, 0.8236399177660375,
0.8877334038470911, 0.8084042532069621,
0.8045680821762038])
Y_train = np.array([0.7766424210611557, 0.8210846773655833, 0.9996114311913593,
0.8041331063189883, 0.9980525368790883, 0.8164056182686034,
0.8925487603333683, 0.7758207470960685,
0.37345286573743475, 0.9325789202459493,
0.6060269037514895, 0.9319771743389491, 0.9990691225991941,
0.9320002808310418, 0.9992560731072977, 0.9980241561997089,
0.8882905258641204, 0.4678339275898943, 0.9312152374846061,
0.9542371205095945, 0.8885893668675711])
X_test = np.array([0.9749191829308574, 0.8735366740730178, 0.8882783211709133,
0.8022891400991644, 0.8650601322313454, 0.8697902997857514,
1.0, 0.8165876695985228, 0.8923841531760973])
Y_test = np.array([0.975653685270635, 0.9096752789481569, 0.6653736469114154,
0.46367666660348744, 0.9991817903431941, 1.0,
0.9111205717076893, 0.5264993912088891, 0.9989199241685126])
X = np.array([0.704619794270697, 0.77677964510883, 0.7211381534179618,
0.6478339581502052, 0.6779457393024553, 0.8588819357831449,
0.8045680821762038, 0.8320164648420931, 0.8650601322313454,
0.8697902997857514, 0.8236399177660375, 0.6878750931810429,
0.8923841531760973, 0.8692320257603844, 0.8877334038470911,
0.8735366740730178, 0.8207082120250023, 0.8022891400991644,
0.6303124926673513, 0.8084042532069621, 0.869691568491902,
0.9710222750072649, 0.9556331888763945, 0.8882783211709133,
0.8165876695985228, 0.6438319113259414, 0.8952188423349681,
0.9749191829308574, 1.0, 0.9640316662124185])
Y = np.array([0.7766424210611557, 0.7758207470960685, 0.37345286573743475,
0.6060269037514895, 0.8210846773655833, 0.8041331063189883,
0.8885893668675711, 0.8882905258641204, 0.9991817903431941, 1.0,
0.4678339275898943, 0.8164056182686034, 0.9989199241685126,
0.9980525368790883, 0.9312152374846061, 0.9096752789481569,
0.9996114311913593, 0.46367666660348744, 0.9320002808310418,
0.9542371205095945, 0.9980241561997089, 0.9319771743389491,
0.8925487603333683, 0.6653736469114154, 0.5264993912088891,
0.9325789202459493, 0.9990691225991941, 0.975653685270635,
0.9111205717076893, 0.9992560731072977])
# Reshape all data to be of the shape (batch_size, 1)
X_train = X_train.reshape((-1, 1))
Y_train = Y_train.reshape((-1, 1))
X_test = X_test.reshape((-1, 1))
Y_test = Y_test.reshape((-1, 1))
X = X.reshape((-1, 1))
Y = Y.reshape((-1, 1))
# Is data scaled? NNs do well with bounded data.
assert np.all(X_train >= 0) and np.all(X_train <= 1)
assert np.all(Y_train >= 0) and np.all(Y_train <= 1)
assert np.all(X_test >= 0) and np.all(X_test <= 1)
assert np.all(Y_test >= 0) and np.all(Y_test <= 1)
assert np.all(X >= 0) and np.all(X <= 1)
assert np.all(Y >= 0) and np.all(Y <= 1)
# Build a model with variable number of hidden layers.
# We will use Keras functional API.
# https://www.perfectlyrandom.org/2019/06/24/a-guide-to-keras-functional-api/
n_dense_layers = 30 # increase this to get more complicated models
# Define the layers first.
input_tensor = Input(shape=(1,), name='input')
layers = []
for i in range(n_dense_layers):
layers += [Dense(units=50, activation='relu', name=f'dense_layer_{i}')]
if (i > 0) & (i % 5 == 0):
# avg over batches not features
layers += [BatchNormalization(axis=1)]
sigmoid_layer = Dense(units=1, activation='sigmoid', name='sigmoid_layer')
# Connect the layers using Keras Functional API
mid_layer = input_tensor
for dense_layer in layers:
mid_layer = dense_layer(mid_layer)
output_tensor = sigmoid_layer(mid_layer)
model = Model(inputs=[input_tensor], outputs=[output_tensor])
optimizer = Adam(learning_rate=0.0005)
model.compile(optimizer=optimizer, loss='mae', metrics=['mae'])
model.fit(x=[X_train], y=[Y_train], epochs=40000, batch_size=21)
# Predict on various datasets
Y_train_pred = model.predict(X_train)
# Create a dataframe to inspect results manually
train_df = pd.DataFrame({
'x': X_train.reshape((-1)),
'y_true': Y_train.reshape((-1)),
'y_pred': Y_train_pred.reshape((-1))
})
train_df['error'] = train_df['y_pred'] - train_df['y_true']
print(train_df)
# A dictionary to store all the errors in one place.
train_errors = {
'me': np.mean(train_df['error']),
'mae': np.mean(np.abs(train_df['error'])),
'mse': np.mean(np.square(train_df['error'])),
'rmse': np.sqrt(np.mean(np.square(train_df['error']))),
}
print(train_errors)
# Make a plot to visualize true vs predicted
plt.figure(1)
plt.clf()
plt.plot(train_df['x'], train_df['y_true'], 'r.', label='y_true')
plt.plot(train_df['x'], train_df['y_pred'], 'bo', alpha=0.25, label='y_pred')
plt.grid(True)
plt.xlabel('x')
plt.ylabel('y')
plt.title(f'Train data. MSE={np.round(train_errors["mse"], 5)}.')
plt.legend()
plt.show(block=False)
plt.savefig('true_vs_pred.png')
A problem you may encountering is that you don't have enough training data for the model to be able to fit well. In your example, you only have 21 training instances, each with only 1 feature. Broadly speaking with neural network models, you need on the order of 10K or more training instances to produce a decent model.
Consider the following code that generates a noisy sine wave and tries to train a densely-connected feed-forward neural network to fit the data. My model has two linear layers, each with 50 hidden units and a ReLU activation function. The experiments are parameterized with the variable num_points which I will increase.
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(7)
def generate_data(num_points=100):
X = np.linspace(0.0 , 2.0 * np.pi, num_points).reshape(-1, 1)
noise = np.random.normal(0, 1, num_points).reshape(-1, 1)
y = 3 * np.sin(X) + noise
return X, y
def run_experiment(X_train, y_train, X_test, batch_size=64):
num_points = X_train.shape[0]
model = keras.Sequential()
model.add(layers.Dense(50, input_shape=(1, ), activation='relu'))
model.add(layers.Dense(50, activation='relu'))
model.add(layers.Dense(1, activation='linear'))
model.compile(loss = "mse", optimizer = "adam", metrics=["mse"] )
history = model.fit(X_train, y_train, epochs=10,
batch_size=batch_size, verbose=0)
yhat = model.predict(X_test, batch_size=batch_size)
plt.figure(figsize=(5, 5))
plt.plot(X_train, y_train, "ro", markersize=2, label='True')
plt.plot(X_train, yhat, "bo", markersize=1, label='Predicted')
plt.ylim(-5, 5)
plt.title('N=%d points' % (num_points))
plt.legend()
plt.grid()
plt.show()
Here is how I invoke the code:
num_points = 100
X, y = generate_data(num_points)
run_experiment(X, y, X)
Now, if I run the experiment with num_points = 100, the model predictions (in blue) do a terrible job at fitting the true noisy sine wave (in red).
Now, here is num_points = 1000:
Here is num_points = 10000:
And here is num_points = 100000:
As you can see, for my chosen NN architecture, adding more training instances allows the neural network to better (over)fit the data.
If you do have a lot of training instances, then if you want to purposefully overfit your data, you can either increase the neural network capacity or reduce regularization. Specifically, you can control the following knobs:
increase the number of layers
increase the number of hidden units
increase the number of features per data instance
reduce regularization (e.g. by removing dropout layers)
use a more complex neural network architecture (e.g. transformer blocks instead of RNN)
You may be wondering if neural networks can fit arbitrary data rather than just a noisy sine wave as in my example. Previous research says that, yes, a big enough neural network can fit any data. See:
Universal approximation theorem. https://en.wikipedia.org/wiki/Universal_approximation_theorem
Zhang 2016, "Understanding deep learning requires rethinking generalization". https://arxiv.org/abs/1611.03530
As discussed in the comments, you should make a Python array (with NumPy) like this:-
Myarray = [[0.65, 1], [0.85, 0.5], ....]
Then you would just call those specific parts of the array whom you need to predict. Here the first value is the x-axis value. So you would call it to obtain the corresponding pair stored in Myarray
There are many resources to learn these types of things. some of them are ===>
https://www.geeksforgeeks.org/python-using-2d-arrays-lists-the-right-way/
https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=video&cd=2&cad=rja&uact=8&ved=0ahUKEwjGs-Oxne3oAhVlwTgGHfHnDp4QtwIILTAB&url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DQgfUT7i4yrc&usg=AOvVaw3LympYRszIYi6_OijMXH72

scikit learn: elastic net approaching ridge

So elastic net is supposed to be a hybrid between ridge regression (L2 regularization) and lasso (L1 regularization). However, it seems that even if l1_ratio is 0 I don't get the same result as ridge. I know ridge using gradient descent and elastic net uses coordinate descent, but the optima should be the same, no? Moreover, I've found that elasticnet often throws ConvergenceWarnings for no obvious reason, while lasso and ridge don't. Here's a snippet:
from sklearn.datasets import load_boston
from sklearn.utils import shuffle
from sklearn.linear_model import ElasticNet, Ridge, Lasso
from sklearn.model_selection import train_test_split
data = load_boston()
X, y = shuffle(data.data, data.target, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=43)
alpha = 1
en = ElasticNet(alpha=alpha, l1_ratio=0)
en.fit(X_train, y_train)
print('en train score: ', en.score(X_train, y_train))
rr = Ridge(alpha=alpha)
rr.fit(X_train, y_train)
print('rr train score: ', rr.score(X_train, y_train))
lr = Lasso(alpha=alpha)
lr.fit(X_train, y_train)
print('lr train score: ', lr.score(X_train, y_train))
print('---')
print('en test score: ', en.score(X_test, y_test))
print('rr test score: ', rr.score(X_test, y_test))
print('lr test score: ', lr.score(X_test, y_test))
print('---')
print('en coef: ', en.coef_)
print('rr coef: ', rr.coef_)
print('lr coef: ', lr.coef_)
Even though l1_ratio is 0, the train and test scores of elastic net are close to the lasso scores (and not ridge as you would expect). Moreover, elastic net seems to throw a ConvergenceWarning, even if I increase max_iter (even up to 1000000 there seems to be no effect) and tol (0.1 still throws an error, but 0.2 doesn't). Increasing alpha (as the warning suggests) also has no effect.
Based on the answer by #sascha, one can match the results between the two models:
import sklearn
print(sklearn.__version__)
from sklearn.linear_model import Ridge, ElasticNet
from sklearn.datasets import load_boston
dataset = load_boston()
X = dataset.data
y = dataset.target
f = Ridge(alpha=1,
fit_intercept=True, normalize=False,
copy_X=True, max_iter=1000, tol=1e-4, random_state=42,
solver='auto')
g = ElasticNet(alpha=1/X.shape[0], l1_ratio=1e-16,
fit_intercept=True, normalize=False,
copy_X=True, max_iter=1000, tol=1e-4, random_state=42,
precompute=False, warm_start=False,
positive=False, selection='cyclic')
f.fit(X, y)
g.fit(X, y)
print(abs(f.coef_ - g.coef_) / abs(f.coef_))
Output:
0.19.2
[1.19195623e-14 1.17076625e-15 3.25973465e-13 1.61694280e-14
4.77274767e-15 4.15332538e-15 6.15640568e-14 1.61772832e-15
4.56125088e-14 5.44320605e-14 8.99189018e-15 2.31213025e-15
3.74181954e-15]
Just read the docs. Then you will find out that none of these is using Gradient-descent and more important:
Ridge
Elastic Net
which shows, when substituting a=1, p=0, that:
ElasticNet has one more sample-dependent factor on top of the loss not found in Ridge
ElasticNet has one more 1/2 factor in the l2-term
Why different models? Probably because sklearn follows the canonical/original R-based implementation glmnet.
Furthermore i would not be surprised to see numerical-issues when doing mixed-norm optimization while i'm forcing a non-mixed norm like l1=0, especially when there are specialized solvers for both non-mixed optimization-problems.
Luckily, sklearn even has to say something about it:
Currently, l1_ratio <= 0.01 is not reliable, unless you supply your own sequence of alpha.

Under what parameters are SVC and LinearSVC in scikit-learn equivalent?

I read this thread about the difference between SVC() and LinearSVC() in scikit-learn.
Now I have a data set of binary classification problem(For such a problem, the one-to-one/one-to-rest strategy difference between both functions could be ignore.)
I want to try under what parameters would these 2 functions give me the same result. First of all, of course, we should set kernel='linear' for SVC()
However, I just could not get the same result from both functions. I could not find the answer from the documents, could anybody help me to find the equivalent parameter set I am looking for?
Updated:
I modified the following code from an example of the scikit-learn website, and apparently they are not the same:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features. We could
# avoid this ugly slicing by using a two-dim dataset
y = iris.target
for i in range(len(y)):
if (y[i]==2):
y[i] = 1
h = .02 # step size in the mesh
# we create an instance of SVM and fit out data. We do not scale our
# data since we want to plot the support vectors
C = 1.0 # SVM regularization parameter
svc = svm.SVC(kernel='linear', C=C).fit(X, y)
lin_svc = svm.LinearSVC(C=C, dual = True, loss = 'hinge').fit(X, y)
# create a mesh to plot in
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
# title for the plots
titles = ['SVC with linear kernel',
'LinearSVC (linear kernel)']
for i, clf in enumerate((svc, lin_svc)):
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
plt.subplot(1, 2, i + 1)
plt.subplots_adjust(wspace=0.4, hspace=0.4)
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.8)
# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(())
plt.yticks(())
plt.title(titles[i])
plt.show()
Result:
Output Figure from previous code
In mathematical sense you need to set:
SVC(kernel='linear', **kwargs) # by default it uses RBF kernel
and
LinearSVC(loss='hinge', **kwargs) # by default it uses squared hinge loss
Another element, which cannot be easily fixed is increasing intercept_scaling in LinearSVC, as in this implementation bias is regularized (which is not true in SVC nor should be true in SVM - thus this is not SVM) - consequently they will never be exactly equal (unless bias=0 for your problem), as they assume two different models
SVC : 1/2||w||^2 + C SUM xi_i
LinearSVC: 1/2||[w b]||^2 + C SUM xi_i
Personally I consider LinearSVC one of the mistakes of sklearn developers - this class is simply not a linear SVM.
After increasing intercept scaling (to 10.0)
However, if you scale it up too much - it will also fail, as now tolerance and number of iterations are crucial.
To sum up: LinearSVC is not linear SVM, do not use it if do not have to.

Resources