How to overfit data with Keras? - machine-learning

I'm trying to build a simple regression model using keras and tensorflow. In my problem I have data in the form (x, y), where x and y are simply numbers. I'd like to build a keras model in order to predict y using x as an input.
Since I think images better explains thing, these are my data:
We may discuss if they are good or not, but in my problem I cannot really cheat them.
My keras model is the following (data are splitted 30% test (X_test, y_test) and 70% training (X_train, y_train)):
model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(32, input_shape=() activation="relu", name="first_layer"))
model.add(tf.keras.layers.Dense(16, activation="relu", name="second_layer"))
model.add(tf.keras.layers.Dense(1, name="output_layer"))
model.compile(loss = "mean_squared_error", optimizer = "adam", metrics=["mse"] )
history = model.fit(X_train, y_train, epochs=500, batch_size=1, verbose=0, shuffle=False)
eval_result = model.evaluate(X_test, y_test)
print("\n\nTest loss:", eval_result, "\n")
predict_Y = model.predict(X)
note: X contains both X_test and X_train.
Plotting the prediction I get (blue squares are the prediction predict_Y)
I'm playing a lot with layers, activation funztions and other parameters. My goal is to find the best parameters to train the model, but the actual question, here, is slightly different: in fact I have hard times to force the model to overfit the data (as you can see from the above results).
Does anyone have some sort of idea about how to reproduce overfitting?
This is the outcome I would like to get:
(red dots are under blue squares!)
EDIT:
Here I provide you the data used in the example above: you can copy paste directly to a python interpreter:
X_train = [0.704619794270697, 0.6779457393024553, 0.8207082120250023, 0.8588819357831449, 0.8692320257603844, 0.6878750931810429, 0.9556331888763945, 0.77677964510883, 0.7211381534179618, 0.6438319113259414, 0.6478339581502052, 0.9710222750072649, 0.8952188423349681, 0.6303124926673513, 0.9640316662124185, 0.869691568491902, 0.8320164648420931, 0.8236399177660375, 0.8877334038470911, 0.8084042532069621, 0.8045680821762038]
y_train = [0.7766424210611557, 0.8210846773655833, 0.9996114311913593, 0.8041331063189883, 0.9980525368790883, 0.8164056182686034, 0.8925487603333683, 0.7758207470960685, 0.37345286573743475, 0.9325789202459493, 0.6060269037514895, 0.9319771743389491, 0.9990691225991941, 0.9320002808310418, 0.9992560731072977, 0.9980241561997089, 0.8882905258641204, 0.4678339275898943, 0.9312152374846061, 0.9542371205095945, 0.8885893668675711]
X_test = [0.9749191829308574, 0.8735366740730178, 0.8882783211709133, 0.8022891400991644, 0.8650601322313454, 0.8697902997857514, 1.0, 0.8165876695985228, 0.8923841531760973]
y_test = [0.975653685270635, 0.9096752789481569, 0.6653736469114154, 0.46367666660348744, 0.9991817903431941, 1.0, 0.9111205717076893, 0.5264993912088891, 0.9989199241685126]
X = [0.704619794270697, 0.77677964510883, 0.7211381534179618, 0.6478339581502052, 0.6779457393024553, 0.8588819357831449, 0.8045680821762038, 0.8320164648420931, 0.8650601322313454, 0.8697902997857514, 0.8236399177660375, 0.6878750931810429, 0.8923841531760973, 0.8692320257603844, 0.8877334038470911, 0.8735366740730178, 0.8207082120250023, 0.8022891400991644, 0.6303124926673513, 0.8084042532069621, 0.869691568491902, 0.9710222750072649, 0.9556331888763945, 0.8882783211709133, 0.8165876695985228, 0.6438319113259414, 0.8952188423349681, 0.9749191829308574, 1.0, 0.9640316662124185]
Y = [0.7766424210611557, 0.7758207470960685, 0.37345286573743475, 0.6060269037514895, 0.8210846773655833, 0.8041331063189883, 0.8885893668675711, 0.8882905258641204, 0.9991817903431941, 1.0, 0.4678339275898943, 0.8164056182686034, 0.9989199241685126, 0.9980525368790883, 0.9312152374846061, 0.9096752789481569, 0.9996114311913593, 0.46367666660348744, 0.9320002808310418, 0.9542371205095945, 0.9980241561997089, 0.9319771743389491, 0.8925487603333683, 0.6653736469114154, 0.5264993912088891, 0.9325789202459493, 0.9990691225991941, 0.975653685270635, 0.9111205717076893, 0.9992560731072977]
Where X contains the list of the x values and Y the corresponding y value. (X_test, y_test) and (X_train, y_train) are two (non overlapping) subset of (X, Y).
To predict and show the model results I simply use matplotlib (imported as plt):
predict_Y = model.predict(X)
plt.plot(X, Y, "ro", X, predict_Y, "bs")
plt.show()

Overfitted models are rarely useful in real life. It appears to me that OP is well aware of that but wants to see if NNs are indeed capable of fitting (bounded) arbitrary functions or not. On one hand, the input-output data in the example seems to obey no discernible pattern. On the other hand, both input and output are scalars in [0, 1] and there are only 21 data points in the training set.
Based on my experiments and results, we can indeed overfit as requested. See the image below.
Numerical results:
x y_true y_pred error
0 0.704620 0.776642 0.773753 -0.002889
1 0.677946 0.821085 0.819597 -0.001488
2 0.820708 0.999611 0.999813 0.000202
3 0.858882 0.804133 0.805160 0.001026
4 0.869232 0.998053 0.997862 -0.000190
5 0.687875 0.816406 0.814692 -0.001714
6 0.955633 0.892549 0.893117 0.000569
7 0.776780 0.775821 0.779289 0.003469
8 0.721138 0.373453 0.374007 0.000554
9 0.643832 0.932579 0.912565 -0.020014
10 0.647834 0.606027 0.607253 0.001226
11 0.971022 0.931977 0.931549 -0.000428
12 0.895219 0.999069 0.999051 -0.000018
13 0.630312 0.932000 0.930252 -0.001748
14 0.964032 0.999256 0.999204 -0.000052
15 0.869692 0.998024 0.997859 -0.000165
16 0.832016 0.888291 0.887883 -0.000407
17 0.823640 0.467834 0.460728 -0.007106
18 0.887733 0.931215 0.932790 0.001575
19 0.808404 0.954237 0.960282 0.006045
20 0.804568 0.888589 0.906829 0.018240
{'me': -0.00015776709314323828,
'mae': 0.00329163070145315,
'mse': 4.0713782563067185e-05,
'rmse': 0.006380735268216915}
OP's code seems good to me. My changes were minor:
Use deeper networks. It may not actually be necessary to use a depth of 30 layers but since we just want to overfit, I didn't experiment too much with what's the minimum depth needed.
Each Dense layer has 50 units. Again, this may be overkill.
Added batch normalization layer every 5th dense layer.
Decreased learning rate by half.
Ran optimization for longer using the all 21 training examples in a batch.
Used MAE as objective function. MSE is good but since we want to overfit, I want to penalize small errors the same way as large errors.
Random numbers are more important here because data appears to be arbitrary. Though, you should get similar results if you change random number seed and let the optimizer run long enough. In some cases, optimization does get stuck in a local minima and it would not produce overfitting (as requested by OP).
The code is below.
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense, BatchNormalization
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
import matplotlib.pyplot as plt
# Set seed just to have reproducible results
np.random.seed(84)
tf.random.set_seed(84)
# Load data from the post
# https://stackoverflow.com/questions/61252785/how-to-overfit-data-with-keras
X_train = np.array([0.704619794270697, 0.6779457393024553, 0.8207082120250023,
0.8588819357831449, 0.8692320257603844, 0.6878750931810429,
0.9556331888763945, 0.77677964510883, 0.7211381534179618,
0.6438319113259414, 0.6478339581502052, 0.9710222750072649,
0.8952188423349681, 0.6303124926673513, 0.9640316662124185,
0.869691568491902, 0.8320164648420931, 0.8236399177660375,
0.8877334038470911, 0.8084042532069621,
0.8045680821762038])
Y_train = np.array([0.7766424210611557, 0.8210846773655833, 0.9996114311913593,
0.8041331063189883, 0.9980525368790883, 0.8164056182686034,
0.8925487603333683, 0.7758207470960685,
0.37345286573743475, 0.9325789202459493,
0.6060269037514895, 0.9319771743389491, 0.9990691225991941,
0.9320002808310418, 0.9992560731072977, 0.9980241561997089,
0.8882905258641204, 0.4678339275898943, 0.9312152374846061,
0.9542371205095945, 0.8885893668675711])
X_test = np.array([0.9749191829308574, 0.8735366740730178, 0.8882783211709133,
0.8022891400991644, 0.8650601322313454, 0.8697902997857514,
1.0, 0.8165876695985228, 0.8923841531760973])
Y_test = np.array([0.975653685270635, 0.9096752789481569, 0.6653736469114154,
0.46367666660348744, 0.9991817903431941, 1.0,
0.9111205717076893, 0.5264993912088891, 0.9989199241685126])
X = np.array([0.704619794270697, 0.77677964510883, 0.7211381534179618,
0.6478339581502052, 0.6779457393024553, 0.8588819357831449,
0.8045680821762038, 0.8320164648420931, 0.8650601322313454,
0.8697902997857514, 0.8236399177660375, 0.6878750931810429,
0.8923841531760973, 0.8692320257603844, 0.8877334038470911,
0.8735366740730178, 0.8207082120250023, 0.8022891400991644,
0.6303124926673513, 0.8084042532069621, 0.869691568491902,
0.9710222750072649, 0.9556331888763945, 0.8882783211709133,
0.8165876695985228, 0.6438319113259414, 0.8952188423349681,
0.9749191829308574, 1.0, 0.9640316662124185])
Y = np.array([0.7766424210611557, 0.7758207470960685, 0.37345286573743475,
0.6060269037514895, 0.8210846773655833, 0.8041331063189883,
0.8885893668675711, 0.8882905258641204, 0.9991817903431941, 1.0,
0.4678339275898943, 0.8164056182686034, 0.9989199241685126,
0.9980525368790883, 0.9312152374846061, 0.9096752789481569,
0.9996114311913593, 0.46367666660348744, 0.9320002808310418,
0.9542371205095945, 0.9980241561997089, 0.9319771743389491,
0.8925487603333683, 0.6653736469114154, 0.5264993912088891,
0.9325789202459493, 0.9990691225991941, 0.975653685270635,
0.9111205717076893, 0.9992560731072977])
# Reshape all data to be of the shape (batch_size, 1)
X_train = X_train.reshape((-1, 1))
Y_train = Y_train.reshape((-1, 1))
X_test = X_test.reshape((-1, 1))
Y_test = Y_test.reshape((-1, 1))
X = X.reshape((-1, 1))
Y = Y.reshape((-1, 1))
# Is data scaled? NNs do well with bounded data.
assert np.all(X_train >= 0) and np.all(X_train <= 1)
assert np.all(Y_train >= 0) and np.all(Y_train <= 1)
assert np.all(X_test >= 0) and np.all(X_test <= 1)
assert np.all(Y_test >= 0) and np.all(Y_test <= 1)
assert np.all(X >= 0) and np.all(X <= 1)
assert np.all(Y >= 0) and np.all(Y <= 1)
# Build a model with variable number of hidden layers.
# We will use Keras functional API.
# https://www.perfectlyrandom.org/2019/06/24/a-guide-to-keras-functional-api/
n_dense_layers = 30 # increase this to get more complicated models
# Define the layers first.
input_tensor = Input(shape=(1,), name='input')
layers = []
for i in range(n_dense_layers):
layers += [Dense(units=50, activation='relu', name=f'dense_layer_{i}')]
if (i > 0) & (i % 5 == 0):
# avg over batches not features
layers += [BatchNormalization(axis=1)]
sigmoid_layer = Dense(units=1, activation='sigmoid', name='sigmoid_layer')
# Connect the layers using Keras Functional API
mid_layer = input_tensor
for dense_layer in layers:
mid_layer = dense_layer(mid_layer)
output_tensor = sigmoid_layer(mid_layer)
model = Model(inputs=[input_tensor], outputs=[output_tensor])
optimizer = Adam(learning_rate=0.0005)
model.compile(optimizer=optimizer, loss='mae', metrics=['mae'])
model.fit(x=[X_train], y=[Y_train], epochs=40000, batch_size=21)
# Predict on various datasets
Y_train_pred = model.predict(X_train)
# Create a dataframe to inspect results manually
train_df = pd.DataFrame({
'x': X_train.reshape((-1)),
'y_true': Y_train.reshape((-1)),
'y_pred': Y_train_pred.reshape((-1))
})
train_df['error'] = train_df['y_pred'] - train_df['y_true']
print(train_df)
# A dictionary to store all the errors in one place.
train_errors = {
'me': np.mean(train_df['error']),
'mae': np.mean(np.abs(train_df['error'])),
'mse': np.mean(np.square(train_df['error'])),
'rmse': np.sqrt(np.mean(np.square(train_df['error']))),
}
print(train_errors)
# Make a plot to visualize true vs predicted
plt.figure(1)
plt.clf()
plt.plot(train_df['x'], train_df['y_true'], 'r.', label='y_true')
plt.plot(train_df['x'], train_df['y_pred'], 'bo', alpha=0.25, label='y_pred')
plt.grid(True)
plt.xlabel('x')
plt.ylabel('y')
plt.title(f'Train data. MSE={np.round(train_errors["mse"], 5)}.')
plt.legend()
plt.show(block=False)
plt.savefig('true_vs_pred.png')

A problem you may encountering is that you don't have enough training data for the model to be able to fit well. In your example, you only have 21 training instances, each with only 1 feature. Broadly speaking with neural network models, you need on the order of 10K or more training instances to produce a decent model.
Consider the following code that generates a noisy sine wave and tries to train a densely-connected feed-forward neural network to fit the data. My model has two linear layers, each with 50 hidden units and a ReLU activation function. The experiments are parameterized with the variable num_points which I will increase.
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(7)
def generate_data(num_points=100):
X = np.linspace(0.0 , 2.0 * np.pi, num_points).reshape(-1, 1)
noise = np.random.normal(0, 1, num_points).reshape(-1, 1)
y = 3 * np.sin(X) + noise
return X, y
def run_experiment(X_train, y_train, X_test, batch_size=64):
num_points = X_train.shape[0]
model = keras.Sequential()
model.add(layers.Dense(50, input_shape=(1, ), activation='relu'))
model.add(layers.Dense(50, activation='relu'))
model.add(layers.Dense(1, activation='linear'))
model.compile(loss = "mse", optimizer = "adam", metrics=["mse"] )
history = model.fit(X_train, y_train, epochs=10,
batch_size=batch_size, verbose=0)
yhat = model.predict(X_test, batch_size=batch_size)
plt.figure(figsize=(5, 5))
plt.plot(X_train, y_train, "ro", markersize=2, label='True')
plt.plot(X_train, yhat, "bo", markersize=1, label='Predicted')
plt.ylim(-5, 5)
plt.title('N=%d points' % (num_points))
plt.legend()
plt.grid()
plt.show()
Here is how I invoke the code:
num_points = 100
X, y = generate_data(num_points)
run_experiment(X, y, X)
Now, if I run the experiment with num_points = 100, the model predictions (in blue) do a terrible job at fitting the true noisy sine wave (in red).
Now, here is num_points = 1000:
Here is num_points = 10000:
And here is num_points = 100000:
As you can see, for my chosen NN architecture, adding more training instances allows the neural network to better (over)fit the data.
If you do have a lot of training instances, then if you want to purposefully overfit your data, you can either increase the neural network capacity or reduce regularization. Specifically, you can control the following knobs:
increase the number of layers
increase the number of hidden units
increase the number of features per data instance
reduce regularization (e.g. by removing dropout layers)
use a more complex neural network architecture (e.g. transformer blocks instead of RNN)
You may be wondering if neural networks can fit arbitrary data rather than just a noisy sine wave as in my example. Previous research says that, yes, a big enough neural network can fit any data. See:
Universal approximation theorem. https://en.wikipedia.org/wiki/Universal_approximation_theorem
Zhang 2016, "Understanding deep learning requires rethinking generalization". https://arxiv.org/abs/1611.03530

As discussed in the comments, you should make a Python array (with NumPy) like this:-
Myarray = [[0.65, 1], [0.85, 0.5], ....]
Then you would just call those specific parts of the array whom you need to predict. Here the first value is the x-axis value. So you would call it to obtain the corresponding pair stored in Myarray
There are many resources to learn these types of things. some of them are ===>
https://www.geeksforgeeks.org/python-using-2d-arrays-lists-the-right-way/
https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=video&cd=2&cad=rja&uact=8&ved=0ahUKEwjGs-Oxne3oAhVlwTgGHfHnDp4QtwIILTAB&url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DQgfUT7i4yrc&usg=AOvVaw3LympYRszIYi6_OijMXH72

Related

Decision tree regression producing multiple lines

I'm trying to make a single variable regression using decision tree regression. However when I'm plotting the results. Multiple lines show in the plot just like the photo below. I didn't encounter this problem when I used linear regression.
https://snipboard.io/v9QaoC.jpg - I can't post images since i have less than 10 reputation
My code:
import numpy as np
from sklearn.tree import DecisionTreeRegressor
import matplotlib.pyplot as plt
# Fit regression model
regr_1 = DecisionTreeRegressor(max_depth=2)
regr_2 = DecisionTreeRegressor(max_depth=5)
regr_1.fit(X_train.values.reshape(-1, 1), y_train.values.reshape(-1, 1))
regr_2.fit(X_train.values.reshape(-1, 1), y_train.values.reshape(-1, 1))
# Predict
y_1 = regr_1.predict(X_test.values.reshape(-1, 1))
y_2 = regr_2.predict(X_test.values.reshape(-1, 1))
# Plot the results
plt.figure()
plt.scatter(X_train, y_train, s=20, edgecolor="black", c="darkorange", label="data")
plt.plot(X_test, y_1, color="cornflowerblue", label="max_depth=2", linewidth=2)
plt.plot(X_test, y_2, color="yellowgreen", label="max_depth=5", linewidth=2)
plt.xlabel("data")
plt.ylabel("target")
plt.title("Decision Tree Regression")
plt.legend()
plt.show()
Your plot is likely unattractive because your test samples aren't sorted, so you are 'connecting the dots' between different test datapoints randomly. This was unclear for your linear regression solution because the lines were overlapping.
You can get the plot you expect by sorting your test data:
# Sort
X_test = np.sort(X_test) # Need to specify axis=0 if X_test has shape (n_samples, 0)
# Predict
y_1 = regr_1.predict(X_test.values.reshape(-1, 1))
y_2 = regr_2.predict(X_test.values.reshape(-1, 1))
# Plot the results
plt.figure()
plt.scatter(X_train, y_train, s=20, edgecolor="black", c="darkorange", label="data")
plt.plot(X_test, y_1, color="cornflowerblue", label="max_depth=2", linewidth=2)
plt.plot(X_test, y_2, color="yellowgreen", label="max_depth=5", linewidth=2)
plt.xlabel("data")
plt.ylabel("target")
plt.title("Decision Tree Regression")
plt.legend()
plt.show()

NLP CNN with Embedding, predicting 5 values from Twitter text

I have a bunch of twitter texts (around 70K) that are around 10K words. Some are less and some are more. I have created a Keras architecture to predict 5 values for each Twitter texts and I have trained on those 70K. However, the accuracy (which is defined as follows: a match of pred1 and pred2 happens when all respective 5 values are with no more than 10 difference) is 21% (21% of the test data comply to the mentioned condition). I am not sure that the architecture, the tokenizer and the parameters are appropriate for this problem, but I will provide the code and ask for help. I would appreciate if someone could help me figure out why the accuracy is so low. Here is the model:
class NeuralNetMulti(Regressor):
def __init__(self):
self.name = 'keras-sequential'
self.model = Sequential()
self.num_words = 35000
self.tokenizer = Tokenizer(num_words=self.num_words, lower=True)
# self.earlystopping = callbacks.EarlyStopping(monitor="mae",
# mode="min", patience=5,
# restore_best_weights=True)
def fit(self, X, y):
print('Fitting into the neural net...')
#n_inputs = X.shape[1]
n_outputs = y.shape[1]
self.tokenizer.fit_on_texts(X)
encoded_docs = self.tokenizer.texts_to_sequences(X)
max_length = max([len(s.split()) for s in X])
self.max_length = max_length
X_train = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
vocab_size = len(self.tokenizer.word_index) + 1
print(max_length)
self.model.add(Embedding(self.num_words, 512, input_length=max_length))
self.model.add(Conv1D(filters=32, kernel_size=8, activation='relu'))
self.model.add(MaxPooling1D(pool_size=2))
self.model.add(Conv1D(filters=16, kernel_size=4, activation='relu'))
self.model.add(MaxPooling1D(pool_size=2))
self.model.add(Conv1D(filters=8, kernel_size=4, activation='relu'))
self.model.add(MaxPooling1D(pool_size=2))
self.model.add(Flatten())
self.model.add(Dense(1024, activation='relu'))
self.model.add(Dense(512, activation='relu'))
self.model.add(Dense(256, activation='relu'))
self.model.add(Dense(n_outputs))
self.model.summary()
self.model.compile(loss='mse', optimizer='adam', metrics=['mse', 'mae'])
history = self.model.fit(X_train, y, verbose=1, epochs=5, validation_split=0.1, batch_size=16)
def predict(self, X):
print('Predicting...')
encoded_docs = self.tokenizer.texts_to_sequences(X)
X_test = pad_sequences(encoded_docs, maxlen=self.max_length, padding='post')
predictions = self.model.predict(X_test, verbose=1)
print('Predicted!')
return predictions
X in this case is just an array of strings (the texts). They could be 1000 words, but most of them are around 10K words. y is array of arrays with 5 values that I mentioned. Each of them is between 0 and 100. This model achieves 21% accuracy, but previously I used a TfIdf + PCA and basic Dense network and I achieved 62% accuracy on the same data. I would appreciate anyone experience in this field to give a professional advice. Thank you in advance!
One option is to try using pre-trained word embeddings instead of random embeddings since you are dealing with lots of text here. You can start with static embeddings like GloVe and then try out contextualized embeddings like BERT. Please refer to this Keras documentation to see how you can add pre-trained word vectors - link.

high variance with Randomforest learner

I'm using Random Forest Regressor to fit a 10-dimensional regression problem with around 300 thousand samples. Although not necessary when dealing with Random Forest I started by putting the data on the same scale (by using preprocessing of sklearn) and then I did a randomised search over the following parameter space:
n_estimators=[int(x) for x in linspace (start=100, stop= 2000, num=11)]
max_features= auto, sqrt
max_depth= from 1- to 150 with step =11
min_sampl_split=2,5,10,12
min_samples_leaf=1,2,4,6
Bootstrap true or false
Moreover, after getting the best parameters I did a second narrower search.
Though I am using a 10-Fold cross validation scheme with the random search I'm still getting a serious overfitting problem!
Moreover, I have also tried using DBSCAN algorithm to check for outliers. After excluding some parts of the dataset I got even worse results!
Should I include other parameters of the Random Forest in the randomised search? or should I apply some more preprocessing techniques on the data set before fitting?
For convenience, this is my implementation I wrote:
from sklearn.model_selection import ShuffleSplit
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
n_estimators = [int(x) for x in np.linspace(start = 1, stop =
15, num = 15)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
min_samples_split = [2, 5, 10,12]
min_samples_leaf = [1, 2, 4,6]
bootstrap = [True, False]
cv = ShuffleSplit(n_splits=10, test_size=0.01, random_state=0)
random_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'bootstrap': bootstrap}
rf = RandomForestRegressor()
rf_random = RandomizedSearchCV(estimator = rf, param_distributions
= random_grid, n_iter = 50, cv = cv, verbose=2, random_state=42,
n_jobs = 32)
rf_random.fit(x_train, y_train)
the best parameters returned by the randomizedsearch function:
bootstrap: Fasle. Min_samples_leaf=2. n_estimators= 1647. Max_features: sqrt. min_samples_split=3. Max_depth: None.
The range of the target is from 0 to 10000 [unit]. This model is resulting in 6.98 [unit] RMSE accuracy on the training set and and average of 67.54 [unit] RMSE accuracy on the test sets.
that line
max_depth= from 1- to 150 with step =11
For a 10 feature problem, the optimum depth is under 10. You are overfitting like crazy beacause of that. consider putting max_depth from 1 to 15 with step 1
min_sampl_split=2,5,10,12
min_samples_leaf=1,2,4,6
This should help reduce the variance, however, the step of 11 for max_depth is killing all the efforts you could possibly make

Keras convolutional network scoring low on CIFAR-10 Dataset

I'm trying to train a CNN on the CIFAR-10 Dataset in Keras, but I'm only getting around 10% accuracy, essentially random. I'm training over 50 epochs, with a batch size of 32 and learning rate of 0.01. Is there anything in particular that I am doing wrong?
import os
import numpy as np
import pandas as pd
from PIL import Image
from keras.models import Model
from keras.layers import Input, Dense, Conv2D, MaxPool2D, Dropout, Flatten
from keras.optimizers import SGD
from keras.utils import np_utils
# trainingData = np.array([np.array(Image.open("train/" + f)) for f in os.listdir("train")]) #shape: 50k, 32, 32, 3
# testingData = np.array([np.array(Image.open("test/" + f)) for f in os.listdir("test")]) #shape: same as training
#
# trainingLabels = np.array(pd.read_csv("trainLabels.csv"))[:,1] #categorical labels ["dog", "cat", "etc"....]
# listOfLabels = sorted(list(set(trainingLabels)))
# trainingOutput = np.array([np.array([1.0 if label == ind else 0.0 for ind in listOfLabels]) for label in trainingLabels]) #converted to output
# #for example: training output for dog =
# #[1.0, 0.0, 0.0, ...]
# np.save("trainingInput.np", trainingData)
# np.save("testingInput.np", testingData)
# np.save("trainingOutput.np", trainingOutput)
trainingInput = np.load("trainingInput.npy") #shape = 50k, 32, 32, 3
testingInput = np.load("testingInput.npy") #shape = 10k, 32, 32, 3
listOfLabels = sorted(list(set(np.array(pd.read_csv("trainLabels.csv"))[:,1]))) #categorical list of labels as strings
trainingOutput = np.load("trainingOutput.npy") #shape = 50k, 10
#looks like [0.0, 1.0, 0.0 ... 0.0, 0.0]
print(listOfLabels)
print("Data loaded\n______________\n")
inp = Input(shape=(32, 32, 3))
conva1 = Conv2D(64, (3, 3), padding='same', activation='relu')(inp)
conva2 = Conv2D(64, (3, 3), padding='same', activation='relu')(conva1)
poola = MaxPool2D(pool_size=(3, 3))(conva2)
dropa = Dropout(0.1)(poola)
convb1 = Conv2D(128, (5, 5), padding='same', activation='relu')(dropa)
convb2 = Conv2D(128, (5, 5), padding='same', activation='relu')(convb1)
poolb = MaxPool2D(pool_size=(3, 3))(convb2)
dropb = Dropout(0.1)(poolb)
flat = Flatten()(dropb)
dropc = Dropout(0.5)(flat)
out = Dense(len(listOfLabels), activation='softmax')(dropc)
print(out.shape)
model = Model(inputs=inp, outputs=out)
lrSet = SGD(lr=0.01, clipvalue=0.5)
model.compile(loss='categorical_crossentropy', optimizer=lrSet, metrics=['accuracy'])
model.fit(trainingInput, trainingOutput, batch_size=32, epochs=50, verbose=1, validation_split=0.1)
print(model.predict(testingInput))
Is there anything in particular that I am doing wrong?
Not necessarily "wrong", but some pointers I can suggest are:
It is important that you rescale your data, in case you are not doing so. Instead of handling values ranging from [0,255] it is better to divide all by 255 and handle data with ranges [0,1]. This helps your model's weights converge faster, as each gradient update will be more significant compared to it's unscaled version.
I think that your dropout may be affecting your performance. Even more seeing that you are using CNNs and a strong (0.5) Dropout when passing data to your output. Quoting this great answer:
In the original paper that proposed dropout layers, by Hinton (2012), dropout (with p=0.5) was used on each of the fully connected (dense) layers before the output; it was not used on the convolutional layers. This became the most commonly used configuration.
More recent research has shown some value in applying dropout also to convolutional layers, although at much lower levels: p=0.1 or 0.2.
So perhaps reducing your dropout or playing with it a bit will yield better results. Do notice that you are doing consecutive dropouts on your data, which doesn't seem quite helpful in my opinion and could also be causing problem, so consider redesigning that part:
dropb = Dropout(0.1)(poolb) #drop
flat = Flatten()(dropb) #flatten
dropc = Dropout(0.5)(flat) #then drop again?
Your learning rate may be higher than what is normally used. Although that is SGD's default learning rate, with higher learning values you may be "rushing" your training and failing to find better minima that could yield better performance. Consider using a lower learning rate (0.001 or lower, adjust epochs as needed), or well adding weight decay on your SGD instance. This will prevent your model from getting stuck on local minima that give sub-optimal results.

Under what parameters are SVC and LinearSVC in scikit-learn equivalent?

I read this thread about the difference between SVC() and LinearSVC() in scikit-learn.
Now I have a data set of binary classification problem(For such a problem, the one-to-one/one-to-rest strategy difference between both functions could be ignore.)
I want to try under what parameters would these 2 functions give me the same result. First of all, of course, we should set kernel='linear' for SVC()
However, I just could not get the same result from both functions. I could not find the answer from the documents, could anybody help me to find the equivalent parameter set I am looking for?
Updated:
I modified the following code from an example of the scikit-learn website, and apparently they are not the same:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2] # we only take the first two features. We could
# avoid this ugly slicing by using a two-dim dataset
y = iris.target
for i in range(len(y)):
if (y[i]==2):
y[i] = 1
h = .02 # step size in the mesh
# we create an instance of SVM and fit out data. We do not scale our
# data since we want to plot the support vectors
C = 1.0 # SVM regularization parameter
svc = svm.SVC(kernel='linear', C=C).fit(X, y)
lin_svc = svm.LinearSVC(C=C, dual = True, loss = 'hinge').fit(X, y)
# create a mesh to plot in
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
# title for the plots
titles = ['SVC with linear kernel',
'LinearSVC (linear kernel)']
for i, clf in enumerate((svc, lin_svc)):
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
plt.subplot(1, 2, i + 1)
plt.subplots_adjust(wspace=0.4, hspace=0.4)
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.8)
# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(())
plt.yticks(())
plt.title(titles[i])
plt.show()
Result:
Output Figure from previous code
In mathematical sense you need to set:
SVC(kernel='linear', **kwargs) # by default it uses RBF kernel
and
LinearSVC(loss='hinge', **kwargs) # by default it uses squared hinge loss
Another element, which cannot be easily fixed is increasing intercept_scaling in LinearSVC, as in this implementation bias is regularized (which is not true in SVC nor should be true in SVM - thus this is not SVM) - consequently they will never be exactly equal (unless bias=0 for your problem), as they assume two different models
SVC : 1/2||w||^2 + C SUM xi_i
LinearSVC: 1/2||[w b]||^2 + C SUM xi_i
Personally I consider LinearSVC one of the mistakes of sklearn developers - this class is simply not a linear SVM.
After increasing intercept scaling (to 10.0)
However, if you scale it up too much - it will also fail, as now tolerance and number of iterations are crucial.
To sum up: LinearSVC is not linear SVM, do not use it if do not have to.

Resources