Mean centering with incomplete data - machine-learning

In my kaggle competitions I always mean-centered (center close to zero) my input data. Once I haven't done that, my neural network doesn't converge very well.
I'm using sklearn.preprocessing.MinMaxScaler
import numpy as np
from sklearn import preprocessing
samples = np.array([0, 1, 2, 3]).reshape(-1, 1)
scaler = preprocessing.MinMaxScaler(feature_range=(-1, 1))
[[-1. ]
[ 0.33333333]
[ 1. ]]
However, currently I'm working on a RL project where the data come from self play. That means my scaler can't fit all data from the beginning. My NN predicts QValues and learns from actual QLearning results. These QValues are quite different based on the learning episodes. i.e. 0.024, 24e-10, 24e-20, 24e-40
Is there a different approach for incomplete data?


Accuracy score in K-nearest Neighbour Classifier not matching with GridSearchCV

I'm learning Machine Learning and I'm facing a mismatch I can't explain.
I have a grid to compute the best model, according to the accuracy returned by GridSearchCV.
n_neighbors=[3, 4, 5, 6, 7, 8, 9]
param_grid = dict(n_neighbors=n_neighbors, weights=weights, algorithm=algorithm, leaf_size=leaf_size, p=p)
grid = sklearn.model_selection.GridSearchCV(estimator=model, param_grid=param_grid, cv = 5, n_jobs=1)
SGDgrid =, targetd_simp['VALUES'])
print("SGD Classifier: ")
print("Best: ")
print("Best estimator:")
The results I get are the following:
SGD Classifier:
{'algorithm': 'auto', 'leaf_size': 20, 'n_neighbors': 8, 'p': 1, 'weights': 'distance'}
Best estimator:
KNeighborsClassifier(leaf_size=20, n_neighbors=8, p=1, weights='distance')
[[4962 0 0]
[ 0 4802 0]
[ 0 0 4853]]
Probably this model is highly overfitted. I still to check it, but it's not the matter of question here.
So, basically, if I understand correctly, GridSearchCV is finding a best accuracy score of 0.3869 (quite poor) for one of the chunks in the cross validation, but the final confusion matrix is perfect, as well as the accuracy of this final matrix. It doesn't make much sense for me... How such a in theory, bad model is performing so well?
I also added scoring = 'accuracy' in GridSearchCV to be sure that the returned value is actually accuracy, and it returns exactly the same value.
What am I missing here?
The behavior you are describing is rather normal and to be expected. You should know that GridSearchCV has a parameter refit which is by default set to true. It triggers the following:
Refit an estimator using the best found parameters on the whole dataset.
This means that the estimator returned by best_estimator_ has been refit on your whole dataset (data1 in your case). It is therefore data that the estimator has already seen during training and, expectedly, performs especially well on it. You can easily reproduce this with the following example:
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
X, y = make_classification(random_state=7)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
search = GridSearchCV(KNeighborsClassifier(), param_grid={'n_neighbors': [3, 4, 5]}), y_train)
>>> 0.8533333333333333
print(accuracy_score(y_train, search.predict(X_train)))
>>> 0.9066666666666666
While this is not as impressive as in your case, it is still a clear result. During cross-validation, the model is validated against one fold that was not used for training the model, and thus, against data the model has not seen before. In the second case, however, the model already saw all data during training and it is to be expected that the model will perform better on them.
To get a better feeling of the true model performance, you should use a holdout set with data the model has not seen before:
print(accuracy_score(y_test, search.predict(X_test)))
>>> 0.76
As you can see, the model performs considerably worse on this data and shows us that the former metrics were all a bit too optimistic. The model did in fact not generalize that well.
In conclusion, your result is not surprising and has an easy explanation. The high discrepancy in scores is impressive but still follows the same logic and is actually just a clear indicator of overfitting.

How to load unlabelled data for sentiment classification after training SVM model?

I am trying to do sentiment classification and I used sklearn SVM model. I used the labeled data to train the model and got 89% accuracy. Now I want to use the model to predict the sentiment of unlabeled data. How can I do that? and after classification of unlabeled data, how to see whether it is classified as positive or negative?
I used python 3.7. Below is the code.
import random
import pandas as pd
data = pd.read_csv("label data for testing .csv", header=0)
sentiment_data = list(zip(data['Articles'], data['Sentiment']))
train_x, train_y = zip(*sentiment_data[:350])
test_x, test_y = zip(*sentiment_data[350:])
from nltk import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn import metrics
clf = Pipeline([
('vectorizer', CountVectorizer(analyzer="word",
preprocessor=lambda text: text.replace("<br />", " "),
('classifier', LinearSVC())
]), train_y)
pred_y = clf.predict(test_x)
print("Accuracy : ", metrics.accuracy_score(test_y, pred_y))
print("Precision : ", metrics.precision_score(test_y, pred_y))
print("Recall : ", metrics.recall_score(test_y, pred_y))
When I run this code, I get the output:
ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning)
Accuracy : 0.8977272727272727
Precision : 0.8604651162790697
Recall : 0.925
What is the meaning of ConvergenceWarning?
Thanks in Advance!
What is the meaning of ConvergenceWarning?
As Pavel already mention, ConvergenceWArning means that the max_iteris hitted, you can supress the warning here: How to disable ConvergenceWarning using sklearn?
Now I want to use the model to predict the sentiment of unlabeled
data. How can I do that?
You will do it with the command: pred_y = clf.predict(test_x), the only thing you will adjust is :pred_y (this is your free choice), and test_x, this should be your new unseen data, it has to have the same number of features as your data test_x and train_x.
In your case as you are doing:
sentiment_data = list(zip(data['Articles'], data['Sentiment']))
You are forming a tuple: Check this out
then you are shuffling it and unzip the first 350 rows:
train_x, train_y = zip(*sentiment_data[:350])
Here you train_x is the column: data['Articles'], so all you have to do if you have new data:
new_ data = pd.read_csv("new_data.csv", header=0)
new_y = clf.predict(new_data['Articles'])
how to see whether it is classified as positive or negative?
You can run then: pred_yand there will be either a 1 or a 0 in your outcome. Normally 0 should be negativ, but it depends on your dataset-up
Check out this site about model's persistence. Then you just load it and call predict method. Model will return predicted label. If you used any encoder (LabelEncoder, OneHotEncoder), you need to dump and load it separately.
If I were you, I'd rather do full data-driven approach and use some pretrained embedder. It'll also work for dozens of languages out-of-the-box with is quite neat.
There's LASER from facebook. There's also pypi package, though unofficial. It works just fine.
Nowadays there's a lot of pretrained models, so it shouldn't be that hard to reach near-seminal scores.
Now I want to use the model to predict the sentiment of unlabeled data. How can I do that? and after classification of unlabeled data, how to see whether it is classified as positive or negative?
Basically, you aggregate unlabeled data in same way as train_x or test_x is generated. Probably, it's 2D matrix of shape n_samples x 1, which you would then use in clf.predict to obtain predictions. clf.predict outputs most probable class. In your case 0 is negative and 1 is positive, but it's hard to tell without the dataset.
What is the meaning of ConvergenceWarning?
LinearSVC model is optimized using iterative algorithm. There is an argument max_iter (1000 by default) that controls maximum amount of iterations. If stopping criteria wasn't met during this process, you will get ConvergenceWarning. It shouldn't bother you much, as long as you have acceptable performance in terms of accuracy, or other metrics.

Why should we normalize data for deep learning in Keras?

I was testing some network architectures in Keras for classifying the MNIST dataset. I have implemented one that is similar to the LeNet.
I have seen that in the examples that I have found on the internet, there is a step of data normalization. For example:
X_train /= 255
I have performed a test without this normalization and I have seen that the performance (accuracy) of the network has decreased (keeping the same number of epochs). Why has this happened?
If I increase the number of epochs, the accuracy can reach the same level reached by the model trained with normalization?
So, the normalization affects the accuracy, or only the training speed?
The complete source code of my training script is below:
from keras.models import Sequential
from keras.layers.convolutional import Conv2D
from keras.layers.convolutional import MaxPooling2D
from keras.layers.core import Activation
from keras.layers.core import Flatten
from keras.layers.core import Dense
from keras.datasets import mnist
from keras.utils import np_utils
from keras.optimizers import SGD, RMSprop, Adam
import numpy as np
import matplotlib.pyplot as plt
from keras import backend as k
def build(input_shape, classes):
model = Sequential()
model.add(Conv2D(20, kernel_size=5, padding="same",activation='relu',input_shape=input_shape))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))
model.add(Conv2D(50, kernel_size=5, padding="same", activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))
return model
NB_EPOCH = 4 # number of epochs
BATCH_SIZE = 128 # size of the batch
VERBOSE = 1 # set the training phase as verbose
OPTIMIZER = Adam() # optimizer
VALIDATION_SPLIT=0.2 # percentage of the training data used for
evaluating the loss function
IMG_ROWS, IMG_COLS = 28, 28 # input image dimensions
NB_CLASSES = 10 # number of outputs = number of digits
INPUT_SHAPE = (1, IMG_ROWS, IMG_COLS) # shape of the input
(X_train, y_train), (X_test, y_test) = mnist.load_data()
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255
X_test /= 255
X_train = X_train[:, np.newaxis, :, :]
X_test = X_test[:, np.newaxis, :, :]
print(X_train.shape[0], 'train samples')
print(X_test.shape[0], 'test samples')
y_train = np_utils.to_categorical(y_train, NB_CLASSES)
y_test = np_utils.to_categorical(y_test, NB_CLASSES)
model = build(input_shape=INPUT_SHAPE, classes=NB_CLASSES)
history =, y_train, batch_size=BATCH_SIZE, epochs=NB_EPOCH, verbose=VERBOSE, validation_split=VALIDATION_SPLIT)"model2")
score = model.evaluate(X_test, y_test, verbose=VERBOSE)
print('Test accuracy:', score[1])
Normalization is a generic concept not limited only to deep learning or to Keras.
Why to normalize?
Let me take a simple logistic regression example which will be easy to understand and to explain normalization.
Assume we are trying to predict if a customer should be given loan or not. Among many available independent variables lets just consider Age and Income.
Let the equation be of the form:
Y = weight_1 * (Age) + weight_2 * (Income) + some_constant
Just for sake of explanation let Age be usually in range of [0,120] and let us assume Income in range of [10000, 100000]. The scale of Age and Income are very different. If you consider them as is then weights weight_1 and weight_2 may be assigned biased weights. weight_2 might bring more importance to Income as a feature than to what weight_1 brings importance to Age. To scale them to a common level, we can normalize them. For example, we can bring all the ages in range of [0,1] and all incomes in range of [0,1]. Now we can say that Age and Income are given equal importance as a feature.
Does Normalization always increase the accuracy?
Apparently, No. It is not necessary that normalization always increases accuracy. It may or might not, you never really know until you implement. Again it depends on at which stage in you training you apply normalization, on whether you apply normalization after every activation, etc.
As the range of the values of the features gets narrowed down to a particular range because of normalization, its easy to perform computations over a smaller range of values. So, usually the model gets trained a bit faster.
Regarding the number of epochs, accuracy usually increases with number of epochs provided that your model doesn't start over-fitting.
A very good explanation for Normalization/Standardization and related terms is here.
In a nutshell, normalization reduces the complexity of the problem your network is trying to solve. This can potentially increase the accuracy of your model and speed up the training. You bring the data on the same scale and reduce variance. None of the weights in the network are wasted on doing a normalization for you, meaning that they can be used more efficiently to solve the actual task at hand.
As #Shridhar R Kulkarni says, normalization is a general concept and doesn’t only apply to keras.
It’s often applied as part of data preparation for ML learning models to change numeric values in the dataset to fit a standard scale without distorting the differences in their ranges. As such, normalization enhances the cohesion of entity types within a model by reducing the probability of inconsistent data.
However, not every other dataset and use case requires normalization, it’s primarily necessary when features have different ranges. You may use when;
You want to improve your model’s convergence efficiency and make
optimization feasible
When you want to make training less sensitive to scale features, you can better
solve coefficients.
Want to improve analysis from multiple models.
Normalization is not recommended when;
-Using decision tree models or ensembles based on them
-Your data is not normally distributed- you may have to use other data pre-
processing techniques
-If your dataset comprises already scaled variables
In some cases, normalization can improve performance. However, it is not always necessary.
The critical thing is to understand your dataset and scenario first, then you’ll know whether you need it or not. Sometimes, you can experiment to see if it gives you good performance or not.
Check out deepchecks and see how to deal with important data-related checks you come across in ML.
For example, to check duplicated data in your set, you can use the following code detailed code
from deepchecks.checks.integrity.data_duplicates import DataDuplicates
from deepchecks.base import Dataset, Suite
from datetime import datetime
import pandas as pd
I think there are some issue with the convergence of the optimizer function too. Here i show a simple linear regression. Three examples:
First with an array with small values and it works as expected.
Second an array with bigger values and the loss function explodes toward infinity, suggesting the need to normalize. And at the end in model 3 the same array as case two but it has been normalized and we get convergence.
github colab enabled ipython notebook
I've use the MSE optimizer function i don't know if other optimizers suffer the same issues.

neural network produces similar pattern for all inputs

I am attempting to train an ANN on time series data in Keras. I have three vectors of data that are broken into scrolling window sequences (i.e. for vector l).
np.array([l[i:i+window_size] for i in range( len(l) - window_size)])
The target vector is similarly windowed so the neural net output is a prediction of the target vector for the next window_size number of time steps. All the data is normalized with a min-max scaler. It is fed into the neural network as a shape=(nb_samples, window_size, 3). Here is a plot of the 3 input vectors.
The only output I've managed to muster from the ANN is the following plot. Target vector in blue, predictions in red (plot is zoomed in to make the prediction pattern legible). Prediction vectors are plotted at window_size intervals so each one of the repeated patterns is one prediction from the net.
I've tried many different model architectures, number of epochs, activation functions, short and fat networks, skinny, tall. This is my current one (it's a little out there).
Conv1D(64,4, input_shape=(None,3)) ->
Conv1d(32,4) ->
Dropout(24) ->
LSTM(32) ->
But nothing I try will affect the neural net from outputting this repeated pattern. I must be misunderstanding something about time-series or LSTMs in Keras. But I'm very lost at this point so any help is greatly appreciated. I've attached the full code at this repository.
I played with your code a little and I think I have a few suggestions for getting you on the right track. The code doesn't seem to match your graphs exactly, but I assume you've tweaked it a bit since then. Anyway, there are two main problems:
The biggest problem is in your data preparation step. You basically have the data shapes backwards, in that you have a single timestep of input for X and a timeseries for Y. Your input shape is (18830, 1, 8), when what you really want is (18830, 30, 8) so that the full 30 timesteps are fed into the LSTM. Otherwise the LSTM is only operating on one timestep and isn't really useful. To fix this, I changed the line in from
X = X.reshape(X.shape[0], 1, X.shape[1])
X = windowfy(X, winsize)
Similarly, the output data should probably be only 1 value, from what I've gathered of your goals from the plotting function. There are certainly some situations where you want to predict a whole timeseries, but I don't know if that's what you want in this case. I changed Y_train to use fuels instead of fuels_w so that it only had to predict one step of the timeseries.
Training for 100 epochs might be way too much for this simple network architecture. In some cases when I ran it, it looked like there was some overfitting going on. Observing the decrease of loss in the network, it seems like maybe only 3-4 epochs are needed.
Here is the graph of predictions after 3 training epochs with the adjustments I mentioned. It's not a great prediction, but it looks like it's on the right track now at least. Good luck to you!
EDIT: Example predicting multiple output timesteps:
from sklearn import datasets, preprocessing
import numpy as np
from scipy import stats
from keras import models, layers
OUTPUT_WINDOW = 5 # Predict 5 steps of the output variable.
# Randomly generate some regression data (not true sequential data; samples are independent).
X, y = datasets.make_regression(n_samples=1000, n_features=4, noise=.1)
# Rescale 0-1 and convert into windowed sequences.
X = preprocessing.MinMaxScaler().fit_transform(X)
y = preprocessing.MinMaxScaler().fit_transform(y.reshape(-1, 1))
X = np.array([X[i:i + INPUT_WINDOW] for i in range(len(X) - INPUT_WINDOW)])
y = np.array([y[i:i + OUTPUT_WINDOW] for i in range(INPUT_WINDOW - OUTPUT_WINDOW,
len(y) - OUTPUT_WINDOW)])
print(np.shape(X)) # (990, 10, 4) - Ten timesteps of four features
print(np.shape(y)) # (990, 5, 1) - Five timesteps of one features
# Construct a simple model predicting output sequences.
m = models.Sequential()
m.add(layers.LSTM(20, activation='relu', return_sequences=True, input_shape=(INPUT_WINDOW, 4)))
m.add(layers.LSTM(20, activation='relu'))
m.add(layers.LSTM(20, activation='relu', return_sequences=True))
m.add(layers.wrappers.TimeDistributed(layers.Dense(1, activation='sigmoid')))
m.compile(optimizer='adam', loss='mse')[:800], y[:800], batch_size=10, epochs=60) # Train on first 800 sequences.
preds = m.predict(X[800:], batch_size=10) # Predict the remaining sequences.
print('Prediction:\n' + str(preds[0]))
print('Actual:\n' + str(y[800]))
# Correlation should be around r = .98, essentially perfect.
print('Correlation: ' + str(stats.pearsonr(y[800:].flatten(), preds.flatten())[0]))

Keras LSTM Time Series

I have a problem and at this point I'm completely lost as to how to solve it. I'm using Keras with an LSTM layer to project a time series. I'm trying to use the previous 10 data points to predict the 11th.
Here's the code:
from keras.models import Sequential
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.recurrent import LSTM
def _load_data(data):
data should be pd.DataFrame()
n_prev = 10
docX, docY = [], []
for i in range(len(data)-n_prev):
if not docX:
alsX = np.array(docX)
alsY = np.array(docY)
return alsX, alsY
X, y = _load_data(df_test)
X_train = X[:25]
X_test = X[25:]
y_train = y[:25]
y_test = y[25:]
in_out_neurons = 2
hidden_neurons = 300
model = Sequential()
model.add(LSTM(in_out_neurons, hidden_neurons, return_sequences=False))
model.add(Dense(hidden_neurons, in_out_neurons))
model.compile(loss="mean_squared_error", optimizer="rmsprop"), y_train, nb_epoch=10, validation_split=0.05)
predicted = model.predict(X_test)
So I'm taking the input data (a two column dataframe), creating X which is an n by 10 by 2 array, and y which is an n by 2 array which is one step ahead of the last row in each array of X (labeling the data with the point directly ahead of it.
predicted is returning
[[ 7.56940445, 5.61719704],
[ 7.57328357, 5.62709032],
[ 7.56728049, 5.61216415],
[ 7.55060187, 5.60573629],
[ 7.56717342, 5.61548522],
[ 7.55866942, 5.59696181],
[ 7.57325984, 5.63150951]]
but I should be getting
[[ 73, 48],
[ 74, 42],
[ 91, 51],
[102, 64],
[109, 63],
[ 93, 65],
[ 92, 58]]
The original data set only has 42 rows, so I'm wondering if there just isn't enough there to work with? Or am I missing a key step in the modeling process maybe? I've seen some examples using Embedding layers etc, is that something I should be looking at?
Thanks in advance for any help!
Hey Ryan!
I know it's late but I just came across your question hope it's not too late or that you still find some knowledge here.
First of all, Stackoverflow may not be the best place for this kind of question. First reason to that is you have a conceptual question that is not this site's purpose. Moreover your code runs so it's not even a matter of general programming. Have a look at stats.
Second from what I see there is no conceptual error. You're using everything necessary that is:
lstm with propper dimensions
return_sequences=false just before your Dense layer
linear activation for your output
mse cost/loss/objective function
Third I however find it extremely unlikely that your network learns anything with so few pieces of data. You have to understand that you have less data than parameters here! For the great majority of supervised learning algorithm, the first thing you need is not a good model, it's good data. You can not learn from so few examples, especially not with a complex model such as LSTM networks.
Fourth It seems like your target data is made of relatively high values. First step of pre-processing here could be to standardize the data : center it around zero - that is translate your data by its mean - and rescale by ists standard deviation. This really helps learning!
Fifth In general here are a few things you should look into to improve learning and reduce overfitting :
Batch Normalization
Other optimizer (such as Adam)
Gradient clipping
Random hyper parameter search
(This is not exhaustive, if you're reading this and think something should be added, comment it so it's useful for future readers!)
Last but NOT least I suggest you look at this tutorial on Github, especially the recurrent tutorial for time series with keras.
PS: Daniel Hnyk updated his post ;)
