Why is my training model not predicting correct result? - machine-learning

I have taken a stock price and a day number as the input data.
There are about 1365 input data, but my model is not able to predict the correct value of m ( slope ) and b of my regression problem, using a gradient descent optimizer in TensorFlow.
I have also tried to take different values for the learning rate ( 0.0000000001, .., 0.1 ), but none of them worked.
import matplotlib.pyplot as plt
import pandas as pd
import tensorflow as tf
import numpy as np
y_model=m*x_act + b
with tf.Session() as sess:
for i in range(batches):
rand_ind = np.random.randint(len(x_data),size=batch_size)
feed = {x_act:x_data[rand_ind],y_act:y_true[rand_ind]}
y_ans=(model_m*len(x_data)+1) + model_b

Having spent more than +20 years in trading, quant-modeling and Machine Learning augmented decision support for FX-trading, there are few things I can help you understand before you start investing your time and efforts in a completely wrong direction. Linear regression model, reported above, has flaws, detailed below, which will not be salvaged if moving to any of the few more complex auto-regressive models ( ARMA / ARIMA ) and similarly even the LSTM-tools will not save a naive, skipped or underestimated system identification ( as it is common to call it in technical cybernetics ). Simply put, any-Model setups, that try to indoctrinate some Model-behaviour and abstract from non-TA behaviour-mode-switching, are principally blind and singlehanded for handling a complex ( almost hyperchaotic, in extended Lyapunov sense ) multi-agent ecosystem.
Why my training model not predicting correct result?
Because your assumption is straight wrong.
There are no such stocks, that behave as a linear model, whereas your instructions are strictly opposite,
you ask your linear model yPREDICTED = m.X + b
to find such m and b
so that the overall sum of penalty-errors is minimal.
Having found such m and b, for which the sum of penalty-errors is minimal, the learner, that you have pre-selected to use, has finished it's role.
Right, that means, you can be mathematically sure, there is no such other m and b, that would yield lesser sum of penalty-errors, computed as per your selected method, on the available ( and the same used part thereof ) of the observed examples.
While all was done according to an agreed plan, that still does NOT make The Market to start to "obey" the m.X + b linear behaviour...
If you forget to realise this iron-cast irony, you just started to blindly believe, that linear model rules the real-world ( which we second by second witness it indeed does not ).
So YGWYT -- You Get What You Train
If you train a linear model m.X + b, you cannot be surprised to get nothing else but a least-wrong linear model m.X + b.
Predictions simply have to systematically follow the Model
which means, all your predictions have to systematically be wrong, just by sticking to the least-wrong linear model m.X + b


Logistic Regression with Non-Integer feature value

Hi I was following the Machine Learning course by Andrew Ng.
I found that in regression problems, specially logistic regression they have used integer values for the features which could be plotted in a graph. But there are so many use cases where the feature values may not be integer.
Let's consider the follow example :
I want to build a model to predict if any particular person will take a leave today or not. From my historical data I may find the following features helpful to build the training set.
Name of the person, Day of the week, Number of Leaves left for him till now (which maybe a continuous decreasing variable), etc.
So here are the following questions based on above
How do I go about designing the training set for my logistic regression model.
In my training set, I find some variables are continuously decreasing (ex no of leaves left). Would that create any problem, because I know continuously increasing or decreasing variables are used in linear regression. Is that true ?
Any help is really appreciated. Thanks !
Well, there are a lot of missing information in your question, for example, it'll be very much clearer if you have provided all the features you have, but let me dare to throw some assumptions!
ML Modeling in classification always requires dealing with numerical inputs, and you can easily infer each of the unique input as an integer, especially the classes!
Now let me try to answer your questions:
How do I go about designing the training set for my logistic regression model.
How I see it, you have two options (not necessary both are practical, it's you who should decide according to the dataset you have and the problem), either you predict the probability of all employees in the company who will be off in a certain day according to the historical data you have (i.e. previous observations), in this case, each employee will represent a class (integer from 0 to the number of employees you want to include). Or you create a model for each employee, in this case the classes will be either off (i.e. Leave) or on (i.e. Present).
Example 1
I created a dataset example of 70 cases and 4 employees which looks like this:
Here each name is associated with the day and month they took as off with respect to how many Annual Leaves left for them!
The implementation (using Scikit-Learn) would be something like this (N.B date contains only day and month):
Now we can do something like this:
import math
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold
# read dataset example
df = pd.read_csv('leaves_dataset.csv')
# assign unique integer to every employee (i.e. a class label)
mapping = {'Jack': 0, 'Oliver': 1, 'Ruby': 2, 'Emily': 3}
df.replace(mapping, inplace=True)
y = np.array(df[['Name']]).reshape(-1)
X = np.array(df[['Leaves Left', 'Day', 'Month']])
# create the model
parameters = {'penalty': ['l1', 'l2'], 'C': [0.1, 0.5, 1.0, 10, 100, 1000]}
lr = LogisticRegression(random_state=0)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=2, random_state=0)
clf = GridSearchCV(lr, parameters, cv=cv)
clf.fit(X, y)
# Example: probability of all employees who have 10 days left today
# warning: date must be same format
prob = clf.best_estimator_.predict_proba([[10, 9, 11]])
print({'Jack': prob[0,0], 'Oliver': prob[0,1], 'Ruby': prob[0,2], 'Emily': prob[0,3]})
{'Ruby': 0.27545, 'Oliver': 0.15032,
'Emily': 0.28201, 'Jack': 0.29219}
To make this relatively work you need a real big dataset!
Also this can be better than the second one if there are other informative features in the dataset (e.g. the health status of the employee at that day..etc).
The second option is to create a model for each employee, here the result would be more accurate and more reliable, however, it's almost a nightmare if you have too many employees!
For each employee, you collect all their leaves in the past years and concatenate them into one file, in this case you have to complete all days in the year, in other words: for every day that employee has never got it off, that day should be labeled as on (or numerically speaking 1) and for the days off they should be labeled as off (or numerically speaking 0).
Obviously, in this case, the classes will be 0 and 1 (i.e. on and off) for each employee's model!
For example, consider this dataset example for the particular employee Jack:
Example 2
Then you can do for example:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, RepeatedStratifiedKFold
# read dataset example
df = pd.read_csv('leaves_dataset2.csv')
# assign unique integer to every on and off (i.e. a class label)
mapping = {'off': 0, 'on': 1}
df.replace(mapping, inplace=True)
y = np.array(df[['Type']]).reshape(-1)
X = np.array(df[['Leaves Left', 'Day', 'Month']])
# create the model
parameters = {'penalty': ['l1', 'l2'], 'C': [0.1, 0.5, 1.0, 10, 100, 1000]}
lr = LogisticRegression(random_state=0)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=2, random_state=0)
clf = GridSearchCV(lr, parameters, cv=cv)
clf.fit(X, y)
# Example: probability of the employee "Jack" who has 10 days left today
prob = clf.best_estimator_.predict_proba([[10, 9, 11]])
print({'Off': prob[0,0], 'On': prob[0,1]})
{'On': 0.33348, 'Off': 0.66651}
N.B in this case you have to create a dataset for each employee + training especial model + filling all the days the never taken in the past years as off!
In my training set, I find some variables are continuously decreasing (ex no of leaves left). Would that create any problem,
because I know continuously increasing or decreasing variables are
used in linear regression. Is that true ?
Well, there is nothing preventing you from using contentious values as features (e.g. number of leaves) in Logistic Regression; actually it doesn't make any difference if it's used in Linear or Logistic Regression but I believe you got confused between the features and the response:
The thing is, discrete values should be used in the response of Logistic Regression and Continuous values should be used in the response of the Linear Regression (a.k.a dependent variable or y).

Can any machine learning algorithm find this pattern: x1 < x2 without generating a new feature (e.g. x1-x2) first?

If I had 2 features x1 and x2 where I know that the pattern is:
if x1 < x2 then
Can any machine learning algorithm find such a pattern? What algorithm would that be?
I know that I could create a third feature x3 = x1-x2. Then feature x3 can easily be used by some machine learning algorithms. For example a decision tree can solve the problem 100% using x3 and just 3 nodes (1 decision and 2 leaf nodes).
But, is it possible to solve this without creating new features? This seems like a problem that should be easily solved 100% if a machine learning algorithm could only find such a pattern.
I tried MLP and SVM with different kernels, including svg kernel and the results are not great. As an example of what I tried, here is the scikit-learn code where the SVM could only get a score of 0.992:
import numpy as np
from sklearn.svm import SVC
# Generate 1000 samples with 2 features with random values
X_train = np.random.rand(1000,2)
# Label each sample. If feature "x1" is less than feature "x2" then label as 1, otherwise label is 0.
y_train = X_train[:,0] < X_train[:,1]
y_train = y_train.astype(int) # convert boolean to 0 and 1
svc = SVC(kernel = "rbf", C = 0.9) # tried all kernels and C values from 0.1 to 1.0
svc.fit(X_train, y_train)
print("SVC score: %f" % svc.score(X_train, y_train))
Output running the code:
SVC score: 0.992000
This is an oversimplification of my problem. The real problem may have hundreds of features and different patterns, not just x1 < x2. However, to start with it would help a lot to know how to solve for this simple pattern.
To understand this, you must go into the settings of all the parameters provided by sklearn, and C in particular. It also helps to understand how the value of C influences the classifier's training procedure.
If you look at the equation in the User Guide for SVC, there are two main parts to the equation - the first part tries to find a small set of weights that solves the problem, and the second part tries to minimize the classification errors.
C is the penalty multiplier associated with misclassifications. If you decrease C, then you reduce the penalty (lower training accuracy but better generalization to test) and vice versa.
Try setting C to 1e+6. You will see that you almost always get 100% accuracy. The classifier has learnt the pattern x1 < x2. But it figures that a 99.2% accuracy is enough when you look at another parameter called tol. This controls how much error is negligible for you and by default it is set to 1e-3. If you reduce the tolerance, you can also expect to get similar results.
In general, I would suggest you to use something like GridSearchCV (link) to find the optimal values of hyper parameters like C as this internally splits the dataset into train and validation. This helps you to ensure that you are not just tweaking the hyperparameters to get a good training accuracy but you are also making sure that the classifier will do well in practice.

Deep Learning an Imbalanced data set

I have two data sets that looks like this:
Training (Class 0: 8982, Class 1: 380)
Testing (Class 0: 574, Class 1: 12)
Training (Class 0: 8982, Class 1: 380)
Testing (Class 0: 574, Class 1: 8)
I am trying to build a deep feedforward neural net in Tensorflow. I get accuracies in the 90s and AUC scores in the 80s. Of course, the data set is heavily imbalanced so those metrics are useless. My emphasis is on getting a good recall value and I do not want to oversample the Class 1. I have toyed with the complexity of the model to no avail, the best model predicted only 25% of the positive class correctly.
My question is, considering the distribution of these data sets, is it a futile move to build models without getting more data(I can't get more data) or there's a way around getting to work with data that is this much imbalanced.
Can I use tensorflow to learn imbalance classification with a ratio of about 30:1
Yes, and I have. Specifically Tensorflow provides the ability to feed in a weight matrix. Look at tf.losses.sigmoid_cross_entropy, there is a weights parameter. You can feed in a matrix that matches Y in shape and for each value of Y provide the relative weight that training example should have.
One way to find the correct weights is to start different balances and run your training and then look at your confusion matrix and a run down of precision vs accuracy for each class. Once you get both classes to have about the same precision to accuracy ratio then they are balanced.
Example Implementation
Here is an example implementation that converts a Y into a weight matrix that has performed very well for me
def weightMatrix( matrix , most=0.9 ) :
b = np.maximum( np.minimum( most , matrix.mean(0) ) , 1. - most )
a = 1./( b * 2. )
weights = a * ( matrix + ( 1 - matrix ) * b / ( 1 - b ) )
return weights
The most parameter represents the largest fractional difference to consider. 0.9 equates to .1:.9 = 1:9 , where as .5 equates to 1:1. Values below .5 don't work.
You might be interested to have a look at this question and its answer. Its scope is a priori more restricted than yours, as it addresses specifically weights for classification, but it seems very relevant to your case.
Also, AUC is definitely not irrelevant: it is actually independent of your data imbalance.

How does having smaller values for parameters help in preventing over-fitting?

To reduce the problem of over-fitting in linear regression in machine learning , it is suggested to modify the cost function by including squares of parameters. This results in smaller values of the parameters.
This is not at all intuitive to me. How can having smaller values for parameters result in simpler hypothesis and help prevent over-fitting?
I put together a rather contrived example, but hopefully it helps.
import pandas as pd
import numpy as np
from sklearn import datasets
from sklearn.linear_model import Ridge, Lasso
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import PolynomialFeatures
First build a linear dataset, with a training and test split. 5 in each
X,y, c = datasets.make_regression(10,1, noise=5, coef=True, shuffle=True, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size=5)
Fit the data with a fifth order polynomial with no regularization.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
pipeline = Pipeline([
('poly', PolynomialFeatures(5)),
('model', Ridge(alpha=0.)) # alpha=0 indicates 0 regularization.
Looking at the coefficients
# y_pred = -12.82 + 33.59 x + 292.32 x^2 - 193.29 x^3 - 119.64 x^4 + 78.87 x^5
Here the model touches all the training point, but has high coefficients and does not touch the test points.
Let's try again, but change our L2 regularization
y_pred = 6.88 + 26.13 x + 16.58 x^2 + 12.47 x^3 + 5.86 x^4 - 5.20 x^5
Here we see a smoother shape, with less wiggling around. It no longer touches all the training points, but it is a much smoother curve. The coefficients are smaller due to the regularization being added.
This is a bit more complicated. It depends very much on the algorithm you are using.
To make an easy but slightly stupid example. Instead of optimising the parameter of the function
y = a*x1 + b*x2
you could also optimising the parameters of
y = 1/a * x1 + 1/b * x2
Obviously if you minimise in the former case the you need to maximise them in the latter case.
The fact that for most algorithm minimising the square of the parameters comes from computational learning theory.
Let's assume for the following you want to learn a function
f(x) = a + bx + c * x^2 + d * x^3 +....
One can argue that a function were only a is different from zero is more likely than a function, where a and b are different form zero and so on.
Following Occams razor (If you have two hypothesis explaining your data, the simpler is more likely the right one), you should prefer a hypothesis where more of you parameters are zero.
To give an example lets say your data points are (x,y) = {(-1,0),(1,0)}
Which function would you prefer
f(x) = 0
f(x) = -1 + 1*x^2
Extending this a bit you can go from parameters which are zero to parameters which are small.
If you want to try it out you can sample some data points from a linear function and add a bit of gaussian noise. If you want to find a perfect polynomial fit you need a pretty complicated function with typically pretty large weights. However, if you apply regularisation you will come close to your data generating function.
But if you want to set your reasoning on rock-solid theoretical foundations I would recommend to apply Baysian statistics. The idea there is that you define a probability distribution over regression functions. That way you can define yourself what a "probable" regression function is.
(Actually Machine Learning by Tom Mitchell contains a pretty good and more detailed explanation)
Adding the squares to your function (so from linear to polynomial) takes care that you can draw a curve instead of just a straight line.
Example of polynomial function:
Adding this however can lead to a result which follows the test data too much with as result that new data is matched to close to the test data. Adding more and more polynomials (3rd, 4th orders). So when adding polynomials you always have to watch out that the data is not becoming overfitted.
To get more insight in this, draw some curves in a spreadsheet and see how the curves change following your data.

Ensemble of different kinds of regressors using scikit-learn (or any other python framework)

I am trying to solve the regression task. I found out that 3 models are working nicely for different subsets of data: LassoLARS, SVR and Gradient Tree Boosting. I noticed that when I make predictions using all these 3 models and then make a table of 'true output' and outputs of my 3 models I see that each time at least one of the models is really close to the true output, though 2 others could be relatively far away.
When I compute minimal possible error (if I take prediction from 'best' predictor for each test example) I get a error which is much smaller than error of any model alone. So I thought about trying to combine predictions from these 3 diffent models into some kind of ensemble. Question is, how to do this properly? All my 3 models are build and tuned using scikit-learn, does it provide some kind of a method which could be used to pack models into ensemble? The problem here is that I don't want to just average predictions from all three models, I want to do this with weighting, where weighting should be determined based on properties of specific example.
Even if scikit-learn not provides such functionality, it would be nice if someone knows how to property address this task - of figuring out the weighting of each model for each example in data. I think that it might be done by a separate regressor built on top of all these 3 models, which will try output optimal weights for each of 3 models, but I am not sure if this is the best way of doing this.
This is a known interesting (and often painful!) problem with hierarchical predictions. A problem with training a number of predictors over the train data, then training a higher predictor over them, again using the train data - has to do with the bias-variance decomposition.
Suppose you have two predictors, one essentially an overfitting version of the other, then the former will appear over the train set to be better than latter. The combining predictor will favor the former for no true reason, just because it cannot distinguish overfitting from true high-quality prediction.
The known way of dealing with this is to prepare, for each row in the train data, for each of the predictors, a prediction for the row, based on a model not fit for this row. For the overfitting version, e.g., this won't produce a good result for the row, on average. The combining predictor will then be able to better assess a fair model for combining the lower-level predictors.
Shahar Azulay & I wrote a transformer stage for dealing with this:
class Stacker(object):
A transformer applying fitting a predictor `pred` to data in a way
that will allow a higher-up predictor to build a model utilizing both this
and other predictors correctly.
The fit_transform(self, x, y) of this class will create a column matrix, whose
each row contains the prediction of `pred` fitted on other rows than this one.
This allows a higher-level predictor to correctly fit a model on this, and other
column matrices obtained from other lower-level predictors.
The fit(self, x, y) and transform(self, x_) methods, will fit `pred` on all
of `x`, and transform the output of `x_` (which is either `x` or not) using the fitted
pred: A lower-level predictor to stack.
cv_fn: Function taking `x`, and returning a cross-validation object. In `fit_transform`
th train and test indices of the object will be iterated over. For each iteration, `pred` will
be fitted to the `x` and `y` with rows corresponding to the
train indices, and the test indices of the output will be obtained
by predicting on the corresponding indices of `x`.
def __init__(self, pred, cv_fn=lambda x: sklearn.cross_validation.LeaveOneOut(x.shape[0])):
self._pred, self._cv_fn = pred, cv_fn
def fit_transform(self, x, y):
x_trans = self._train_transform(x, y)
self.fit(x, y)
return x_trans
def fit(self, x, y):
Same signature as any sklearn transformer.
self._pred.fit(x, y)
return self
def transform(self, x):
Same signature as any sklearn transformer.
return self._test_transform(x)
def _train_transform(self, x, y):
x_trans = np.nan * np.ones((x.shape[0], 1))
all_te = set()
for tr, te in self._cv_fn(x):
all_te = all_te | set(te)
x_trans[te, 0] = self._pred.fit(x[tr, :], y[tr]).predict(x[te, :])
if all_te != set(range(x.shape[0])):
warnings.warn('Not all indices covered by Stacker', sklearn.exceptions.FitFailedWarning)
return x_trans
def _test_transform(self, x):
return self._pred.predict(x)
Here is an example of the improvement for the setting described in #MaximHaytovich's answer.
First, some setup:
from sklearn import linear_model
from sklearn import cross_validation
from sklearn import ensemble
from sklearn import metrics
y = np.random.randn(100)
x0 = (y + 0.1 * np.random.randn(100)).reshape((100, 1))
x1 = (y + 0.1 * np.random.randn(100)).reshape((100, 1))
x = np.zeros((100, 2))
Note that x0 and x1 are just noisy versions of y. We'll use the first 80 rows for train, and the last 20 for test.
These are the two predictors: a higher-variance gradient booster, and a linear predictor:
g = ensemble.GradientBoostingRegressor()
l = linear_model.LinearRegression()
Here is the methodology suggested in the answer:
g.fit(x0[: 80, :], y[: 80])
l.fit(x1[: 80, :], y[: 80])
x[:, 0] = g.predict(x0)
x[:, 1] = l.predict(x1)
>>> metrics.r2_score(
y[80: ],
linear_model.LinearRegression().fit(x[: 80, :], y[: 80]).predict(x[80: , :]))
Now, using stacking:
x[: 80, 0] = Stacker(g).fit_transform(x0[: 80, :], y[: 80])[:, 0]
x[: 80, 1] = Stacker(l).fit_transform(x1[: 80, :], y[: 80])[:, 0]
u = linear_model.LinearRegression().fit(x[: 80, :], y[: 80])
x[80: , 0] = Stacker(g).fit(x0[: 80, :], y[: 80]).transform(x0[80:, :])
x[80: , 1] = Stacker(l).fit(x1[: 80, :], y[: 80]).transform(x1[80:, :])
>>> metrics.r2_score(
y[80: ],
u.predict(x[80:, :]))
The stacking prediction does better. It realizes that the gradient booster is not that great.
Ok, after spending some time on googling 'stacking' (as mentioned by #andreas earlier) I found out how I could do the weighting in python even with scikit-learn. Consider the below:
I train a set of my regression models (as mentioned SVR, LassoLars and GradientBoostingRegressor). Then I run all of them on training data (same data which was used for training of each of these 3 regressors). I get predictions for examples with each of my algorithms and save these 3 results into pandas dataframe with columns 'predictedSVR', 'predictedLASSO' and 'predictedGBR'. And I add the final column into this datafrane which I call 'predicted' which is a real prediction value.
Then I just train a linear regression on this new dataframe:
#df - dataframe with results of 3 regressors and true output
from sklearn linear_model
stacker= linear_model.LinearRegression()
stacker.fit(df[['predictedSVR', 'predictedLASSO', 'predictedGBR']], df['predicted'])
So when I want to make a prediction for new example I just run each of my 3 regressors separately and then I do:
on outputs of my 3 regressors. And get a result.
The problem here is that I am finding optimal weights for regressors 'on average, the weights will be same for each example on which I will try to make prediction.
What you describe is called "stacking" which is not implemented in scikit-learn yet, but I think contributions would be welcome. An ensemble that just averages will be in pretty soon: https://github.com/scikit-learn/scikit-learn/pull/4161
Late response, but I wanted to add one practical point for this sort of stacked regression approach (which I use this frequently in my work).
You may want to choose an algorithm for the stacker which allows positive=True (for example, ElasticNet). I have found that, when you have one relatively stronger model, the unconstrained LinearRegression() model will often fit a larger positive coefficient to the stronger and a negative coefficient to the weaker model.
Unless you actually believe that your weaker model has negative predictive power, this is not a helpful outcome. Very similar to having high multi-colinearity between features of a regular regression model. Causes all sorts of edge effects.
This comment applies most significantly to noisy data situations. If you're aiming to get RSQ of 0.9-0.95-0.99, you'd probably want to throw out the model which was getting a negative weighting.
