Say I have a machine learning algorithm trained using features F1, F2 and F3. This model is then subsequently picked and used on another project (imported using Joblib).
When using the trained model, do the inputs need to be in the same order (F1, F2 or F3)?
For simplicity assume that you are fitting a linear model and a regression model (but generalizes to all the others). If F1, F2, F3 are your features then it finds the weights w1, w2, w3, bias such that the error made by w1*F1 + w2*F2 + w3*F3 + bias is minimum. It is called the linear combination of weight and features.
So when making the prediction the model calcualtes the value w1*F1 + w2*F2 + w3*F3 + bias so the order of features matter.
Yes, they must be in exactly the same order. And preprocessed in exactly the same way.
Related
If I had 2 features x1 and x2 where I know that the pattern is:
if x1 < x2 then
class1
else
class2
Can any machine learning algorithm find such a pattern? What algorithm would that be?
I know that I could create a third feature x3 = x1-x2. Then feature x3 can easily be used by some machine learning algorithms. For example a decision tree can solve the problem 100% using x3 and just 3 nodes (1 decision and 2 leaf nodes).
But, is it possible to solve this without creating new features? This seems like a problem that should be easily solved 100% if a machine learning algorithm could only find such a pattern.
I tried MLP and SVM with different kernels, including svg kernel and the results are not great. As an example of what I tried, here is the scikit-learn code where the SVM could only get a score of 0.992:
import numpy as np
from sklearn.svm import SVC
# Generate 1000 samples with 2 features with random values
X_train = np.random.rand(1000,2)
# Label each sample. If feature "x1" is less than feature "x2" then label as 1, otherwise label is 0.
y_train = X_train[:,0] < X_train[:,1]
y_train = y_train.astype(int) # convert boolean to 0 and 1
svc = SVC(kernel = "rbf", C = 0.9) # tried all kernels and C values from 0.1 to 1.0
svc.fit(X_train, y_train)
print("SVC score: %f" % svc.score(X_train, y_train))
Output running the code:
SVC score: 0.992000
This is an oversimplification of my problem. The real problem may have hundreds of features and different patterns, not just x1 < x2. However, to start with it would help a lot to know how to solve for this simple pattern.
To understand this, you must go into the settings of all the parameters provided by sklearn, and C in particular. It also helps to understand how the value of C influences the classifier's training procedure.
If you look at the equation in the User Guide for SVC, there are two main parts to the equation - the first part tries to find a small set of weights that solves the problem, and the second part tries to minimize the classification errors.
C is the penalty multiplier associated with misclassifications. If you decrease C, then you reduce the penalty (lower training accuracy but better generalization to test) and vice versa.
Try setting C to 1e+6. You will see that you almost always get 100% accuracy. The classifier has learnt the pattern x1 < x2. But it figures that a 99.2% accuracy is enough when you look at another parameter called tol. This controls how much error is negligible for you and by default it is set to 1e-3. If you reduce the tolerance, you can also expect to get similar results.
In general, I would suggest you to use something like GridSearchCV (link) to find the optimal values of hyper parameters like C as this internally splits the dataset into train and validation. This helps you to ensure that you are not just tweaking the hyperparameters to get a good training accuracy but you are also making sure that the classifier will do well in practice.
Suppose we have a set of inputs (named x1, x2, ..., xn) that give us the output y. The goal is to predict y from some values of x1... xn that were not seem yet. It's clear to me that this problem can be modelled as a Regression problem on the realm of Machine Learning.
However, let's say data keep coming. I'm able to predict y from x1... xn. Furthermore, I'm able to check afterwards whether or not that prediction was a good one. If it was a good one, everything is fine. On the other hand, I would like to update my model in case that prediction deviates a lot from the real y. The one way I can see this is to insert this new data on my training set and train the regression algorithm again. Two problems arise from that. First, it may cost more than I can afford to recompute my module from scratch from time to time. Second, I may already have too much data on my training set so that new coming data is negligible. However, the new coming data might be more import than the older ones due to the nature of my problem.
It seems that a good solution would be to compute a kind of continuous regression that is more related to the new data than to the older one. I have searched for such approach but I have not found anything relevant. Perhaps I'm looking at the wrong direction. Does anyone have a clue on how to do it?
If you want to consider the newer data more important you have to use weights. Usually it is called
sample_weight
in fit() function in scikit-learn (if you use this library).
Weights can be defined as 1 / (time pass from this current observation).
Now about the second problem. If the recalculation takes much time you can cut your observations and use the latest ones. Fit your model on the whole data and on the fresh one + some part of the old data and check how much your weights are changed. I suppose if you really have a dependence between {x_i} and {y} you don't need the whole dataset.
Otherwise you can use weights again. But for now you will weight weights in the model:
model for old data: w1*x1 + w2*x2 + ...
model for new data: ~w1*x1 + ~w2*x2 + ...
common model: (w1*a1_1 + ~w1*a1_2)*x1 + (w2*a2_1 + ~w2*a2_2)*x2 + ...
Here a1_1, a2_1 are the weights for 'old model', a2_1, a2_2 - for new one, w1, w2 - coefficients of old model, ~w1, ~w2 - of the new one.
Parameters {a} can be estimated as in the first bullet (be hands), but you also can create another linear model to estimate them. But my advice: don't use non-linear regression for {a} - you will overfit.
I have taken a stock price and a day number as the input data.
There are about 1365 input data, but my model is not able to predict the correct value of m ( slope ) and b of my regression problem, using a gradient descent optimizer in TensorFlow.
I have also tried to take different values for the learning rate ( 0.0000000001, .., 0.1 ), but none of them worked.
import matplotlib.pyplot as plt
import pandas as pd
import tensorflow as tf
import numpy as np
batch_size=8
ds=pd.read_csv("FB.csv",sep=",",header=None)
x_data=np.array(ds[0].values)
y_true=ds[1].values
x_data=np.array(x_data)
m=tf.Variable(2.2)
b=tf.Variable(0.5)
x_act=tf.placeholder(tf.float32,[batch_size])
y_act=tf.placeholder(tf.float32,[batch_size])
y_model=m*x_act + b
error=tf.reduce_sum(tf.square(y_act-y_model))
optim=tf.train.GradientDescentOptimizer(learning_rate=0.0000001)
train=optim.minimize(error)
init=tf.global_variables_initializer()
with tf.Session() as sess:
batches=1700
for i in range(batches):
rand_ind = np.random.randint(len(x_data),size=batch_size)
sess.run(init)
feed = {x_act:x_data[rand_ind],y_act:y_true[rand_ind]}
sess.run(train,feed_dict=feed)
model_m,model_b=sess.run([m,b])
model_m
model_b
y_ans=(model_m*len(x_data)+1) + model_b
y_ans
Having spent more than +20 years in trading, quant-modeling and Machine Learning augmented decision support for FX-trading, there are few things I can help you understand before you start investing your time and efforts in a completely wrong direction. Linear regression model, reported above, has flaws, detailed below, which will not be salvaged if moving to any of the few more complex auto-regressive models ( ARMA / ARIMA ) and similarly even the LSTM-tools will not save a naive, skipped or underestimated system identification ( as it is common to call it in technical cybernetics ). Simply put, any-Model setups, that try to indoctrinate some Model-behaviour and abstract from non-TA behaviour-mode-switching, are principally blind and singlehanded for handling a complex ( almost hyperchaotic, in extended Lyapunov sense ) multi-agent ecosystem.
Why my training model not predicting correct result?
Because your assumption is straight wrong.
There are no such stocks, that behave as a linear model, whereas your instructions are strictly opposite,
you ask your linear model yPREDICTED = m.X + b
to find such m and b
so that the overall sum of penalty-errors is minimal.
Having found such m and b, for which the sum of penalty-errors is minimal, the learner, that you have pre-selected to use, has finished it's role.
Right, that means, you can be mathematically sure, there is no such other m and b, that would yield lesser sum of penalty-errors, computed as per your selected method, on the available ( and the same used part thereof ) of the observed examples.
While all was done according to an agreed plan, that still does NOT make The Market to start to "obey" the m.X + b linear behaviour...
If you forget to realise this iron-cast irony, you just started to blindly believe, that linear model rules the real-world ( which we second by second witness it indeed does not ).
So YGWYT -- You Get What You Train
If you train a linear model m.X + b, you cannot be surprised to get nothing else but a least-wrong linear model m.X + b.
Predictions simply have to systematically follow the Model
which means, all your predictions have to systematically be wrong, just by sticking to the least-wrong linear model m.X + b
Q.E.D.
I implemented an ANN (1 hidden layer of 64 units, learning rate = 0.001, epsilon = 0.001, iters = 500) with pythons OpenCV module. Train error ~ 3% and test error ~ 12%
In order to improve the accruacy/ generalisation of my NN I decided to proceed by- implementing model selection (of #hidden units and learning rate) to get an accurate value of hyperparameters and plotting learning curves to determine if more data is needed (currently have 2.5k).
Having read some sources regarding NN training and model selection, I'm very confused on the following matter -
1) In order to perform model selection, I know the following needs to be done-
create set possibleHiddenUnits {4, 8, 16, 32, 64}
randomly select Tr & Va sets from the total set of Tr + Va with some split e.g. 80/20
foreach ele in possibleHiddenUnits
(*) compute weights for the NN using backpropagation and an iterative optimisation algorithm like Gradient Descent (where we provide the termination criteria in the form of number of iterations / epsilon)
compute Validation set error using these trained weights
select the number of hidden units which min Va set error
Alternatively, I believe we can also use k-fold cross validation.
a. how do you decide what the number of iterations/ epsilon for GD should be?
b. does 1 iteration out of x iterations of GD (where the entire training set is used to compute the gradients of cost wrt weights through backprop) constitute an 'epoch'?
2) Sources (whats is the difference between train, validation and test set, in neural networks? and How to use k-fold cross validation in a neural network) mention that the training for a NN is done in the following way as it prevents over-fitting
for each epoch
for each training data instance
propagate error through the network
adjust the weights
calculate the accuracy over training data
for each validation data instance
calculate the accuracy over the validation data
if the threshold validation accuracy is met
exit training
else
continue training
a. I believe this method should be executed once the model selection has been done. But then how do we avoid overfitting of the model in step (*) of the model selection process above?
b. Am I right in assuming that one epoch constitues one iteration of training where weights are calculated using the entire Tr set through GD + backprop and GD involves x (>1) iters over the entire Tr set to calculate the weights ?
Also, out off 1b and 2b which is correct?
This is more of a comment but since I cant make comments yet I write it here. Have you tried other methods like l2 regularization or dropout? I dont know a lot about model selection but dropout has a very similiar effect like taking lots of models and averaging them. Normaly dropout should do the trick and you wont have problems with overfitting anymore.
I am working on a data mining projects and I have to design following model.
I have given 4 feature x1, x2, x3 and x4 and four function defined on these
feature such that each function depend upon some subset of available feature.
e.g.
F1(x1, x2) =x1^2+2x2^2
F2(x2, x3) =2x2^2+3x3^3
F3(x3, x4) =3x3^3+4x4^4
This implies F1 is some function which depend on feature x1, x2. F2 is some feature which depend upon x2, x3 and so on
Now I have a training data set is available where value of x1,x2,x3,x4 is known and sum(F1+F2+F3) { I know total sum but not individual sum of function)
Now using these training data I have to make a model which which can predict total sum of all the function correctly i.e. (F1+F2+F3)
I am new to the data mining and Machine learning field .So I apologizes in advance if this question is too trivial or wrong. I have tried to model it many way but I am not getting any clear thought about it. I will appreciate any help regarding this.
Your problem is non-linear regression. You have features
x1 x2 x3 x4 S where (S = sum(F1+F2+F3) )
You would like to predict S using Xn but S function is non-linear.
Since your function S is non linear, you need to use a non-linear regression algorithm for this problem. Normal nonlinear regression may solve your problem or you may choose other approaches. You may for example try Tree Regression or MARS (Multivariate Adaptive Regression Splines). They are well known algorithms and you can find commercial and open source versions.