Output of the logit function as addtional input to a Neural Network - machine-learning

This is w.r.t a hybrid of ANN and logistic regression in a binary classification problem. For example in one of the papers I came across they state that "A hybrid model type is constructed by using the logistic regression model to calculate the probability of failure and then adding that value as an additional input variable into the ANN. This type of model is defined as a Plogit-ANN model".
So, for n input variables, I'm trying to understand how the additional input n+1 to a ANN is treated by the activation function (eg. a logit function) and in the summation of weights multiplied by inputs. Do we treat this probability variable n+1 as one of the standalone weights like a special type of b0 that we add in the sum of weights multiplied by inputs e.g. Summation for each Neuron = (Sum (Wi*Xi))+additional variable.
Thank you for your assistance.

According to the description provided the easiest way is to treat this as additional feature of your data. So you have a model that predicts something about your original dataset (probability of some additional thing), thus you get x -> f(x). You simply concatenate it to your feature vector so x' = [x1 x2 ... xk f(x)], and push it through the network.
However the described approach is quite naive, since you are doing these two things (training f and training neural net) completely independently), what might be more beneficial is to instead treat fitting f as an auxiliary loss and train your model jointly.

Related

Regularization Coefficient in Polynomial Regression

Regularization Term
Overfitting in Polynomial Regression, Comparing the training set's root mean squared error and the validation set's root-mean-squared-error.
Graph of the root-mean-square-error vs lnλ for the M=9 polynomial
I didn't understand this graph properly. While training the model to learn the parameters, we have to set λ = 0 since it doesn't make sense to already select the value of λ and then proceed with the training. So How is the training error varying as we vary the value of λ?. We divided the dataset into the valid and the train so that we train the model in the training set, and then verify the validation through the valid set.
You may confuse concepts here.
Why the loss function?
You apply a shrinkage penalty to your loss function.
This will nudge your model towards finding weights closer to zero. Which can e.g. be helpful to do regularization by trading variance with bias (see e.g. the Ridge Regression in “An Introduction to Statistical Learning” by Witten et al.)
How to train?
Setting the term λ after training should not affect your trained model. Think about it in terms of linear regression: as soon as you have fitted the linear regression function, your error function does not matter anymore. You just apply the linear regression function.
Thus, you have to set the λ parameter during training.
Otherwise, your model will optimize the parameters without the regularization term and will thus not shrink the sum of weights. Hence, you are actually training your model without regularization.
How to find a good value for λ?
You have to distinguish multiple steps:
Training: You train a model setting λ to a fixed value on your training data. The λ does not change during training but must not be zero ( if you want to have regularization).
This training will yield a model; let’s call it Model λ1.
Single Validation Run: You validate how well Model λ1 performs on the validation set. This tells you how and if λ improved your model, but not if there is a better λ.
Cross-Validation Idea: Train multiple models and use validation runs to evaluate their performance to find the best λ. In other words, you train multiple models (e.g. Model λ2 .. λ10) with different λ values. Then you can compare their validation set performance among each other, to see which value for λ is best (on the validation set).
Why having a 3 set split ( train / validation / test): If you pick a final model with this procedure (e.g. Model λ3) you still don’t know how well your model generalizes because you have been using the validation set to find a good value for λ. Thus, you would expect your model to perform rather well on the validation set.
Hence, you evaluate your final model on the test set which your model has never seen, and where you never performed any kind of parameter optimization. The performance you measure is the final performance of your model. It is important that you do not evaluate multiple models on the training set, and then select the best because, then, again you would be optimizing the performance on the training set.
How to interpret the plot?
This is actually hard without some more knowledge about the problem you are tackling.
On the first look, it seems like your model is overfitting for small values of λ and improves its performance on the validation set for larger values due to regularization.

What is 'fit' in Machine learning?

What is 'fit' in machine learning? I noticed in some cases it is a synonym for training.
Can someone please explain in layman's term?
A machine learning model is typically specified with some functional form that includes parameters.
An example is a line intended to model data that has an outcome variable y that can be described in terms of a feature x. In that case, the functional form would be:
y = mx + b
fitting the model means finding values for m and b that are in accordance with training data, which is a set of points (x1, y1), (x2, y2), ..., (xN, yN). It may not be possible to set m and b such that the line passes through all training data points, but some loss function could be defined for describing a well-fit line. The fitting algorithm's purpose would be to minimize that loss function. In the case of line fitting, the loss could be the total distance of training data points to the line, but it may be more mathematically convenient to set the loss to the total squared distance of training data points to the line.
In general, a model can be more complex than a line and include many parameters. For some models, the number of parameters is not fixed and can change as part of the fitting process. The features and the outcome variable can be discrete, continuous, and/or multidimensional. For unsupervised problems, there is no outcome variable.
In all these cases, fitting is still analogous to the line example above, where an algorithm is run to find model parameters that in some sense explain the training data. This often involves running some optimization procedure.
A model that is well-fit to the training data may not be well-fit to other non-training data, even if the other data is sampled from the same distribution as the training data. A technique called regularization can be used to address this issue.

difference between LinearRegression and svm.SVR(kernel="linear")

First there are questions on this forum very similar to this one but trust me none matches so no duplicating please.
I have encountered two methods of linear regression using scikit's sklearn and I am failing to understand the difference between the two, especially where in first code there's a method train_test_split() called while in the other one directly fit method is called.
I am studying with multiple resources and this single issue is very confusing to me.
First which uses SVR
X = np.array(df.drop(['label'], 1))
X = preprocessing.scale(X)
y = np.array(df['label'])
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.2)
clf = svm.SVR(kernel='linear')
clf.fit(X_train, y_train)
confidence = clf.score(X_test, y_test)
And second is this one
# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)
# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)
So my main focus is the difference between using svr(kernel="linear") and using LinearRegression()
cross_validation.train_test_split : Splits arrays or matrices into random train and test subsets.
In second code, splitting is not random.
svm.SVR: The Support Vector Regression (SVR) uses the same principles as the SVM for classification, with only a few minor differences. First of all, because output is a real number it becomes very difficult to predict the information at hand, which has infinite possibilities. In the case of regression, a margin of tolerance (epsilon) is set in approximation to the SVM which would have already requested from the problem. But besides this fact, there is also a more complicated reason, the algorithm is more complicated therefore to be taken in consideration. However, the main idea is always the same: to minimize error, individualizing the hyperplane which maximizes the margin, keeping in mind that part of the error is tolerated.
Linear Regression: In statistics, linear regression is a linear approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variables) denoted X. The case of one explanatory variable is called simple linear regression.
Reference:
https://cs.adelaide.edu.au/~chhshen/teaching/ML_SVR.pdf
This is what I found:
Intuitively, as all regressors it tries to fit a line to data by minimising a cost function. However, the interesting part about SVR is that you can deploy a non-linear kernel. In this case you end making non-linear regression, i.e. fitting a curve rather than a line.
This process is based on the kernel trick and the representation of the solution/model in the dual rather than in the primal. That is, the model is represented as combinations of the training points rather than a function of the features and some weights. At the same time the basic algorithm remains the same: the only real change in the process of going non-linear is the kernel function, which changes from a simple inner product to some non linear function.
So SVR allows non linear fitting problems as well while LinearRegression() is only for simple linear regression with straight line (may contain any number of features in both cases).
The main difference for these methods is in mathematics background!
We have samples X and want to predict target Y.
The Linear Regression method just minimizes the least squares error:
for one object target y = x^T * w, where w is model's weights.
Loss(w) = Sum_1_N(x_n^T * w - y_n) ^ 2 --> min(w)
As it is a convex functional the global minimum will be always found.
After taking derivative of Loss by w and transforming sums to vectors you'll get:
w = (X^T * X)^(-1)* (X^T * Y)
So, in ML (i'm sure sklearn also has the same implementation) the w is calculated according above formula.
X is train samples, when you call fit method.
In predict this weights just multiplies on X_test.
So the decision is explicit and faster (except for Big selections as finding inverse matrix in this cases is complicated task) than converging methods such as svm.
In addition: Lasso and Ridge solves the same task but have additionally the regularization on weights in their losses.
And you can calculate the weights explicit in that cases too.
The SVM.Linear does almost the same thing except it has an optimization task for maximizing the margin (i apologize but it is difficult to put it down because i didn't find out how to write in Tex format here).
So it uses gradient descent methods for finding global extremum.
Sklearn's class SVM even have attribute max_iter which is used in the converging tasks.
To sum up: Linear Regression has explicit decision and SVM finds approximate of real decision because of numerical(computational) solution.

Can a machine learning model provide information about mean and standard deviation of data on which it was trained?

Consider a parametric binary classifier (such as Logistic Regression, SVM etc.) trained on a dataset (say containing two features for e.g. Blood Pressure and Cholesterol level). The dataset is thrown away and the trained model can only be used as a black box (no tweaks and inside information can be gathered from the trained model). Only a set of data points can be provided and their labels predicted.
Is it possible to get information about the mean and/or standard deviation and/or range of the features of the dataset on which this model was trained? If yes, how so? and If no, then why can't we?
Thank you for your response! :)
SVM does not provide any information about the data statistics, it is a maximum margin classifier and it finds the best separating hyperplane between two datasets in the feature space, as a linear combination of "support vectors". If you use kernel functions, then this combination is in the kernel space, it is not even in the original feature space. SVM does not have a straightforward probabilistic interpretation whatsoever.
Logistic regression is a discriminative classifer and models the conditional probability p (y|x,w) where y is your label, x is your data and w are the features. After maximum likelihood training you are left with w and it is again a discriminator (hyperplane) in the feature space, so you don't have the features again.
The following can be considered. Use a Gaussian classifier. Assume that your class is produced by the prior class probability p (y). Then a class conditional density p (x|y,w) produces your data. Then by the Bayes rule, you will have: p (y|x,w) = (p (y)p (x|y,w))/p (x). If you define the class conditional density p (x|y,w) as Gaussian, its parameter set w will consists of the mean vector m and covariance matrix C of x, assuming it is being produced by the class y. But remember that, this will work only based on the assumption that the current data vector belongs to a specific class. Conditioned on w, a better option would be for mean vector: E [x|w]. This the expectation of x with respect to p (x|w). It comes down to a weighted average of mean vectors for the class y=0 and y=1, with respect to their prior class probabilities. Same should work for covariance as well, but it needs to be derived properly, I am not %100 sure right now.

What is a loss function in simple words?

Can anyone please explain in simple words and possibly with some examples what is a loss function in the field of machine learning/neural networks?
This came out while I was following a Tensorflow tutorial:
https://www.tensorflow.org/get_started/get_started
It describes how far off the result your network produced is from the expected result - it indicates the magnitude of error your model made on its prediciton.
You can then take that error and 'backpropagate' it through your model, adjusting its weights and making it get closer to the truth the next time around.
The loss function is how you're penalizing your output.
The following example is for a supervised setting i.e. when you know the correct result should be. Although loss functions can be applied even in unsupervised settings.
Suppose you have a model that always predicts 1. Just the scalar value 1.
You can have many loss functions applied to this model. L2 is the euclidean distance.
If I pass in some value say 2 and I want my model to learn the x**2 function then the result should be 4 (because 2*2 = 4). If we apply the L2 loss then its computed as ||4 - 1||^2 = 9.
We can also make up our own loss function. We can say the loss function is always 10. So no matter what our model outputs the loss will be constant.
Why do we care about loss functions? Well they determine how poorly the model did and in the context of backpropagation and neural networks. They also determine the gradients from the final layer to be propagated so the model can learn.
As other comments have suggested I think you should start with basic material. Here's a good link to start off with http://neuralnetworksanddeeplearning.com/
Worth to note we can speak of different kind of loss functions:
Regression loss functions and classification loss functions.
Regression loss function describes the difference between the values that a model is predicting and the actual values of the labels.
So the loss function has a meaning on a labeled data when we compare the prediction to the label at a single point of time.
This loss function is often called the error function or the error formula.
Typical error functions we use for regression models are L1 and L2, Huber loss, Quantile loss, log cosh loss.
Note: L1 loss is also know as Mean Absolute Error. L2 Loss is also know as Mean Square Error or Quadratic loss.
Loss functions for classification represent the price paid for inaccuracy of predictions in classification problems (problems of identifying which category a particular observation belongs to).
Name a few: log loss, focal loss, exponential loss, hinge loss, relative entropy loss and other.
Note: While more commonly used in regression, the square loss function can be re-written and utilized for classification.

Resources