What's the different between Objective functions and xgboost models? - machine-learning

I know that GBtree uses a decision tree for classification and regression,
but how can we use gblinear for classification problem? Doesn't it gives us a continuous prediction?
I think I am confused between the parameters "boosters" and "objective function" in xgboost.
What can GBtree gives us compare to GBLinear ?
What is the difference between "Objective Function" to "Booster"
in xgboost?

GBLinear gives a "linear" modeling to solve your problem. Linear
regression is a Linear model that predict a continues value as you
mentioned. But there are other Linear models like Logistic
Regression which predict a value between 0 to 1 (or a probability of
a classification problem). So if you use a booster of the GBLinear
type, you should use binary:logistic objective function. GBtree gives a decision tree modeling to your problem.
Objective function is a function you try to minimize (it doesn't
directly relates to the model). Mostly, objective functions defines
some kind of error. For example, in a Linear Regression you have a
heuristic that looks as follows: Hw = w0 + w1*x1 + w2*x2 + ... +
wn*xn (this heuristic, is actually a way to model your
problem). Where the "Objective" also called the "Cost" function,
is similar to this: COST = (Hw - y_pred)^2. Your objective is to
find w0, ..., wn which will minimize that error thus you will get
a "model" that is "fit" to "solve" your problem.
GBtree / GBlinear are models. A way to model your problem. The model
worth nothing without "tuning" his "weights". With the "objective"
function, you "tune" your "weights".

Related

Can gradient descent itself solve non-linear problem in ANN?

I'm recently studying the theory about neural network. And I'm a little confuse about the role of gradient descent and activation function in ANN.
From what I understand, the activation function is used for transforming the model to non-linear model. So that it can solve the problem that is not linear separable. And the gradient descent is the tool to help model learn.
So my questions are :
If I use an activation function such as sigmoid for the model, but instead of using gradient decent to improve the model, I use classic perceptron learning rule : Wj = Wj + a*(y-h(x)), where the h(x) is the sigmoid function with the net input. Can the model learn the non-linear separable problem ?
If I do not include the non-linear activation function in the model. Just simple net input : h(x) = w0 + w1*x1 + ... + wj*xj. And using gradient decent to improve the model. Can the model learn the non-linear separable problem ?
I'm really confused about this problem, that which one is the main reason that the model can learn non-linear separable problem.
Supervised Learning 101
This is a pretty deep question, so I'm going to review the basics first to make sure we understand each other. In its simplest form, supervised learning, and classification in particular, attempts to learn a function f such that y=f(x), from a set of observations {(x_i,y_i)}. The following problems arise in practice:
You know nothing about f. It could be a polynomial, exponential, or some exotic highly non-linear thing that doesn't even have a proper name in math.
The dataset you're using to learn is just a limited, and potentially noisy, subset of the true data distribution you're trying to learn.
Because of this, any solution you find will have to be approximate. The type of architecture you will use will determine a family of function h_w(x), and each value of w will represent one function in this family. Note that because there is usually an infinite number of possible w, the family of functions h_w(x) are often infinitely large.
The goal of learning will then be to determine which w is most appropriate. This is where gradient descent intervenes: it is just an optimisation tool that helps you pick reasonably good w, and thus select a particular model h(x).
The problem is, the actual f function you are trying to approximate may not be part of the family h_w you decided to pick, and so you are .
Answering the actual questions
Now that the basics are covered, let's answer your questions:
Putting a non-linear activation function like sigmoid at the output of a single layer model ANN will not help it learn a non-linear function. Indeed a single layer ANN is equivalent to linear regression, and adding the sigmoid transforms it into Logistic Regression. Why doesn't it work? Let me try an intuitive explanation: the sigmoid at the output of the single layer is there to squash it to [0,1], so that it can be interpreted as a class membership probability. In short, the sigmoid acts a differentiable approximation to a hard step function. Our learning procedure relies on this smoothness (a well-behaved gradient is available everywhere), and using a step function would break eg. gradient descent. This doesn't change the fact that the decision boundary of the model is linear, because the final class decision is taken from the value of sum(w_i*x_i). This is probably not really convincing, so let's illustrate instead using the Tensorflow Playground. Note that the learning rule does not matter here, because the family of function you're optimising over consist only of linear functions on their input, so you will never learn a non-linear one!
If you drop the sigmoid activation, you're left with a simple linear regression. You don't even project your result back to [0,1], so the output will not be simple to interpret as class probability, but the final result will be the same. See the Playground for a visual proof.
What is needed then?
To learn a non-linearly separable problem, you have several solutions:
Preprocess the input x into x', so that taking x' as an input makes the problem linearly separable. This is only possible if you know the shape that the decision boundary should take, so generally only applicable to very simple problems. In the playground problem, since we're working with a circle, we can add the squares of x1 and x2 to the input. Although our model is linear in its input, an appropriate non-linear transformation of the input has been carefully selected, so we get an excellent fit.
We could try to automatically learn the right representation of the data, by adding one or more hidden layers, which will work to extract a good non-linear transformation. It can be proven that using a single hidden layer is enough to approximate anything as long as make the number of hidden neurons high enough. For our example, we get a good fit using only a few hidden neurons with ReLU activations. Intuitively, the more neurons you add, the more "flexible" the decision boundary can become. People in deep learning have been adding depth rather than width because it can be shown that making the network deeper makes it require less neurons overall, even though it makes training more complex.
Yes, gradient descent is quite capable of solving a non-linear problem. The method works as long as the various transformations are roughly linear within a "delta" of the adjustments. This is why we adjust our learning rates: to stay within the ranges in which linear assumptions are relatively accurate.
Non-linear transformations give us a better separation to implement the ideas "this is boring" and "this is exactly what I'm looking for!" If these functions are smooth, or have a very small quantity of jumps, we can apply our accustomed approximations and iterations to solve the overall system.
Determining the useful operating ranges is not a closed-form computation, by any means; as with much of AI research, it requires experimentation and refinement. The direct answer to your question is that you've asked the wrong entity -- try the choices you've listed, and see which works best for your application.

Linear Regression: Is there a difference in the model between using ML instead MSE?

We know we need 4 things for building a machine learning algorithm:
A Dataset
A Model
A cost function
An optimization procedure
Taking the example of linear regression (y = m*x +q) we have two most common way of finding the best parameters: using ML or MSE as cost functions.
We hypotize data are Gaussian-distributed, using ML.
Is this assumption part of the model, also?
It it's not, why? Is it part of the cost function?
I can't see the "edge" of the model, in this case.
Is this assumption part of the model, also?
Yes it is. The ideas of different loss functions derived from the nature of the problem, consequently the nature of the model.
MSE by definition calculates for the mean of the squares of the errors (error means the difference between real y and predicted y) which in its turn will be high if the data is not Gaussian-Like distributed. Just imagine a few extreme values among the data, what will happen to the line slope and consequently the residual error?
It is worth mentioning the assumptions of Linear Regression:
Linear relationship
Multivariate normality
No or little multicollinearity
No auto-correlation
Homoscedasticity
If it's not, why? Is it part of the cost function?
As far I have seen, the assumption is not directly related to the cost function itself, rather related -as above-mentioned- to the model itself.
For example, Support Vector Machine idea is separation of classes. That’s finding out a line/ hyper-plane (in multidimensional space that separate outs classes), thus its cost function is Hinge Loss to "maximum-margin" of classification.
On the other hand, Logistic Regression uses Log-Loss (related to cross-entropy) because the model is binary and works on the probability of the output (0 or 1). And the list goes on...
The assumption that the data is Gaussian-distributed is part of the model in the sense that, for Gaussian distributed data the minimal Mean Squared Error also yields the maximum liklelihood solution for the data, given the model parameters. (Common proof, you can look it up if you are interested).
So you could say that the Gaussian distribution assumption justifies the choice of least squares as the loss function.

Ordinal logistic regression how it differs from logistic regression?

I am sure this question may not be in the brilliant category. But Somehow to learn machine learning i may start with stupid question. So, please.
I understood the terms of regressions partially.
The regression essentially give the idea of the relationship between the dependent and independent variables.
If the dependent variable is continuous and if you see the linear relation between dependent and independent, then linear regression is a way to go.
A slight change now. If the dependent value could be something like Binary value (Y/N), ie: the output value is binomial distribution, then logistic regression is a way to go that which demands non linear relationship between dependent and independent.
So far..Please correct me if i am wrong.
Now my question is with respect to ordinal logistic regression.
I have started looking at the below link for reference
https://statistics.laerd.com/spss-tutorials/ordinal-regression-using-spss-statistics.php
Where it is mentioned that " It can be considered as either a generalisation of multiple linear regression or as a generalisation of binomial logistic regression".
Could someone help me understand this above statement with examples?
Logistic regression can be considered as an extension of linear regression. But instead of predicting continuous variables, it predicts discrete variables by introducing the computation of an activation function. So, you are asked to produce a discriminatory function that based on X you produce a function that outputs f: [1,2, ..., k] where k is the number of classes that your problem presents. Now X can be composed of features that are both continuous or discrete. It does not matter, just make sure you apply pre-processing to them.
The base case for logistic regression is finding the decision boundary that divides two classes. But in order to add more classes, you have to implement another approach. There are several: softmax (https://en.wikipedia.org/wiki/Softmax_function), one-vs-all (https://en.wikipedia.org/wiki/Multiclass_classification), etc.
Finally, answering your question about ordinal logistic regression is an extension of logistic regression. But considers the order of the output variables such as in the case of a test. Take a look online for examples.

What is a loss function in simple words?

Can anyone please explain in simple words and possibly with some examples what is a loss function in the field of machine learning/neural networks?
This came out while I was following a Tensorflow tutorial:
https://www.tensorflow.org/get_started/get_started
It describes how far off the result your network produced is from the expected result - it indicates the magnitude of error your model made on its prediciton.
You can then take that error and 'backpropagate' it through your model, adjusting its weights and making it get closer to the truth the next time around.
The loss function is how you're penalizing your output.
The following example is for a supervised setting i.e. when you know the correct result should be. Although loss functions can be applied even in unsupervised settings.
Suppose you have a model that always predicts 1. Just the scalar value 1.
You can have many loss functions applied to this model. L2 is the euclidean distance.
If I pass in some value say 2 and I want my model to learn the x**2 function then the result should be 4 (because 2*2 = 4). If we apply the L2 loss then its computed as ||4 - 1||^2 = 9.
We can also make up our own loss function. We can say the loss function is always 10. So no matter what our model outputs the loss will be constant.
Why do we care about loss functions? Well they determine how poorly the model did and in the context of backpropagation and neural networks. They also determine the gradients from the final layer to be propagated so the model can learn.
As other comments have suggested I think you should start with basic material. Here's a good link to start off with http://neuralnetworksanddeeplearning.com/
Worth to note we can speak of different kind of loss functions:
Regression loss functions and classification loss functions.
Regression loss function describes the difference between the values that a model is predicting and the actual values of the labels.
So the loss function has a meaning on a labeled data when we compare the prediction to the label at a single point of time.
This loss function is often called the error function or the error formula.
Typical error functions we use for regression models are L1 and L2, Huber loss, Quantile loss, log cosh loss.
Note: L1 loss is also know as Mean Absolute Error. L2 Loss is also know as Mean Square Error or Quadratic loss.
Loss functions for classification represent the price paid for inaccuracy of predictions in classification problems (problems of identifying which category a particular observation belongs to).
Name a few: log loss, focal loss, exponential loss, hinge loss, relative entropy loss and other.
Note: While more commonly used in regression, the square loss function can be re-written and utilized for classification.

SVM - Difference between Energy vs Loss vs Regularization vs Cost function

I am reading A Tutorial on Energy Based Learning and I am trying to understand the difference between all those terms stated above in the context of SVMs. This link summarizes the differences between a loss, cost and an objective function. Based on my understanding,
Objective function: Something we want to minimize. For example ||w||^2 for SVM.
Loss function: Penalty between prediction and label which is also equivalent to the regularization term. Example is the hinge loss function in SVM.
Cost function: A general formulation that combines the objective and loss function.
Now, the 1st link states that the hinge function is max(0, m + E(W,Yi,Xi) - E(W,Y,X)) i.e. it is a function of the energy term. Does that mean that the energy function of the SVM is 1 - y(wx + b) ? Are energy functions are a part of a loss function. And a loss + objective function a part of the cost function ?
A concise summary of the 4 terms would immensely help my understanding. Also, do correct me if my understanding is wrong. The terms sound so confusing. Thanks !
Objective function: Something we want to minimize. For example ||w||^2 for SVM.
Objective function is - as name suggests - objective of optimization. It can be either something we want to minimize (like cost function) or maximize (like likelihood). In general - function that measures how good is our current solution (usually by returning a real number)
Loss function: Penalty between prediction and label which is also equivalent to the regularization term. Example is the hinge loss function in SVM.
First of all, loss is not equivalent to regularization, in any sense. Loss function is a a penalty between a model and truth. This can be a prediction of class conditional distribuition vs true label, thus can also be a data distribution vs. empirical sample, and many more.
Regularization
Regularization is a term, penalty, measure which is supposed to be a penalty for too complex model. In ML, or generally in statistics when dealing with estimators, you always try to balance two sources of error - variance (coming from too complex models, overfitting) and bias (coming from too simple models, bad learning methods, underfitting). Regularization is a technique of penalizing high-variance models in the optimization process in order to get less overfitted one. In other words - for techniques which can fit training set perfectly, it is important to have a measure which forbids it in order to preserve ability to generalize.
Cost function: A general formulation that combines the objective and loss function.
Cost function is just an objective function which one minimizes. It can be composed of some agglomeration of loss functions and regularizers.
Now, the 1st link states that the hinge function is max(0, m + E(W,Yi,Xi) - E(W,Y,X)) i.e. it is a function of the energy term. Does that mean that the energy function of the SVM is 1 - y(wx + b) ? Are energy functions are a part of a loss function. And a loss + objective function a part of the cost function ?
The hinge loss is max(0, 1 - y(<w,x> - b)). The one defined here is not really for SVM but for general factor graphs, I would strongly suggest to start learning ML from basics and not from advanced techniques. Without good understanding of basics of ML, this paper will not be possible to understand.
To show example with SVM and naming convention
C SUM_i=1^N max(0, 1 - y_i(<w, x_i> - b)) + ||w||^2
\__________________________/ \_____/
loss regularization
\_________________________________________________/
cost / objective function

Resources