How can I determine "loss function" for MLPClassifier in skilearn? - machine-learning

I want to use MLPClassifier of skilearn
mlp = MLPClassifier(hidden_layer_sizes=(50,), max_iter=10, alpha=1e-4,
solver='sgd', verbose=10, tol=1e-4, random_state=1,
learning_rate_init=.1)
I didn't find any parameter for the loss function, I want it to be mean_squared_error. Is it possible to determine it for the model?

According to the docs:
This model optimizes the log-loss function using LBFGS or stochastic gradient descent.
Log-loss is basically the same as cross-entropy.
There is no way to pass another loss function to MLPClassifier, so you cannot use MSE. But MLPRegressor uses MSE, if you really want that.
However, the general advice is to stick to cross-entropy loss for classification, it is said to have some advantages over MSE. So you may just want to use MLPClassifier as is for your classification problem.

Related

Is the loss function='Multiclass' in catboost same as log loss if I am doing a multiclassification problem?

I am making a multiclass prediction model using catboost, The final solution should have minimum Logloss error but Logloss is not present in catboost, they have something called 'Multiclass' as the loss function. Are they both same? if not then how can I measure the accuracy of the catboost model in terms of Logloss?
Are they both same? Effectively, Yes...
The catboost documentation describe the calculation of 'MultiClass' loss as what is generally considered as Multinomial/Multiclass Cross Entropy Loss. That is effectively, a Log Softmax applied to the classifier output 'a' to produce values that can be interpreted as probabilities, and subsequently then apply Negative Log Likelihood Loss (NLLLoss), wiki1 & wiki2.
Their documentation describe the calculation of 'LogLoss' also, which again is NLLLoss, however applied to 'p'. Which they describe here to be result of applying the sigmoid fn to the classifier output. Since the NLLLoss is reworked for the binary problem, only a single class probability is calculated, using 'p' and '1-p' for each class. And in this special (binary) case, use of sigmoid and softmax are equivalent.
How can I measure the the catboost model in terms of Logloss?
They describe a method to produce desired metrics on given data.
Be careful not to confuse loss/objective function 'loss_function' with evaluation metric 'eval_metric', however in this instance, the same function can be used for both, as listed in their supported metrics.
Hope this helps!
Log loss is not a loss function but a metric to measure the performance of a classification model where the prediction is a probability value between 0 and 1.
Learn more here.

Why should we use Lasso over Linear regression for feature selection in machine learning?

while selecting features in machine learning, one can use Lasso regression to figure out the least required feature by selecting the least coefficient but we can do the same using Linear Regression
linear regression
Y=x0+x1b1+x2b2.......xnbn
here x1,x2,x3...xn are coefficient, using gradient descent we get the best coefficient, we can remove the features who has the least coefficient. now when it is possible using Linear Regression then why should one use Lasso Regression?
am i missing something, please help
Lasso is a regularization technique which is for avoiding overfitting when you train your model. When you do not use any regularization technique, your loss function just tries to minimize the difference between the predicted value and real value min |y_pred - y|.
To minimize this loss function, gradient descent changes the coefficient of your model. This step may cause the overfitting of your model because your optimization function want only to minimize the difference between prediction and real value. To solve this issue, regularization techniques add another penalty term to the loss functions: value of coefficients. In this way, when your model tries to minimize the difference between predicted and real value, it also tries to do not increase the coefficients too much.
As you mentioned, you can select features in both ways, however, Lasso technique also takes care of the overfitting problem.

Activation functions in Neural Networks

I have a few set of questions related to the usage of various activation functions used in neural networks? I would highly appreciate if someone could give good explanatory answers.
Why ReLU is used only on hidden layers specifically?
Why Sigmoid is a not used in Multi-class classification?
Why we do not use any activation function in regression problems having all negative values?
Why we use "average='micro','macro','average'" while calculating performance metric in multi_class classification?
I'll answer to the best of my ability the 2 first questions:
Relu (=max(0,x)) is used to extract feature maps from data. This is why it is used in the hidden layers where we're learning what important characteristics or features the data holds that could make the model learn how to classify for example. In the FC layers, it's time to make a decision about the output, so we usually use sigmoid or softmax, which tend to give us numbers between 0 and 1 (probability) that can give an interpretable result.
Sigmoid gives a probability for each class. So, if you have 10 classes, you'll have 10 probabilities. And depending on the threshold used, your model would predict for example that the image corresponds to two classes when in multi-classification you want just one predicted class per image. That's why softmax is used in this context: It chooses the class with the maximum probability. So it'll predict just one class.

Need help choosing loss function

I have used resnet50 to solve a multi-class classification problem. The model outputs probabilities for each class. Which loss function should I choose for my model?
After choosing binary cross entropy :
After choosing categorical cross entropy:
The above results are for the same model with just different loss functions.This model is supposed to classify images into 26 classes so categorical cross entropy should work.
Also, in the first case accuracy is about 96% but losses are so high. Why?
edit 2:
Model architecture:
You definitely need to use categorical_crossentropy for a multi-classification problem. binary_crossentropy will reduce your problem down to a binary classification problem in a way that's unclear without further looking into it.
I would say that the reason you are seeing high accuracy in the first (and to some extent the second) case is because you are overfitting. The first dense layer you are adding contains 8 million parameters (!!! to see that do model.summary()), and you only have 70k images to train it with 8 epochs. This architectural choice is very demanding both in computing power and in data requirement. You are also using a very basic optimizer (SGD). Try to use a more powerful Adam.
Finally, I am a bit surprised at your choice to take a 'sigmoid' activation function in the output layer. Why not a more classic 'softmax'?
For a multi-class classification problem you use the categorical_crossentropy loss, as what it does is match the ground truth probability distribution with the one predicted by the model.
This is exactly what is used for multi-class classification, you have a misconception of you think you can't use this loss.

How to use Theano/TensorFlow/Keras for running SGD without neural nets?

Given a model equation, a specific loss function and gradients (that I've already derived), how do I use something like Theano/TensorFlow (or Keras since it's more generic) to train the model without using neural nets?
I simply want to use SGD to minimize the regularized logistic loss. Is this a good example: http://www.deeplearning.net/tutorial/logreg.html ?
Equations (1) and (2) of http://arxiv.org/pdf/1510.04935v2.pdf are, for instance, something I'm trying to work with.

Resources