I started machine learning with sci-kit learn and came across various models in machine learning.
In every model, there was a fit() function.
Although I read many blog posts and came to know that fit() helps us to find the parameter of the model.
For example in Linear Regression model, fit() function helps to find the slope and intercept.
But I am still not able to understand the behind logic of fit() function.
In general at least for predictive models, fit() takes data that you want to use to train some model so that it can make predictions about other related data. Each type of model has different constraints and different types of patterns it attempts to extract from the data. In one dimensional linear regression, fit() is looking for a linear (straight line) relationship in the data and finds a linear function (slope and intercept) that minimizes the sum of squared differences between the function and the data points provided.
Related
I have a binary classification problem I'm trying to tackle in Keras. To start, I was following the usual MNIST example, using softmax as the activation function in my output layer.
However, in my problem, the 2 classes are highly unbalanced (1 appears ~10 times more often than the other). And what's even more critical, they are non-symmetrical in the way they may be mistaken.
Mistaking an A for a B is way less severe than mistaking a B for an A. Just like a caveman trying to classify animals into pets and predators: mistaking a pet for a predator is no big deal, but the other way round will be lethal.
So my question is: how would I model something like this with Keras?
thanks a lot
A non-exhaustive list of things you could do:
Generate a balanced data set using data augmentations. If the data are images, you can add image augmentations in a custom data generator that will output balanced amounts of data from each class per batch and save the results to a new data set. If the data are tabular, you can use a library like imbalanced-learn to perform over/under sampling.
As #Daniel said you can use class_weights during training (in the fit method) in a way that mistakes on important class are penalized more. See this tutorial: Classification on imbalanced data. The same idea can be implemented with a custom loss function with/without class_weights during training.
We know we need 4 things for building a machine learning algorithm:
A Dataset
A Model
A cost function
An optimization procedure
Taking the example of linear regression (y = m*x +q) we have two most common way of finding the best parameters: using ML or MSE as cost functions.
We hypotize data are Gaussian-distributed, using ML.
Is this assumption part of the model, also?
It it's not, why? Is it part of the cost function?
I can't see the "edge" of the model, in this case.
Is this assumption part of the model, also?
Yes it is. The ideas of different loss functions derived from the nature of the problem, consequently the nature of the model.
MSE by definition calculates for the mean of the squares of the errors (error means the difference between real y and predicted y) which in its turn will be high if the data is not Gaussian-Like distributed. Just imagine a few extreme values among the data, what will happen to the line slope and consequently the residual error?
It is worth mentioning the assumptions of Linear Regression:
Linear relationship
Multivariate normality
No or little multicollinearity
No auto-correlation
Homoscedasticity
If it's not, why? Is it part of the cost function?
As far I have seen, the assumption is not directly related to the cost function itself, rather related -as above-mentioned- to the model itself.
For example, Support Vector Machine idea is separation of classes. That’s finding out a line/ hyper-plane (in multidimensional space that separate outs classes), thus its cost function is Hinge Loss to "maximum-margin" of classification.
On the other hand, Logistic Regression uses Log-Loss (related to cross-entropy) because the model is binary and works on the probability of the output (0 or 1). And the list goes on...
The assumption that the data is Gaussian-distributed is part of the model in the sense that, for Gaussian distributed data the minimal Mean Squared Error also yields the maximum liklelihood solution for the data, given the model parameters. (Common proof, you can look it up if you are interested).
So you could say that the Gaussian distribution assumption justifies the choice of least squares as the loss function.
I am implementing a simple neural net from scratch, just for practice. I have got it working fine with sigmoid, tanh and ReLU activations for binary classification problems. I am now attempting to use it for multi-class, mutually exclusive problems. Of course, softmax is the best option for this.
Unfortunately, I have had a lot of trouble understanding how to implement softmax, cross-entropy loss and their derivatives in backprop. Even after asking a couple of questions here and on Cross Validated, I can't get any good guidance.
Before I try to go further with implementing softmax, is it possible to somehow use sigmoid for multi-class problems (I am trying to predict 1 of n characters, which are encoded as one-hot vectors)? And if so, which loss function would be best? I have been using the squared error for all binary classifications.
Your question is about the fundamentals of neural networks and therefore I strongly suggest you start here ( Michael Nielsen's book ).
It is python-oriented book with graphical, textual and formulated explanations - great for beginners. I am confident that you will find this book useful for your understanding. Look for chapters 2 and 3 to address your problems.
Addressing your question about the Sigmoids, it is possible to use it for multiclass predictions, but not recommended. Consider the following facts.
Sigmoids are activation functions of the form 1/(1+exp(-z)) where z is the scalar multiplication of the previous hidden layer (or inputs) and a row of the weights matrix, in addition to a bias (reminder: z=w_i . x + b where w_i is the i-th row of the weight matrix ). This activation is independent of the others rows of the matrix.
Classification tasks are regarding categories. Without any prior knowledge ,and even with, most of the times, categories have no order-value interpretation; predicting apple instead of orange is no worse than predicting banana instead of nuts. Therefore, one-hot encoding for categories usually performs better than predicting a category number using a single activation function.
To recap, we want an output layer with number of neurons equals to number of categories, and sigmoids are independent of each other, given the previous layer values. We also would like to predict the most probable category, which implies that we want the activations of the output layer to have a meaning of probability disribution. But Sigmoids are not guaranteed to sum to 1, while softmax activation does.
Using L2-loss function is also problematic due to vanishing gradients issue. Shortly, the derivative of the loss is (sigmoid(z)-y) . sigmoid'(z) (error times the derivative), that makes this quantity small, even more when the sigmoid is closed to saturation. You can choose cross entropy instead, or a log-loss.
EDIT:
Corrected phrasing about ordering the categories. To clarify, classification is a general term for many tasks related to what we used today as categorical predictions for definite finite sets of values. As of today, using softmax in deep models to predict these categories in a general "dog/cat/horse" classifier, one-hot-encoding and cross entropy is a very common practice. It is reasonable to use that if the aforementioned is correct. However, there are (many) cases it doesn't apply. For instance, when trying to balance the data. For some tasks, e.g. semantic segmentation tasks, categories can have ordering/distance between them (or their embeddings) with meaning. So please, choose wisely the tools for your applications, understanding what their doing mathematically and what their implications are.
What you ask is a very broad question.
As far as I know, when the class become 2, the softmax function will be the same as sigmoid, so yes they are related. Cross entropy maybe the best loss function.
For the backpropgation, it is not easy to find the formula...there
are many ways.Since the help of CUDA, I don't think it is necessary to spend much time on it if you just want to use the NN or CNN in the future. Maybe try some framework like Tensorflow or Keras(highly recommand for beginers) will help you.
There is also many other factors like methods of gradient descent, the setting of hyper parameters...
Like I said, the topic is very abroad. Why not trying the machine learning/deep learning courses on Coursera or Stanford online course?
So after you have a machine learning algorithm trained, with your layers, nodes, and weights, how exactly does it go about getting a prediction for an input vector? I am using MultiLayer Perceptron (neural networks).
From what I currently understand, you start with your input vector to be predicted. Then you send it to your hidden layer(s) where it adds your bias term to each data point, then adds the sum of the product of each data point and the weight for each node (found in training), then runs that through the same activation function used in training. Repeat for each hidden layer, then does the same for your output layer. Then each node in the output layer is your prediction(s).
Is this correct?
I got confused when using opencv to do this, because in the guide it says when you use the function predict:
If you are using the default cvANN_MLP::SIGMOID_SYM activation
function with the default parameter values fparam1=0 and fparam2=0
then the function used is y = 1.7159*tanh(2/3 * x), so the output
will range from [-1.7159, 1.7159], instead of [0,1].
However, when training it is also stated in the documentation that SIGMOID_SYM uses the activation function:
f(x)= beta*(1-e^{-alpha x})/(1+e^{-alpha x} )
Where alpha and beta are user defined variables.
So, I'm not quite sure what this means. Where does the tanh function come into play? Can anyone clear this up please? Thanks for the time!
The documentation where this is found is here:
reference to the tanh is under function descriptions predict.
reference to activation function is by the S looking graph in the top part of the page.
Since this is a general question, and not code specific, I did not post any code with it.
I would suggest that you read about appropriate algorithm that your are using or plan to use. To be honest there is no one definite algorithm to solve a problem but you can explore what features you got and what you need.
Regarding how an algorithm performs prediction is totally depended on the choice of algorithm. Support Vector Machine (SVM) performs prediction by fitting hyperplanes on the feature space and using some metric such as distance for learning and than the learnt model is used for prediction. KNN on the other than uses simple nearest neighbor measurement for prediction.
Please do more work on what exactly you need and read through the research papers to get proper understanding. There is not magic involved in prediction but rather mathematical formulations.
I've been studying neural networks for a bit and recently learned about the dropout training algorithm. There are excellent papers out there to understand how it works, including the ones from the authors.
So I built a neural network with dropout training (it was fairly easy) but I'm a bit confused about how to perform model selection. From what I understand, looks like dropout is a method to be used when training the final model obtained through model selection.
As for the test part, papers always talk about using the complete network with halved weights, but they do not mention how to use it in the training/validation part (at least the ones I read).
I was thinking about using the network without dropout for the model selection part. Say that makes me find that the net performs well with N neurons. Then, for the final training (the one I use to train the network for the test part) I use 2N neurons with dropout probability p=0.5. That assures me to have exactly N neurons active on average, thus using the network at the right capacity most of the time.
Is this a correct approach?
By the way, I'm aware of the fact that dropout might not be the best choice with small datasets. The project I'm working on has academic purposes, so it's not really needed that I use the best model for the data, as long as I stick with machine learning good practices.
First of all, model selection and the training of a particular model are completely different issues. For model selection, you would usually need a data set that is completely independent of both training set used to build the model and test set used to estimate its performance. So if you're doing for example a cross-validation, you would need an inner cross-validation (to train the models and estimate the performance in general) and an outer cross-validation to do the model selection.
To see why, consider the following thought experiment (shamelessly stolen from this paper). You have a model that makes a completely random prediction. It has a number of parameters that you can set, but have no effect. If you're trying different parameter settings long enough, you'll eventually get a model that has a better performance than all the others simply because you're sampling from a random distribution. If you're using the same data for all of these models, this is the model you will choose. If you have a separate test set, it will quickly tell you that there is no real effect because the performance of this parameter setting that achieves good results during the model-building phase is not better on the separate set.
Now, back to neural networks with dropout. You didn't refer to any particular paper; I'm assuming that you mean Srivastava et. al. "Dropout: A Simple Way to Prevent Neural Networks from Overfitting". I'm not an expert on the subject, but the method to me seems to be similar to what's used in random forests or bagging to mitigate the flaws an individual learner may exhibit by applying it repeatedly in slightly different contexts. If I understood the method correctly, essentially what you end up with is an average over several possible models, very similar to random forests.
This is a way to make an individual model better, but not for model selection. The dropout is a way of adjusting the learned weights for a single neural network model.
To do model selection on this, you would need to train and test neural networks with different parameters and then evaluate those on completely different sets of data, as described in the paper I've referenced above.