What determines the capacity of a machine learning model? - machine-learning

For a KNN model, is the value of K inversely proportional to its capacity/complexity? How can one determine whether a curve is a training curve or a validation curve based on its Loss Vs Capacity graph? Thanks! ML newbie here.

K in KNN is a hyperparameter that determines how many (K) votes from the 'nearest neighbors' are supposed to be considered to make a prediction against a sample. What do you mean by the capacity of the model? Also, the larger K is, the more complex the model.

Related

h2o GLM with regularization

I am using the h2o package in R to fit a GLM via the h2o.glm() fucntion. One reasonable way to assess feature importance in a GLM with the l1 regularization penalty is to monitor the order that parameters enter the linear predictor (i.e. the model) as the l1 penalty weight decreases over each successive lambda. I cannot find in the h2o documentation if it possible to extract this information from a returned model object.
Does anyone know if it is possible to view the fitted model form after each successive lambda?
Thanks,
you can get the full regularization path with h2o.getGLMFullRegularizationPath(my_glm) where my_glm is the glm you trained, just remember to set lambda_search equal to TRUE (i.e. my_glm = h2o.glm(x,y,training_frame, lambda_search = TRUE)

Create negative examples in dataset with only positive ones

Imagine we have a classification problem on a dataset where the examples are only positive (equivalently negative). For instance, on a problem where the the winning class is specified by position (e.g. think of a tennis dataset problem where the first player is always the winner). How can we create negative examples in order to train a supervised learning algorithm on this dataset? One idea could be to generate negative examples, by exchanging the positions of the features that are tied to each of the classes. Do you think this will give an unbiased dataset? Could we create negative duplicates of our original dataset and train a supervised learning algorithm on this double dataset?

How to interpret loss and accuracy for a machine learning model [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
When I trained my neural network with Theano or Tensorflow, they will report a variable called "loss" per epoch.
How should I interpret this variable? Higher loss is better or worse, or what does it mean for the final performance (accuracy) of my neural network?
The lower the loss, the better a model (unless the model has over-fitted to the training data). The loss is calculated on training and validation and its interperation is how well the model is doing for these two sets. Unlike accuracy, loss is not a percentage. It is a summation of the errors made for each example in training or validation sets.
In the case of neural networks, the loss is usually negative log-likelihood and residual sum of squares for classification and regression respectively. Then naturally, the main objective in a learning model is to reduce (minimize) the loss function's value with respect to the model's parameters by changing the weight vector values through different optimization methods, such as backpropagation in neural networks.
Loss value implies how well or poorly a certain model behaves after each iteration of optimization. Ideally, one would expect the reduction of loss after each, or several, iteration(s).
The accuracy of a model is usually determined after the model parameters are learned and fixed and no learning is taking place. Then the test samples are fed to the model and the number of mistakes (zero-one loss) the model makes are recorded, after comparison to the true targets. Then the percentage of misclassification is calculated.
For example, if the number of test samples is 1000 and model classifies 952 of those correctly, then the model's accuracy is 95.2%.
There are also some subtleties while reducing the loss value. For instance, you may run into the problem of over-fitting in which the model "memorizes" the training examples and becomes kind of ineffective for the test set. Over-fitting also occurs in cases where you do not employ a regularization, you have a very complex model (the number of free parameters W is large) or the number of data points N is very low.
They are two different metrics to evaluate your model's performance usually being used in different phases.
Loss is often used in the training process to find the "best" parameter values for your model (e.g. weights in neural network). It is what you try to optimize in the training by updating weights.
Accuracy is more from an applied perspective. Once you find the optimized parameters above, you use this metrics to evaluate how accurate your model's prediction is compared to the true data.
Let us use a toy classification example. You want to predict gender from one's weight and height. You have 3 data, they are as follows:(0 stands for male, 1 stands for female)
y1 = 0, x1_w = 50kg, x2_h = 160cm;
y2 = 0, x2_w = 60kg, x2_h = 170cm;
y3 = 1, x3_w = 55kg, x3_h = 175cm;
You use a simple logistic regression model that is y = 1/(1+exp-(b1*x_w+b2*x_h))
How do you find b1 and b2? you define a loss first and use optimization method to minimize the loss in an iterative way by updating b1 and b2.
In our example, a typical loss for this binary classification problem can be:
(a minus sign should be added in front of the summation sign)
We don't know what b1 and b2 should be. Let us make a random guess say b1 = 0.1 and b2 = -0.03. Then what is our loss now?
so the loss is
Then you learning algorithm (e.g. gradient descent) will find a way to update b1 and b2 to decrease the loss.
What if b1=0.1 and b2=-0.03 is the final b1 and b2 (output from gradient descent), what is the accuracy now?
Let's assume if y_hat >= 0.5, we decide our prediction is female(1). otherwise it would be 0. Therefore, our algorithm predict y1 = 1, y2 = 1 and y3 = 1. What is our accuracy? We make wrong prediction on y1 and y2 and make correct one on y3. So now our accuracy is 1/3 = 33.33%
PS: In Amir's answer, back-propagation is said to be an optimization method in NN. I think it would be treated as a way to find gradient for weights in NN. Common optimization method in NN are GradientDescent and Adam.
Just to clarify the Training/Validation/Test data sets:
The training set is used to perform the initial training of the model, initializing the weights of the neural network.
The validation set is used after the neural network has been trained. It is used for tuning the network's hyperparameters, and comparing how changes to them affect the predictive accuracy of the model. Whereas the training set can be thought of as being used to build the neural network's gate weights, the validation set allows fine tuning of the parameters or architecture of the neural network model. It's useful as it allows repeatable comparison of these different parameters/architectures against the same data and networks weights, to observe how parameter/architecture changes affect the predictive power of the network.
Then the test set is used only to test the predictive accuracy of the trained neural network on previously unseen data, after training and parameter/architecture selection with the training and validation data sets.

How to get scores along with Machine Learning results?

I am new to machine learning and I am currently working on classification problem. I am able to train the model and predict test data sets. I want to know whether is there some way by which I can get scores along with the prediction. By scores , I mean those are proximity scores along with prediction. For example, in standard age-salary-buy (based on age and salary whether the customer will buy the product or not) classification problem, I want to know what is a score out of 100 that he will buy that product in addition to the prediction of whether he will buy it or not.
Currently, I am using LibSVM Algo. Is there some algo which provides me above data ?
Thanks.
What you are looking for is a support of your decision. In other words, many classifiers base their decision of x class over labels Y on:
cl(x) = arg max_{y \in Y} p(y|x)
where p(y|x) is their internal estimation of "x having label y". And such classifiers include:
neural networks (with sigmoid output)
logistic regression
naive bayes
voting ensembles (such as RF)
...
These methods can be easily converted to your 0-100 scale, as probability is in 0-1 scale.
Some, on the other hand use measure proportional to probability (such as SVM), but unbounded, here you can get this value (often called decision function) but you cannot convert it to 0-100 score (as you do not have "maximum" value). This is a big drawback, so some modification were proposed. In particular for SVM you have Platt's scaling which actually fits a logistic regression on top of SVM so you get your probability estimate. In libSVM you can set -b to get probability estimates
from libsvm website
-b probability_estimates: whether to train a SVC or SVR model for probability estimates, 0 or 1 (default 0)

Speeding up the classification process - PCA combined with SVM?

I have a cyclic method running which collects a data set of 15.000 feature vectors with 30 dimensions (every 200ms). My current setup simply feeds all raw feature vectors to a SVM with RBF (Radial basis function). The classification result is rather unconvincing as being costly in terms of time. I know that the dataset isn't that big, so classification in real-time could be possible with the right subsampling feature vector or so. The goal is to speed up the entire classification process (training/prediction) to reach a few milliseconds. To obtain an unsupervised classification approach, I currently run k-means to label the feature vectors. I pick a few cluster results and assign them class 1 and all others class 0.
The idea now the following:
collect all 15.000 (N) feature vectors with 30 (D) dimensions
PCA on all N feature vectors
use the eigenvalues to determine a feature vector with (d) dimensions (d < D)
Fed the new set of (n < N)
feature vectors
or: the eigenvectors ?
to train the svm
Maybe instead of SVM a KNN approach would result in similar result?
Does this approach makes sense?
Any ideas to improve the process or change it in order to speed it up?
How do I determine the best number of d?
The classification accuracy shouldn't suffer too much from the time reduction.
EDIT: Data stream mining
I was just reading about Data Stream Mining. I think this topic fits my setup quite well since I have to extract knowledge structures from continuous, rapid data records. Maybe I should replace the SVM with a Gradient Boosted Tree?
Thanks!

Resources