How do sample weights work in classification models?

How do sample weights work in classification models? - machine-learning

What does it mean to provide weights to each sample for
classification? How does a classification algorithm like Logistic regression or SVMs use weights to emphasize certain examples more than others? I would love going into details to unpack how these algorithms leverage sample weights.
If you look at the sklearn documentation for logistic regression, you can see that the fit function has an optional sample_weight parameter which is defined as an array of weights assigned to individual samples.

this option is meant for imbalance dataset. Let's take an example: i've got a lot of datas and some are just noise. But other are really important to me and i'd like my algorithm to consider them a lot more than the other points. So i assigne a weight to it in order to make sure that it will be dealt with properly.
It change the way the loss is calculate. The error (residues) will be multiplie by the weight of the point and thus, the minimum of the objective function will be shifted. I hope it's clear enough. i don't know if you're familiar with the math behind it so i provide here a small introduction to have everything under hand (apologize if this was not needed)
https://perso.telecom-paristech.fr/rgower/pdf/M2_statistique_optimisation/Intro-ML-expanded.pdf

See a good explanation here: https://www.kdnuggets.com/2019/11/machine-learning-what-why-how-weighting.html .

Related

Machine Learning: Weighting Training Points by Importance

I have a set of labeled training data, and I am training a ML algorithm to predict the label. However, some of my data points are more important than others. Or, analogously, these points have less uncertainty than the others.
Is there a general method to include an importance-representing weight to each training point in the model? Are there instead some specific models which are capable of this while others are not?
I can imagine duplicating these points (and perhaps smearing their features slightly to avoid exact duplicates), or downsampling the less important points. Is there a more elegant way to approach this problem?

Scikit-learn allows you to pass an array of sample weights while fitting the model. Vowpal Wabbit (an online ML library) also has this option.

Naive bayes classifier poor accuracy for positive negative classes

I am new to the concept of machine learning and I am trying to figure out this problem. I am using WEKA for this. I have 4 clusters who's means kind of form a square. The training dataset that I provide to Naive Bayes has 2 classes where opposite (across the center of entire plot) clusters are in same class. The accuracy for this model is not even 50 percent but when I change the classes from opposite to the same side, the accuracy becomes 100 percent. Why is this so?

Naive bays can not represent the solution to that problem.
There is more than one form of naive bays, but none can handle that problem. The solutions each one can solve are somewhat different.
Try to ask yourself what properties about the solution change when you "change the classes from opposite to the same side", and what it would take to represent that solution.

Start by loading a snapshot of the data to weka to understand the data. The decide if Naive Bayes is the right algorithm for the solution

Can I implement a classifier using a function?

I was learning about different techniques for classification, like probablistic classifiers etc , and stubled upon the question Why cant we implement a binary classifier as a Regression function of all the attributes and classify on the basis of the output of the function , say if the output is less than a certain value it belongs to class A , else in class B . Is there any limitation to this method compared to probablistic approach ?

You can do this and it is often done in practice, for example in Logistic Regression. It is not even limited to binary classes. There is no inherent limitation compared to a probabilistic approach, although you should keep in mind that both are fundamentally different approaches and hard to compare.

I think you have some misunderstanding in classification. No matter what kind of classifier you are using (svm, or logistic regression), you can always view the output model as
f(x)>b ===> positive
f(x) negative
This applies to both probabilistic model and non-probabilistic model. In fact, this is something related to risk minimization which results the cut-off branch naturally.

Yes, this is possible. For example, a perceptron does exactly that.
However, it is limited in its use to linearly separable problems. But multiple of them can be combined to solve arbitrarily complex problems in general neural networks.
Another machine learning technique, SVM, works in a similar way. It first transforms the input data into some high dimensional space and then separates it via a linear function.

Which classification algorithm to choose?

I would like to classify text documents into four categories. Also I have lot of samples which are already classified that can be used for training. I would like the algorithm to learn on the fly.. please suggest an optimal algorithm that works for this requirement.

If by "on the fly" you mean online learning (where training and classification can be interleaved), I suggest the k-nearest neighbor algorithm. It's available in Weka and in the package TiMBL.
A perceptron will also be able to do this.
"Optimal" isn't a well-defined term in this context.

there are several algorithms which can be learned on fly. Examples: k-nearest neighbors, naive Bayes, neural networks. You can try how appropriate each of these methods are on a sample corpus.

Since you have unlabeled data you might want to use a model where this helps. The first thing that comes to my mind is nonlinear NCA: Learning a Nonlinear Embedding by Preserving
Class Neighbourhood Structure, (Salakhutdinov, Hinton).

Well....I have to say that document classification is kind of different what you guys are thinking.
Typically, in document classification, after preprocessing, the test data is always extremely huge, for example, O(N^2)...Therefore it might be too computationally expensive.
The another typical classifier that came into my mind is discriminant classifier...which doesn't need the generative model for your dataset. After training, you have to do is to put your single entry to the algorithm, and it is gonna be classified.
Good luck with this. For example, you can check E. Alpadin's book, Introduction to Machine Learning.

How to approach machine learning problems with high dimensional input space?

How should I approach a situtation when I try to apply some ML algorithm (classification, to be more specific, SVM in particular) over some high dimensional input, and the results I get are not quite satisfactory?
1, 2 or 3 dimensional data can be visualized, along with the algorithm's results, so you can get the hang of what's going on, and have some idea how to aproach the problem. Once the data is over 3 dimensions, other than intuitively playing around with the parameters I am not really sure how to attack it?

What do you do to the data? My answer: nothing. SVMs are designed to handle high-dimensional data. I'm working on a research problem right now that involves supervised classification using SVMs. Along with finding sources on the Internet, I did my own experiments on the impact of dimensionality reduction prior to classification. Preprocessing the features using PCA/LDA did not significantly increase classification accuracy of the SVM.
To me, this totally makes sense from the way SVMs work. Let x be an m-dimensional feature vector. Let y = Ax where y is in R^n and x is in R^m for n < m, i.e., y is x projected onto a space of lower dimension. If the classes Y1 and Y2 are linearly separable in R^n, then the corresponding classes X1 and X2 are linearly separable in R^m. Therefore, the original subspaces should be "at least" as separable as their projections onto lower dimensions, i.e., PCA should not help, in theory.
Here is one discussion that debates the use of PCA before SVM: link
What you can do is change your SVM parameters. For example, with libsvm link, the parameters C and gamma are crucially important to classification success. The libsvm faq, particularly this entry link, contains more helpful tips. Among them:
Scale your features before classification.
Try to obtain balanced classes. If impossible, then penalize one class more than the other. See more references on SVM imbalance.
Check the SVM parameters. Try many combinations to arrive at the best one.
Use the RBF kernel first. It almost always works best (computationally speaking).
Almost forgot... before testing, cross validate!
EDIT: Let me just add this "data point." I recently did another large-scale experiment using the SVM with PCA preprocessing on four exclusive data sets. PCA did not improve the classification results for any choice of reduced dimensionality. The original data with simple diagonal scaling (for each feature, subtract mean and divide by standard deviation) performed better. I'm not making any broad conclusion -- just sharing this one experiment. Maybe on different data, PCA can help.

Some suggestions:
Project data (just for visualization) to a lower-dimensional space (using PCA or MDS or whatever makes sense for your data)
Try to understand why learning fails. Do you think it overfits? Do you think you have enough data? Is it possible there isn't enough information in your features to solve the task you are trying to solve? There are ways to answer each of these questions without visualizing the data.
Also, if you tell us what the task is and what your SVM output is, there may be more specific suggestions people could make.

You can try reducing the dimensionality of the problem by PCA or the similar technique. Beware that PCA has two important points. (1) It assumes that the data it is applied to is normally distributed and (2) the resulting data looses its natural meaning (resulting in a blackbox). If you can live with that, try it.
Another option is to try several parameter selection algorithms. Since SVM's were already mentioned here, you might try the approach of Chang and Li (Feature Ranking Using Linear SVM) in which they used linear SVM to pre-select "interesting features" and then used RBF - based SVM on the selected features. If you are familiar with Orange, a python data mining library, you will be able to code this method in less than an hour. Note that this is a greedy approach which, due to its "greediness" might fail in cases where the input variables are highly correlated. In that case, and if you cannot solve this problem with PCA (see above), you might want to go to heuristic methods, which try to select best possible combinations of predictors. The main pitfall of this kind of approaches is the high potential of overfitting. Make sure you have a bunch "virgin" data that was not seen during the entire process of model building. Test your model on that data only once, after you are sure that the model is ready. If you fail, don't use this data once more to validate another model, you will have to find a new data set. Otherwise you won't be sure that you didn't overfit once more.
List of selected papers on parameter selection:
Feature selection for high-dimensional genomic microarray data
Oh, and one more thing about SVM. SVM is a black box. You better figure out what is the mechanism that generate the data and model the mechanism and not the data. On the other hand, if this would be possible, most probably you wouldn't be here asking this question (and I wouldn't be so bitter about overfitting).
List of selected papers on parameter selection
Feature selection for high-dimensional genomic microarray data
Wrappers for feature subset selection
Parameter selection in particle swarm optimization
I worked in the laboratory that developed this Stochastic method to determine, in silico, the drug like character of molecules

I would approach the problem as follows:
What do you mean by "the results I get are not quite satisfactory"?
If the classification rate on the training data is unsatisfactory, it implies that either
You have outliers in your training data (data that is misclassified). In this case you can try algorithms such as RANSAC to deal with it.
Your model(SVM in this case) is not well suited for this problem. This can be diagnozed by trying other models (adaboost etc.) or adding more parameters to your current model.
The representation of the data is not well suited for your classification task. In this case preprocessing the data with feature selection or dimensionality reduction techniques would help
If the classification rate on the test data is unsatisfactory, it implies that your model overfits the data:
Either your model is too complex(too many parameters) and it needs to be constrained further,
Or you trained it on a training set which is too small and you need more data
Of course it may be a mixture of the above elements. These are all "blind" methods to attack the problem. In order to gain more insight into the problem you may use visualization methods by projecting the data into lower dimensions or look for models which are suited better to the problem domain as you understand it (for example if you know the data is normally distributed you can use GMMs to model the data ...)

If I'm not wrong, you are trying to see which parameters to the SVM gives you the best result. Your problem is model/curve fitting.
I worked on a similar problem couple of years ago. There are tons of libraries and algos to do the same. I used Newton-Raphson's algorithm and a variation of genetic algorithm to fit the curve.
Generate/guess/get the result you are hoping for, through real world experiment (or if you are doing simple classification, just do it yourself). Compare this with the output of your SVM. The algos I mentioned earlier reiterates this process till the result of your model(SVM in this case) somewhat matches the expected values (note that this process would take some time based your problem/data size.. it took about 2 months for me on a 140 node beowulf cluster).
If you choose to go with Newton-Raphson's, this might be a good place to start.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart