Can I implement a classifier using a function? - machine-learning

I was learning about different techniques for classification, like probablistic classifiers etc , and stubled upon the question Why cant we implement a binary classifier as a Regression function of all the attributes and classify on the basis of the output of the function , say if the output is less than a certain value it belongs to class A , else in class B . Is there any limitation to this method compared to probablistic approach ?

You can do this and it is often done in practice, for example in Logistic Regression. It is not even limited to binary classes. There is no inherent limitation compared to a probabilistic approach, although you should keep in mind that both are fundamentally different approaches and hard to compare.

I think you have some misunderstanding in classification. No matter what kind of classifier you are using (svm, or logistic regression), you can always view the output model as
f(x)>b ===> positive
f(x) negative
This applies to both probabilistic model and non-probabilistic model. In fact, this is something related to risk minimization which results the cut-off branch naturally.

Yes, this is possible. For example, a perceptron does exactly that.
However, it is limited in its use to linearly separable problems. But multiple of them can be combined to solve arbitrarily complex problems in general neural networks.
Another machine learning technique, SVM, works in a similar way. It first transforms the input data into some high dimensional space and then separates it via a linear function.

Related

Naive Bayes vs. SVM for classifying text data

I'm working on a problem that involves classifying a large database of texts. The texts are very short (think 3-8 words each) and there are 10-12 categories into which I wish to sort them. For the features, I'm simply using the tf–idf frequency of each word. Thus, the number of features is roughly equal to the number of words that appear overall in the texts (I'm removing stop words and some others).
In trying to come up with a model to use, I've had the following two ideas:
Naive Bayes (likely the sklearn multinomial Naive Bayes implementation)
Support vector machine (with stochastic gradient descent used in training, also an sklearn implementation)
I have built both models, and am currently comparing the results.
What are the theoretical pros and cons to each model? Why might one of these be better for this type of problem? I'm new to machine learning, so what I'd like to understand is why one might do better.
Many thanks!
The biggest difference between the models you're building from a "features" point of view is that Naive Bayes treats them as independent, whereas SVM looks at the interactions between them to a certain degree, as long as you're using a non-linear kernel (Gaussian, rbf, poly etc.). So if you have interactions, and, given your problem, you most likely do, an SVM will be better at capturing those, hence better at the classification task you want.
The consensus for ML researchers and practitioners is that in almost all cases, the SVM is better than the Naive Bayes.
From a theoretical point of view, it is a little bit hard to compare the two methods. One is probabilistic in nature, while the second one is geometric. However, it's quite easy to come up with a function where one has dependencies between variables which are not captured by Naive Bayes (y(a,b) = ab), so we know it isn't an universal approximator. SVMs with the proper choice of Kernel are (as are 2/3 layer neural networks) though, so from that point of view, the theory matches the practice.
But in the end it comes down to performance on your problem - you basically want to choose the simplest method which will give good enough results for your problem and have a good enough performance. Spam detection has been famously solvable by just Naive Bayes, for example. Face recognition in images by a similar method enhanced with boosting etc.
Support Vector Machine (SVM) is better at full-length content.
Multinomial Naive Bayes (MNB) is better at snippets.
MNB is stronger for snippets than for longer documents. While (Ng and Jordan,
2002) showed that NB is better than SVM/logistic
regression (LR) with few training cases, MNB is also better with short documents. SVM usually beats NB when it has more than 30–50 training cases, we show that MNB is still better on snippets even with relatively large training sets (9k cases).
Inshort, NBSVM seems to be an appropriate and very strong baseline for sophisticated classification text data.
Source Code: https://github.com/prakhar-agarwal/Naive-Bayes-SVM
Reference: http://nlp.stanford.edu/pubs/sidaw12_simple_sentiment.pdf
Cite: Wang, Sida, and Christopher D. Manning. "Baselines and bigrams:
Simple, good sentiment and topic classification." Proceedings of the
50th Annual Meeting of the Association for Computational Linguistics:
Short Papers-Volume 2. Association for Computational Linguistics,
2012.

Confusion regarding difference of machine learning and statistical learning algorithms

I have read these lines in one of the IEEE Transaction on software learning
"Researchers have adopted a myriad of different techniques to construct software fault prediction models. These include various statistical techniques such as logistic regression and Naive Bayes which explicitly construct an underlying probability model. Furthermore, different machine learning techniques such as decision trees, models based on the notion of perceptrons, support vector machines, and techniques that do not explicitly construct a prediction model but instead look at a set of most similar
known cases have also been investigated.
Can anyone can explain what they are really want to convey.
Please give example.
Thanx in advance.
The authors seem to distinguish probabilistic vs non-probabilistic models, that is models that produce a distribution p(output | data) vs those that just produce an output output = f(data).
The description of the non-probabilistic algorithms is a bits odd to my taste, though. The difference between a (linear) support vector machine, a perceptron and logistic regression from the model and algorithmic perspective is not super large. Implying the former "look at a set of most similar known cases" and the latter doesn't seems strange.
The authors seem to be distinguishing models which compute per-class probabilities (from which you can derive a classification rule to assign an input to the most probable class, or, more complicated, assign an input to the class which has the least misclassification cost) and those which directly assign inputs to classes without passing through the per-class probability as an intermediate result.
A classification task can be viewed as a decision problem; in this case one needs per-class probabilities and a misclassification cost matrix. I think this approach is described in many current texts on machine learning, e.g., Brian Ripley's "Pattern Recognition and Neural Networks" and Hastie, Tibshirani, and Friedman, "Elements of Statistical Learning".
As a meta-comment, you might get more traction for this question on stats.stackexchange.com.

How to choose classifier on specific dataset

When given the dataset, normally m instances by n features matrix, how to choose the classifier that is most appropriate for the dataset.
This is just like what algorithm to solve a prime Number. Not every algorithm solve any problem means each problem assigned which finite no. of algorithm. In machine learning you can apply different algorithm on a type of problem.
If matrix contain real numbered features then you can use KNN algorithm can be used. Or if matrix have words as feature then you can use naive bayes classifier which is one of best for text classification. And Machine learning have tons of algorithm you can read them apply to your problem which fits best. Hope you understand what I said.
An interesting but much more general map I found:
http://scikit-learn.org/stable/tutorial/machine_learning_map/
If you have weka, you can use experimenter and choose different algorithms on same data set to evaluate different models.
This project compares many different classifiers on different typical datasets.
If you have no idea, you could use this simple tool auto-weka which will test all the different classifiers you selected within different constraints. Before using auto-weka, you may need to convert your data to ARFF using Weka or just manually (many tutorial on youtube).
The best classifier depends on your data (binary/string/real/tags, patterns, distribution...), what kind of output to predict (binary class / multi-class / evolving classes / a value from regression ?) and the expected performance (time, memory, accuracy). It would also depend on whether you want to update your model frequently or not (ie. if it is a stream, better use an online classifier).
Please note that the best classifier may not be one but an ensemble of different classifiers.

Which machine learning classifier to choose, in general? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
Suppose I'm working on some classification problem. (Fraud detection and comment spam are two problems I'm working on right now, but I'm curious about any classification task in general.)
How do I know which classifier I should use?
Decision tree
SVM
Bayesian
Neural network
K-nearest neighbors
Q-learning
Genetic algorithm
Markov decision processes
Convolutional neural networks
Linear regression or logistic regression
Boosting, bagging, ensambling
Random hill climbing or simulated annealing
...
In which cases is one of these the "natural" first choice, and what are the principles for choosing that one?
Examples of the type of answers I'm looking for (from Manning et al.'s Introduction to Information Retrieval book):
a. If your data is labeled, but you only have a limited amount, you should use a classifier with high bias (for example, Naive Bayes).
I'm guessing this is because a higher-bias classifier will have lower variance, which is good because of the small amount of data.
b. If you have a ton of data, then the classifier doesn't really matter so much, so you should probably just choose a classifier with good scalability.
What are other guidelines? Even answers like "if you'll have to explain your model to some upper management person, then maybe you should use a decision tree, since the decision rules are fairly transparent" are good. I care less about implementation/library issues, though.
Also, for a somewhat separate question, besides standard Bayesian classifiers, are there 'standard state-of-the-art' methods for comment spam detection (as opposed to email spam)?
First of all, you need to identify your problem. It depends upon what kind of data you have and what your desired task is.
If you are Predicting Category :
You have Labeled Data
You need to follow Classification Approach and its algorithms
You don't have Labeled Data
You need to go for Clustering Approach
If you are Predicting Quantity :
You need to go for Regression Approach
Otherwise
You can go for Dimensionality Reduction Approach
There are different algorithms within each approach mentioned above. The choice of a particular algorithm depends upon the size of the dataset.
Source: http://scikit-learn.org/stable/tutorial/machine_learning_map/
Model selection using cross validation may be what you need.
Cross validation
What you do is simply to split your dataset into k non-overlapping subsets (folds), train a model using k-1 folds and predict its performance using the fold you left out. This you do for each possible combination of folds (first leave 1st fold out, then 2nd, ... , then kth, and train with the remaining folds). After finishing, you estimate the mean performance of all folds (maybe also the variance/standard deviation of the performance).
How to choose the parameter k depends on the time you have. Usual values for k are 3, 5, 10 or even N, where N is the size of your data (that's the same as leave-one-out cross validation). I prefer 5 or 10.
Model selection
Let's say you have 5 methods (ANN, SVM, KNN, etc) and 10 parameter combinations for each method (depending on the method). You simply have to run cross validation for each method and parameter combination (5 * 10 = 50) and select the best model, method and parameters. Then you re-train with the best method and parameters on all your data and you have your final model.
There are some more things to say. If, for example, you use a lot of methods and parameter combinations for each, it's very likely you will overfit. In cases like these, you have to use nested cross validation.
Nested cross validation
In nested cross validation, you perform cross validation on the model selection algorithm.
Again, you first split your data into k folds. After each step, you choose k-1 as your training data and the remaining one as your test data. Then you run model selection (the procedure I explained above) for each possible combination of those k folds. After finishing this, you will have k models, one for each combination of folds. After that, you test each model with the remaining test data and choose the best one. Again, after having the last model you train a new one with the same method and parameters on all the data you have. That's your final model.
Of course, there are many variations of these methods and other things I didn't mention. If you need more information about these look for some publications about these topics.
The book "OpenCV" has a great two pages on this on pages 462-463. Searching the Amazon preview for the word "discriminative" (probably google books also) will let you see the pages in question. These two pages are the greatest gem I have found in this book.
In short:
Boosting - often effective when a large amount of training data is available.
Random trees - often very effective and can also perform regression.
K-nearest neighbors - simplest thing you can do, often effective but slow and requires lots of memory.
Neural networks - Slow to train but very fast to run, still optimal performer for letter recognition.
SVM - Among the best with limited data, but losing against boosting or random trees only when large data sets are available.
Things you might consider in choosing which algorithm to use would include:
Do you need to train incrementally (as opposed to batched)?
If you need to update your classifier with new data frequently (or you have tons of data), you'll probably want to use Bayesian. Neural nets and SVM need to work on the training data in one go.
Is your data composed of categorical only, or numeric only, or both?
I think Bayesian works best with categorical/binomial data. Decision trees can't predict numerical values.
Does you or your audience need to understand how the classifier works?
Use Bayesian or decision trees, since these can be easily explained to most people. Neural networks and SVM are "black boxes" in the sense that you can't really see how they are classifying data.
How much classification speed do you need?
SVM's are fast when it comes to classifying since they only need to determine which side of the "line" your data is on. Decision trees can be slow especially when they're complex (e.g. lots of branches).
Complexity.
Neural nets and SVMs can handle complex non-linear classification.
As Prof Andrew Ng often states: always begin by implementing a rough, dirty algorithm, and then iteratively refine it.
For classification, Naive Bayes is a good starter, as it has good performances, is highly scalable and can adapt to almost any kind of classification task. Also 1NN (K-Nearest Neighbours with only 1 neighbour) is a no-hassle best fit algorithm (because the data will be the model, and thus you don't have to care about the dimensionality fit of your decision boundary), the only issue is the computation cost (quadratic because you need to compute the distance matrix, so it may not be a good fit for high dimensional data).
Another good starter algorithm is the Random Forests (composed of decision trees), this is highly scalable to any number of dimensions and has generally quite acceptable performances. Then finally, there are genetic algorithms, which scale admirably well to any dimension and any data with minimal knowledge of the data itself, with the most minimal and simplest implementation being the microbial genetic algorithm (only one line of C code! by Inman Harvey in 1996), and one of the most complex being CMA-ES and MOGA/e-MOEA.
And remember that, often, you can't really know what will work best on your data before you try the algorithms for real.
As a side-note, if you want a theoretical framework to test your hypothesis and algorithms theoretical performances for a given problem, you can use the PAC (Probably approximately correct) learning framework (beware: it's very abstract and complex!), but to summary, the gist of PAC learning says that you should use the less complex, but complex enough (complexity being the maximum dimensionality that the algo can fit) algorithm that can fit your data. In other words, use the Occam's razor.
Sam Roweis used to say that you should try naive Bayes, logistic regression, k-nearest neighbour and Fisher's linear discriminant before anything else.
My take on it is that you always run the basic classifiers first to get some sense of your data. More often than not (in my experience at least) they've been good enough.
So, if you have supervised data, train a Naive Bayes classifier. If you have unsupervised data, you can try k-means clustering.
Another resource is one of the lecture videos of the series of videos Stanford Machine Learning, which I watched a while back. In video 4 or 5, I think, the lecturer discusses some generally accepted conventions when training classifiers, advantages/tradeoffs, etc.
You should always keep into account the inference vs prediction trade-off.
If you want to understand the complex relationship that is occurring in your data then you should go with a rich inference algorithm (e.g. linear regression or lasso). On the other hand, if you are only interested in the result you can go with high dimensional and more complex (but less interpretable) algorithms, like neural networks.
Selection of Algorithm is depending upon the scenario and the type and size of data set.
There are many other factors.
This is a brief cheat sheet for basic machine learning.

How to approach machine learning problems with high dimensional input space?

How should I approach a situtation when I try to apply some ML algorithm (classification, to be more specific, SVM in particular) over some high dimensional input, and the results I get are not quite satisfactory?
1, 2 or 3 dimensional data can be visualized, along with the algorithm's results, so you can get the hang of what's going on, and have some idea how to aproach the problem. Once the data is over 3 dimensions, other than intuitively playing around with the parameters I am not really sure how to attack it?
What do you do to the data? My answer: nothing. SVMs are designed to handle high-dimensional data. I'm working on a research problem right now that involves supervised classification using SVMs. Along with finding sources on the Internet, I did my own experiments on the impact of dimensionality reduction prior to classification. Preprocessing the features using PCA/LDA did not significantly increase classification accuracy of the SVM.
To me, this totally makes sense from the way SVMs work. Let x be an m-dimensional feature vector. Let y = Ax where y is in R^n and x is in R^m for n < m, i.e., y is x projected onto a space of lower dimension. If the classes Y1 and Y2 are linearly separable in R^n, then the corresponding classes X1 and X2 are linearly separable in R^m. Therefore, the original subspaces should be "at least" as separable as their projections onto lower dimensions, i.e., PCA should not help, in theory.
Here is one discussion that debates the use of PCA before SVM: link
What you can do is change your SVM parameters. For example, with libsvm link, the parameters C and gamma are crucially important to classification success. The libsvm faq, particularly this entry link, contains more helpful tips. Among them:
Scale your features before classification.
Try to obtain balanced classes. If impossible, then penalize one class more than the other. See more references on SVM imbalance.
Check the SVM parameters. Try many combinations to arrive at the best one.
Use the RBF kernel first. It almost always works best (computationally speaking).
Almost forgot... before testing, cross validate!
EDIT: Let me just add this "data point." I recently did another large-scale experiment using the SVM with PCA preprocessing on four exclusive data sets. PCA did not improve the classification results for any choice of reduced dimensionality. The original data with simple diagonal scaling (for each feature, subtract mean and divide by standard deviation) performed better. I'm not making any broad conclusion -- just sharing this one experiment. Maybe on different data, PCA can help.
Some suggestions:
Project data (just for visualization) to a lower-dimensional space (using PCA or MDS or whatever makes sense for your data)
Try to understand why learning fails. Do you think it overfits? Do you think you have enough data? Is it possible there isn't enough information in your features to solve the task you are trying to solve? There are ways to answer each of these questions without visualizing the data.
Also, if you tell us what the task is and what your SVM output is, there may be more specific suggestions people could make.
You can try reducing the dimensionality of the problem by PCA or the similar technique. Beware that PCA has two important points. (1) It assumes that the data it is applied to is normally distributed and (2) the resulting data looses its natural meaning (resulting in a blackbox). If you can live with that, try it.
Another option is to try several parameter selection algorithms. Since SVM's were already mentioned here, you might try the approach of Chang and Li (Feature Ranking Using Linear SVM) in which they used linear SVM to pre-select "interesting features" and then used RBF - based SVM on the selected features. If you are familiar with Orange, a python data mining library, you will be able to code this method in less than an hour. Note that this is a greedy approach which, due to its "greediness" might fail in cases where the input variables are highly correlated. In that case, and if you cannot solve this problem with PCA (see above), you might want to go to heuristic methods, which try to select best possible combinations of predictors. The main pitfall of this kind of approaches is the high potential of overfitting. Make sure you have a bunch "virgin" data that was not seen during the entire process of model building. Test your model on that data only once, after you are sure that the model is ready. If you fail, don't use this data once more to validate another model, you will have to find a new data set. Otherwise you won't be sure that you didn't overfit once more.
List of selected papers on parameter selection:
Feature selection for high-dimensional genomic microarray data
Oh, and one more thing about SVM. SVM is a black box. You better figure out what is the mechanism that generate the data and model the mechanism and not the data. On the other hand, if this would be possible, most probably you wouldn't be here asking this question (and I wouldn't be so bitter about overfitting).
List of selected papers on parameter selection
Feature selection for high-dimensional genomic microarray data
Wrappers for feature subset selection
Parameter selection in particle swarm optimization
I worked in the laboratory that developed this Stochastic method to determine, in silico, the drug like character of molecules
I would approach the problem as follows:
What do you mean by "the results I get are not quite satisfactory"?
If the classification rate on the training data is unsatisfactory, it implies that either
You have outliers in your training data (data that is misclassified). In this case you can try algorithms such as RANSAC to deal with it.
Your model(SVM in this case) is not well suited for this problem. This can be diagnozed by trying other models (adaboost etc.) or adding more parameters to your current model.
The representation of the data is not well suited for your classification task. In this case preprocessing the data with feature selection or dimensionality reduction techniques would help
If the classification rate on the test data is unsatisfactory, it implies that your model overfits the data:
Either your model is too complex(too many parameters) and it needs to be constrained further,
Or you trained it on a training set which is too small and you need more data
Of course it may be a mixture of the above elements. These are all "blind" methods to attack the problem. In order to gain more insight into the problem you may use visualization methods by projecting the data into lower dimensions or look for models which are suited better to the problem domain as you understand it (for example if you know the data is normally distributed you can use GMMs to model the data ...)
If I'm not wrong, you are trying to see which parameters to the SVM gives you the best result. Your problem is model/curve fitting.
I worked on a similar problem couple of years ago. There are tons of libraries and algos to do the same. I used Newton-Raphson's algorithm and a variation of genetic algorithm to fit the curve.
Generate/guess/get the result you are hoping for, through real world experiment (or if you are doing simple classification, just do it yourself). Compare this with the output of your SVM. The algos I mentioned earlier reiterates this process till the result of your model(SVM in this case) somewhat matches the expected values (note that this process would take some time based your problem/data size.. it took about 2 months for me on a 140 node beowulf cluster).
If you choose to go with Newton-Raphson's, this might be a good place to start.

Resources