I have been looking for fast linear SVM library and i came across two of the most important ones Liblinear and Pegasos , from the paper presented of liblinear it looks like for liblinaer outperforms pegasos. however pegasos claims if data is sparse then it works fast.
As pegasos came earlier there is no comparison in it;s documentation.
So for sparse data what should i choose ?
As far as I know, sparse data is handled fine by both. The question is more on the number of data points. Liblinear has solvers for both the primal and the dual, and these solve the problem to a high precision without any need to tune the parameters. For pegasos or similar subgradient descent solvers (if you want one of these, I'd recommend Leon Bottou's sgd) the result is strongly dependant on the initial learning rate and learning rate schedule, which can be tricky to tune.
As a rule of thumb, if I have less than 10k data points, I'd always use liblinear (with the primal solver), maybe even up to 100k. Above that, I'd consider using SGD if I feel liblinear is to slow. Even if liblinear is slightly slower, I prefer using it as it means I don't have to think about learning rate, learning rate decay and number of epochs.
Btw, you can very easily compare these different solvers using a framework like scikit-learn, which includes SGD, Liblinear and LibSVM solvers, or lightning, which includes A LOT of solvers.
Both LIBLINEAR and Pegasos are linear classification techniques that were specifically developed to deal with large sparse data with a huge number of instances and features. They are only faster than the traditional SVM on this kind of data.
I never used Pegasos before, but I can assure you that LIBLINEAR is very fast with this kind of data and the authors say that "it is competitive with or even faster than state of the art linear classifiers such as Pegasos".
Related
I have some questions about SVM :
1- Why using SVM? or in other words, what causes it to appear?
2- The state Of art (2017)
3- What improvements have they made?
SVM works very well. In many applications, they are still among the best performing algorithms.
We've seen some progress in particular on linear SVMs, that can be trained much faster than kernel SVMs.
Read more literature. Don't expect an exhaustive answer in this QA format. Show more effort on your behalf.
SVM's are most commonly used for classification problems where labeled data is available (supervised learning) and are useful for modeling with limited data. For problems with unlabeled data (unsupervised learning), then support vector clustering is an algorithm commonly employed. SVM tends to perform better on binary classification problems since the decision boundaries will not overlap. Your 2nd and 3rd questions are very ambiguous (and need lots of work!), but I'll suffice it to say that SVM's have found wide range applicability to medical data science. Here's a link to explore more about this: Applications of Support Vector Machine (SVM) Learning in Cancer Genomics
What are the advantages and disadvantages of LDA vs Naive Bayes in
terms of machine learning classification?
I know some of the differences like Naive Bayes assumes variables to be independent, while LDA assumes Gaussian class-conditional density models, but I don't understand when to use LDA and when to use NB depending on the situation?
Both methods are pretty simple, so it's hard to say which one is going to work much better. It's often faster just to try both and calculate the test accuracy. But here's the list of characteristics that usually indicate if certain method is less likely to give good results. It all boils down to the data.
Naive Bayes
The first disadvantage of the Naive Bayes classifier is the feature independence assumption. In practice, the data is multi-dimensional and different features do correlate. Due to this, the result can be potentially pretty bad, though not always significantly. If you know for sure, that features are dependent (e.g. pixels of an image), don't expect Naive Bayes to show off.
Another problem is data scarcity. For any possible value of a feature, a likelihood is estimated by a frequentist approach. This can result in probabilities being close to 0 or 1, which in turn leads to numerical instabilities and worse results.
A third problem arises for continuous features. The Naive Bayes classifier works only with categorical variables, so one has to transform continuous features to discrete, by which throwing away a lot of information. If there's a continuous variable in the data, it's a strong sign against Naive Bayes.
Linear Discriminant Analysis
The LDA does not work well if the classes are not balanced, i.e. the number of objects in various classes are highly different. The solution is to get more data, which can be pretty easy or almost impossible, depending on a task.
Another disadvantage of LDA is that it's not applicable for non-linear problems, e.g. separation of donut-shape point clouds, but in high dimensional spaces it's hard to spot it right away. Usually you understand this after you see LDA not working, but if the data is known to be very non-linear, this is a strong sign against LDA.
In addition, LDA can be sensitive to overfitting and need careful validation / testing.
I am intended to do a yes/no classifier. The problem is that the data does not come from me, so I have to work with what I have been given. I have around 150 samples, each sample contains 3 features, these features are continuous numeric variables. I know the dataset is quite small. I would like to make you two questions:
A) What would be the best machine learning algorithm for this? SVM? a neural network? All that I have read seems to require a big dataset.
B)I could make the dataset a little bit bigger by adding some samples that do not contain all the features, only one or two. I have read that you can use sparse vectors in this case, is this possible with every machine learning algorithm? (I have seen them in SVM)
Thanks a lot for your help!!!
My recommendation is to use a simple and straightforward algorithm, like decision tree or logistic regression, although, the ones you refer to should work equally well.
The dataset size shouldn't be a problem, given that you have far more samples than variables. But having more data always helps.
Naive Bayes is a good choice for a situation when there are few training examples. When compared to logistic regression, it was shown by Ng and Jordan that Naive Bayes converges towards its optimum performance faster with fewer training examples. (See section 4 of this book chapter.) Informally speaking, Naive Bayes models a joint probability distribution that performs better in this situation.
Do not use a decision tree in this situation. Decision trees have a tendency to overfit, a problem that is exacerbated when you have little training data.
I am a beginner in machine learning and recently read about supervised and unsupervised machine learning. It looks like supervised learning is synonymous to classification and unsupervised learning is synonymous to clustering, is it so?
No.
Supervised learning is when you know correct answers (targets). Depending on their type, it might be classification (categorical targets), regression (numerical targets) or learning to rank (ordinal targets) (this list is by no means complete, there might be other types that I either forgot or unaware of).
On the contrary, in unsupervised learning setting we don't know correct answers, and we try to infer, learn some structure from data. Be it cluster number or low-dimensional approximation (dimensionality reduction, actually, one might think of clusterization as of extreme 1D case of dimensionality reduction). Again, this might be far away from completeness, but the general idea is about hidden structure, that we try to discover from data.
Supervised learning is when you have labeled training data. In other words, you have a well-defined target to optimize your method for.
Typical (supervised) learning tasks are classification and regression: learning to predict categorial (classification), numerical (regression) values or ranks (learning to rank).
Unsupservised learning is an odd term. Because most of the time, the methods aren't "learning" anything. Because what would they learn from? You don't have training data?
There are plenty of unsupervised methods that don't fit the "learning" paradigm well. This includes dimensionality reduction methods such as PCA (which by far predates any "machine learning" - PCA was proposed in 1901, long before the computer!). Many of these are just data-driven statistics (as opposed to parameterized statistics). This includes most cluster analysis methods, outlier detection, ... for understanding these, it's better to step out of the "learning" mindset. Many people have trouble understanding these approaches, because they always think in the "minimize objective function f" mindset common in learning.
Consider for example DBSCAN. One of the most popular clustering algorithms. It does not fit the learning paradigm well. It can nicely be interpreted as a graph-theoretic construct: (density-) connected components. But it doesn't optimize any objective function. It computes the transitive closure of a relation; but there is no function maximized or minimized.
Similarly APRIORI finds frequent itemsets; combinations of items that occur more than minsupp times, where minsupp is a user parameter. It's an extremely simple definition; but the search space can be painfully large when you have large data. The brute-force approach just doesn't finish in acceptable time. So APRIORI uses a clever search strategy to avoid unnecessary hard disk accesses, computations, and memory. But there is no "worse" or "better" result as in learning. Either the result is correct (complete) or not - nothing to optimize on the result (only on the algorithm runtime).
Calling these methods "unsupervised learning" is squeezing them into a mindset that they don't belong into. They are not "learning" anything. Neither optimizes a function, or uses labels, or uses any kind of feedback. They just SELECT a certain set of objects from the database: APRIORI selects columns that frequently have a 1 at the same time; DBSCAN select connected components in a density graph. Either the result is correct, or not.
Some (but by far not all) unsupervised methods can be formalized as an optimization problem. At which point they become similar to popular supervised learning approaches. For example k-means is a minimization problem. PCA is a minimization problem, too - closely related to linear regression, actually. But it is the other way around. Many machine learning tasks are transformed into an optimization problem; and can be solved with general purpose statistical tools, which just happen to be highly popular in machine learning (e.g. linear programming). All the "learning" part is then wrapped into the way the data is transformed prior to feeding it into the optimizer. And in some cases, like for PCA, a non-iterative way to compute the optimum solution was found (in 1901). So in these cases, you don't need the usual optimization hammer at all.
I have my own java based implementation of clustering (knn). However I am facing scalability issues. I do not plan to use Mahout because my requirements are very simple and mahout requires lot of work. I am looking for java based Canopy clustering implementation which i can plug into my algo and do parellel processing.
Mahout based Canopy libraries are coupled with Vectors and indexes and does not work on plain strings. If you know of the way, where i can use canopy clustering on strings using simple library, it would fix my issue.
My requirement is to pass list of strings(say 10K) to Canopy clustering algo and it should return sublists based on T1 and T2.
Canopy clustering is mostly useful as a preprocessing step for parallelization. I'm not sure how much it will get you on a single node. I figure you might as well compute the actual algorithm right away, or build an index such as an M-tree.
The strength of Canopy clustering is that you can run it independently on a number of nodes and then just overlap their results.
Also check if it actually is compatible to your approach. I figure that canopy might need metric properties to be correct. Is your string distance a proper metric (i.e. triangle inequality)?
10,000 data points, if that's all you're concerned with, should be no problem with standard k-means. I'd look at optimising that before you consider canopy clustering (which is really designed for millions or even billions of examples). Some things you may have missed:
pre-compute the feature vectors for each string. Don't do it every time you want to compare s_1 to s_2 or s_1 to cluster centroid
you only need to keep the summary statistics in memory: the sum of all points assigned to a cluster and the number of points assigned to a cluster. When you're done with an iteration, divides sums by ns and you have your new centroids.
what's the dimensionality of your feature space? be aware that you should use a distance metric where the dimensions where both vectors are zero have no impact, so you should only need to compute for non-zero dimensions. Store your points as sparse vectors to facilitate this.
Can you do some analysis and determine where the bottle-neck in your implementation is? I'm a little perplexed by your comment about Mahout not working with plain strings.
You should give the clustering algorithms in ELKI a try. Sorry for so shamelessly promoting a project I'm closely affiliated with. But it is the largest collection of clustering and outlier detection algorithms that are implemented in a comparable fashion. (If you'd take all the clustering algorithms available in some R package, you might end up with more algorithms, but they won't be really comparable because of implementation differences)
And benchmarking showed enormous speed differences with different implementations of the same algorithm. See our benchmarking web site on how much performance can vary even on simple algorithms such as k-means.
We do not yet have Canopy Clustering. The reason is that it's more of a preprocessing index than actually a clustering algorithm. Kind of like a primitive variant of the M-tree, or of DBSCAN clustering. However, we should would like to see a contributed canopy clustering as such a preprocessing step.
ELKIs abilities to process strings are also a bit limited so far. You can load typical TF-IDF vectors just fine and we have somewhat optimized sparse vector classes and similarity functions. They don't fully exploit sparsity for k-means yet, though, and there is no spherical k-means yet either. But there are various reasons why k-means results on sparse vectors cannot be expected to be very meaningful; it's more of a heuristic.
But it would be interesting if you could give it a try for your problem and report back your experiences. Was the performance somewhat competitive with your implementation? And we would love to see contributed modules for text processing, such as e.g. further optimized similarity functions, or a spherical k-means variant.
Update: ELKI now actually includes CanopyClustering: CanopyPreClustering (will be part of 0.6.0 then). But as of now, it's just another clustering algorithm, and not yet used to accelerate other algorithms such as k-means. I need to check how to best use it as some kind of index to accelerate algorithms. I can imagine it also helps for speeding up DBSCAN if you set T1=epsilon and T2=0.5*T1. The big issue with CanopyClustering IMHO is how to choose a good radius.