Ensemble Learning in Unsupervised Learning

Ensemble Learning in Unsupervised Learning - machine-learning

I have a question regarding the current literature in ensemble learning (more specifically in unsupervised learning).
For what I read in the literature, Ensemble Learning when applied to Unsupervised Learning resumes basically to Clustering Problems. However, if I have x unsupervised methods that output a score (similar to a regression problem), is there an approach that can combine these results into a single one?

On evaluation of outlier rankings and outlier scores. Schubert, E., Wojdanowski, R., Zimek, A., & Kriegel, H. P. (2012, April). In Proceedings of the 2012 SIAM International Conference on Data Mining (pp. 1047-1058). Society for Industrial and Applied Mathematics.
In this publication, we not "just normalize" outlier scores, but we also suggest a unsupervised ensemble member selection strategy called "greedy ensemble".
However, normalization is crucial, and difficult. We published some of the earlier progress with respect to score normalization as
Interpreting and unifying outlier scores. Kriegel, H. P., Kroger, P., Schubert, E., & Zimek, A. (2011, April). In Proceedings of the 2011 SIAM International Conference on Data Mining (pp. 13-24). Society for Industrial and Applied Mathematics.
If you don't normalize your scores (and min-max scaling is not enough), you will usually not be able to combine them in a meaningful way, except with very strong preconditions. Even two different subspaces will usually yield incomparable values because of having a different number of features, and different feature scales.
There is also some work on semi-supervised ensembles, e.g.
Learning Outlier Ensembles: The Best of Both Worlds—Supervised and Unsupervised. Micenková, B., McWilliams, B., & Assent, I. (2014).In Proceedings of the ACM SIGKDD 2014 Workshop on Outlier Detection and Description under Data Diversity (ODD2). New York, NY, USA (pp. 51-54).
Also beware of overfitting. It's quite easy to arrive at a single good result by tweaking parameters and repeated evaluation. But this leaks evaluation information into your experiment, i.e. you tend to overfit. Performing well across a large range of parameters and data sets is very hard. One of the key observations of the following study was that for every algorithm, you'll find at least one data set and parameter set, where it 'outperforms' the others; but if you change parameters a little, or use a different data set, the benefits of the "superior" new methods are not reproducible.
On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Campos, G. O., Zimek, A., Sander, J., Campello, R. J., Micenková, B., Schubert, E., ... & Houle, M. E. (2016). Data Mining and Knowledge Discovery, 30(4), 891-927.
So you will have to work really hard to do a reliable evaluation. Be careful how to choose parameters.

Related

Which accuracy (training or test) is being reported in journal articles?

I am new to neural network. When I read articles, they often say “we noted a 98% accuracy”. I carefully read the articles (see below two articles), but there is no further information whether the accuracy is referring to training or test (validation). Please let me know which accuracy is the one the authors are implying.
Grinblat, G. L., Uzal, L. C., Larese, M. G., & Granitto, P. M. (2016). Deep learning for plant identification using vein morphological patterns. Computers and Electronics in Agriculture, 127, 418-424.
Satti, V., Satya, A., & Sharma, S. (2013). An automatic leaf recognition system for plant identification using machine vision technology. International journal of engineering science and technology, 5(4), 874.

For what I read, accuracy refer to test, when you test with a large number of data, you give an opportunity to your machine learning to have a high accuracy. Of cause, the test always determine if your work gives the result expected

Use sklearn DBSCAN model to classify new entries

I have a huge "dynamic" dataset and I'm trying to find interesting clusters on it.
After running a lot of different unsupervised clustering algorithms I have found a configuration of DBSCAN which gives coherent results.
I would like to extrapolate the model that DBSCAN creates according to my test data to apply it to other datasets, but without re-running the algorithm. I cannot run the algorithm over the whole dataset cause it would run out of memory, and the model might not make sense to me at a different time as the data is dynamic.
Using sklearn, I have found that other clustering algorithms - like MiniBatchKMeans - have a predict method, but DBSCAN does not.
I understand that for MiniBatchKMeans the centroids uniquely define the model. But such a thing might not exist for DBSCAN.
So my question is: What is the proper way to extrapolate the DBSCAN model? should I train a supervised learning algorithm using the output that DBSCAN gave on my test dataset? or is there something intrinsically belonging to DBSCAN model that can be used to classify new data without re-running the algorithm?

DBSCAN and other 'unsupervised' clustering methods can be used to automatically propagate labels used by classifiers (a 'supervised' machine learning task) in what as known as 'semi-supervised' machine learning. I'll break down the general steps for doing this and cite a series of semi-supervised papers that motivated this approach.
By some means, label a small portion of your data.
Use DBSCAN or other clustering method (e.g. k-nearest neighbors) to cluster your labeled and unlabeled data.
For each cluster, determine the most common label (if any) for members of the cluster. Re-label all members in the cluster to that label. This effectively increased the number of labeled training data.
Train a supervised classifier using the dataset from step 3.
The following papers propose some extensions to this general process to improve classification performance. As a note, all of the following papers have found that k-means is a consistent, efficient, and effective clustering method for semi-supervised learning compared to about a dozen other clustering methods. They then use k-nearest neighbors with a large K value for classification. One paper that specifically covered DBSCAN based clustering is:
- Erman, J., & Arlitt, M. (2006). Traffic classification using clustering algorithms. In Proceedings of the 2006 SIGCOMM workshop on Mining network data (pp. 281–286). https://doi.org/http://doi.acm.org/10.1145/1162678.1162679
NOTE: These papers are listed in chronological order and build upon each other. The 2016 Glennan paper is what you should read if you only want to see the most successful/advanced iteration.
Erman, J., & Arlitt, M. (2006). Traffic classification using clustering algorithms. In Proceedings of the 2006 SIGCOMM workshop on Mining network data (pp. 281–286). https://doi.org/http://doi.acm.org/10.1145/1162678.1162679
Wang, Y., Xiang, Y., Zhang, J., & Yu, S. (2011). A novel semi-supervised approach for network traffic clustering. In 5th International Conference on Network and System Security (NSS) (pp. 169–175). Milan, Italy: IEEE. https://doi.org/10.1109/ICNSS.2011.6059997
Zhang, J., Chen, C., Xiang, Y., & Zhou, W. (2012). Semi-supervised and compound classification of network traffic. In Proceedings - 32nd IEEE International Conference on Distributed Computing Systems Workshops, ICDCSW 2012 (pp. 617–621). https://doi.org/10.1109/ICDCSW.2012.12
Glennan, T., Leckie, C., & Erfani, S. M. (2016). Improved Classification of Known and Unknown Network Traffic Flows Using Semi-supervised Machine Learning. In J. K. Liu & R. Steinfeld (Eds.), Information Security and Privacy: 21st Australasian Conference (Vol. 2, pp. 493–501). Melbourne: Springer International Publishing. https://doi.org/10.1007/978-3-319-40367-0_33

Train a classificator based on your model.
DBSCAN is not easy to adapt to new objects, because you would need to eventually adjust minPts. Adding points to DBSCAN can cause clusters to merge, which you probably do not want to happen.
If you consider the clusters found by DBSCAN to be useful, train a classifier to put new instances into the same classes. You now want to perform classification, not rediscover structure.

Supervised Learning, (ii) Unsupervised Learning, (iii) Reinforcement Learn

I am new to Machine learning. While reading about Supervised Learning, Unsupervised Learning, Reinforcement Learning I came across a question as below and got confused. Please help me in identifying in below three which one is Supervised Learning, Unsupervised Learning, Reinforcement learning.
What types of learning, if any, best describe the following three scenarios:
(i) A coin classification system is created for a vending machine. In order to do this,
the developers obtain exact coin specications from the U.S. Mint and derive
a statistical model of the size, weight, and denomination, which the vending
machine then uses to classify its coins.
(ii) Instead of calling the U.S. Mint to obtain coin information, an algorithm is
presented with a large set of labeled coins. The algorithm uses this data to
infer decision boundaries which the vending machine then uses to classify its
coins.
(iii) A computer develops a strategy for playing Tic-Tac-Toe by playing repeatedly
and adjusting its strategy by penalizing moves that eventually lead to losing.

(i) unsupervised learning - as no labelled data is available
(ii) supervised learning - as you already have labelled data available
(iii) reinforcement learning- where you learn and relearn based on the actions and the effects/rewards from that actions.

Let's say, you have dataset represented as matrix X. Each row in X is an observation (instance) and each column represents particular variable (feature).
If you also have (and use) vector y of labels, corresponding to observations, then this is a task of supervised learning. There's "supervisor" involved, that says which observations belong to class #1, which to class #2, etc.
If you don't have labels for observations, then you have to make decisions based on the X dataset itself. For example, in the example with coins you may want to build model of normal distribution for coin parameters and create system that signals when the coin has unusual parameters (and thus may be attempted fraud). In this case you don't have any kind of supervisor that would say what coins are ok and what represent fraud attempt. Thus, it is unsupervised learning task.
In 2 previous examples you first trained your model and then used it, without any further changes to the model. In reinforcement learning model is continuously improved based on processed data and the result. For example, robot that seeks to find the way from point A to point B may first compute parameters of the move, then shift based on these parameters, then analyze new position and update move parameters, so that next move would be more accurate (repeat until get to point B).
Based on this, I'm pretty sure you will be able to find correspondence between these 3 kinds of learning and your items.

I wrote an article on Perceptron for Novices. I have explained Supervised Learning in details with Delta Rule. Also described Unsupervised Learning and Reinforcement Learning (in brief). You may check if you are interested.
"An Intuitive Example of Artificial Neural Network (Perceptron) Detecting Cars / Pedestrians from a Self-driven Car"
https://www.spicelogic.com/Blog/Perceptron-Artificial-Neural-Networks-10

Gradient boosting predictions in low-latency production environments?

Can anyone recommend a strategy for making predictions using a gradient boosting model in the <10-15ms range (the faster the better)?
I have been using R's gbm package, but the first prediction takes ~50ms (subsequent vectorized predictions average to 1ms, so there appears to be overhead, perhaps in the call to the C++ library). As a guideline, there will be ~10-50 inputs and ~50-500 trees. The task is classification and I need access to predicted probabilities.
I know there are a lot of libraries out there, but I've had little luck finding information even on rough prediction times for them. The training will happen offline, so only predictions need to be fast -- also, predictions may come from a piece of code / library that is completely separate from whatever does the training (as long as there is a common format for representing the trees).

I'm the author of the scikit-learn gradient boosting module, a Gradient Boosted Regression Trees implementation in Python. I put some effort in optimizing prediction time since the method was targeted at low-latency environments (in particular ranking problems); the prediction routine is written in C, still there is some overhead due to Python function calls. Having said that: prediction time for single data points with ~50 features and about 250 trees should be << 1ms.
In my use-cases prediction time is often governed by the cost of feature extraction. I strongly recommend profiling to pin-point the source of the overhead (if you use Python, I can recommend line_profiler).
If the source of the overhead is prediction rather than feature extraction you might check whether its possible to do batch predictions instead of predicting single data points thus limiting the overhead due to the Python function call (e.g. in ranking you often need to score the top-K documents, so you can do the feature extraction first and then run predict on the K x n_features matrix.
If this doesn't help either you should try the limit the number of trees because the runtime cost for prediction is basically linear in the number of trees.
There are a number of ways to limit the number of trees without affecting the model accuracy:
Proper tuning of the learning rate; the smaller the learning rate, the more trees are needed and thus the slower is prediction.
Post-process GBM with L1 regularization (Lasso); See Elements of Statistical Learning Section 16.3.1 - use predictions of each tree as new features and run the representation through a L1 regularized linear model - remove those trees that don't get any weight.
Fully-corrective weight updates; instead of doing the line-search/weight update just for the most recent tree, update all trees (see [Warmuth2006] and [Johnson2012]). Better convergence - fewer trees.
If none of the above does the trick you could investigate cascades or early-exit strategies (see [Chen2012])
References:
[Warmuth2006] M. Warmuth, J. Liao, and G. Ratsch. Totally corrective boosting algorithms that maximize the margin. In Proceedings of the 23rd international conference on Machine learning, 2006.
[Johnson2012] Rie Johnson, Tong Zhang, Learning Nonlinear Functions Using Regularized Greedy Forest, arxiv, 2012.
[Chen2012] Minmin Chen, Zhixiang Xu, Kilian Weinberger, Olivier Chapelle, Dor Kedem, Classifier Cascade for Minimizing Feature Evaluation Cost, JMLR W&CP 22: 218-226, 2012.

When should I use support vector machines as opposed to artificial neural networks?

I know SVMs are supposedly 'ANN killers' in that they automatically select representation complexity and find a global optimum (see here for some SVM praising quotes).
But here is where I'm unclear -- do all of these claims of superiority hold for just the case of a 2 class decision problem or do they go further? (I assume they hold for non-linearly separable classes or else no-one would care)
So a sample of some of the cases I'd like to be cleared up:
Are SVMs better than ANNs with many classes?
in an online setting?
What about in a semi-supervised case like reinforcement learning?
Is there a better unsupervised version of SVMs?
I don't expect someone to answer all of these lil' subquestions, but rather to give some general bounds for when SVMs are better than the common ANN equivalents (e.g. FFBP, recurrent BP, Boltzmann machines, SOMs, etc.) in practice, and preferably, in theory as well.

Are SVMs better than ANN with many classes? You are probably referring to the fact that SVMs are in essence, either either one-class or two-class classifiers. Indeed they are and there's no way to modify a SVM algorithm to classify more than two classes.
The fundamental feature of a SVM is the separating maximum-margin hyperplane whose position is determined by maximizing its distance from the support vectors. And yet SVMs are routinely used for multi-class classification, which is accomplished with a processing wrapper around multiple SVM classifiers that work in a "one against many" pattern--i.e., the training data is shown to the first SVM which classifies those instances as "Class I" or "not Class I". The data in the second class, is then shown to a second SVM which classifies this data as "Class II" or "not Class II", and so on. In practice, this works quite well. So as you would expect, the superior resolution of SVMs compared to other classifiers is not limited to two-class data.
As far as i can tell, the studies reported in the literature confirm this, e.g., In the provocatively titled paper Sex with Support Vector Machines substantially better resolution for sex identification (Male/Female) in 12-square pixel images, was reported for SVM compared with that of a group of traditional linear classifiers; SVM also outperformed RBF NN, as well as large ensemble RBF NN). But there seem to be plenty of similar evidence for the superior performance of SVM in multi-class problems: e.g., SVM outperformed NN in protein-fold recognition, and in time-series forecasting.
My impression from reading this literature over the past decade or so, is that the majority of the carefully designed studies--by persons skilled at configuring and using both techniques, and using data sufficiently resistant to classification to provoke some meaningful difference in resolution--report the superior performance of SVM relative to NN. But as your Question suggests, that performance delta seems to be, to a degree, domain specific.
For instance, NN outperformed SVM in a comparative study of author identification from texts in Arabic script; In a study comparing credit rating prediction, there was no discernible difference in resolution by the two classifiers; a similar result was reported in a study of high-energy particle classification.
I have read, from more than one source in the academic literature, that SVM outperforms NN as the size of the training data decreases.
Finally, the extent to which one can generalize from the results of these comparative studies is probably quite limited. For instance, in one study comparing the accuracy of SVM and NN in time series forecasting, the investigators reported that SVM did indeed outperform a conventional (back-propagating over layered nodes) NN but performance of the SVM was about the same as that of an RBF (radial basis function) NN.
[Are SVMs better than ANN] In an Online setting? SVMs are not used in an online setting (i.e., incremental training). The essence of SVMs is the separating hyperplane whose position is determined by a small number of support vectors. So even a single additional data point could in principle significantly influence the position of this hyperplane.
What about in a semi-supervised case like reinforcement learning? Until the OP's comment to this answer, i was not aware of either Neural Networks or SVMs used in this way--but they are.
The most widely used- semi-supervised variant of SVM is named Transductive SVM (TSVM), first mentioned by Vladimir Vapnick (the same guy who discovered/invented conventional SVM). I know almost nothing about this technique other than what's it is called and that is follows the principles of transduction (roughly lateral reasoning--i.e., reasoning from training data to test data). Apparently TSV is a preferred technique in the field of text classification.
Is there a better unsupervised version of SVMs? I don't believe SVMs are suitable for unsupervised learning. Separation is based on the position of the maximum-margin hyperplane determined by support vectors. This could easily be my own limited understanding, but i don't see how that would happen if those support vectors were unlabeled (i.e., if you didn't know before-hand what you were trying to separate). One crucial use case of unsupervised algorithms is when you don't have labeled data or you do and it's badly unbalanced. E.g., online fraud; here you might have in your training data, only a few data points labeled as "fraudulent accounts" (and usually with questionable accuracy) versus the remaining >99% labeled "not fraud." In this scenario, a one-class classifier, a typical configuration for SVMs, is the a good option. In particular, the training data consists of instances labeled "not fraud" and "unk" (or some other label to indicate they are not in the class)--in other words, "inside the decision boundary" and "outside the decision boundary."
I wanted to conclude by mentioning that, 20 years after their "discovery", the SVM is a firmly entrenched member in the ML library. And indeed, the consistently superior resolution compared with other state-of-the-art classifiers is well documented.
Their pedigree is both a function of their superior performance documented in numerous rigorously controlled studies as well as their conceptual elegance. W/r/t the latter point, consider that multi-layer perceptrons (MLP), though they are often excellent classifiers, are driven by a numerical optimization routine, which in practice rarely finds the global minimum; moreover, that solution has no conceptual significance. On the other hand, the numerical optimization at the heart of building an SVM classifier does in fact find the global minimum. What's more that solution is the actual decision boundary.
Still, i think SVM reputation has declined a little during the past few years.
The primary reason i suspect is the NetFlix competition. NetFlix emphasized the resolving power of fundamental techniques of matrix decomposition and even more significantly t*he power of combining classifiers. People combined classifiers long before NetFlix, but more as a contingent technique than as an attribute of classifier design. Moreover, many of the techniques for combining classifiers are extraordinarily simple to understand and also to implement. By contrast, SVMs are not only very difficult to code (in my opinion, by far the most difficult ML algorithm to implement in code) but also difficult to configure and implement as a pre-compiled library--e.g., a kernel must be selected, the results are very sensitive to how the data is re-scaled/normalized, etc.

I loved Doug's answer. I would like to add two comments.
1) Vladimir Vapnick also co-invented the VC dimension which is important in learning theory.
2) I think that SVMs were the best overall classifiers from 2000 to 2009, but after 2009, I am not sure. I think that neural nets have improved very significantly recently due to the work in Deep Learning and Sparse Denoising Auto-Encoders. I thought I saw a number of benchmarks where they outperformed SVMs. See, for example, slide 31 of
http://deeplearningworkshopnips2010.files.wordpress.com/2010/09/nips10-workshop-tutorial-final.pdf
A few of my friends have been using the sparse auto encoder technique. The neural nets build with that technique significantly outperformed the older back propagation neural networks. I will try to post some experimental results at artent.net if I get some time.

I'd expect SVM's to be better when you have good features to start with. IE, your features succinctly capture all the necessary information. You can see if your features are good if instances of the same class "clump together" in the feature space. Then SVM with Euclidian kernel should do the trick. Essentially you can view SVM as a supercharged nearest neighbor classifier, so whenever NN does well, SVM should do even better, by adding automatic quality control over the examples in your set. On the converse -- if it's a dataset where nearest neighbor (in feature space) is expected to do badly, SVM will do badly as well.

- Is there a better unsupervised version of SVMs?
Just answering only this question here. Unsupervised learning can be done by so-called one-class support vector machines. Again, similar to normal SVMs, there is an element that promotes sparsity. In normal SVMs only a few points are considered important, the support vectors. In one-class SVMs again only a few points can be used to either:
"separate" a dataset as far from the origin as possible, or
define a radius as small as possible.
The advantages of normal SVMs carry over to this case. Compared to density estimation only a few points need to be considered. The disadvantages carry over as well.

Are SVMs better than ANNs with many classes?
SVMs have been designated for discrete classification. Before moving to ANNs, try ensemble methods like Random Forest , Gradient Boosting, Gaussian Probability Classification etc
What about in a semi-supervised case like reinforcement learning?
Deep Q learning provides better alternatives.
Is there a better unsupervised version of SVMs?
SVM is not suited for unsupervised learning. You have other alternatives for unsupervised learning : K-Means, Hierarchical clustering, TSNE clustering etc
From ANN perspective, you can try Autoencoder, General adversarial network
Few more useful links:
towardsdatascience
wikipedia

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart