Gaussian Process Confidence vs Credible Intervals - machine-learning

Since Gaussian Process returns a distribution and not a point estimate, why this example (and actually in every example with GP) talk about Confidence Intervals on the analogues for Bayesian statistics the Credible Intervals?

I was wondering about that, too. My assumption is the following: GaussianProcessRegressor from sklearn implements Algorithm 2.1 from Rasmussen & Williams (2006). Throughout this book, the ±2σ interval around µ is referred to as "95% confidence region". They simply do not make a distinction between "confidence" and "credible" region. I think the authors of sklearn adopted that.
C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, MIT Press, 2006

Related

Which accuracy (training or test) is being reported in journal articles?

I am new to neural network. When I read articles, they often say “we noted a 98% accuracy”. I carefully read the articles (see below two articles), but there is no further information whether the accuracy is referring to training or test (validation). Please let me know which accuracy is the one the authors are implying.
Grinblat, G. L., Uzal, L. C., Larese, M. G., & Granitto, P. M. (2016). Deep learning for plant identification using vein morphological patterns. Computers and Electronics in Agriculture, 127, 418-424.
Satti, V., Satya, A., & Sharma, S. (2013). An automatic leaf recognition system for plant identification using machine vision technology. International journal of engineering science and technology, 5(4), 874.
For what I read, accuracy refer to test, when you test with a large number of data, you give an opportunity to your machine learning to have a high accuracy. Of cause, the test always determine if your work gives the result expected

Ensemble Learning in Unsupervised Learning

I have a question regarding the current literature in ensemble learning (more specifically in unsupervised learning).
For what I read in the literature, Ensemble Learning when applied to Unsupervised Learning resumes basically to Clustering Problems. However, if I have x unsupervised methods that output a score (similar to a regression problem), is there an approach that can combine these results into a single one?
On evaluation of outlier rankings and outlier scores. Schubert, E., Wojdanowski, R., Zimek, A., & Kriegel, H. P. (2012, April). In Proceedings of the 2012 SIAM International Conference on Data Mining (pp. 1047-1058). Society for Industrial and Applied Mathematics.
In this publication, we not "just normalize" outlier scores, but we also suggest a unsupervised ensemble member selection strategy called "greedy ensemble".
However, normalization is crucial, and difficult. We published some of the earlier progress with respect to score normalization as
Interpreting and unifying outlier scores. Kriegel, H. P., Kroger, P., Schubert, E., & Zimek, A. (2011, April). In Proceedings of the 2011 SIAM International Conference on Data Mining (pp. 13-24). Society for Industrial and Applied Mathematics.
If you don't normalize your scores (and min-max scaling is not enough), you will usually not be able to combine them in a meaningful way, except with very strong preconditions. Even two different subspaces will usually yield incomparable values because of having a different number of features, and different feature scales.
There is also some work on semi-supervised ensembles, e.g.
Learning Outlier Ensembles: The Best of Both Worlds—Supervised and Unsupervised. Micenková, B., McWilliams, B., & Assent, I. (2014).In Proceedings of the ACM SIGKDD 2014 Workshop on Outlier Detection and Description under Data Diversity (ODD2). New York, NY, USA (pp. 51-54).
Also beware of overfitting. It's quite easy to arrive at a single good result by tweaking parameters and repeated evaluation. But this leaks evaluation information into your experiment, i.e. you tend to overfit. Performing well across a large range of parameters and data sets is very hard. One of the key observations of the following study was that for every algorithm, you'll find at least one data set and parameter set, where it 'outperforms' the others; but if you change parameters a little, or use a different data set, the benefits of the "superior" new methods are not reproducible.
On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Campos, G. O., Zimek, A., Sander, J., Campello, R. J., Micenková, B., Schubert, E., ... & Houle, M. E. (2016). Data Mining and Knowledge Discovery, 30(4), 891-927.
So you will have to work really hard to do a reliable evaluation. Be careful how to choose parameters.

How could I deal with the sparse feature with high dimension in an SVR task?

I have a twitter-like(another micro blog) data set with 1.6 million datapoints and tried to predict the its retweet numbers based on its content. I extracted its keyword and use the keywords as the bag of words feature. Then I got 1.2 million dimension feature. The feature vector is very sparse,usually only ten dimension in one data point. And I use SVR to do the regression. Now it has taken 2 days. I think the training time might take quite a long time. I don't know if I do this task like this is normal. Is there any way or is it necessary to optimize this problem?
BTW. If in this case , I don't use any kernel and the machine is 32GB RAM and i-7 16 cores. How long the training time will be in estimation? I used the lib pyml.
You need to find a dimensionality reduction approach that works for your problem.
I've worked on a similar problem to yours and I found that Information Gain worked well, but there are others.
I found this paper (Fabrizio Sebastiani, Machine Learning in Automated Text Categorization, ACM Computing Surveys, Vol. 34, No.1, pp.1-47, 2002) to be a good theoretical treatment of text classification, including feature reduction by a variety of methods from the simple (Term Frequency) to the complex (Information-Theoretic).
These functions try to capture the intuition that the best terms for ci are the
ones distributed most differently in the sets of positive and negative examples of
ci. However, interpretations of this principle vary across different functions. For instance, in the experimental sciences χ2 is used to measure how the results of an observation differ (i.e., are independent) from the results expected according to an initial hypothesis (lower values indicate lower dependence). In DR we measure how independent tk and ci are. The terms tk with the lowest value for χ2(tk, ci) are thus the most independent from ci; since we are interested in the terms which are not, we select the terms for which χ2(tk, ci) is highest.
These techniques help you choose terms that are most useful in separating the training documents into the given classes; the terms with the highest predictive value for your problem.
I've been successful using Information Gain for feature reduction and found this paper (Entropy based feature selection for text categorization Largeron, Christine and Moulin, Christophe and Géry, Mathias - SAC - Pages 924-928 2011) to be a very good practical guide.
Here the authors present a simple formulation of entropy-based feature selection that's useful for implementation in code:
Given a term tj and a category ck, ECCD(tj , ck) can be
computed from a contingency table. Let A be the number
of documents in the category containing tj ; B, the number
of documents in the other categories containing tj ; C, the
number of documents of ck which do not contain tj and D,
the number of documents in the other categories which do
not contain tj (with N = A + B + C + D):
Using this contingency table, Information Gain can be estimated by:
This approach is easy to implement and provides very good Information-Theoretic feature reduction.
You needn't use a single technique either; you can combine them. Ter-Frequency is simple, but can also be effective. I've combined the Information Gain approach with Term Frequency to do feature selection successfully. You should experiment with your data to see which technique or techniques work most effectively.
At first you can simply remove all words with high frequency and all words with low frequency, because both of them don't tell you much about content of a text, then you have to do a word-stemming.
After that you can try to reduce dimensionality of your space, with Feature hashing, or some more advance dimensionality reduction trick (PCA, ICA), or even both of them.

What is the appropriate Machine Learning Algorithm for this scenario?

I am working on a Machine Learning problem which looks like this:
Input Variables
Categorical
a
b
c
d
Continuous
e
Output Variables
Discrete(Integers)
v
x
y
Continuous
z
The major issue that I am facing is that Output Variables are not totally independent of each other and there is no relation that can be established between them. That is, there is a dependence but not due to the causality (one value being high doesn't imply that the other will be high too but the chances of other being higher will improve)
An Example would be:
v - Number of Ad Impressions
x - Number of Ad Clicks
y - Number of Conversions
z - Revenue
Now, for an Ad to be clicked, it has to first appear on a search, so Click is somewhat dependent on Impression.
Again, for an Ad to be Converted, it has to be first clicked, so again Conversion is somewhat dependent on Click.
So running 4 instances of the problem predicting each of the output variables doesn't make sense to me. Infact there should be some way to predict all 4 together taking care of their implicit dependencies.
But as you can see, there won't be a direct relation, infact there would be a probability that is involved but which can't be worked out manually.
Plus the output variables are not Categorical but are in fact Discrete and Continuous.
Any inputs on how to go about solving this problem. Also guide me to existing implementations for the same and which toolkit to use to quickly implement the solution.
Just a random guess - I think this problem can be targeted by Bayesian Networks. What do you think ?
Bayesian Networks will do fine in your case. Your network won't be that huge either so you can live with exact inference algorithms like graph elimination or junction tree. If you decide to use BNs, then you can use Kevin Murphy's BN toolbox. Here is a link to that. For a more general toolbox that uses Gibbs sampling for approximate Monte Carlo inference, you can use BUGS.
Edit:
As an example look at the famous sprinkler example here. For totally discrete variables, you define the conditional probability tables as in the link. For instance you say that given that today is cloudy, there is a 0.8 probability of rain. You define all probability distributions, where the graph shows the causality relations (i.e. if cloud then rain etc.) Then as query you ask to your inference algorithm questions like, given that grass was wet; was it cloudy, was it raining, was the sprinkler on and so on.
To use BNs one needs a system model that is described in terms of causality relations (Directed Acyclic Graph) and probability transitions. If you wanna learn your system parameters there are techniques like EM algorithm. However, learning the graph structure is a really hard task and supervised machine learning approaches will do better in that case.

Ordinal classification packages and algorithms

I'm attempting to make a classifier that chooses a rating (1-5) for a item i. For each item i, I have a vector x containing about 40 different quantities pertaining to i. I also have a gold standard rating for each item. Based on some function of x, I want to train a classifier to give me a rating 1-5 that closely matches the gold standard.
Most of the information I've seen on classifiers deal with just binary decisions, while I have a rating decision. Are there common techniques or code libraries out there to deal with this sort of problem?
I agree with you that ML problems in which the response variable is on an ordinal scale
require special handling--'machine-mode' (i.e., returning a class label) seems insufficient
because the class labels ignore the relationship among the labels ("1st, 2nd, 3rd");
likewise, 'regression-mode' (i.e., treating the ordinal labels as floats, {1, 2, 3}) because
it ignores the metric distance between the response variables (e.g., 3 - 2 != 1).
R has (at least) several packages directed to ordinal regression. One of these is actually called Ordinal, but i haven't used it. I have used the Design Package in R for ordinal regression and i can certainly recommend it. Design contains a complete set of functions for solution, diagnostics, testing, and results presentation of ordinal regression problems via the Ordinal Logistic Model. Both Packages are available from CRAN) A step-by-step solution of an ordinal regression problem using the Design Package is presented on the UCLA Stats Site.
Also, i recently looked at a paper by a group at Yahoo working on ordinal classification using Support Vector Machines. I have not attempted to apply their technique.
Have you tried using Weka? It supports binary, numerical, and nominal attributes out of the box, the latter two of which might work well enough for your purposes.
Furthermore, it looks like one of the classifiers that's available is a meta-classifier called OrdinalClassClassifier.java, which is the result of this research:
Eibe Frank and Mark Hall, A simple approach to ordinal classification. In Proceedings of the 12th European Conference on Machine Learning, 2001, pp. 145-156.
If you don't need a pre-made approach, then these references (in addition to doug's note about the Yahoo SVM paper) might be useful:
W Chu and Z Ghahramani, Gaussian processes for ordinal regression. Journal of Machine Learning Research, 2006.
Wei Chu and S. Sathiya Keerthi, New approaches to support vector ordinal regression. In Proceedings of the 22nd international conference on Machine Learning, 2005, 145-152.
The problems that dough has raised are all valid. Let me add another one. You didn't say how you would like to measure the agreement between the classification and the "gold standard". You have to formulate the answer to that question as soon as possible, as this will have a huge impact on your next step. In my experience, the most problematic part of any (ok, not any, most) optimization task is the score function. Try asking yourself whether all errors equal? Does miss-classifying the "3" as being "4" has the same impact as classifying "4" as "3"? What about "1" vs "5". Can mistakenly missing one case have disastrous consequences (miss HIV diagnosis, activate pilot ejection in a plane)
The simplest way to measure the agreement between categorical classifiers is Cohen's Kappa. More complicated methods are described in the following links here, here, here, and here
Having said that, sometimes picking a solution that "just works", instead of "the right one" is faster and easier. If I were you I would pick a machine learning library (R, Weka, I personally love Orange) and see what I get. Only if you don't have reasonably good results with that, look for more complex solutions
If not interested in fancy statistics a one hidden layer back propagation neural network with 3 or 5 output nodes will probably do the trick if the training data is sufficiently large. Most NN classifiers try to minimize the mean squared error which is not always desired. Support Vector Machines mentioned earlier is a good alternative.
FANN is a good library for back propagation NNs, it also has some tools to assist in training of the network.
There are two packages in R that might help taming ordinal data
ordinalForest on CRAN
rpartScore on CRAN
I'm working on an OrdinalClassifier that is based on the sklearn framework (specifically the OVR multiclass classifier) and which works well with sklearn workflow such as pipelines, cross validation, and scoring.
Through testing, I'm finding that it performs very well vs. standard non-ordinal multiclass classification using SVC. And it gives much greater control over optimizing for precision and recall on the positive class (in my testing, I used sklearn's diabetes dataset and transformed the disease progression target(y) into a low, medium, high class label. Testing via cross validation is on my repo along with attribution. Scoring is based on weighted f1.
https://github.com/leeprevost/OrdinalClassifier

Resources