What does "Oracle" means in machine learning? - machine-learning

I'm reading "Diversity Creation Methods: A Survey and Categorisation". There is an explanation about the Q-statistic, “Take two classifiers, fi and fj. Over a large set of testing patterns, they will exhibit certain conicident errors, and therefore a probability of error coincidence can be calculated. These are also referred to as the Oracle outputs”

AFAIK "Oracle" doesn't have any established meaning in ML in general. In this article it's just a synonym for an ensemble member. It's not used constantly throughout the article, so I'm guessing it's just a reference to the term used in some earlier formulation of the method discussed.
Sometimes "Oracle" stands for an imaginary source of knowledge about the target function - a source of training/testing samples used in some kind of intuitive proofs or
thought experiments.

Related

What is the difference between probabilistic programming vs. probabilistic machine learning?

I am trying to wrap my head around the concept of probabilistic programming but the more I read, the more I feel confused.
My understanding at this point in time is that probabilistic programming is similar to Bayesian networks, just translated into programming language for creation of automated inference models?
I have some background in machine learning and I remember some machine learning models also output probabilities and then I come across the term probabilistic machine learning...
Is there a difference between the two? Or are they something similar?
Appreciate anyone that could help clarify.
I guess there is some vagueness between the two terms but my take on them is the following:
Probabilistic Programming it is expressing probabilistic models as computer programs that generate data (i.e. simulators).
Probabilistic Models + Programming = Probabilistic Programming
There is no say about what comprise a probabilistic model (it may well be a neural network of some sorts). Therefore, I view this term as:
More generic
More frequently used in an applied context (with relation to programming)
Probabilistic Machine Learning
is a another flavour of ML which deals with probabilistic aspects of predictions, e.g. the model does not treat input / output values as certain and/or point values, but instead treats them (or some of them) as random variables. Prominent example of such an approach is Gaussian Process.
My understanding at this point in time is that probabilistic programming is similar to Bayesian networks, just translated into programming language for creation of automated inference models?
That is correct. A probabilistic program can be viewed as equivalent to a Bayesian network, but is expressed in a much richer language. Probabilistic programming as a field proposes such representations, as well as algorithms that take advantage of those representations, because sometimes a richer representation makes the problem easier.
Consider for example a probabilistic program that models a disease that is more likely to afflict men:
N = 1000000;
for i = 1:N {
male[i] ~ Bernoulli(0.5);
disease[i] ~ if male[i] then Bernoulli(0.8) else Bernoulli(0.3)
}
This probabilistic program is equivalent to the following Bayesian network accompanied by the appropriate conditional probability tables:
For highly repetitive networks such as this, authors often use plate notation to make their depiction more succinct:
However, plate notation is a device for human-readable publications, not a formal language in the same sense that a programming language is. Also, for more complex models, plate notation could become harder to understand and maintain. Finally, a programming language brings other benefits, such as primitive operations that make it easier to express conditional probabilities.
So, is it just a matter of having a convenient representation? No, because a more abstract representation contains more high-level information that can be used to improve inference performance.
Suppose that we want to compute the probability distribution on the number of people among N individuals with the disease. A straightforward and generic Bayesian network algorithm would have to consider the large number of 2^N combinations of assignments to the disease variables in order to compute that answer.
The probabilistic program representation, however, explicitly indicates that the conditional probabilities for disease[i] and male[i] are identical for all i. An inference algorithm can exploit that to compute the marginal probability of disease[i], which is identical for all i, use the fact that the number of diseased people will therefore be a binomial distribution B(N, P(disease[i])) and provide that as the desired answer, in time constant in N. It would also be able to provide an explanation of this conclusion that would be more understandable and insightful to a user.
One could argue that this comparison is unfair because a knowledgeable user would not pose the query as defined for the explicit O(N)-sized Bayesian network, but instead simplify the problem in advance by exploiting its simple structure. However, the user may not be knowledgeable enough for such a simplification, particularly for more complex cases, or may not have the time to do it, or may make a mistake, or may not know in advance what the model will be so she cannot simplify it manually like that. Probabilistic programming offers the possibility that such simplification be made automatically.
To be fair, most current probabilistic programming tools (such as JAGS and Stan) do not perform this more sophisticated mathematical reasoning (often called lifted probabilistic inference) and instead simply perform Markov Chain Monte Carlo (MCMC) sampling over the Bayesian network equivalent to the probabilistic program (but often without having to build the entire network in advance, which is also another possible gain). In any case, this convenience already more than justifies their use.

Q-learning vs dynamic programming

Is the classic Q-learning algorithm, using lookup table (instead of function approximation), equivalent to dynamic programming?
From Sutton & Barto's book (Reinforcement Learning: An Introduction, chapter 4)
The term dynamic programming (DP) refers to a collection of algorithms
that can be used to compute optimal policies given a perfect model of
the environment as a Markov decision process (MDP). Classical DP
algorithms are of limited utility in reinforcement learning both
because of their assumption of a perfect model and because of their
great computational expense, but they are still important
theoretically.
So, although both share the same working principles (either using tabular Reinforcement Learning/Dynamic Programming or approximated RL/DP), the key difference between classic DP and classic RL is that the first assume the model is known. This basically means knowing the transition probabilities (which indicates the probability of change from state s to state s' given action a) and the expected immediate reward function.
On the contrary, RL methods only require to have access to a set of samples, either collected online or offline (depending on the algorithm).
Of course, there are hybrid methods that can be place between RL and DP, for example those that learn a model from the samples, and then use that model in the learning process.
NOTE: The term Dynamic Programming, in addition to a set of mathematical optimization techniques related with RL, is also used to refer a "general algorithmic pattern", as pointed in some comment. In both case, the fundaments are the same, but depending on the context may have a different meaning.
Dynamic Programming is an umbrella encompassing many algorithms. Q-Learning is a specific algorithm. So, no, it is not the same.
Also, if you mean Dynamic Programming as in Value Iteration or Policy Iteration, still not the same. These algorithms are "planning" methods. You have to give them a transition and a reward function and they will iteratively compute a value function and an optimal policy.
Q-Learning is a model-free reinforcement learning method. This one is "model-free", not because it doesn't use a machine learning model or anything like that, but because they don't require, and don't use a model of the environment, also known as MDP, to obtain an optimal policy. You also have "model-based" methods. These, unlike Dynamic Programming methods, are based on learning a model, not simply using one. And, unlike model-free methods, don't throw away samples after estimating values, they instead try to reconstruct the transition and reward function to get better performance.
Model-based methods combine model-free and planning algorithms to get same good results with less amount of samples than required by model-free methods (Q-Learning), and without needing a model like Dynamic Programming methods (Value/Policy Iteration).
If you use Q-learning in an offline setup, like AlphaGo, for example, then it is equivalent to dynamic programming. The difference is that it can also be used in an online setup.

The options for the first step of document clustering

I checked several document clustering algorithms, such as LSA, pLSA, LDA, etc. It seems they all require to represent the documents to be clustered as a document-word matrix, where the rows stand for document and the columns stand for words appearing in the document. And the matrix is often very sparse.
I am wondering, is there any other options to represent documents besides using the document-word matrix? Because I believe the way we express a problem has a significant influence on how well we can solve it.
As #ffriend pointed out, you cannot really avoid using the term-document-matrix (TDM) paradigm. Clustering methods operates on points in a vector space, and this is exactly what the TDM encodes. However, within that conceptual framework there are many things you can do to improve the quality of the TDM:
feature selection and re-weighting attempt to remove or weight down features (words) that do not contribute useful information (in the sense that your chosen algorithm does just as well or better without these features, or if their counts are decremented). You might want to read more about Mutual Information (and its many variants) and TF-IDF.
dimensionality reduction is about encoding the information as accurately as possible in the TDM using less columns. Singular Value Decomposition (the basis of LSA) and Non-Negative Tensor Factorisation are popular in the NLP community. A desirable side effect is that the TDM becomes considerably less sparse.
feature engineering attempts to build a TDM where the choice of columns is motivated by linguistic knowledge. For instance, you may want to use bigrams instead of words, or only use nouns (requires a part-of-speech tagger), or only use nouns with their associated adjectival modifier (e.g. big cat, requires a dependency parser). This is a very empirical line of work and involves a lot of experimentation, but often yield improved results.
the distributional hypothesis makes if possible to get a vector representing the meaning of each word in a document. There has been work on trying to build up a representation of an entire document from the representations of the words it contains (composition). Here is a shameless link to my own post describing the idea.
There is a massive body of work on formal and logical semantics that I am not intimately familiar with. A document can be encoded as a set of predicates instead of a set of words, i.e. the columns of the TDM can be predicates. In that framework you can do inference and composition, but lexical semantics (the meaning if individual words) is hard to deal with.
For a really detailed overview, I recommend Turney and Pantel's "From Frequency to Meaning : Vector Space Models of Semantics".
You question says you want document clustering, not term clustering or dimensionality reduction. Therefore I'd suggest you steer clear of the LSA family of methods, since they're a preprocessing step.
Define a feature-based representation of your documents (which can be, or include, term counts but needn't be), and then apply a standard clustering method. I'd suggest starting with k-means as it's extremely easy and there are many, many implementations of it.
OK, this is quite a very general question, and many answers are possible, none is definitive
because it's an ongoing research area. So far, the answers I have read mainly concern so-called "Vector-Space models", and your question is termed in a way that suggests such "statistical" approaches. Yet, if you want to avoid manipulating explicit term-document matrices, you might want to have a closer look at the Bayesian paradigm, which relies on
the same distributional hypothesis, but exploits a different theoretical framework: you don't manipulate any more raw distances, but rather probability distributions and, which is the most important, you can do inference based on them.
You mentioned LDA, I guess you mean Latent Dirichlet Allocation, which is the most well-known such Bayesian model to do document clustering. It is an alternative paradigm to vector space models, and a winning one: it has been proven to give very good results, which justifies its current success. Of course, one can argue that you still use kinds of term-document matrices through the multinomial parameters, but it's clearly not the most important aspect, and Bayesian researchers do rarely (if ever) use this term.
Because of its success, there are many software that implements LDA on the net. Here is one, but there are many others:
http://jgibblda.sourceforge.net/

Validating Output From a Clustering Algorithm

Is there an objective way to validate the output of a clustering algorithm?
I'm using scikit-learn's affinity propagation clustering against a dataset composed of objects with many attributes. The difference matrix supplied to the clustering algorithm is composed of the weighted difference of these attributes. I'm looking for a way to objectively validate tweaks in the distance weightings as reflected in the resulting clusters. The dataset is large and has enough attributes that manual examination of small examples is not a reasonable way to verify the produced clusters.
Yes:
Give the clusters to a domain expert, and have him analyze if the structure the algorithm found is sensible. Not so much if it is new, but if it is sensible.
... and No:
There is not automatic evaluation available that is fair. In the sense that it takes the objective of unsupervised clustering into account: knowledge discovery aka: learn something new about your data.
There are two common ways of evaluating clusterings automatically:
internal cohesion. I.e. there is some particular property such as in-cluser variance compared to between-cluster variance to minimize. The problem is that it's usually fairly trivial to cheat. I.e. to construct a trivial solution that scores really well. So this method must not be used to compare methods based on different assumptions. You can't even fairly compare different types of linkage for hiearchical clustering.
external evaluation. You use a labeled data set, and score algorithms by how well they rediscover existing knowledge. Sometimes this works quite well, so it is an accepted state of the art for evaluation. Yet, any supervised or semi-supervised method will of course score much better on this. As such, it is A) biased towards supervised methods, and B) actually going completely against the knowledge discovery idea of finding something you did not yet know.
If you really mean to use clustering - i.e. learn something about your data - you will at some point have to inspect the clusters, preferrably by a completely independent method such as a domain expert. If he can tell you that e.g. the user group identified by the clustering is a non-trivial group not yet investigated closely, then you are a winner.
However, most people want to have a "one click" (and one-score) evaluation, unfortunately.
Oh, and "clustering" is not really a machine learning task. There actually is no learning involved. To the machine learning community, it is the ugly duckling that nobody cares about.
There is another way to evaluate the clustering quality by computing a stability metric on subfolds, a bit like cross validation for supervised models:
Split the dataset in 3 folds A, B and C. Compute two clustering with you algorithm on A+B and A+C. Compute the Adjusted Rand Index or Adjusted Mutual Information of the 2 labelings on their intersection A and consider this value as an estimate of the stability score of the algorithm.
Rinse-repeat by shuffling the data and splitting it into 3 other folds A', B' and C' and recompute a stability score.
Average the stability scores over 5 or 10 runs to have a rough estimate of the standard error of the stability score.
As you can guess this is very computer intensive evaluation method.
It is still an open research area to know whether or not this Stability-based evaluation of clustering algorithms is really useful in practice and to identify when it can fail to produce a valid criterion for model selection. Please refer to Clustering Stability: An Overview by Ulrike von Luxburg and references therein for an overview of the state of the art on those matters.
Note: it is important to use Adjusted for Chance metrics such as ARI or AMI if you want to use this strategy to select the best value of k in k-means for instance. Non adjusted metrics such as NMI and V-measure will tend to favor models with higher k arbitrarily.

A few implementation details for a Support-Vector Machine (SVM)

In a particular application I was in need of machine learning (I know the things I studied in my undergraduate course). I used Support Vector Machines and got the problem solved. Its working fine.
Now I need to improve the system. Problems here are
I get additional training examples every week. Right now the system starts training freshly with updated examples (old examples + new examples). I want to make it incremental learning. Using previous knowledge (instead of previous examples) with new examples to get new model (knowledge)
Right my training examples has 3 classes. So, every training example is fitted into one of these 3 classes. I want functionality of "Unknown" class. Anything that doesn't fit these 3 classes must be marked as "unknown". But I can't treat "Unknown" as a new class and provide examples for this too.
Assuming, the "unknown" class is implemented. When class is "unknown" the user of the application inputs the what he thinks the class might be. Now, I need to incorporate the user input into the learning. I've no idea about how to do this too. Would it make any difference if the user inputs a new class (i.e.. a class that is not already in the training set)?
Do I need to choose a new algorithm or Support Vector Machines can do this?
PS: I'm using libsvm implementation for SVM.
I just wrote my Answer using the same organization as your Question (1., 2., 3).
Can SVMs do this--i.e., incremental learning? Multi-Layer Perceptrons of course can--because the subsequent training instances don't affect the basic network architecture, they'll just cause adjustment in the values of the weight matrices. But SVMs? It seems to me that (in theory) one additional training instance could change the selection of the support vectors. But again, i don't know.
I think you can solve this problem quite easily by configuring LIBSVM in one-against-many--i.e., as a one-class classifier. SVMs are one-class classifiers; application of an SVM for multi-class means that it has been coded to perform multiple, step-wise one-against-many classifications, but again the algorithm is trained (and tested) one class at a time. If you do this, then what's left after step-wise execution against the test set, is "unknown"--in other words, whatever data is not classified after performing multiple, sequential one-class classifications, is by definition in that 'unknown' class.
Why not make the user's guess a feature (i.e., just another dependent variable)? The only other option is to make it the class label itself, and you don't want that. So you would, for instance, add a column to your data matrix "user class guess", and just populate it with some value most likely to have no effect for those data points not in the 'unknown' category and therefore for which the user will not offer a guess--this value could be '0' or '1', but really it depends on how you have your data scaled and normalized).
Your first item will likely be the most difficult, since there are essentially no good incremental SVM implementations in existence.
A few months ago, I also researched online or incremental SVM algorithms. Unfortunately, the current state of implementations is quite sparse. All I found was a Matlab example, OnlineSVR (a thesis project only implementing regression support), and SVMHeavy (only binary class support).
I haven't used any of them personally. They all appear to be at the "research toy" stage. I couldn't even get SVMHeavy to compile.
For now, you can probably get away with doing periodic batch training to incorporate updates. I also use LibSVM, and it's quite fast, so it sould be a good substitute until a proper incremental version is implemented.
I also don't think SVM's can model the concept of an "unknown" sample by default. They typically work as a series of boolean classifiers, so a sample ends up as positively being classified as something, even if that sample is drastically different from anything seen previously. A possible workaround would be to model the ranges of your features, and randomly generate samples that exist outside of these ranges, and then add these to your training set.
For example, if you have an attribute called "color", which has a minimum value of 4 and a maximum value of 123, then you could add these to your training set
[({'color':3},'unknown'),({'color':125},'unknown')]
to give your SVM an idea of what an "unknown" color means.
There are algorithms to train an SVM incrementally, but I don't think libSVM implements this. I think you should consider whether you really need this feature. I see no problem with your current approach, unless the training process is really too slow. If it is, could you retrain in batches (i.e. after every 100 new examples)?
You can get libSVM to produce probabilities of class membership. I think this can be done for multiclass classification, but I'm not entirely sure about that. You will need to decide some threshold at which the classification is not certain enough and then output 'Unknown'. I suppose something like setting a threshold on the difference between the most likely and second most likely class would achieve this.
I think libSVM scales to any number of new classes. The accuracy of your model may well suffer by adding new classes, however.
Even though this question is probably out of date, I feel obliged to give some additional thoughts.
Since your first question has been answered by others (there is no production-ready SVM which implements incremental learning, even though it is possible), I will skip it. ;)
Adding 'Unknown' as a class is not a good idea. Depending on it's use, the reasons are different.
If you are using the 'Unknown' class as a tag for "this instance has not been classified, but belongs to one of the known classes", then your SVM is in deep trouble. The reason is, that libsvm builds several binary classifiers and combines them. So if you have three classes - let's say A, B and C - the SVM builds the first binary classifier by splitting the training examples into "classified as A" and "any other class". The latter will obviously contain all examples from the 'Unknown' class. When trying to build a hyperplane, examples in 'Unknown' (which really belong to the class 'A') will probably cause the SVM to build a hyperplane with a very small margin and will poorly recognizes future instances of A, i.e. it's generalization performance will diminish. That's due to the fact, that the SVM will try to build a hyperplane which separates most instances of A (those officially labeled as 'A') onto one side of the hyperplane and some instances (those officially labeled as 'Unknown') on the other side .
Another problem occurs if you are using the 'Unknown' class to store all examples, whose class is not yet known to the SVM. For example, the SVM knows the classes A, B and C, but you recently got example data for two new classes D and E. Since these examples are not classified and the new classes not known to the SVM, you may want to temporarily store them in 'Unknown'. In that case the 'Unknown' class may cause trouble, since it possibly contains examples with enormous variation in the values of it's features. That will make it very hard to create good separating hyperplanes and therefore the resulting classifier will poorly recognize new instances of D or E as 'Unknown'. Probably the classification of new instances belonging to A, B or C will be hindered as well.
To sum up: Introducing an 'Unknown' class which contains examples of known classes or examples of several new classes will result in a poor classifier. I think it's best to ignore all unclassified instances when training the classifier.
I would recommend, that you solve this issue outside the classification algorithm. I was asked for this feature myself and implemented a single webpage, which shows an image of the object in question and a button for each known class. If the object in question belongs to a class which is not known yet, the user can fill out another form to add a new class. If he goes back to the classification page, another button for that class will magically appear. After the instances have been classified, they can be used for training the classifier. (I used a database to store the known classes and reference which example belongs to which class. I implemented an export function to make the data SVM-ready.)

Resources