Is the classic Q-learning algorithm, using lookup table (instead of function approximation), equivalent to dynamic programming?
From Sutton & Barto's book (Reinforcement Learning: An Introduction, chapter 4)
The term dynamic programming (DP) refers to a collection of algorithms
that can be used to compute optimal policies given a perfect model of
the environment as a Markov decision process (MDP). Classical DP
algorithms are of limited utility in reinforcement learning both
because of their assumption of a perfect model and because of their
great computational expense, but they are still important
theoretically.
So, although both share the same working principles (either using tabular Reinforcement Learning/Dynamic Programming or approximated RL/DP), the key difference between classic DP and classic RL is that the first assume the model is known. This basically means knowing the transition probabilities (which indicates the probability of change from state s to state s' given action a) and the expected immediate reward function.
On the contrary, RL methods only require to have access to a set of samples, either collected online or offline (depending on the algorithm).
Of course, there are hybrid methods that can be place between RL and DP, for example those that learn a model from the samples, and then use that model in the learning process.
NOTE: The term Dynamic Programming, in addition to a set of mathematical optimization techniques related with RL, is also used to refer a "general algorithmic pattern", as pointed in some comment. In both case, the fundaments are the same, but depending on the context may have a different meaning.
Dynamic Programming is an umbrella encompassing many algorithms. Q-Learning is a specific algorithm. So, no, it is not the same.
Also, if you mean Dynamic Programming as in Value Iteration or Policy Iteration, still not the same. These algorithms are "planning" methods. You have to give them a transition and a reward function and they will iteratively compute a value function and an optimal policy.
Q-Learning is a model-free reinforcement learning method. This one is "model-free", not because it doesn't use a machine learning model or anything like that, but because they don't require, and don't use a model of the environment, also known as MDP, to obtain an optimal policy. You also have "model-based" methods. These, unlike Dynamic Programming methods, are based on learning a model, not simply using one. And, unlike model-free methods, don't throw away samples after estimating values, they instead try to reconstruct the transition and reward function to get better performance.
Model-based methods combine model-free and planning algorithms to get same good results with less amount of samples than required by model-free methods (Q-Learning), and without needing a model like Dynamic Programming methods (Value/Policy Iteration).
If you use Q-learning in an offline setup, like AlphaGo, for example, then it is equivalent to dynamic programming. The difference is that it can also be used in an online setup.
Related
I am trying to wrap my head around the concept of probabilistic programming but the more I read, the more I feel confused.
My understanding at this point in time is that probabilistic programming is similar to Bayesian networks, just translated into programming language for creation of automated inference models?
I have some background in machine learning and I remember some machine learning models also output probabilities and then I come across the term probabilistic machine learning...
Is there a difference between the two? Or are they something similar?
Appreciate anyone that could help clarify.
I guess there is some vagueness between the two terms but my take on them is the following:
Probabilistic Programming it is expressing probabilistic models as computer programs that generate data (i.e. simulators).
Probabilistic Models + Programming = Probabilistic Programming
There is no say about what comprise a probabilistic model (it may well be a neural network of some sorts). Therefore, I view this term as:
More generic
More frequently used in an applied context (with relation to programming)
Probabilistic Machine Learning
is a another flavour of ML which deals with probabilistic aspects of predictions, e.g. the model does not treat input / output values as certain and/or point values, but instead treats them (or some of them) as random variables. Prominent example of such an approach is Gaussian Process.
My understanding at this point in time is that probabilistic programming is similar to Bayesian networks, just translated into programming language for creation of automated inference models?
That is correct. A probabilistic program can be viewed as equivalent to a Bayesian network, but is expressed in a much richer language. Probabilistic programming as a field proposes such representations, as well as algorithms that take advantage of those representations, because sometimes a richer representation makes the problem easier.
Consider for example a probabilistic program that models a disease that is more likely to afflict men:
N = 1000000;
for i = 1:N {
male[i] ~ Bernoulli(0.5);
disease[i] ~ if male[i] then Bernoulli(0.8) else Bernoulli(0.3)
}
This probabilistic program is equivalent to the following Bayesian network accompanied by the appropriate conditional probability tables:
For highly repetitive networks such as this, authors often use plate notation to make their depiction more succinct:
However, plate notation is a device for human-readable publications, not a formal language in the same sense that a programming language is. Also, for more complex models, plate notation could become harder to understand and maintain. Finally, a programming language brings other benefits, such as primitive operations that make it easier to express conditional probabilities.
So, is it just a matter of having a convenient representation? No, because a more abstract representation contains more high-level information that can be used to improve inference performance.
Suppose that we want to compute the probability distribution on the number of people among N individuals with the disease. A straightforward and generic Bayesian network algorithm would have to consider the large number of 2^N combinations of assignments to the disease variables in order to compute that answer.
The probabilistic program representation, however, explicitly indicates that the conditional probabilities for disease[i] and male[i] are identical for all i. An inference algorithm can exploit that to compute the marginal probability of disease[i], which is identical for all i, use the fact that the number of diseased people will therefore be a binomial distribution B(N, P(disease[i])) and provide that as the desired answer, in time constant in N. It would also be able to provide an explanation of this conclusion that would be more understandable and insightful to a user.
One could argue that this comparison is unfair because a knowledgeable user would not pose the query as defined for the explicit O(N)-sized Bayesian network, but instead simplify the problem in advance by exploiting its simple structure. However, the user may not be knowledgeable enough for such a simplification, particularly for more complex cases, or may not have the time to do it, or may make a mistake, or may not know in advance what the model will be so she cannot simplify it manually like that. Probabilistic programming offers the possibility that such simplification be made automatically.
To be fair, most current probabilistic programming tools (such as JAGS and Stan) do not perform this more sophisticated mathematical reasoning (often called lifted probabilistic inference) and instead simply perform Markov Chain Monte Carlo (MCMC) sampling over the Bayesian network equivalent to the probabilistic program (but often without having to build the entire network in advance, which is also another possible gain). In any case, this convenience already more than justifies their use.
I am implementing a SARSA(lambda) model in C++ to overcome some of the limitations (the sheer amount of time and space DP models require) of DP models, which hopefully will reduce the computation time (takes quite a few hours atm for similar research) and less space will allow adding more complexion to the model.
We do have explicit transition probabilities, and they do make a difference. So how should we incorporate them in a SARSA model?
Simply select the next state according to the probabilities themselves? Apparently SARSA models don't exactly expect you to use probabilities - or perhaps I've been reading the wrong books.
PS- Is there a way of knowing if the algorithm is properly implemented? First time working with SARSA.
The fundamental difference between Dynamic Programming (DP) and Reinforcement Learning (RL) is that the first assumes that environment's dynamics is known (i.e., a model), while the latter can learn directly from data obtained from the process, in the form of a set of samples, a set of process trajectories, or a single trajectory. Because of this feature, RL methods are useful when a model is difficult or costly to construct. However, it should be notice that both approaches share the same working principles (called Generalized Policy Iteration in Sutton's book).
Given they are similar, both approaches also share some limitations, namely, the curse of dimensionality. From Busoniu's book (chapter 3 is free and probably useful for your purposes):
A central challenge in the DP and RL fields is that, in their original
form (i.e., tabular form), DP and RL algorithms cannot be implemented
for general problems. They can only be implemented when the state and
action spaces consist of a finite number of discrete elements, because
(among other reasons) they require the exact representation of value
functions or policies, which is generally impossible for state spaces
with an infinite number of elements (or too costly when the number of
states is very high).
Even when the states and actions take finitely many values, the cost
of representing value functions and policies grows exponentially with
the number of state variables (and action variables, for Q-functions).
This problem is called the curse of dimensionality, and makes the
classical DP and RL algorithms impractical when there are many state
and action variables. To cope with these problems, versions of the
classical algorithms that approximately represent value functions
and/or policies must be used. Since most problems of practical
interest have large or continuous state and action spaces,
approximation is essential in DP and RL.
In your case, it seems quite clear that you should employ some kind of function approximation. However, given that you know the transition probability matrix, you can choose a method based on DP or RL. In the case of RL, transitions are simply used to compute the next state given an action.
Whether is better to use DP or RL? Actually I don't know the answer, and the optimal method likely depends on your specific problem. Intuitively, sampling a set of states in a planned way (DP) seems more safe, but maybe a big part of your state space is irrelevant to find an optimal pocliy. In such a case, sampling a set of trajectories (RL) maybe is more effective computationally. In any case, if both methods are rightly applied, should achive a similar solution.
NOTE: when employing function approximation, the convergence properties are more fragile and it is not rare to diverge during the iteration process, especially when the approximator is non linear (such as an artificial neural network) combined with RL.
If you have access to the transition probabilities, I would suggest not to use methods based on a Q-value. This will require additional sampling in order to extract information that you already have.
It may not always be the case, but without additional information I would say that modified policy iteration is a more appropriate method for your problem.
I'm trying to find optimal policy in environment with continuous states (dim. = 20) and discrete actions (3 possible actions). And there is a specific moment: for optimal policy one action (call it "action 0") should be chosen much more frequently than other two (~100 times more often; this two action more risky).
I've tried Q-learning with NN value-function approximation. Results were rather bad: NN learns always choose "action 0". I think that policy gradient methods (on NN weights) may help, but don't understand how to use them on discrete actions.
Could you give some advise what to try? (maybe algorithms, papers to read).
What are the state-of-the-art RL algorithms when state space is continuous and action space is discrete?
Thanks.
Applying Q-learning in continuous (states and/or actions) spaces is not a trivial task. This is especially true when trying to combine Q-learning with a global function approximator such as a NN (I understand that you refer to the common multilayer perceptron and the backpropagation algorithm). You can read more in the Rich Sutton's page. A better (or at least more easy) solution is to use local approximators such as for example Radial Basis Function networks (there is a good explanation of why in Section 4.1 of this paper).
On the other hand, the dimensionality of your state space maybe is too high to use local approximators. Thus, my recommendation is to use other algorithms instead of Q-learning. A very competitive algorithm for continuous states and discrete actions is Fitted Q Iteration, which usually is combined with tree methods to approximate the Q-function.
Finally, a common practice when the number of actions is low, as in your case, it is to use an independent approximator for each action, i.e., instead of a unique approximator that takes as input the state-action pair and return a Q value, using three approximators, one per action, that take as input only the state. You can find an example of this in Example 3.1 of the book Reinforcement Learning and Dynamic Programming Using Function Approximators
Is there an objective way to validate the output of a clustering algorithm?
I'm using scikit-learn's affinity propagation clustering against a dataset composed of objects with many attributes. The difference matrix supplied to the clustering algorithm is composed of the weighted difference of these attributes. I'm looking for a way to objectively validate tweaks in the distance weightings as reflected in the resulting clusters. The dataset is large and has enough attributes that manual examination of small examples is not a reasonable way to verify the produced clusters.
Yes:
Give the clusters to a domain expert, and have him analyze if the structure the algorithm found is sensible. Not so much if it is new, but if it is sensible.
... and No:
There is not automatic evaluation available that is fair. In the sense that it takes the objective of unsupervised clustering into account: knowledge discovery aka: learn something new about your data.
There are two common ways of evaluating clusterings automatically:
internal cohesion. I.e. there is some particular property such as in-cluser variance compared to between-cluster variance to minimize. The problem is that it's usually fairly trivial to cheat. I.e. to construct a trivial solution that scores really well. So this method must not be used to compare methods based on different assumptions. You can't even fairly compare different types of linkage for hiearchical clustering.
external evaluation. You use a labeled data set, and score algorithms by how well they rediscover existing knowledge. Sometimes this works quite well, so it is an accepted state of the art for evaluation. Yet, any supervised or semi-supervised method will of course score much better on this. As such, it is A) biased towards supervised methods, and B) actually going completely against the knowledge discovery idea of finding something you did not yet know.
If you really mean to use clustering - i.e. learn something about your data - you will at some point have to inspect the clusters, preferrably by a completely independent method such as a domain expert. If he can tell you that e.g. the user group identified by the clustering is a non-trivial group not yet investigated closely, then you are a winner.
However, most people want to have a "one click" (and one-score) evaluation, unfortunately.
Oh, and "clustering" is not really a machine learning task. There actually is no learning involved. To the machine learning community, it is the ugly duckling that nobody cares about.
There is another way to evaluate the clustering quality by computing a stability metric on subfolds, a bit like cross validation for supervised models:
Split the dataset in 3 folds A, B and C. Compute two clustering with you algorithm on A+B and A+C. Compute the Adjusted Rand Index or Adjusted Mutual Information of the 2 labelings on their intersection A and consider this value as an estimate of the stability score of the algorithm.
Rinse-repeat by shuffling the data and splitting it into 3 other folds A', B' and C' and recompute a stability score.
Average the stability scores over 5 or 10 runs to have a rough estimate of the standard error of the stability score.
As you can guess this is very computer intensive evaluation method.
It is still an open research area to know whether or not this Stability-based evaluation of clustering algorithms is really useful in practice and to identify when it can fail to produce a valid criterion for model selection. Please refer to Clustering Stability: An Overview by Ulrike von Luxburg and references therein for an overview of the state of the art on those matters.
Note: it is important to use Adjusted for Chance metrics such as ARI or AMI if you want to use this strategy to select the best value of k in k-means for instance. Non adjusted metrics such as NMI and V-measure will tend to favor models with higher k arbitrarily.
I would like to use a supervised machine learning algorithm to predict a binary function (true or false) for a set of sentences based on the presence or absence of words in the sentences.
Ideally, I would like to avoid having to hardcode the set of words used to decide on the output so that the algorithm automatically learns which words are (together ?) most likely to trigger specific outputs.
http://shop.oreilly.com/product/9780596529321.do (Programming Collective Intelligence) has a nice section in chapter 4 titled "Learning From Clicks" which describes how to do this by using 1 layer of hiden nodes in a neural network with one new hidden node for each new combination of input words.
Similarly, it is possible to create a feature for each word in the training data set and train pretty much any classic machine learning algorithm using these features. Adding new training data will generate new features which will require me to re-train the algorithm from scratch.
Which brings me to my questions:
is it actually a problem if I have to retrain everything from scratch whenever the training data set is extended ?
what kind of algorithm would more experience machine learning users recommend to use for this kind of problem ?
what criteria should I use in picking an algorithm versus another ? (other than actually trying them all and see which perform better with precision/recall metrics)
if you have worked on similar problems, what about extending the features with 2-grams (1 if a specific 2-gram is present, 0 if not) ? 3-grams ?
You could look into the general area of topic modelling if you want to find words which are generally found together.
The most simple approach would be to use latent semantic analysis ( http://en.wikipedia.org/wiki/Latent_semantic_analysis ), which is just applying SVD to a term document matrix. You'd then need to do some additional post hoc analysis to fit this to your particular outcome.
A more involved, and much more complex approach would be to use latent dirichlet allocation ( http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation )
In terms of just adding new features (words) that is fine as long as you are going to retrain. You can also use TF/IDF to give that particular word a value when representing the matrix (Instead of just a 1 or 0).
I don't know what programming language you are trying to do this in, but I know there are libraries out there in Java and Pythont hat do all of the above.