Why is RL called 'reinforcement' learning? - machine-learning

I understand why machine learning is named as such, and on top of that the nomenclature behind supervised and unsupervised learning. So what is reinforced about reinforcement learning?

The “reinforcement” in reinforcement learning refers to how certain behaviors are encouraged, and others discouraged. Behaviors are reinforced through rewards which are gained through experiences with the environment.

Modern reinforcement learning is built upon two main threads. One thread concerns learning by trial and error and originated in the psychology of animal learning. The second thread concerns the problem of optimal control, and it is a solution using value functions and dynamic programming ( Sutton and Barto., 2018).
Reinforcement learning borrowed his name from the first thread of studies. According to Watkins (1989), in studying the animals' ability to learn, the animals may be automatically provided with reinforcers. In behavioral terms, a positive reinforcer might be a morsel of food for a hungry animal, for instance, or a sip of water for a thirsty animal. Conversely, a negative reinforcer might be an electric shock.
PS. Watkins proposed the Q-learning algorithm.
Edit: (Added more history)
According to Sutton and Barto (2018): "The term “reinforcement” in the context of animal learning came into use well after Thorndike’s expression of the Law of Effect, first appearing in this context (to the best of our knowledge) in the 1927 English translation of Pavlov’s monograph on conditioned reflexes. Pavlov described reinforcement as the strengthening of a pattern of behavior due
to an animal receiving a stimulus—a reinforcer—in an appropriate temporal relationship with another stimulus or with a response."
Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. MIT press, 2018.
Thorndike, E. L. Animal Intelligence. Hafner, Darien, CT, 1911.
Watkins, Christopher John Cornish Hellaby. "Learning from delayed rewards." (1989).

Reinforcement learning is reinforced through trial and error. Outcomes which are incorrect (or less than optimal) do not need to be manually corrected. Instead, the focus is on exploration, and feedback (reinforcement) is obtained from these same experiences.

Related

When is a task considered few-shot learning?

When reading about few-shot learning, I can never seem to find an exact definition. When the concept is explained, it is often done by saying something along the lines of 'using few data samples'.
Is there a precise definition of few-shot learning, or when a task is considered few-shot learning? When the term 'N-way-K-shot learning' is used, are there any boundaries on which values N and K can have?
The idea behind few-shot learning is to train a classifier using only a small number of labelled samples. More precisely, given N classes and k samples, the aim is to use samples per class where m << k to train the classifier. Few-shot learning is a popular option when the number of available labelled samples is limited.
However, when it comes to implementing few-shot learning, one popular method is to use contrastive learning to fine-tune an existing model to learn the similarities between samples belonging to the same class.
SetFit is a few-shot learning framework that uses contrastive learning for fine-tuning sentence transformers to be used for text classification. I suggest you read SetFit's paper (available on Axriv; here: https://arxiv.org/abs/2209.11055). I believe it has the technical details you need to answer your question. Moreover, SetFit's implementation is available on GitHub (here: https://github.com/huggingface/setfit).
I hope it helped.

What is the difference between probabilistic programming vs. probabilistic machine learning?

I am trying to wrap my head around the concept of probabilistic programming but the more I read, the more I feel confused.
My understanding at this point in time is that probabilistic programming is similar to Bayesian networks, just translated into programming language for creation of automated inference models?
I have some background in machine learning and I remember some machine learning models also output probabilities and then I come across the term probabilistic machine learning...
Is there a difference between the two? Or are they something similar?
Appreciate anyone that could help clarify.
I guess there is some vagueness between the two terms but my take on them is the following:
Probabilistic Programming it is expressing probabilistic models as computer programs that generate data (i.e. simulators).
Probabilistic Models + Programming = Probabilistic Programming
There is no say about what comprise a probabilistic model (it may well be a neural network of some sorts). Therefore, I view this term as:
More generic
More frequently used in an applied context (with relation to programming)
Probabilistic Machine Learning
is a another flavour of ML which deals with probabilistic aspects of predictions, e.g. the model does not treat input / output values as certain and/or point values, but instead treats them (or some of them) as random variables. Prominent example of such an approach is Gaussian Process.
My understanding at this point in time is that probabilistic programming is similar to Bayesian networks, just translated into programming language for creation of automated inference models?
That is correct. A probabilistic program can be viewed as equivalent to a Bayesian network, but is expressed in a much richer language. Probabilistic programming as a field proposes such representations, as well as algorithms that take advantage of those representations, because sometimes a richer representation makes the problem easier.
Consider for example a probabilistic program that models a disease that is more likely to afflict men:
N = 1000000;
for i = 1:N {
male[i] ~ Bernoulli(0.5);
disease[i] ~ if male[i] then Bernoulli(0.8) else Bernoulli(0.3)
}
This probabilistic program is equivalent to the following Bayesian network accompanied by the appropriate conditional probability tables:
For highly repetitive networks such as this, authors often use plate notation to make their depiction more succinct:
However, plate notation is a device for human-readable publications, not a formal language in the same sense that a programming language is. Also, for more complex models, plate notation could become harder to understand and maintain. Finally, a programming language brings other benefits, such as primitive operations that make it easier to express conditional probabilities.
So, is it just a matter of having a convenient representation? No, because a more abstract representation contains more high-level information that can be used to improve inference performance.
Suppose that we want to compute the probability distribution on the number of people among N individuals with the disease. A straightforward and generic Bayesian network algorithm would have to consider the large number of 2^N combinations of assignments to the disease variables in order to compute that answer.
The probabilistic program representation, however, explicitly indicates that the conditional probabilities for disease[i] and male[i] are identical for all i. An inference algorithm can exploit that to compute the marginal probability of disease[i], which is identical for all i, use the fact that the number of diseased people will therefore be a binomial distribution B(N, P(disease[i])) and provide that as the desired answer, in time constant in N. It would also be able to provide an explanation of this conclusion that would be more understandable and insightful to a user.
One could argue that this comparison is unfair because a knowledgeable user would not pose the query as defined for the explicit O(N)-sized Bayesian network, but instead simplify the problem in advance by exploiting its simple structure. However, the user may not be knowledgeable enough for such a simplification, particularly for more complex cases, or may not have the time to do it, or may make a mistake, or may not know in advance what the model will be so she cannot simplify it manually like that. Probabilistic programming offers the possibility that such simplification be made automatically.
To be fair, most current probabilistic programming tools (such as JAGS and Stan) do not perform this more sophisticated mathematical reasoning (often called lifted probabilistic inference) and instead simply perform Markov Chain Monte Carlo (MCMC) sampling over the Bayesian network equivalent to the probabilistic program (but often without having to build the entire network in advance, which is also another possible gain). In any case, this convenience already more than justifies their use.

Ensemble Learning in Unsupervised Learning

I have a question regarding the current literature in ensemble learning (more specifically in unsupervised learning).
For what I read in the literature, Ensemble Learning when applied to Unsupervised Learning resumes basically to Clustering Problems. However, if I have x unsupervised methods that output a score (similar to a regression problem), is there an approach that can combine these results into a single one?
On evaluation of outlier rankings and outlier scores. Schubert, E., Wojdanowski, R., Zimek, A., & Kriegel, H. P. (2012, April). In Proceedings of the 2012 SIAM International Conference on Data Mining (pp. 1047-1058). Society for Industrial and Applied Mathematics.
In this publication, we not "just normalize" outlier scores, but we also suggest a unsupervised ensemble member selection strategy called "greedy ensemble".
However, normalization is crucial, and difficult. We published some of the earlier progress with respect to score normalization as
Interpreting and unifying outlier scores. Kriegel, H. P., Kroger, P., Schubert, E., & Zimek, A. (2011, April). In Proceedings of the 2011 SIAM International Conference on Data Mining (pp. 13-24). Society for Industrial and Applied Mathematics.
If you don't normalize your scores (and min-max scaling is not enough), you will usually not be able to combine them in a meaningful way, except with very strong preconditions. Even two different subspaces will usually yield incomparable values because of having a different number of features, and different feature scales.
There is also some work on semi-supervised ensembles, e.g.
Learning Outlier Ensembles: The Best of Both Worlds—Supervised and Unsupervised. Micenková, B., McWilliams, B., & Assent, I. (2014).In Proceedings of the ACM SIGKDD 2014 Workshop on Outlier Detection and Description under Data Diversity (ODD2). New York, NY, USA (pp. 51-54).
Also beware of overfitting. It's quite easy to arrive at a single good result by tweaking parameters and repeated evaluation. But this leaks evaluation information into your experiment, i.e. you tend to overfit. Performing well across a large range of parameters and data sets is very hard. One of the key observations of the following study was that for every algorithm, you'll find at least one data set and parameter set, where it 'outperforms' the others; but if you change parameters a little, or use a different data set, the benefits of the "superior" new methods are not reproducible.
On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Campos, G. O., Zimek, A., Sander, J., Campello, R. J., Micenková, B., Schubert, E., ... & Houle, M. E. (2016). Data Mining and Knowledge Discovery, 30(4), 891-927.
So you will have to work really hard to do a reliable evaluation. Be careful how to choose parameters.

Q-learning vs dynamic programming

Is the classic Q-learning algorithm, using lookup table (instead of function approximation), equivalent to dynamic programming?
From Sutton & Barto's book (Reinforcement Learning: An Introduction, chapter 4)
The term dynamic programming (DP) refers to a collection of algorithms
that can be used to compute optimal policies given a perfect model of
the environment as a Markov decision process (MDP). Classical DP
algorithms are of limited utility in reinforcement learning both
because of their assumption of a perfect model and because of their
great computational expense, but they are still important
theoretically.
So, although both share the same working principles (either using tabular Reinforcement Learning/Dynamic Programming or approximated RL/DP), the key difference between classic DP and classic RL is that the first assume the model is known. This basically means knowing the transition probabilities (which indicates the probability of change from state s to state s' given action a) and the expected immediate reward function.
On the contrary, RL methods only require to have access to a set of samples, either collected online or offline (depending on the algorithm).
Of course, there are hybrid methods that can be place between RL and DP, for example those that learn a model from the samples, and then use that model in the learning process.
NOTE: The term Dynamic Programming, in addition to a set of mathematical optimization techniques related with RL, is also used to refer a "general algorithmic pattern", as pointed in some comment. In both case, the fundaments are the same, but depending on the context may have a different meaning.
Dynamic Programming is an umbrella encompassing many algorithms. Q-Learning is a specific algorithm. So, no, it is not the same.
Also, if you mean Dynamic Programming as in Value Iteration or Policy Iteration, still not the same. These algorithms are "planning" methods. You have to give them a transition and a reward function and they will iteratively compute a value function and an optimal policy.
Q-Learning is a model-free reinforcement learning method. This one is "model-free", not because it doesn't use a machine learning model or anything like that, but because they don't require, and don't use a model of the environment, also known as MDP, to obtain an optimal policy. You also have "model-based" methods. These, unlike Dynamic Programming methods, are based on learning a model, not simply using one. And, unlike model-free methods, don't throw away samples after estimating values, they instead try to reconstruct the transition and reward function to get better performance.
Model-based methods combine model-free and planning algorithms to get same good results with less amount of samples than required by model-free methods (Q-Learning), and without needing a model like Dynamic Programming methods (Value/Policy Iteration).
If you use Q-learning in an offline setup, like AlphaGo, for example, then it is equivalent to dynamic programming. The difference is that it can also be used in an online setup.

Supervised Learning, (ii) Unsupervised Learning, (iii) Reinforcement Learn

I am new to Machine learning. While reading about Supervised Learning, Unsupervised Learning, Reinforcement Learning I came across a question as below and got confused. Please help me in identifying in below three which one is Supervised Learning, Unsupervised Learning, Reinforcement learning.
What types of learning, if any, best describe the following three scenarios:
(i) A coin classification system is created for a vending machine. In order to do this,
the developers obtain exact coin specications from the U.S. Mint and derive
a statistical model of the size, weight, and denomination, which the vending
machine then uses to classify its coins.
(ii) Instead of calling the U.S. Mint to obtain coin information, an algorithm is
presented with a large set of labeled coins. The algorithm uses this data to
infer decision boundaries which the vending machine then uses to classify its
coins.
(iii) A computer develops a strategy for playing Tic-Tac-Toe by playing repeatedly
and adjusting its strategy by penalizing moves that eventually lead to losing.
(i) unsupervised learning - as no labelled data is available
(ii) supervised learning - as you already have labelled data available
(iii) reinforcement learning- where you learn and relearn based on the actions and the effects/rewards from that actions.
Let's say, you have dataset represented as matrix X. Each row in X is an observation (instance) and each column represents particular variable (feature).
If you also have (and use) vector y of labels, corresponding to observations, then this is a task of supervised learning. There's "supervisor" involved, that says which observations belong to class #1, which to class #2, etc.
If you don't have labels for observations, then you have to make decisions based on the X dataset itself. For example, in the example with coins you may want to build model of normal distribution for coin parameters and create system that signals when the coin has unusual parameters (and thus may be attempted fraud). In this case you don't have any kind of supervisor that would say what coins are ok and what represent fraud attempt. Thus, it is unsupervised learning task.
In 2 previous examples you first trained your model and then used it, without any further changes to the model. In reinforcement learning model is continuously improved based on processed data and the result. For example, robot that seeks to find the way from point A to point B may first compute parameters of the move, then shift based on these parameters, then analyze new position and update move parameters, so that next move would be more accurate (repeat until get to point B).
Based on this, I'm pretty sure you will be able to find correspondence between these 3 kinds of learning and your items.
I wrote an article on Perceptron for Novices. I have explained Supervised Learning in details with Delta Rule. Also described Unsupervised Learning and Reinforcement Learning (in brief). You may check if you are interested.
"An Intuitive Example of Artificial Neural Network (Perceptron) Detecting Cars / Pedestrians from a Self-driven Car"
https://www.spicelogic.com/Blog/Perceptron-Artificial-Neural-Networks-10

Resources