Stochastic state transitions in MDP: How does Q-learning estimate that? - machine-learning

I am implementing Q-learning to a grid-world for finding the most optimal policy. One thing that is bugging me is that the state transitions are stochastic. For example, if I am in the state (3,2) and take an action 'north', I would land-up at (3,1) with probability 0.8, to (2,2) with probability 0.1 and to (4,2) with probability 0.1. How do I fit this information in the algorithm? As I have read so far, Q-learning is a "model-free" learning- It does not need to know the state-transition probabilities. I am not convinced as to how the algorithm will automatically find these transition probabilities during the training process. If someone can clear things up, I shall be grateful.

Let's look at what Q-learning guarantees to see why it handles transition probabilities.
Let's call q* the optimal action-value function. This is the function that returns the correct value of taking some action in a some state. The value of a state-action pair is the expected cumulative reward of taking the action, then following an optimal policy afterwards. An optimal policy is simply a way of choosing actions that achieves the maximum possible expected cumulative reward. Once we have q*, it's easy to find an optimal policy; from each state s that you find yourself in, greedily choose the action that maximizes q*(s,a). Q-learning learns q* given that it visits all states and actions infinitely many times.
For example, if I am in the state (3,2) and take an action 'north', I would land-up at (3,1) with probability 0.8, to (2,2) with probability 0.1 and to (4,2) with probability 0.1. How do I fit this information in the algorithm?
Because the algorithm visits all states and actions infinitely many times, averaging q-values, it learns an expectation of the value of attempting to go north. We go north so many times that the value converges to the sum of each possible outcome weighted by its transition probability. Say we knew all of the values on the gridworld except for the value of going north from (3,2), and assume there is no reward for any transition from (3,2). After sampling north from (3,2) infinitely many times, the algorithm converges to the value 0.8 * q(3,1) + 0.1 * q(2,2) + 0.1 * q(4,2). With this value in hand, a greedy action choice from (3,2) will now be properly informed of the true expected value of attempting to go north. The transition probability gets baked right into the value!

Related

How cost function curve is actually calculated for gradient descent i:e how many times the model chooses the weights randomly?

As far as I know, to calculate the weights and bias for simple linear regression, it follows gradient descent algorithm which works on finding global minima for cost function(curve). And that cost function is calculated by randomly choosing a set of weights and then calculating mean error on all the records. In that way we get a point on the cost curve. Again another set of weights are chosen and mean error is calculated. So all these points make up the cost curve.
My doubt is, how many times the weights are randomly chosen to get the points, before calculating (finding the cost function)the cost curve.
Thanks in advance.
Gradient descent algorithm iterates till convergence.
By convergence, it means, global minima is found out for the convex cost function.
There are basically two ways people use to find convergence.
Automatic convergence test : Declare convergence if cost function decreases by less than e in an iteration, where e is some small value such as 10^-3. However, it is difficult to choose this threshold value in practice.
Plot cost function against iterations : Plotting cost function against iteration can give you fair idea about convergence. It can also be used for debugging (cost function must be decreasing on every iteration).
For example, in this figure I can deduce that I need near 300-400 iterations of Gradient Descent.
Also this enables you to check for different learning rate (alpha) vs iterations.

Relationship between bellman optimal equation and Q-learning

Optimal value of state-action by bellman optimal equation(63 page of sutton 2018) is
and Q-learning is
I have known that Q-learning is model-free. so It doesn't need a probability of transition for next state.
However, p(s'r|s,a) of bellman equation is probability of transition for next state s' with reward r when s, a are given. so I think to get a Q(s,a), it needs probability of transition.
Q of bellman equation and Q of q-learning is different?
If it is same, how q-learning can work as model-free?
Is there any way to get a Q(s,a) regardless of probability of transition for q-learning?
Or Am i confusing something?
Q-learning is an instance of the Bellman equation applied to a state-action value function. It is "model-free" in the sense that you don't need a transition function that determines, for a given decision, which state is next.
However, there are several formulations of Q-Learning which differ in the information that is known. In particular, when you know the transition function, you can and should use it in your Bellman equation. This results in the equation you cited.
On the other hand, if you don't know the transition function, Q-learning works as well, but you have to sample the impact of the transition function through simulations.

Computing ROC curve for K-NN classifier

As you probably know, in K-NN, the decision is usually taken according to the "majority vote", and not according to some threshold - i.e. there is no parameter to base a ROC curve on.
Note that in the implementation of my K-NN classifier, the votes don't have equal weights. I.e. the weight for each "neighbor" is e^(-d), where d is the distance between the tested sample and the neighbor. This measure gives higher weights for the votes of the nearer neighbors among the K neighbors.
My current decision rule is that if the sum of the scores of the positive neighbors is higher than the sum of the scores of the negative samples, then my classifier says POSITIVE, else, it says NEGATIVE.
But - There is no threshold.
Then, I thought about the following idea:
Deciding on the class of the samples which has a higher sum of votes, could be more generally described as using the threshold 0, for the score computed by: (POS_NEIGHBORS_SUMMED_SCORES - NEG_NEIGHBORS_SUMMED_SCORES)
So I thought changing my decision rule to be using a threshold on that measure, and plotting a ROC curve basing on thresholds on the values of
(POS_NEIGHBORS_SUMMED_SCORES - NEG_NEIGHBORS_SUMMED_SCORES)
Does it sound like a good approach for this task?
Yes, it is more or less what is typically used. If you take a look at scikit-learn it has weights in knn, and they also have predit_proba, which gives you a clear decision threshold. Typically you do not want to condition on a difference, however, but rather ratio
votes positive / (votes negative + votes positive) < T
this way, you know that you just have to "move" threshold from 0 to 1, and not arbitrary values. it also now has a clear interpretation - as an internal probability estimate that you consider "sure enough". By default T = 0.5, if the probability is above 50% you classify as positive, but as said before - you can do anything wit it.

receiver operating characteristic (ROC) on a test set

The following image definitely makes sense to me.
Say you have a few trained binary classifiers A, B (B not much better than random guessing etc. ...) and a test set composed of n test samples to go with all those classifiers. Since Precision and Recall are computed for all n samples, those dots corresponding to classifiers make sense.
Now sometimes people talk about ROC curves and I understand that precision is expressed as a function of recall or simply plotted Precision(Recall).
I don't understand where does this variability come from, since you have a fixed number of test samples. Do you just pick some subsets of the test set and find precision and recall in order to plot them and hence many discrete values (or an interpolated line) ?
The ROC curve is well-defined for a binary classifier that expresses its output as a "score." The score can be, for example, the probability of being in the positive class, or it could also be the probability difference (or even the log-odds ratio) between probability distributions over each of the two possible outcomes.
The curve is obtained by setting the decision threshold for this score at different levels and measuring the true-positive and false-positive rates, given that threshold.
There's a good example of this process in Wikipedia's "Receiver Operating Characteristic" page:
For example, imagine that the blood protein levels in diseased people and healthy people are normally distributed with means of 2 g/dL and 1 g/dL respectively. A medical test might measure the level of a certain protein in a blood sample and classify any number above a certain threshold as indicating disease. The experimenter can adjust the threshold (black vertical line in the figure), which will in turn change the false positive rate. Increasing the threshold would result in fewer false positives (and more false negatives), corresponding to a leftward movement on the curve. The actual shape of the curve is determined by how much overlap the two distributions have.
If code speaks more clearly to you, here's the code in scikit-learn that computes an ROC curve given a set of predictions for each item in a dataset. The fundamental operation seems to be (direct link):
desc_score_indices = np.argsort(y_score, kind="mergesort")[::-1]
y_score = y_score[desc_score_indices]
y_true = y_true[desc_score_indices]
# accumulate the true positives with decreasing threshold
tps = y_true.cumsum()
fps = 1 + list(range(len(y_true))) - tps
return fps, tps, y_score
(I've omitted a bunch of code in there that deals with (common) cases of having weighted samples and when the classifier gives near-identical scores to multiple samples.) Basically the true labels are sorted in descending order by the score assigned to them by the classifier, and then their cumulative sum is computed, giving the true positive rate as a function of the score assigned by the classifier.
And here's an example showing how this gets used: http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html
ROC curve just shows "How much sensitivity you will obtain if you increase FPR by some amount". Tradeoff between TPR and FPR. Variability comes from varying some parameter of classifier (For logistic regression case below - it is threshold value).
For example logistic regression gives you probability that object belongs to positive class (values in [0..1]), but it's just probability. It's not a class. So in general case you have to specify threshold for probability, above which you will classify object as positive. You can learn logistic regression, obtain from it probabilities of positive class for each object of your set, and then you just vary this threshold parameter, with some step from 0 to 1, by thresholding your probabilities (computed on previous step) with this threshold you will get class labels for every object, and compute TPR and FPR from this labels. Thus you will get TPR and FPR for every threshold. You can mark them on plot and eventually, after you compute (TPR,FPR) pairs for all thresholds - draw a line through them.
Also for linear binary classifiers you can think about this varying process as a process of choosing distance between decision line and positive (or negative, if you want) class cluster. If you move decision line far from positive class - you will classify more objects as a positive (because you increased positive class space), and at the same time you increased FPR by some value (because space of negative class decreased).

how to interpret the "soft" and "max" in the SoftMax regression?

I know the form of the softmax regression, but I am curious about why it has such a name? Or just for some historical reasons?
The maximum of two numbers max(x,y) could have sharp corners / steep edges which sometimes is an unwanted property (e.g. if you want to compute gradients).
To soften the edges of max(x,y), one can use a variant with softer edges: the softmax function. It's still a max function at its core (well, to be precise it's an approximation of it) but smoothed out.
If it's still unclear, here's a good read.
Let's say you have a set of scalars xi and you want to calculate a weighted sum of them, giving a weight wi to each xi such that the weights sum up to 1 (like a discrete probability). One way to do it is to set wi=exp(a*xi) for some positive constant a, and then normalize the weights to one. If a=0 you get just a regular sample average. On the other hand, for a very large value of a you get max operator, that is the weighted sum will be just the largest xi. Therefore, varying the value of a gives you a "soft", or a continues way to go from regular averaging to selecting the max. The functional form of this weighted average should look familiar to you if you already know what a SoftMax regression is.

Resources