Relationship between bellman optimal equation and Q-learning - machine-learning

Optimal value of state-action by bellman optimal equation(63 page of sutton 2018) is
and Q-learning is
I have known that Q-learning is model-free. so It doesn't need a probability of transition for next state.
However, p(s'r|s,a) of bellman equation is probability of transition for next state s' with reward r when s, a are given. so I think to get a Q(s,a), it needs probability of transition.
Q of bellman equation and Q of q-learning is different?
If it is same, how q-learning can work as model-free?
Is there any way to get a Q(s,a) regardless of probability of transition for q-learning?
Or Am i confusing something?

Q-learning is an instance of the Bellman equation applied to a state-action value function. It is "model-free" in the sense that you don't need a transition function that determines, for a given decision, which state is next.
However, there are several formulations of Q-Learning which differ in the information that is known. In particular, when you know the transition function, you can and should use it in your Bellman equation. This results in the equation you cited.
On the other hand, if you don't know the transition function, Q-learning works as well, but you have to sample the impact of the transition function through simulations.

Related

How to improve Levenberg-Marquardt's method for polynomial curve fitting?

Some weeks ago I started coding the Levenberg-Marquardt algorithm from scratch in Matlab. I'm interested in the polynomial fitting of the data but I haven't been able to achieve the level of accuracy I would like. I'm using a fifth order polynomial after I tried other polynomials and it seemed to be the best option. The algorithm always converges to the same function minimization no matter what improvements I try to implement. So far, I have unsuccessfully added the following features:
Geodesic acceleration term as a second order correction
Delayed gratification for updating the damping parameter
Gain factor to get closer to the Gauss-Newton direction or
the steepest descent direction depending on the iteration.
Central differences and forward differences for the finite difference method
I don't have experience in nonlinear least squares, so I don't know if there is a way to minimize the residual even more or if there isn't more room for improvement with this method. I attach below an image of the behavior of the polynomial for the last iterations. If I run the code for more iterations, the curve ends up not changing from iteration to iteration. As it is observed, there is a good fit from time = 0 to time = 12. But I'm not able to fix the behavior of the function from time = 12 to time = 20. Any help will be very appreciated.
Fitting a polynomial does not seem to be the best idea. Your data set looks like an exponential transient, with an horizontal asymptote. Forcing a polynomial to that will work very poorly.
I'd rather try with a simple model, such as
A (1 - e^(-at)).
With naked eye, A ~ 15. You should have a look at the values of log(15 - y).

How does scikitlearn implement line search?

In this section of the documentation on gradient boosting, it says
Gradient Boosting attempts to solve this minimization problem numerically via steepest descent: The steepest descent direction is the negative gradient of the loss function evaluated at the current model F_{m-1} which can be calculated for any differentiable loss function:
Where the step length \gamma_m is chosen using line search:
I understand the purpose of the line search, but I don't understand the algorithm itself. I read through the source code, but it's still not clicking. An explanation would be much appreciated.
The implementation is depending on which loss function you choose when initialize a GradientBoostingClassifier instance(use this for example, the regression part should be similar). The default loss function is 'deviance' and the corresponding optimization algorithm is implemented here. In the _update_terminal_region function, a simple Newton iteration is implemented with only one step.
Is this the answer you want?
I suspect the thing you find confusing is this: you can see where scikit-learn computes the negative gradient of the loss function and fits a base estimator to that negative gradient. It looks like the _update_terminal_region method is responsible for figuring out the step size, but you can't see anywhere it might be solving the line search minimization problem as written in the documentation.
The reason you can't find a line search happening is that, for the special case of decision tree regressors, which are just piecewise constant functions, the optimal solution is usually known. For example, if you look at the _update_terminal_region method of the LeastAbsoluteError loss function, you see that the leaves of the tree are given the value of the weighted median of the difference between y and the predicted value for the examples for which that leaf is relevant. This median is the known optimal solution.
To summarize what's happening, for each gradient descent iteration the following steps are taken:
Compute the negative gradient of the loss function at the current prediction.
Fit a DecisionTreeRegressor to the negative gradient. This fitting produces a tree with good splits for decreasing the loss.
Replace the values at the leaves of the DecisionTreeRegressor with values that minimize loss. These are usually computed from some simple known formula that takes advantage of the fact that the decision tree is just a piecewise constant function.
This method should be at least as good as what is described in the docs, but I think in some cases might not be identical to it.
From your comments it seems the algorithm itself is unclear and not the way scikitlearn implements it.
Notation in the wikipedia article is slightly sloppy, one does not simply differentiate by a function evaluated at a point. Once you replace F_{m-1}(x_i) with \hat{y_i} and replace partial derivative with a partial derivative evaluated at \hat{y}=F_{m-1}(x) things become clearer:
This would also remove x_{i} (sort of) from the minimization problem and shows the intent of line search - to optimize depending on the current prediction and not depending on the training set. Now, notice that:
Hence you're just minimizing:
So line search simply optimizes one degree of freedom you have (once you've found the right gradient direction) - the step size.

Why calculate jacobians in ekf-slam

I know it is a very basic question but I want to know why do we calculate the Jacobian matrices in EKF-SLAM, I have tried so hard to understand this, well it won't be that hard but I want to know it. I was wondering if anyone could help me on this.
The Kalman filter operates on linear systems. The steps update two parts in parallel: The state x, and the error covariances P. In a linear system we predict the next x by Fx. It turns out that you can compute the exact covariance of Fx as FPF^T. In a non-linear system, we can update x as f(x), but how do we update P? There are two popular approaches:
In the EKF, we choose a linear approximation of f() at x, and then use the usual method FPF^T.
In the UKF, we build an approximation of the distribution of x with covariance P. The approximation is a set of points called sigma points. Then we propagate those states through our real f(sigma_point) and we measure the resulting distribution's variance.
You are concerned with the EKF (case 1). What is a good linear approximation of a function? If you zoom way in on a curve, it starts to look like a straight line, with a slope that's the derivative of the function at that point. If that sounds strange, look at Taylor series. The multi-variate equivalent is called the Jacobian. So we evaluate the Jacobian of f() at x to give us an F. Now Fx != f(x), but that's okay as long as the changes we're making to x are small (small enough that our approximated F wouldn't change much from before to after).
The main problem with the EKF approximation is that when we use the approximation to update the distributions after the measurement step, it tends to make the resulting covariance P too low. It acts like corrections "work" in a linear way. The actual update will depart slightly from the linear approximation and not be quite as good. These small amounts of overconfidence build up as the KF iterates and have to be offset by adding some fictitious process noise to Q.

Stochastic state transitions in MDP: How does Q-learning estimate that?

I am implementing Q-learning to a grid-world for finding the most optimal policy. One thing that is bugging me is that the state transitions are stochastic. For example, if I am in the state (3,2) and take an action 'north', I would land-up at (3,1) with probability 0.8, to (2,2) with probability 0.1 and to (4,2) with probability 0.1. How do I fit this information in the algorithm? As I have read so far, Q-learning is a "model-free" learning- It does not need to know the state-transition probabilities. I am not convinced as to how the algorithm will automatically find these transition probabilities during the training process. If someone can clear things up, I shall be grateful.
Let's look at what Q-learning guarantees to see why it handles transition probabilities.
Let's call q* the optimal action-value function. This is the function that returns the correct value of taking some action in a some state. The value of a state-action pair is the expected cumulative reward of taking the action, then following an optimal policy afterwards. An optimal policy is simply a way of choosing actions that achieves the maximum possible expected cumulative reward. Once we have q*, it's easy to find an optimal policy; from each state s that you find yourself in, greedily choose the action that maximizes q*(s,a). Q-learning learns q* given that it visits all states and actions infinitely many times.
For example, if I am in the state (3,2) and take an action 'north', I would land-up at (3,1) with probability 0.8, to (2,2) with probability 0.1 and to (4,2) with probability 0.1. How do I fit this information in the algorithm?
Because the algorithm visits all states and actions infinitely many times, averaging q-values, it learns an expectation of the value of attempting to go north. We go north so many times that the value converges to the sum of each possible outcome weighted by its transition probability. Say we knew all of the values on the gridworld except for the value of going north from (3,2), and assume there is no reward for any transition from (3,2). After sampling north from (3,2) infinitely many times, the algorithm converges to the value 0.8 * q(3,1) + 0.1 * q(2,2) + 0.1 * q(4,2). With this value in hand, a greedy action choice from (3,2) will now be properly informed of the true expected value of attempting to go north. The transition probability gets baked right into the value!

What are the variables involved in constructing an ROC curve?

Say I have a classifier and I achieve FAR of 10% and FRR of 15%. What would I need to do with these to construct an ROC curve?
I'm having trouble seeing what they actually represent and the situation in which they are used. I don't seem to have an important variable the shifts the FAR and FRR in one direction or the other. Can I still use ROC?
Short answer is: no, you cannot.
ROC curve is a parametric curve, you need a scalar value you can shift in order to change your classifiers decisions. Its usefull for:
Checking robustness to this parameter
fine-tuning final probability estimates for a particular application

Resources