Q-learning with 2D actions and 2D states - machine-learning

My problem is as follows:
The agent should at each state, adjust the water flow and a fan speed for a power plant boiler to receive a feedback of a double state: current temperature, amount of emissions.
If my agent has a tuple of actions and a tuple of states, does that mean i should split my q-learning problem into 2 where 1 agent would have a Q and R matrix for the water/temperature environment and the other agent for the fan speed/amount of emissions environment? Or is there a way to represent an R and Q matrix for the agent described originally?

It is normal for states and actions to be multidimensional. What you do is have your agent learn the values of all combinations of water flow and fan speed for all combinations of current temperature and amount of emission. If this makes the table unwieldy, then you will need to approximate it, and this is a whole field in itself.

Related

Training PPO from stable_baselines3 on a grid world that randomizes

I'm new to RL and I was hoping to get some advice from yol:
I created a custom environment that is a 10x10 grid world where the agent and its target destination (as well as some obstacles, namely: Fires) can be randomly placed. The state of the env that the model is trained on is just the Box numpy array representing the different position (0s for empty spaces, 1 for the target, etc).
what the world could look like
The PPO model (from stable_baselines3) is unable to learn now to navigate randomly generated worlds even after 5 million time steps of training (each reset of an environment creates new random world layout). Tensor-board is showing only a very slight average reward increase after all that training.
I am able to train the model effectively only I keep the world layout the same on every reset (so no random placement of the agent, etc).
So my question is: should PPO be in theory able to deal with random world generation like that or am I trying to make it do something that is beyond its capabilities?
More details: I'm using all default PPO parameters (with MlpPolicy).
The reward system is as follows:
On every step the reward is -0.5 * distance between the agent (smily face) and the target ('$')
If the agent is next to a fire ('X'), it gets -100 reward
If the agent is is next to the target ('$'), it gets reward of 1000 and the episode ends
Max of 200 steps per episode.
I would rather try good old off-policy deterministic solutions like DQN for that task, but on-policy stochastic PPO should solve it as well. I recommend to change three things in your design, maybe changing them could help your training.
First, your reward signal design probably "embarrasses" your network: you have an enormously big positive terminal reward while trying to push your agent to this terminal state asap with small punishments. I would definetely suggest reward normalization for PPO.
Second, if you don't fine-tune your PPO hyperparameters, your entropy coefficient ent_coef remains 0.0, however entropy part of your loss function could be very useful in your env. I would try 0.01 at least.
Third, PPO would be really enhanced in your case (in my opinion) if you change mlp policy to a recurrent one. Take a look at RecurrentPPO

Why ALS can be easily computed in parallel?

I understand that ALS can be computed partially in parallel because we can work with rows / columns independently to compute item / user latent features but then if we've computed user_latent features we will need item_latent features for the next step to compute user_latent features and so on.
So, my questions is can it be optimized any further? Or the fact we can work with one big matrix in parallel is enough for real applications.

Q learning transition matrix

I'm trying to figure out how to implement Q learning in a gridworld example. I believe I understand the basics of how Q learning works but it doesn't seem to be giving me the correct values.
This example is from Sutton and Barton's book on reinforcement learning.
The gridworld is specified such that the agent can take actions {N,E,W,S} at any given state with equal probability and the rewards for all actions is 0 except if the agent attempts to move off the grid in which case it's -1. There are two special states, A and B where the agent deterministically will move to A' and B' respectively with rewards +10 and +5 respectively.
My question is about how I would go about implementing this through Q learning. I want to be able to estimate the value function through matrix inversion. The agent starts out in some initial state, not knowing anything and then takes actions selected by an epsilon-greedy algorithm and gets rewards that we can simulate since we know how the rewards are distributed.
This leads me to my question. Can I build a transition probability matrix each time the agent transitions from some state S -> S' where the probabilities are computed based on the frequency with which the agent took a particular action and did a particular transition?
For Q-learning you don't need a "model" of the environment (i.e transition probabilities matrix) to estimate the value function as it is a model-free method. For matrix inversion evaluation you refer to dynamic programming (model-based) where you use a transition matrix. You can think of Q-learning algorithm as a kind of trial and error algorithm where you select an action and receive feedback from the environment. However, contrary to model-based methods you don't have any knowledge about how your environment works (no transition matrix and reward matrix). Eventually, after enough sampled experience the Q function will converge to the optimal one.
For the implementation of the algorithm, you can start from an initial state, after initializing you Q function for all stats and actions (so you can keep track of a $SxA$). Then you select an action according to you policy. Here you should implement a step function. The step function will return the new state $s'$ and the reward. Consider step function as the feedback of the environment to your action.
Eventually you just need to update your Q-function according to:
$Q(s,a)=Q(s,a)+\alpha\left[r+\gamma\underset{a'}{\max(Q(s',a)})-Q(s,a)\right]$
Set $s=s'$ and repeat the whole process till convergence.
Hope this helps
Not sure if this helps, but here is a write up explaining Q learning through an example of a robot. There is also some R code in there if you want to try it out yourself.

Updates in Temporal Difference Learning

I read about Tesauro's TD-Gammon program and would love to implement it for tic tac toe, but almost all of the information is inaccessible to me as a high school student because I don't know the terminology.
The first equation here, http://www.stanford.edu/group/pdplab/pdphandbook/handbookch10.html#x26-1310009.2
gives the "general supervised learning paradigm." It says that the w sub t on the left side of the equation is the parameter vector at time step t. What exactly does "time step" mean? Within the framework of a tic tac toe neural network designed to output the value of a board state, would the time step refer to the number of played pieces in a given game? For example the board represented by the string "xoxoxoxox" would be at time step 9 and the board "xoxoxoxo " would be at time step 8? Or would the time step refer to the amount of time elapsed since training has begun?
Since w sub t is the weight vector for a given time step, does this mean that every time step has its own evaluation function (neural network)? So to evaluate a board state with only one move, you would have to feed into a different NN than you would feed a board state with two moves? I think I am misinterpreting something here because as far as I know Tesauro used only one NN for evaluating all board states (though it is hard to find reliable information about TD-Gammon).
How come the gradient of the output is taken with respect to w and not w sub t?
Thanks in advance for clarifying these ideas. I would appreciate any advice regarding my project or suggestions for accessible reading material.
TD deals with learning within the framework of a Markov Decision Process. That is, you begin in some state st, perform an action at, receive reward rt, and end up in another state — st+1. The initial state is called s0. The subscript is called time.
A TD agent begins not knowing what rewards it will receive for what actions or what other states those actions take it to. It's goal is to learn which actions maximize long-term total reward.
The state could represent the state of the tic-tac-toe board. So, s0 in your case would be a clear board: "---------", s1 might be "-o-------", s2 might be "-o--x----", etc. The action might be the cell index to mark. With this representation, you would have 39=19 683 possible states and 9 possible actions. After learning the game, you would have a table telling you which cell to mark on each possible board.
The same kind of direct representation would not work well for Backgammon, because there are far too many possible states. What TD-Gammon does is approximate states using a neural network. The weights of the network are treated as a state vector, and the reward is always 0, except upon winning.
The tricky part here is that the state the TD-Gammon algorithm learns is the state of the neural network used to evaluate board positions, not the state of the board. So, at t=0 you have not played a single game and the network is in its initial state. Every time you have to make a move, you use the network to choose the best possible one, and then update the network's weights based on whether the move led to victory or not. Before making the move, the network has weights wt, afterwards it has weights wt+1. After playing hundreds of thousands of games, you learn weights that allow the neural network to evaluate board positions quite accurately.
I hope this clears things up.

Reinforcement learning of a policy for multiple actors in large state spaces

I have a real-time domain where I need to assign an action to N actors involving moving one of O objects to one of L locations. At each time step, I'm given a reward R, indicating the overall success of all actors.
I have 10 actors, 50 unique objects, and 1000 locations, so for each actor I have to select from 500000 possible actions. Additionally, there are 50 environmental factors I may take into account, such as how close each object is to a wall, or how close it is to an actor. This results in 25000000 potential actions per actor.
Nearly all reinforcement learning algorithms don't seem to be suitable for this domain.
First, they nearly all involve evaluating the expected utility of each action in a given state. My state space is huge, so it would take forever to converge a policy using something as primitive as Q-learning, even if I used function approximation. Even if I could, it would take too long to find the best action out of a million actions in each time step.
Secondly, most algorithms assume a single reward per actor, whereas the reward I'm given might be polluted by the mistakes of one or more actors.
How should I approach this problem? I've found no code for domains like this, and the few academic papers I've found on multi-actor reinforcement learning algorithms don't provide nearly enough detail to reproduce the proposed algorithm.
Clarifying the problem
N=10 actors
O=50 objects
L=1K locations
S=50 features
As I understand it, you have a warehouse with N actors, O objects, L locations, and some walls. The goal is to make sure that each of the O objects ends up in any one of the L locations in the least amount of time. The action space consist of decisions on which actor should be moving which object to which location at any point in time. The state space consists of some 50 X-dimensional environmental factors that include features such as proximity of actors and objects to walls and to each other. So, at first glance, you have XS(OL)N action values, with most action dimensions discrete.
The problem as stated is not a good candidate for reinforcement learning. However, it is unclear what the environmental factors really are and how many of the restrictions are self-imposed. So, let's look at a related, but different problem.
Solving a different problem
We look at a single actor. Say, it knows it's own position in the warehouse, positions of the other 9 actors, positions of the 50 objects, and 1000 locations. It wants to achieve maximum reward, which happens when each of the 50 objects is at one of the 1000 locations.
Suppose, we have a P-dimensional representation of position in the warehouse. Each position could be occupied by the actor in focus, one of the other actors, an object, or a location. The action is to choose an object and a location. Therefore, we have a 4P-dimensional state space and a P2-dimensional action space. In other words, we have a 4PP2-dimensional value function. By futher experimenting with representation, using different-precision encoding for different parameters, and using options 2, it might be possible to bring the problem into the practical realm.
For examples of learning in complicated spatial settings, I would recommend reading the Konidaris papers 1 and 2.
1 Konidaris, G., Osentoski, S. & Thomas, P., 2008. Value function approximation in reinforcement learning using the Fourier basis. Computer Science Department Faculty Publication Series, p.101.
2 Konidaris, G. & Barto, A., 2009. Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining Y. Bengio et al., eds. Advances in Neural Information Processing Systems, 18, pp.1015-1023.

Resources