In my understanding, reinforcement learning will get a reward from the action.
However, when playing a video game, there is no reward ( reward == 0 ) in most of the steps (ex: street fighter), eventually, we got a reward ( ex: player win, reward = 1 ), there are so many actions, how machine know which one is the key point to win this game ?
In Reinforcement Learning the reward can be immediate or delayed [1]:
The immediate reward could be:
very high positive if the agent wins the game (it is the last action that defeats the opponent);
very low negative if the agent loses the game;
positive if the action damages your opponent;
negative if the agent loses health points.
The delayed reward is caused by a future reward that is possible through a current action. For example, moving one step to the left could cause that in the next step it avoids being hit and it can hit the opponent.
Reinforcement learning algorithms, such as Q-learning, choose the action that gives the highest expected reward. This reward is continuously updated with the current reward (r at time t) and with possible future rewards (the last value in the equation, max Q, based on actions from time t+1 and later):
More detailed information about (Deep) Reinforcement Learning, with some examples of applications to games, is given in A Beginner's Guide to Deep Reinforcement Learning.
Related
I would like to analyze a problem similar to the following.
Problem:
You will be given N dices.
You will be given a lot of data about each dice (eg surface information, material information, location of the center of gravity … etc).
The features of the dice are randomly generated every game and are fired at the same speed, angle and initial position.
As a result of rolling the dice, you get 1 point if you get 6 and 0 points otherwise.
There are training data of 100000 games. (Dice data and match results)
I would like to learn the rule of selecting only dice whose probability of getting 6 is higher than 1/6.
I apologize for the vague problem statement.
First of all, it is my mistake to assume that "N dice".
The dice may be one by one.
One dice with random characteristics are distributed
When it rolls, it is recorded whether 6 has come out or not.
It was easy to understand if it was made into the problem that "this [characteristics, result] data is 100,000".
If you get something other than 6, you will get -1 points.
If you get 6, you will get +5 points.
Example:
X: vector of a dice data
f: function I want to know
f: X-> [0, 1]
(if result> 0.5, I pick this dice.)
For example, a dice with a 1/5 chance of getting a 6 gets 4 out of 5 times a non-6, so I wondered if it would be better to give an immediate reward.
Is it good to decide the reward by the number of points after 100000 games?
I have read some general reinforcement learning methods, but there is a concept of state transition. However, there is no state transition in this game. (Each game ends in 1 step, and each game is independent.)
I am a student just learning neural networks from scratch. It helps if you give me a hint. Thank you.
by the way,
I think that the result of this learning can be concluded "It is good to choose the dice whose pips farthest to the center of gravity is 6."
Let's first talk about Reinforcement-Learning.
Problem setups, in order of increasing generality:
Multi-Armed Bandit - no state, just actions with unknown rewards
Contextual Bandit - rewards also depend on some context (state)
Reinforcement Learning (MDP) - actions can also influence the next state
Common to all of all three is that you want to maximize the sum of rewards over time, and there is an exploration vs exploitation trade-off. You are not just given a large dataset. If you want to know what the best action is, you have to try it a few times and observe the reward. This may cost you some reward you could have earned otherwise.
Of those three, the Contextual Bandit is the closest match to your setup, although it doesn't quite match to your goals. It would go like this: Given some properties of the dice (the context), select the best dice from a group of possible choices (the action, e.g. the output from your network), such that you get the highest expected reward. At the same time you are also training your network, so you have to pick bad or unknown properties sometimes to explore them.
However, there are two reasons why it doesn't match:
You already have data from several 100000 of games, and seem to be not interested in minimizing the cost of trial and error to acquire more data. You assume this data is representative, so no exploration is required.
You are only interested in prediction. You want classify the dice into "good for rolling 6" vs "bad". This piece of information could be used later to make a decision between different choices if you know the cost for making a wrong decision. If you are just learning f() because you are curious about the property of a dice, then is a pure statistical prediction problem. You don't have to worry about short- or long-term rewards. You don't have to worry about selection or consequences of any actions.
Because of this, you actually only have a supervised learning problem. You could still solve it with reinforcement learning because RL is more general. But your RL algorithm would be wasting a lot of time figuring out that it really cannot influence the next state.
Supervised Learning
Your dice actually behaves like a biased coin, it's a Bernoulli trial with ~1/6 success probability. This is now a standard classification problem: given your features, predict the probability that a dice will lead to a good match result.
It seems that your "match results" can be easily converted in the number of rolls and the number of positive outcomes (rolled a six) with the same dice. If you have a large number of rolls for every dice, you can simply classify this die and use this class (together with the physical properties) as one data point to train your network.
You can do more fancy things if you have fewer rolls but I won't go into that. (If you are interested, have a look at the beta distribution and how the cross-entropy loss works with neural networks.)
I want to learn how deep reinforcement algorithm works and what time it takes to train itself for any given environment.
I came up with a very simple example of environment:
There is a counter which holds an integer between 0 to 100.
counting to 100 is its goal.
there is one parameter direction whose value can be +1 or -1.
it simply show the direction to move.
out neural network takes this direction as input and 2 possible action as output.
Change the direction
Do not change the direction
1st action will simply flip the direction (+1 => -1 or -1 =>+1). 2nd action will keep the direction as it is.
I am using python for backend and javascript for frontend.
It seems to take too much time, and still it is pretty random. i have used 4 layer perceptron. training rate of 0.001 . memory learning with batch of 100. Code is of Udemy tutorial of Artificial Intelligence and is working properly.
My question is, What should be the reward for completion and for each state.? and how much time it is required to train simple example as that.?
In Reinforcement Learning the underlining reward function is what defines the game. Different reward functions lead to different games with different optimal strategies.
In your case there are a few different possibilities:
Give +1 for reaching 100 and only then.
Give +1 for reaching 100 and -0.001 for every time step it is not at 100.
Give +1 for going up -1 for going down.
The third case is way too easy there is no long term planing involved. In the first too cases the agent will only start learning once it accidentally reaches 100 and sees that it is good. But in the first case once it learns to go up it doesn't matter how long it takes to get there. The second is the most interesting where it needs to get there as fast as possible.
There is no right answer for what reward to use, but ultimately the reward you choose defines the game you are playing.
Note: 4 layer perceptron for this problem is Big Time Overkill. One layer should be enough (this problem is very simple). Have you tried the reinforcement learning environments at OpenAI's gym? Highly recommend it, they have all the "classical" reinforcement learning problems.
I want to implement Q-Learning for the Chrome dinosaur game (the one you can play when you are offline).
I defined my state as: distance to next obstacle, speed and the size of the next obstacle.
For the reward I wanted to use the number of successfully passed obstacles, but it could happen that the same state has different immediate rewards. The same type of obstacle could reappear later in the game, but the reward for passing it would be higher because more obstacles have already been passed.
My question now is: Is this a problem or would Q-Learning still work? If not is there a better way?
The definition of an MDP says that the reward r(s,a,s') is defined to be the expected reward for taking action a in state s to search state s'. This means that a given (s,a,s') can have a constant reward, or some distribution of rewards as long as it has a well defined expectation. As you've defined it, the reward is proportional to the number of obstacles passed. Because the game can continue forever, the reward for some (s,a,s') begins to look like the sum of the natural numbers. This series diverges so it does not have an expectation. In practice, if you ran Q-learning you would probably see the value function diverge (NaN values), but the policy in the middle of learning might be okay because the values that will grow the fastest will be the best state action pairs.
To avoid this, you should choose a different reward function. You could reward the agent with whatever its score is when it dies (big reward at the end, zero otherwise). You would also be fine giving a living reward (small reward each time step) as long as the agent has no choice but to move forward. As long as the highest total rewards are assigned to the longest runs (and the expectation of the reward for a (s,a,s') tuple is well defined) it's good.
I'm trying to figure out how to implement Q learning in a gridworld example. I believe I understand the basics of how Q learning works but it doesn't seem to be giving me the correct values.
This example is from Sutton and Barton's book on reinforcement learning.
The gridworld is specified such that the agent can take actions {N,E,W,S} at any given state with equal probability and the rewards for all actions is 0 except if the agent attempts to move off the grid in which case it's -1. There are two special states, A and B where the agent deterministically will move to A' and B' respectively with rewards +10 and +5 respectively.
My question is about how I would go about implementing this through Q learning. I want to be able to estimate the value function through matrix inversion. The agent starts out in some initial state, not knowing anything and then takes actions selected by an epsilon-greedy algorithm and gets rewards that we can simulate since we know how the rewards are distributed.
This leads me to my question. Can I build a transition probability matrix each time the agent transitions from some state S -> S' where the probabilities are computed based on the frequency with which the agent took a particular action and did a particular transition?
For Q-learning you don't need a "model" of the environment (i.e transition probabilities matrix) to estimate the value function as it is a model-free method. For matrix inversion evaluation you refer to dynamic programming (model-based) where you use a transition matrix. You can think of Q-learning algorithm as a kind of trial and error algorithm where you select an action and receive feedback from the environment. However, contrary to model-based methods you don't have any knowledge about how your environment works (no transition matrix and reward matrix). Eventually, after enough sampled experience the Q function will converge to the optimal one.
For the implementation of the algorithm, you can start from an initial state, after initializing you Q function for all stats and actions (so you can keep track of a $SxA$). Then you select an action according to you policy. Here you should implement a step function. The step function will return the new state $s'$ and the reward. Consider step function as the feedback of the environment to your action.
Eventually you just need to update your Q-function according to:
$Q(s,a)=Q(s,a)+\alpha\left[r+\gamma\underset{a'}{\max(Q(s',a)})-Q(s,a)\right]$
Set $s=s'$ and repeat the whole process till convergence.
Hope this helps
Not sure if this helps, but here is a write up explaining Q learning through an example of a robot. There is also some R code in there if you want to try it out yourself.
I read about Tesauro's TD-Gammon program and would love to implement it for tic tac toe, but almost all of the information is inaccessible to me as a high school student because I don't know the terminology.
The first equation here, http://www.stanford.edu/group/pdplab/pdphandbook/handbookch10.html#x26-1310009.2
gives the "general supervised learning paradigm." It says that the w sub t on the left side of the equation is the parameter vector at time step t. What exactly does "time step" mean? Within the framework of a tic tac toe neural network designed to output the value of a board state, would the time step refer to the number of played pieces in a given game? For example the board represented by the string "xoxoxoxox" would be at time step 9 and the board "xoxoxoxo " would be at time step 8? Or would the time step refer to the amount of time elapsed since training has begun?
Since w sub t is the weight vector for a given time step, does this mean that every time step has its own evaluation function (neural network)? So to evaluate a board state with only one move, you would have to feed into a different NN than you would feed a board state with two moves? I think I am misinterpreting something here because as far as I know Tesauro used only one NN for evaluating all board states (though it is hard to find reliable information about TD-Gammon).
How come the gradient of the output is taken with respect to w and not w sub t?
Thanks in advance for clarifying these ideas. I would appreciate any advice regarding my project or suggestions for accessible reading material.
TD deals with learning within the framework of a Markov Decision Process. That is, you begin in some state st, perform an action at, receive reward rt, and end up in another state — st+1. The initial state is called s0. The subscript is called time.
A TD agent begins not knowing what rewards it will receive for what actions or what other states those actions take it to. It's goal is to learn which actions maximize long-term total reward.
The state could represent the state of the tic-tac-toe board. So, s0 in your case would be a clear board: "---------", s1 might be "-o-------", s2 might be "-o--x----", etc. The action might be the cell index to mark. With this representation, you would have 39=19 683 possible states and 9 possible actions. After learning the game, you would have a table telling you which cell to mark on each possible board.
The same kind of direct representation would not work well for Backgammon, because there are far too many possible states. What TD-Gammon does is approximate states using a neural network. The weights of the network are treated as a state vector, and the reward is always 0, except upon winning.
The tricky part here is that the state the TD-Gammon algorithm learns is the state of the neural network used to evaluate board positions, not the state of the board. So, at t=0 you have not played a single game and the network is in its initial state. Every time you have to make a move, you use the network to choose the best possible one, and then update the network's weights based on whether the move led to victory or not. Before making the move, the network has weights wt, afterwards it has weights wt+1. After playing hundreds of thousands of games, you learn weights that allow the neural network to evaluate board positions quite accurately.
I hope this clears things up.