I am relatively new in this field, but I couldn't find anything similar to this problem.
The problem: An agent can move from state s1 to state s2 in many ways (in one step).
For example if states represent locations, assume that an agent can move from location represented by s1 to the one of s2 in one step, by taking one of the actions a1 or a2.
This means that multiple actions taken in some state lead to the same state.
Is there anything similar in the literature?
Yes, this situation it pretty standard and can be managed by any Reinforcement Learning algorithm. Markov Decission Processes (which is the mathematical framework commonly used to model the environment in RL) do not assume that there is a unique action that can lead from one state s1 to a another state s2.
So any literature about RL is also covering the case you describe.
For example, this MDP from Wikipedia article for Markov decision process shows an case where you can move from state s1 to state s2 in two ways and one step:
Related
In a perfect information environment, where we are able to know the state after an action, like playing chess, is there any reason to use Q learning not TD (temporal difference) learning?
As far as I understand, TD learning will try to learn V(state) value, but Q learning will learn Q(state action value) value, which means Q learning learns slower (as state action combination is more than state only), is that correct?
Q-Learning is a TD (temporal difference) learning method.
I think you are trying to refer to TD(0) vs Q-learning.
I would say it depends on your actions being deterministic or not. Even if you have the transition function, it can be expensive to decide which action to take in TD(0) as you need to calculate the expected value for each of the actions in each step. In Q-learning that would be summarized in the Q-value.
Given a deterministic environment (or as you say, a "perfect" environment in which you are able to know the state after performing an action), I guess you can simulate the affect of all possible actions in a given state (i.e., compute all possible next states), and choose the action that achieves the next state with the maximum value V(state).
However,it should be taken into account that both value functions V(state) and Q functions Q(state,action) are defined for a given policy. In some way, the value function can be considered as an average of the Q function, in the sense that V(s) "evaluates" the state s for all possible actions. So, to compute a good estimation of V(s) the agent still needs to perform all the possible actions in s.
In conclusion, I think that although V(s) is simpler than Q(s,a), likely they need a similar quantity of experience (or time) to achieve a stable estimation.
You can find more info about value (V and Q) functions in this section of the Sutton & Barto RL book.
Q learning is a TD control algorithm, this means it tries to give you an optimal policy as you said. TD learning is more general in the sense that can include control algorithms and also only prediction methods of V for a fixed policy.
Actually Q-learning is the process of using state-action pairs instead of just states. But that doesnt mean Q learning is different from TD. In TD(0) our agent takes one step(which could be one step in state-action pair or just state) and then updates it's Q-value. And same in n-step TD where our agent takes n steps and then updates the Q-values. Comparing TD and Q-learning isn't the right way. You can compare TD and SARSA algorithms instead. And TD and MonteCarlo
How to convert or visualize decision table to decision tree graph,
is there an algorithm to solve it, or a software to visualize it?
For example, I want to visualize my decision table below:
http://i.stack.imgur.com/Qe2Pw.jpg
Gotta say that is an interesting question.
I don't know the definitive answer, but I'd propose such a method:
use Karnaugh map to turn your decision table to minimized boolean function
turn your function into a tree
Lets simplyify an example, and assume that using Karnaugh got you function (a and b) or c or d. You can turn that into a tree as:
Source: my own
It certainly is easier to generate a decision table from a decision tree, not the other way around.
But the way I see it you could convert your decision table to a data set. Let the 'Disease' be the class attribute and treat the evidence as simple binary instance attributes. From that you can easily generate a decision tree using one of available decision tree induction algorithms, for example C4.5. Just remember to disable pruning and lower the minimum number of objects parameter.
During that process you would lose a bit of information, but the accuracy would remain the same. Take a look at both rows describing disease D04 - the second row is in fact more general than the first. Decision tree generated from this data would recognize the mentioned disease only from E11, 12 and 13 attributes, since it's enough to correctly label the instance.
I've spent few hours looking for a good algorithm. But I'm happy with my results.
My code is too dirty now to paste here (I can share privately on request, on your discretion) but the general idea is as the following.
Assume you have a data set with some decision criteria and outcome.
Define a tree structure (e.g. data.tree in R) and create "Start" root node.
Calculate outcome entropy of your data set. If entropy is zero you are done.
Using each criterion, one by one, as tree node calculate entropy for all branches created with this criterion. Take the minimum one entropy of all branches.
Branches created with the criterion with the smallest (minimum) entropy are your next tree node. Add them as child nodes.
Split your data according to decision point/tree node found in step 4 and remove the criterion used.
Repeat step 2-4 for each branch until your all branches have entropy = 0.
Enjoy your ideal decision tree :)
I'm trying to figure out how to implement Q learning in a gridworld example. I believe I understand the basics of how Q learning works but it doesn't seem to be giving me the correct values.
This example is from Sutton and Barton's book on reinforcement learning.
The gridworld is specified such that the agent can take actions {N,E,W,S} at any given state with equal probability and the rewards for all actions is 0 except if the agent attempts to move off the grid in which case it's -1. There are two special states, A and B where the agent deterministically will move to A' and B' respectively with rewards +10 and +5 respectively.
My question is about how I would go about implementing this through Q learning. I want to be able to estimate the value function through matrix inversion. The agent starts out in some initial state, not knowing anything and then takes actions selected by an epsilon-greedy algorithm and gets rewards that we can simulate since we know how the rewards are distributed.
This leads me to my question. Can I build a transition probability matrix each time the agent transitions from some state S -> S' where the probabilities are computed based on the frequency with which the agent took a particular action and did a particular transition?
For Q-learning you don't need a "model" of the environment (i.e transition probabilities matrix) to estimate the value function as it is a model-free method. For matrix inversion evaluation you refer to dynamic programming (model-based) where you use a transition matrix. You can think of Q-learning algorithm as a kind of trial and error algorithm where you select an action and receive feedback from the environment. However, contrary to model-based methods you don't have any knowledge about how your environment works (no transition matrix and reward matrix). Eventually, after enough sampled experience the Q function will converge to the optimal one.
For the implementation of the algorithm, you can start from an initial state, after initializing you Q function for all stats and actions (so you can keep track of a $SxA$). Then you select an action according to you policy. Here you should implement a step function. The step function will return the new state $s'$ and the reward. Consider step function as the feedback of the environment to your action.
Eventually you just need to update your Q-function according to:
$Q(s,a)=Q(s,a)+\alpha\left[r+\gamma\underset{a'}{\max(Q(s',a)})-Q(s,a)\right]$
Set $s=s'$ and repeat the whole process till convergence.
Hope this helps
Not sure if this helps, but here is a write up explaining Q learning through an example of a robot. There is also some R code in there if you want to try it out yourself.
Problem Statement is somewhat like this:
Given a website, we have to classify it into one of the two predefined classes (say whether its an e-commerce website or not?)
We have already tried Naive Bayes Algorithms for this with multiple pre-processing techniques (stop word removal, stemming etc.) and proper features.
We want to increase the accuracy to 90 or somewhat closer, which we are not getting from this approach.
The issue here is, while evaluating the accuracy manually, we look for a few identifiers on web page (e.g. Checkout button, Shop/Shopping,paypal and many more) which are sometimes missed in our algorithms.
We were thinking, if we are too sure of these identifiers, why don't we create a rule based classifier where we will classify a page as per a set of rules(which will be written on the basis of some priority).
e.g. if it contains shop/shopping and has checkout button then it's an ecommerce page.
And many similar rules in some priority order.
Depending on a few rules we will visit other pages of the website as well (currently, we visit only home page which is also a reason of not getting very high accuracy).
What are the potential issues that we will be facing with rule based approach? Or it would be better for our use case?
Would be a good idea to create those rules with sophisticated algorithms(e.g. FOIL, AQ etc)?
The issue here is, while evaluating the accuracy manually, we look for a few identifiers on web page (e.g. Checkout button, Shop/Shopping,paypal and many more) which are sometimes missed in our algorithms.
So why don't you include this information in your classification scheme? It's not hard to find a payment/checkout button in the html, so the presence of these should definitely be features. A good classifier relies on two things- good data, and good features. Make sure you have both!
If you must do a rule based classifier, then think of it more or less like a decision tree. It's very easy to do if you are using a functional programming language- basically just recurse until you hit an endpoint, at which point you are given a classification.
A Decision Tree algorithm can take your data and return a rule set for prediction of unlabeled instances.
In fact, a decision tree is really just a recursive descent partitioner comprised of a set of rules in which each rule sits at a node in the tree and application of that rule on an unlabeled data instance, sends this instance down either the left fork or right fork.
Many decision tree implementations explicitly generate a rule set, but this isn't necesary, because the rules (both what the rule is and the position of that rule in the decision flow) are easy to see just by looking at the tree that represents the trained decision tree classifier.
In particular, each rule is just a Boolean test for a particular value in a particular feature (data column or field).
For instance, suppose one of the features in each data row describes the type of Application Cache; further suppose that this feature has three possible values, memcache, redis, and custom. Then a rule might be Applilcation Cache | memcache, or does this data instance have an Application Cache based on redis?
The rules extracted from a decision tree are Boolean--either true or false. By convention False is represented by the left edge (or link to the child node below and to the left-hand-side of this parent node); and True is represented by the right-hand-side edge.
Hence, a new (unlabeled) data row begins at the root node, then is sent down either the right or left side depending on whether the rule at the root node is answered True or False. The next rule is applied (at least level in the tree hierarchy) until the data instance reaches the lowest level (a node with no rule, or leaf node).
Once the data point is filtered to a leaf node, then it is in essence classified, becasue each leaf node has a distribution of training data instances associated with it (e.g., 25% Good | 75% Bad, if Good and Bad are class labels). This empirical distribution (which in the ideal case is comprised of a data instances having just one class label) determines the unknown data instances's estimated class label.
The Free & Open-Source library, Orange, has a decision tree module (implementations of specific ML techniques are referred to as "widgets" in Orange) which seems to be a solid implementation of C4.5, which is probably the most widely used and perhaps the best decision tree implementation.
An O'Reilly Site has a tutorial on decision tree construction and use, including source code for a working decision tree module in python.
I have a real-time domain where I need to assign an action to N actors involving moving one of O objects to one of L locations. At each time step, I'm given a reward R, indicating the overall success of all actors.
I have 10 actors, 50 unique objects, and 1000 locations, so for each actor I have to select from 500000 possible actions. Additionally, there are 50 environmental factors I may take into account, such as how close each object is to a wall, or how close it is to an actor. This results in 25000000 potential actions per actor.
Nearly all reinforcement learning algorithms don't seem to be suitable for this domain.
First, they nearly all involve evaluating the expected utility of each action in a given state. My state space is huge, so it would take forever to converge a policy using something as primitive as Q-learning, even if I used function approximation. Even if I could, it would take too long to find the best action out of a million actions in each time step.
Secondly, most algorithms assume a single reward per actor, whereas the reward I'm given might be polluted by the mistakes of one or more actors.
How should I approach this problem? I've found no code for domains like this, and the few academic papers I've found on multi-actor reinforcement learning algorithms don't provide nearly enough detail to reproduce the proposed algorithm.
Clarifying the problem
N=10 actors
O=50 objects
L=1K locations
S=50 features
As I understand it, you have a warehouse with N actors, O objects, L locations, and some walls. The goal is to make sure that each of the O objects ends up in any one of the L locations in the least amount of time. The action space consist of decisions on which actor should be moving which object to which location at any point in time. The state space consists of some 50 X-dimensional environmental factors that include features such as proximity of actors and objects to walls and to each other. So, at first glance, you have XS(OL)N action values, with most action dimensions discrete.
The problem as stated is not a good candidate for reinforcement learning. However, it is unclear what the environmental factors really are and how many of the restrictions are self-imposed. So, let's look at a related, but different problem.
Solving a different problem
We look at a single actor. Say, it knows it's own position in the warehouse, positions of the other 9 actors, positions of the 50 objects, and 1000 locations. It wants to achieve maximum reward, which happens when each of the 50 objects is at one of the 1000 locations.
Suppose, we have a P-dimensional representation of position in the warehouse. Each position could be occupied by the actor in focus, one of the other actors, an object, or a location. The action is to choose an object and a location. Therefore, we have a 4P-dimensional state space and a P2-dimensional action space. In other words, we have a 4PP2-dimensional value function. By futher experimenting with representation, using different-precision encoding for different parameters, and using options 2, it might be possible to bring the problem into the practical realm.
For examples of learning in complicated spatial settings, I would recommend reading the Konidaris papers 1 and 2.
1 Konidaris, G., Osentoski, S. & Thomas, P., 2008. Value function approximation in reinforcement learning using the Fourier basis. Computer Science Department Faculty Publication Series, p.101.
2 Konidaris, G. & Barto, A., 2009. Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining Y. Bengio et al., eds. Advances in Neural Information Processing Systems, 18, pp.1015-1023.