State dependent action set in reinforcement learning - machine-learning

How do people deal with problems where the legal actions in different states are different? In my case I have about 10 actions total, the legal actions are not overlapping, meaning that in certain states, the same 3 states are always legal, and those states are never legal in other types of states.
I'm also interested in see if the solutions would be different if the legal actions were overlapping.
For Q learning (where my network gives me the values for state/action pairs), I was thinking maybe I could just be careful about which Q value to choose when I'm constructing the target value. (ie instead of choosing the max, I choose the max among legal actions...)
For Policy-Gradient type of methods I'm less sure of what the appropriate setup is. Is it okay to just mask the output layer when computing the loss?

There are two closely related works in recent two years:
[1] Boutilier, Craig, et al. "Planning and learning with stochastic action sets." arXiv preprint arXiv:1805.02363 (2018).
[2] Chandak, Yash, et al. "Reinforcement Learning When All Actions Are Not Always Available." AAAI. 2020.

Currently this problem seems to not have one, universal and straight-forward answer. Maybe because it is not that of an issue?
Your suggestion of choosing the best Q value for legal actions is actually one of the proposed ways to handle this. For policy gradients methods you can achieve similar result by masking the illegal actions and properly scaling up the probabilities of the other actions.
Other approach would be giving negative rewards for choosing an illegal action - or ignoring the choice and not making any change in the environment, returning the same reward as before. For one of my personal experiences (Q Learning method) I've chosen the latter and the agent learned what he has to learn, but he was using the illegal actions as a 'no action' action from time to time. It wasn't really a problem for me, but negative rewards would probably eliminate this behaviour.
As you see, these solutions don't change or differ when the actions are 'overlapping'.
Answering what you've asked in the comments - I don't believe you can train the agent in described conditions without him learning the legal/illegal actions rules. This would need, for example, something like separate networks for each set of legal actions and doesn't sound like the best idea (especially if there are lots of possible legal action sets).
But is the learning of these rules hard?
You have to answer some questions yourself - is the condition, that makes the action illegal, hard to express/articulate? It is, of course, environment-specific, but I would say that it is not that hard to express most of the time and agents just learn them during training. If it is hard, does your environment provide enough information about the state?

Not sure if I understand your question correctly, but if you mean that in certain states some actions are impossible then you simply reflect it in the reward function (big negative value). You can even decide to end the episode if it is not clear what state would the illegal action result in. The agent should then learn that those actions are not desirable in the specific states.
In exploration mode, the agent might still choose to take the illegal actions. However, in exploitation mode it should avoid them.

I recently built a DDQ agent for connect-four and had to address this. Whenever a column was chosen that was already full with tokens, I set the reward equivalent to losing the game. This was -100 in my case and it worked well.
In connect four, allowing an illegal move (effectively skipping a turn) can in some cases be advantageous for the player. This is why I set the reward equivalent to losing and not a smaller negative number.
So if you set the negative reward greater than losing, you'll have to consider in your domain what are the implications of allowing illegal moves to happen in exploration.

Related

Monte Carlo Tree Search - intuition behind child selection function for games of two players with opposite goals

Simple question on hello world example of the MCTS for tic-tac-toe,
Let's assume we are given a board and we want to make an optimal decision. As I undestand the choice of consecutive nodes while simulation (until leaf is met) is determined by a exploration/exploitation trade-off function (as described on wikipedia). I really wonder what is the intuition behind first component (exploitation) of the function here, especially for games between two players with oppposite goals. Then the meaning of "the most promising" changes depending on who makes a move. Shouldn't this function change depeding on who makes the next move (especially its first component)?
Yes, that exploitation part of the equation should be implemented to take into account the evaluations from the perspective of the agent/player who gets to select an action in that node.
For single-agent settings, the implementation is straightforward; simply always maximize.
For zero-sum, turn-based, two-player settings, you'd want to alternate between maximizing or minimizing that exploitation part of the equation (note: always maximize the exploration term!). This can also be implemented by simply multiplying that term by -1 in nodes where the opponent gets to move.
Other settings are possible too, but require slightly more implementation effort (e.g. keeping different average scores for different players in settings which are not zero-sum or have more than two players)

Dynamic Process parameter adjustment in Semiconductor manufacturing data

I have process parameter data from semiconductor manufacturing.and requirement is to suggest what could be the best parameter adjustment to be made to process parameter to get better yield ie best path for high yield. what machine learning /Statistical models best suits this requirement
Note:I have thought of using decision tree which can give us best path for high yield.
Would like to know it any other methods that can be more efficient
data is like
lotno x1 x2 x3 x4 x5 yield(%)
<95% yield is considered as 0 and >95% as 1
I'm not really sure of the question here, but as a former semiconductor process engineer, here is how I look at the yield improvement approach - perspective.
Process Development.
DOE: Typically, I would run structured DOEs to understand my process (#4). I would first identify "potential" 'factors', and run various "screening" experiments to identify statistical significance. With the goal basically here to identify the most statistically significant (and for that matter, least significant) factors. So these are inherently simple experiments, low # of "levels" which don't target understanding of the curvature of the response surface, they just look for magnitude change of response vs factor. Generally, I am most concerned with 'Process' factors, but it is important to recognize that the influence of variable inputs can come from more than just "machine knobs' as example. Variable can arise from 1) People, 2) Environment (moisture, temp, etc), 3) consumables (used in the process), 4) Equipment (is 40 psi on this tool really 40 psi and the same as 40 psi on a different tool) 4) Process variable settings.
With the most statistically significant factors, I would run more elaborate DOE using the major factors and analyze this data to develop a model. There are generally more 'levels' used here to allow for curvature insight of the response surface via the analysis. There are many types of well known standard experimental design structures here. And there is software such as JMP that is specifically set up to do this analysis.
From here, the idea would be to generate a model in the form of Response = F (Factors). That allows you to essentially optimize the response based upon these factors where the response is a reflection of your yield criteria.
From here, the engineer would typically execute confirmation runs with optimized factors to confirm optimized response.
Note that the software analysis typically allows for the engineer to illuminate any run order dependence. The execution of the DOE is typically performed in a randomized cell fashion. (Each 'cell' is a set of conditions for the experiment). Similarly the experiments include some level of repetition to gauge 'repeatability' of the 'system'. This inclusion can be explicit (run the same cell twice), but there is also some level of repeatability inherent in the design as well since you are running multiple cells, albeit at difference settings. But generally, the experiment includes explicitly repeated cells.
And finally there is the concept of manufacturability, which includes constraints of time, cost, physical limits, equipment capability, etc. (The ideal process works great, but it takes 10 years, costs 1 million dollars and requires projected settings outsides the capability of the tool.)
Since you have manufacturing data, hopefully, you have the data that captures the other types of factors as well (1,2,3), so you should specifically analyze the data to try to identify such effects. This is typically done as A vs B comparisons. Person A vs B, Tool A vs B, Consumable A vs B, Consumable lot A vs B, Summer vs Winter, etc.
Basically, there are all sorts of comparisons you could envision here and check for statistically differences across two sets of populations.
A comment on response: What is the yield criteria? You should know this in order to formulate the model. For semiconductors, we have both line yield (process yield) but there is also device yield. I assume for your work, you are primarily concerned with line yield. So minimizing variability in the factors (from 1,2,3,4) to achieve the desired response (target response(s) with minimal variability) is the primary goal.
APC (Advanced Process Control).
In many cases, there is significant trending that results from whatever reason; crappy tool control (the tool heats up), crappy consumable (the target material wears, the polishing pad wears, the chemical bath gets loaded, whatever), and so the idea here is how to adjust the next batch/lot/wafer based upon the history of what came prior. Either improve the manufacturing to avoid/minimize this trending (run order dependence) or adjust process to accommodate it to achieve the desired response.
Time for lunch, hope this helps...if you post on the specific process module type, and even equipment and consumables, I might be able to provide more insight.

Is this a correct implementation of Q-Learning for Checkers?

I am trying to understand Q-Learning,
My current algorithm operates as follows:
1. A lookup table is maintained that maps a state to information about its immediate reward and utility for each action available.
2. At each state, check to see if it is contained in the lookup table and initialise it if not (With a default utility of 0).
3. Choose an action to take with a probability of:
(*ϵ* = 0>ϵ>1 - probability of taking a random action)
1-ϵ = Choosing the state-action pair with the highest utility.
ϵ = Choosing a random move.
ϵ decreases over time.
4. Update the current state's utility based on:
Q(st, at) += a[rt+1, + d.max(Q(st+1, a)) - Q(st,at)]
I am currently playing my agent against a simple heuristic player, who always takes the move that will give it the best immediate reward.
The results - The results are very poor, even after a couple hundred games, the Q-Learning agent is losing a lot more than it is winning. Furthermore, the change in win-rate is almost non-existent, especially after reaching a couple hundred games.
Am I missing something? I have implemented a couple agents:
(Rote-Learning, TD(0), TD(Lambda), Q-Learning)
But they all seem to be yielding similar, disappointing, results.
There are on the order of 10²⁰ different states in checkers, and you need to play a whole game for every update, so it will be a very, very long time until you get meaningful action values this way. Generally, you'd want a simplified state representation, like a neural network, to solve this kind of problem using reinforcement learning.
Also, a couple of caveats:
Ideally, you should update 1 value per game, because the moves in a single game are highly correlated.
You should initialize action values to small random values to avoid large policy changes from small Q updates.

NeuroEvolution of Augmenting Topologies (NEAT) and global innovation number

I was not able to find why we should have a global innovation number for every new connection gene in NEAT.
From my little knowledge of NEAT, every innovation number corresponds directly with an node_in, node_out pair, so, why not only use this pair of ids instead of the innovation number? Which new information there is in this innovation number? chronology?
Update
Is it an algorithm optimization?
Note: this more of an extended comment than an answer.
You encountered a problem I also just encountered whilst developing a NEAT version for javascript. The original paper published in ~2002 is very unclear.
The original paper contains the following:
Whenever a new
gene appears (through structural mutation), a global innovation number is incremented
and assigned to that gene. The innovation numbers thus represent a chronology of the
appearance of every gene in the system. [..] ; innovation numbers are never changed. Thus, the historical origin of every
gene in the system is known throughout evolution.
But the paper is very unclear about the following case, say we have two ; 'identical' (same structure) networks:
The networks above were initial networks; the networks have the same innovation ID, namely [0, 1]. So now the networks randomly mutate an extra connection.
Boom! By chance, they mutated to the same new structure. However, the connection ID's are completely different, namely [0, 2, 3] for parent1 and [0, 4, 5] for parent2 as the ID is globally counted.
But the NEAT algorithm fails to determine that these structures are the same. When one of the parents scores higher than the other, it's not a problem. But when the parents have the same fitness, we have a problem.
Because the paper states:
In composing the offspring, genes are randomly chosen from veither parent at matching genes, whereas all excess or disjoint genes are always included from the more fit parent, or if they are equally fit, from both parents.
So if the parents are equally fit, the offspring will have connections [0, 2, 3, 4, 5]. Which means that some nodes have double connections... Removing global innovation counters, and just assign id's by looking at node_in and node_out, you avoid this problem.
So when you have equally fit parents, yes you have optimized the algorithm. But this is almost never the case.
Quite interesting: in the newer version of the paper, they actually removed that bolded line! Older version here.
By the way, you can solve this problem by instead of assigning innovation ID's, assign ID based on node_in and node_out using pairing functions. This creates quite interesting neural networks when fitness is equal:
I can't provide a detailed answer, but the innovation number enables certain functionality within the NEAT model to be optimal (like calculating the species of a gene), as well as allowing crossover between the variable length genomes. Crossover is not necessary in NEAT, but it can be done, due to the innovation number.
I got all my answers from here:
http://nn.cs.utexas.edu/downloads/papers/stanley.ec02.pdf
It's a good read
During crossover, we have to consider two genomes that share a connection between the two same nodes in their personal neural networks. How do we detect this collision without iterating both genome's connection genes over and over again for each step of crossover? Easy: if both connections being examined during crossover share an innovation number, they are connecting the same two nodes because they received that connection from the same common ancestor.
Easy Example:
If I am a genome with a specific connection gene with innovation number 'i', my children that take gene 'i' from me may eventually cross over with each other in 100 generations. We have to detect when these two evolved versions (alleles) of my gene 'i' are in collision to prevent taking both. Taking two of the same gene would cause the phenotype to probably loop and crash, killing the genotype.
When I created my first implementation of NEAT I thought the same... why would you keep a innovation number tracker...? and why would you use it only for one generation? Wouldn't be better to not keep it at all and use a key value par with the nodes connected?
Now that I am implementing my third revision I can see what Kenneth Stanley tried to do with them and why he wanted to keep them only for one generation.
When a connection is created, it will start its optimization in that moment. It marks its origin. If the same connection pops out in another generation, that will start its optimization then. Generation numbers try to separate the ones which come from a common ancestor, so the ones that have been optimized for many generations are not put side to side that one that was just generated. If a same connection is found in two genomes, that means that that gene comes from the same origin and thus, can be aligned.
Imagine then that you have your generation champion. Some of their genes will have 50 percent chance to be lost due that the aligned genes are treated equally.
What is better...? I haven't seen any experiments comparing the two approaches.
Kenneth Stanley also addressed this issue in the NEAT users page: https://www.cs.ucf.edu/~kstanley/neat.html
Should a record of innovations be kept around forever, or only for the current
generation?
In my implementation of NEAT, the record is only kept for a generation, but there
is nothing wrong with keeping them around forever. In fact, it may work better.
Here is the long explanation:
The reason I didn't keep the record around for the entire run in my
implementation of NEAT was because I felt that calling something the same
mutation that happened under completely different circumstances was not
intuitive. That is, it is likely that several generations down the line, the
"meaning" or contribution of the same connection relative to all the other
connections in a network is different than it would have been if it had appeared
generations ago. I used a single generation as a yardstick for this kind of
situation, although that is admittedly ad hoc.
That said, functionally speaking, I don't think there is anything wrong with
keeping innovations around forever. The main effect is to generate fewer species.
Conversely, not keeping them around leads to more species..some of them
representing the same thing but separated nonetheless. It is not currently clear
which method produces better results under what circumstances.
Note that as species diverge, calling a connection that appeared in one species a
different name than one that appeared earlier in another just increases the
incompatibility of the species. This doesn't change things much since they were
incompatible to begin with. On the other hand, if the same species adds a
connection that it added in an earlier generation, that must mean some members of
the species had not adopted that connection yet...so now it is likely that the
first "version" of that connection that starts being helpful will win out, and
the other will die away. The third case is where a connection has already been
generally adopted by a species. In that case, there can be no mutation creating
the same connection in that species since it is already taken. The main point is,
you don't really expect too many truly similar structures with different markings
to emerge, even with only keeping the record around for 1 generation.
Which way works best is a good question. If you have any interesting experimental
results on this question, please let me know.
My third revision will allow both options. I will add more information to this answer when I have results about it.

SARSA Implementation

I am learning about SARSA algorithm implementation and had a question. I understand that the general "learning" step takes the form of:
Robot (r) is in state s. There are four actions available:
North (n), East (e), West (w) and South (s)
such that the list of Actions,
a = {n,w,e,s}
The robot randomly picks an action, and updates as follows:
Q(a,s) = Q(a,s) + L[r + DQ(a',s1) - Q(a,s)]
Where L is the learning rate, r is the reward associated to (a,s), Q(s',a') is the expected reward from an action a' in the new state s' and D is the discount factor.
Firstly, I don't undersand the role of the term - Q(a,s), why are we re-subtracting the current Q-value?
Secondly, when picking actions a and a' why do these have to be random? I know in some implementations or SARSA all possible Q(s', a') are taken into account and the highest value is picked. (I believe this is Epsilon-Greedy?) Why not to this also to pick which Q(a,s) value to update? Or why not update all Q(a,s) for the current s?
Finally, why is SARSA limited to one-step lookahead? Why, say, not also look into an hypothetical Q(s'',a'')?
I guess overall my questions boil down to what makes SARSA better than another breath-first or depth-first search algorithm?
Why do we subtract Q(a,s)? r + DQ(a',s1) is the reward that we got on this run through from getting to state s by taking action a. In theory, this is the value that Q(a,s) should be set to. However, we won't always take the same action after getting to state s from action a, and the rewards associated with going to future states will change in the future. So we can't just set Q(a,s) equal to r + DQ(a',s1). Instead, we just want to push it in the right direction so that it will eventually converge on the right value. So we look at the error in prediction, which requires subtracting Q(a,s) from r + DQ(a',s1). This is the amount that we would need to change Q(a,s) by in order to make it perfectly match the reward that we just observed. Since we don't want to do that all at once (we don't know if this is always going to be the best option), we multiply this error term by the learning rate, l, and add this value to Q(a,s) for a more gradual convergence on the correct value.`
Why do we pick actions randomly? The reason to not always pick the next state or action in a deterministic way is basically that our guess about which state is best might be wrong. When we first start running SARSA, we have a table full of 0s. We put non-zero values into the table by exploring those areas of state space and finding that there are rewards associated with them. As a result, something not terrible that we have explored will look like a better option than something that we haven't explored. Maybe it is. But maybe the thing that we haven't explored yet is actually way better than we've already seen. This is called the exploration vs exploitation problem - if we just keep doing things that we know work, we may never find the best solution. Choosing next steps randomly ensures that we see more of our options.
Why can't we just take all possible actions from a given state? This will force us to basically look at the entire learning table on every iteration. If we're using something like SARSA to solve the problem, the table is probably too big to do this for in a reasonable amount of time.
Why can SARSA only do one-step look-ahead? Good question. The idea behind SARSA is that it's propagating expected rewards backwards through the table. The discount factor, D, ensures that in the final solution you'll have a trail of gradually increasing expected rewards leading to the best reward. If you filled in the table at random, this wouldn't always be true. This doesn't necessarily break the algorithm, but I suspect it leads to inefficiencies.
Why is SARSA better than search? Again, this comes down to an efficiency thing. The fundamental reason that anyone uses learning algorithms rather than search algorithms is that search algorithms are too slow once you have too many options for states and actions. In order to know the best action to take from any other state action pair (which is what SARSA calculates), you would need to do a search of the entire graph from every node. This would take O(s*(s+a)) time. If you're trying to solve real-world problems, that's generally too long.

Resources