Testing interaction term in longitudinal model - interaction

I was wondering if someone could provide me with a good resource/paper for how i should test interactions when i have a time variable as well. For example, a collegue informed me that i should test my interactions like this but I would like a paper that proves it:
outcome (continous variable) = exposure (dichotomous variable) + time variable+ interaction variable (dichotomous variable) + exposuretime variable interaction term+ exposure*time variable+ covariates
any thoughts on this? thank you

Related

State dependent action set in reinforcement learning

How do people deal with problems where the legal actions in different states are different? In my case I have about 10 actions total, the legal actions are not overlapping, meaning that in certain states, the same 3 states are always legal, and those states are never legal in other types of states.
I'm also interested in see if the solutions would be different if the legal actions were overlapping.
For Q learning (where my network gives me the values for state/action pairs), I was thinking maybe I could just be careful about which Q value to choose when I'm constructing the target value. (ie instead of choosing the max, I choose the max among legal actions...)
For Policy-Gradient type of methods I'm less sure of what the appropriate setup is. Is it okay to just mask the output layer when computing the loss?
There are two closely related works in recent two years:
[1] Boutilier, Craig, et al. "Planning and learning with stochastic action sets." arXiv preprint arXiv:1805.02363 (2018).
[2] Chandak, Yash, et al. "Reinforcement Learning When All Actions Are Not Always Available." AAAI. 2020.
Currently this problem seems to not have one, universal and straight-forward answer. Maybe because it is not that of an issue?
Your suggestion of choosing the best Q value for legal actions is actually one of the proposed ways to handle this. For policy gradients methods you can achieve similar result by masking the illegal actions and properly scaling up the probabilities of the other actions.
Other approach would be giving negative rewards for choosing an illegal action - or ignoring the choice and not making any change in the environment, returning the same reward as before. For one of my personal experiences (Q Learning method) I've chosen the latter and the agent learned what he has to learn, but he was using the illegal actions as a 'no action' action from time to time. It wasn't really a problem for me, but negative rewards would probably eliminate this behaviour.
As you see, these solutions don't change or differ when the actions are 'overlapping'.
Answering what you've asked in the comments - I don't believe you can train the agent in described conditions without him learning the legal/illegal actions rules. This would need, for example, something like separate networks for each set of legal actions and doesn't sound like the best idea (especially if there are lots of possible legal action sets).
But is the learning of these rules hard?
You have to answer some questions yourself - is the condition, that makes the action illegal, hard to express/articulate? It is, of course, environment-specific, but I would say that it is not that hard to express most of the time and agents just learn them during training. If it is hard, does your environment provide enough information about the state?
Not sure if I understand your question correctly, but if you mean that in certain states some actions are impossible then you simply reflect it in the reward function (big negative value). You can even decide to end the episode if it is not clear what state would the illegal action result in. The agent should then learn that those actions are not desirable in the specific states.
In exploration mode, the agent might still choose to take the illegal actions. However, in exploitation mode it should avoid them.
I recently built a DDQ agent for connect-four and had to address this. Whenever a column was chosen that was already full with tokens, I set the reward equivalent to losing the game. This was -100 in my case and it worked well.
In connect four, allowing an illegal move (effectively skipping a turn) can in some cases be advantageous for the player. This is why I set the reward equivalent to losing and not a smaller negative number.
So if you set the negative reward greater than losing, you'll have to consider in your domain what are the implications of allowing illegal moves to happen in exploration.

Why do we need to explicitly update the moving_mean and moving_variance in TensorFlow's Batch normalization in tf.contrib.layers.batch_norm?

To Long To Read: How can I use Batch Normalization with tf.contrib.layers.batch_norm without having to explicitly tell session to update the moving_statistics (moving_mean and moving_variance) or not?
A few months ago I provided an answer to How could I use Batch Normalization in TensorFlow? and noticed a few weird details that I wanted to address. First it seems that the implementation that I provide seems repetitive with respect to the is_training variable. Recall my suggested code:
from tensorflow.contrib.layers.python.layers import batch_norm as batch_norm
def batch_norm_layer(x,train_phase,scope_bn):
bn_train = batch_norm(x, decay=0.999, center=True, scale=True,
updates_collections=None,
is_training=True,
reuse=None, # is this right?
trainable=True,
scope=scope_bn)
bn_inference = batch_norm(x, decay=0.999, center=True, scale=True,
updates_collections=None,
is_training=False,
reuse=True, # is this right?
trainable=True,
scope=scope_bn)
z = tf.cond(train_phase, lambda: bn_train, lambda: bn_inference)
return z
in it I have a train_phase variable that just holds a tf boolean tf.placeholder(tf.bool, name='phase_train'). As you can see, it is used to decide if the batch norm layer should be in inference mode or not. However, the variable seemed a little redundant, since it seems I have two variables that specify the same thing twice. i.e. once in train_phase and another in is_training. Is that really necessary?
I thought about it a bit and it seems I might to be able to remove the hard coded (is_training=True/False) with the (pseudo)code:
from tensorflow.contrib.layers.python.layers import batch_norm as batch_norm
def batch_norm_layer(x,train_phase,scope_bn):
bn = batch_norm(x, decay=0.999, center=True, scale=True,
updates_collections=None,
is_training=get_bool(train_phase),
reuse=None, # is this right?
trainable=True,
scope=scope_bn)
z = tf.cond(train_phase, lambda: bn, lambda: bn)
return z
which seems to make the train_phase variable completely redundant/silly. This actually highlights my most important point, is the train_phase variable and tf.cond(train_phase, lambda: bn_train, lambda: bn_inference) even necessary? Which actually brings up my biggest complaint about the code (though I think this code might not even run because when defining the graph the placeholder train_phase might not even have a value but you get the idea).
Honestly I find having to even explicitly define train_phase very dangerous because it seems very unnecessary for users to have to handle the inference/training mode of Batch Norm this explicitly. Though, "normal" users of Batch Norm should always update the moving_mean,moving_variance with the train data and any standard user of Batch Norm should not be updating moving_mean,moving_variance with test statistics at any time. Since the user is required to do:
sess.run(fetches=train_step, feed_dict={x: batch_xs, y_: batch_ys, phase_train=True})
it can bring cause really bad bugs for users that shouldn't even exist in the first place (at least in my opinion). Furthermore, it seems weird to have to explicitly say what the phase_train is because whenever one trains, one uses an optimizer, so it should be incredibly clear when that code is called that it should be true. Maybe this is a terrible idea but it feels like the optimizer or the session should be setting that to true automatically rather than relying on the user to do it right.
I understand that sometimes users are allowed more flexibility to be more creative but I can't really appreciate how this (even for a researcher) be a good feature. Maybe I am just using the library incorrectly or being paranoic, but should the user really be forced to be so explicit when using batch norm? Is there some way around this?
As a side point, having the phase_train be part of the model also makes the code be a bit more ugly and confusing than it feels necessary because it seems to me that its unavoidable to have a line of code where the session is being used to check if the batch norm flag is on or not. The code I am trying to avoid writing is the logic:
if batch_norm:
# during training
sess.run(fetches=train_step, feed_dict={x: batch_xs, y_: batch_ys, phase_train=True})
else:
# with no batch norm
sess.run(fetches=train_step, feed_dict={x: batch_xs, y_: batch_ys})
it just feels totally unnecessary. It feels the during training the model should know if it should be updating the variables or not.
As quick (really ugly) solution to the last problem with the if condition in the session, one can always define phase_train as part of the model (or at least as part of the graph) and accordingly set it equal to true and/or false when appropriate but when one doesn't actually use the batch norm layer, one actually does not use the phase_train placeholder in the model even if we set it have a value in the session.run. i.e. the sessions sets it to true or false, but when BN is not being used, it doesn't even matter what one sets it equal to since its not actually being used. Obviously, this makes the code really confusing (since one is defining some variable one doesn't even need), but I can't seem to find a way to hide the phase_train variable. For the moment this is what I am going for because it seems really ugly to have to split (or duplicate) my code between lines that have:
sess.run(fetches=..., feed_dict={...,phase_train=False})
and the ones that don't have it all:
sess.run(fetches=..., feed_dict={...})
Ideally I want the second solution and have batch norm work regardless if I use the silly phase_train variable.
I don't really have a complete answer to your question, but I have a few observations:
The standard practice seems to be to build slightly different graphs for training and for inference, each built with or without is_training enabled.
The batch_norm layer is designed so that you can use an arg_scope to set is_training=True for all layers in your model. For example, take a look at how the Inceptionv3 model is defined here:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/slim/python/slim/nets/inception_v3.py#L571 . This at least makes it much more convenient to set is_training once in your Python code that builds a model and to have it apply everywhere.
Tensorflow's underlying infrastructure doesn't distinguish between training and inference time—it's just running graphs of operators. tf.Session doesn't really know anything about Neural Networks, training, or inference, so it isn't the right place for this kind of logic.
One could imagine that an Optimizer should rewrite the graph to enable is_training for those operators that support it. I don't have a strong opinion about this; you might try filing a Tensorflow Github issue making that feature request to see what others think about it. It might seem a bit too "magical".
Hope that helps!

Reinforcement Learning - How does an Agent know which action to pick?

I'm trying to understand Q-Learning
The basic update formula:
Q(st, at) += a[rt+1, + d.max(Q(st+1, a)) - Q(st,at)]
I understand the formula, and what it does, but my question is:
How does the agent know to choose Q(st, at)?
I understand that an agent follows some policy π, but how do you create this policy in the first place?
My agents are playing checkers, so I am focusing on model-free algorithms.
All the agent knows is the current state it is in.
I understand that when it performs an action, you update the utility, but how does it know to take that action in the first place.
At the moment I have:
Check each move you could make from that state.
Pick whichever move has the highest utility.
Update the utility of the move made.
However, this doesnt really solve much, you still get stuck in local minimum/maximums.
So, just to round things off, my main question is:
How, for an agent that knows nothing and is using a model-free algorithm, do you generate an initial policy, so it know which action to take?
That update formula incrementally computes the expected value of each action in every state. A greedy policy chooses always the highest valued action. This is the best policy when you have already learned the values. The most common policy for use during learning is the ε-greedy policy, which chooses the highest valued action with probability 1-ε, and a random action with probability ε.

Is this a correct implementation of Q-Learning for Checkers?

I am trying to understand Q-Learning,
My current algorithm operates as follows:
1. A lookup table is maintained that maps a state to information about its immediate reward and utility for each action available.
2. At each state, check to see if it is contained in the lookup table and initialise it if not (With a default utility of 0).
3. Choose an action to take with a probability of:
(*ϵ* = 0>ϵ>1 - probability of taking a random action)
1-ϵ = Choosing the state-action pair with the highest utility.
ϵ = Choosing a random move.
ϵ decreases over time.
4. Update the current state's utility based on:
Q(st, at) += a[rt+1, + d.max(Q(st+1, a)) - Q(st,at)]
I am currently playing my agent against a simple heuristic player, who always takes the move that will give it the best immediate reward.
The results - The results are very poor, even after a couple hundred games, the Q-Learning agent is losing a lot more than it is winning. Furthermore, the change in win-rate is almost non-existent, especially after reaching a couple hundred games.
Am I missing something? I have implemented a couple agents:
(Rote-Learning, TD(0), TD(Lambda), Q-Learning)
But they all seem to be yielding similar, disappointing, results.
There are on the order of 10²⁰ different states in checkers, and you need to play a whole game for every update, so it will be a very, very long time until you get meaningful action values this way. Generally, you'd want a simplified state representation, like a neural network, to solve this kind of problem using reinforcement learning.
Also, a couple of caveats:
Ideally, you should update 1 value per game, because the moves in a single game are highly correlated.
You should initialize action values to small random values to avoid large policy changes from small Q updates.

SARSA Implementation

I am learning about SARSA algorithm implementation and had a question. I understand that the general "learning" step takes the form of:
Robot (r) is in state s. There are four actions available:
North (n), East (e), West (w) and South (s)
such that the list of Actions,
a = {n,w,e,s}
The robot randomly picks an action, and updates as follows:
Q(a,s) = Q(a,s) + L[r + DQ(a',s1) - Q(a,s)]
Where L is the learning rate, r is the reward associated to (a,s), Q(s',a') is the expected reward from an action a' in the new state s' and D is the discount factor.
Firstly, I don't undersand the role of the term - Q(a,s), why are we re-subtracting the current Q-value?
Secondly, when picking actions a and a' why do these have to be random? I know in some implementations or SARSA all possible Q(s', a') are taken into account and the highest value is picked. (I believe this is Epsilon-Greedy?) Why not to this also to pick which Q(a,s) value to update? Or why not update all Q(a,s) for the current s?
Finally, why is SARSA limited to one-step lookahead? Why, say, not also look into an hypothetical Q(s'',a'')?
I guess overall my questions boil down to what makes SARSA better than another breath-first or depth-first search algorithm?
Why do we subtract Q(a,s)? r + DQ(a',s1) is the reward that we got on this run through from getting to state s by taking action a. In theory, this is the value that Q(a,s) should be set to. However, we won't always take the same action after getting to state s from action a, and the rewards associated with going to future states will change in the future. So we can't just set Q(a,s) equal to r + DQ(a',s1). Instead, we just want to push it in the right direction so that it will eventually converge on the right value. So we look at the error in prediction, which requires subtracting Q(a,s) from r + DQ(a',s1). This is the amount that we would need to change Q(a,s) by in order to make it perfectly match the reward that we just observed. Since we don't want to do that all at once (we don't know if this is always going to be the best option), we multiply this error term by the learning rate, l, and add this value to Q(a,s) for a more gradual convergence on the correct value.`
Why do we pick actions randomly? The reason to not always pick the next state or action in a deterministic way is basically that our guess about which state is best might be wrong. When we first start running SARSA, we have a table full of 0s. We put non-zero values into the table by exploring those areas of state space and finding that there are rewards associated with them. As a result, something not terrible that we have explored will look like a better option than something that we haven't explored. Maybe it is. But maybe the thing that we haven't explored yet is actually way better than we've already seen. This is called the exploration vs exploitation problem - if we just keep doing things that we know work, we may never find the best solution. Choosing next steps randomly ensures that we see more of our options.
Why can't we just take all possible actions from a given state? This will force us to basically look at the entire learning table on every iteration. If we're using something like SARSA to solve the problem, the table is probably too big to do this for in a reasonable amount of time.
Why can SARSA only do one-step look-ahead? Good question. The idea behind SARSA is that it's propagating expected rewards backwards through the table. The discount factor, D, ensures that in the final solution you'll have a trail of gradually increasing expected rewards leading to the best reward. If you filled in the table at random, this wouldn't always be true. This doesn't necessarily break the algorithm, but I suspect it leads to inefficiencies.
Why is SARSA better than search? Again, this comes down to an efficiency thing. The fundamental reason that anyone uses learning algorithms rather than search algorithms is that search algorithms are too slow once you have too many options for states and actions. In order to know the best action to take from any other state action pair (which is what SARSA calculates), you would need to do a search of the entire graph from every node. This would take O(s*(s+a)) time. If you're trying to solve real-world problems, that's generally too long.

Resources