Neural Network-based Reinforcement Learning with variable action set - machine-learning

How does one handle variable (state-dependent) action set with Reinforcement Learning, specifically Actor-Critic method? I have found a similar question (Reinforcement Learning With Variable Actions), but it offers no complete answer I can use.
The problem is that neural network that models Policy function has a fixed number of outputs (corresponding to a maximum possible set of actions). It can't help but compute probabilities for all actions, including ones that are impossible in the current state. It becomes an especially huge problem, when there are states where only one or two actions out of, say, initial 50 are possible.
I see two possibilities:
1) Ignore impossible actions, and choose an actions among possible ones, re-normalizing probability of each one to the sum of their probabilities.
2) Let action selector choose impossible actions, but penaltize the network for doing so, until it learns to never choose impossible actions.
I have tried both, and they all have some problems:
1) Ignoring impossible actions may lead to action with output probability of 0.01 to have re-normalized probability of 0.99, depending on other actions. During the back-propagation step, this will lead to a big gradient, because of log(probability) factor (the network will use the original, non-normalized probability in calculation). I'm not entirely sure if this is desirable, and I seem to get not particularly good results with this approach.
2) Penaltizing choosing bad actions is even more problematic. If I only penaltize them slightly, action choose can get stuck for a long time while it adjusts over-inflated probabilities of impossible actions until previously small probabilities of possible actions become viable. But if I penaltize bad actions hugely, it leads to bad outcomes, like having a network where one action has 1.0 probability, and the rest are 0.0. What's worse, the network can even become stuck in an endless loop because of precision problems.
I haven't seen a proper discussion of this question anywhere, but maybe I just can't think of a proper search terms. I would be very thankful if anyone could direct me to a paper or blog post that deals with this question, or give me a extended answer on proper handling of this case.

Related

Cut points from which to choose the best split in Decision Tree regressor with continious feature?

I understand, that in the Decision Tree algorithm, when the splitting is decided, we choose the best split based on some criterion. And when you're looking for the best split, you have to iterate through some list of values. But it seems very computationally expensive to consider every value of the feature as the possible threshold (or, so called, cut point). Thus, there is a necessity for some heuristic for choosing these thresholds. For example, if we have continuous feature and categorical target (i.e, we are dealing with classification problem), we can do the following: sort dataset by given feature and consider for splitting only values, where target variable is changing it's value.
But what do you do if you have regression task, i.e. both feature and target are continuous variables? I realize, that I have to calculate, for example, the mean variance or mean median deviation in both branches for each split. But how do you decide from which values you're choosing you best split? People surely have came up with some optimal solution in order to avoid iterating over every value of the feature in the training set.
I've done some research, but most sources only focuses on different criteria and questions of how you determine whether your split is suitable. Which is not really answering my question.
I've found this question, but Predictor only suggests, that it can be done using the percentiles. And I think, that there is no guarantee, that this is how it really done in real life.
I've also found this question, but for me geledek's answer is not very clear (obviously, dude just copy-pasted his answer from presentation, that he is referring to). I'm pretty much fine with the Method 1, but I would really appreciate if someone could explain Method 2 in more details. Or, perhaps, provide some different source or explanation of your own.
UPD: I've also looked up to the scikit-learn repo at GitHub, and found this line. I can't quite understand the overall code, but it seems that this particular line implies that thresholds are chosen as the averages of the neighboring feature values (which corresponds with the aforementioned Method 1 from the question above). Is that correct? I also don't understand this comment: # sum of halves is used to avoid infinite value. How exactly does dividing by two prevent from getting infinite values? Don't you get infinity only when you are dividing by zero? Is dividing by two necessary, because this way we are getting average value (and not because we don't want to get infinitely)?

Last time step's state vs all time steps' state of RNN/LSTM/GRU

Based on my understanding so far, after training a RNN/LSTM model for sequence classification task I can do prediction in following two ways,
Take the last state and make prediction using a softmax layer
Take all time step's states, make prediction at each time step and take the maximum after summing predictions
In general, is there any reason to choose one over another? Or this is application dependent? Also if I decide to use second strategy should I use different softmax layers for each time step or one softmax layer for all time steps?
I have never seen any network that implements the second approach. The most obvious reason is that all states except for the last one haven't seen the whole sequence.
Take, for example, review sentiment classification. It can start with few positive aspects, after which goes a "but" with a list of drawbacks. All RNN cells before the "but" are going to be biased and their state won't reflect the true label. Does it matter how many of them output positive class and how confident they are? The last cell output would be a better predictor anyway, so I don't see a reason to take the previous ones into account.
If the sequential of the data aspect is not important in a particular problem, then RNN doesn't seem like a good approach in general. Otherwise, you should better use the last state.
There is, however, one exception in sequence-to-sequence models with attention mechanism (see for instance this question). But it is different, because the decoder is predicting a new token on each step, so it can benefit from looking at earlier states. Besides it takes the final hidden state information as well.

Incorporating Transition Probabilities in SARSA

I am implementing a SARSA(lambda) model in C++ to overcome some of the limitations (the sheer amount of time and space DP models require) of DP models, which hopefully will reduce the computation time (takes quite a few hours atm for similar research) and less space will allow adding more complexion to the model.
We do have explicit transition probabilities, and they do make a difference. So how should we incorporate them in a SARSA model?
Simply select the next state according to the probabilities themselves? Apparently SARSA models don't exactly expect you to use probabilities - or perhaps I've been reading the wrong books.
PS- Is there a way of knowing if the algorithm is properly implemented? First time working with SARSA.
The fundamental difference between Dynamic Programming (DP) and Reinforcement Learning (RL) is that the first assumes that environment's dynamics is known (i.e., a model), while the latter can learn directly from data obtained from the process, in the form of a set of samples, a set of process trajectories, or a single trajectory. Because of this feature, RL methods are useful when a model is difficult or costly to construct. However, it should be notice that both approaches share the same working principles (called Generalized Policy Iteration in Sutton's book).
Given they are similar, both approaches also share some limitations, namely, the curse of dimensionality. From Busoniu's book (chapter 3 is free and probably useful for your purposes):
A central challenge in the DP and RL fields is that, in their original
form (i.e., tabular form), DP and RL algorithms cannot be implemented
for general problems. They can only be implemented when the state and
action spaces consist of a finite number of discrete elements, because
(among other reasons) they require the exact representation of value
functions or policies, which is generally impossible for state spaces
with an infinite number of elements (or too costly when the number of
states is very high).
Even when the states and actions take finitely many values, the cost
of representing value functions and policies grows exponentially with
the number of state variables (and action variables, for Q-functions).
This problem is called the curse of dimensionality, and makes the
classical DP and RL algorithms impractical when there are many state
and action variables. To cope with these problems, versions of the
classical algorithms that approximately represent value functions
and/or policies must be used. Since most problems of practical
interest have large or continuous state and action spaces,
approximation is essential in DP and RL.
In your case, it seems quite clear that you should employ some kind of function approximation. However, given that you know the transition probability matrix, you can choose a method based on DP or RL. In the case of RL, transitions are simply used to compute the next state given an action.
Whether is better to use DP or RL? Actually I don't know the answer, and the optimal method likely depends on your specific problem. Intuitively, sampling a set of states in a planned way (DP) seems more safe, but maybe a big part of your state space is irrelevant to find an optimal pocliy. In such a case, sampling a set of trajectories (RL) maybe is more effective computationally. In any case, if both methods are rightly applied, should achive a similar solution.
NOTE: when employing function approximation, the convergence properties are more fragile and it is not rare to diverge during the iteration process, especially when the approximator is non linear (such as an artificial neural network) combined with RL.
If you have access to the transition probabilities, I would suggest not to use methods based on a Q-value. This will require additional sampling in order to extract information that you already have.
It may not always be the case, but without additional information I would say that modified policy iteration is a more appropriate method for your problem.

Which machine learning model is applicable to the following case

I want to build a model that recognizes the species based on multiple indicators. The problem is, neural networks (usually) receive vectors, and my indicators are not always easily expressed in numbers. For example, one of the indicators is not only whether species performs some actions (that would be, say, '0' or '1', or anything in between, if the essence of action permits that), but sometimes, in which order are those actions performed. I want the system to be able to decide and classify species based on these indicators. There are not may classes but rather many indicators.
The amount of training data is not an issue, I can get as much as I want.
What machine learning techniques should I consider? Maybe some special kind of neural network would do? Or maybe something completely different.
If you treat a sequence of actions as a string, then using features like "an action A was performed" is akin to unigram model. If you want to account for order of actions, you should add bigrams, trigrams, etc.
That will blow up your feature space, though. For example, if you have M possible actions, then there are M (M-1) / 2 bigrams. In general, there are O(Mk) k-grams. This leads to the following issues:
The more features you have — the harder it is to apply some methods. For example, many models suffer from curse of dimensionality
The more features you have — the more data you need to capture meaningful relations.
This is just one possible approach to your problem. There may be others. For example, if you know that there's some set of parameters ϴ, that governs action-generating process in a known (at least approximately) way, you can build a separate model to infer these first, and then use ϴ as features.
The process of coming up with sensible numerical representation of your data is called feature engineering. Once you've done that, you can use any Machine Learning algorithm at your disposal.

Reinforcement Learning With Variable Actions

All the reinforcement learning algorithms I've read about are usually applied to a single agent that has a fixed number of actions. Are there any reinforcement learning algorithms for making a decision while taking into account a variable number of actions? For example, how would you apply a RL algorithm in a computer game where a player controls N soldiers, and each soldier has a random number of actions based its condition? You can't formulate fixed number of actions for a global decision maker (i.e. "the general") because the available actions are continually changing as soldiers are created and killed. And you can't formulate a fixed number of actions at the soldier level, since the soldier's actions are conditional based on its immediate environment. If a soldier sees no opponents, then it might only be able to walk, whereas if it sees 10 opponents, then it has 10 new possible actions, attacking 1 of the 10 opponents.
What you describe is nothing unusual. Reinforcement learning is a way of finding the value function of a Markov Decision Process. In an MDP, every state has its own set of actions. To proceed with reinforcement learning application, you have to clearly define what the states, actions, and rewards are in your problem.
If you have a number of actions for each soldier that are available or not depending on some conditions, then you can still model this as selection from a fixed set of actions. For example:
Create a "utility value" for each of the full set of actions for each soldier
Choose the highest valued action, ignoring those actions that are not available at a given time
If you have multiple possible targets, then the same principle applies, except this time you model your utility function to take the target designation as an additional parameter, and run the evaluation function multiple times (one for each target). You pick the target that has the highest "attack utility".
In continuous domain action spaces, the policy NN often outputs the mean and/or the variance, from which you, then, sample the action, assuming it follows a certain distribution.

Resources