Reinforcement Learning With Variable Actions

Reinforcement Learning With Variable Actions - machine-learning

All the reinforcement learning algorithms I've read about are usually applied to a single agent that has a fixed number of actions. Are there any reinforcement learning algorithms for making a decision while taking into account a variable number of actions? For example, how would you apply a RL algorithm in a computer game where a player controls N soldiers, and each soldier has a random number of actions based its condition? You can't formulate fixed number of actions for a global decision maker (i.e. "the general") because the available actions are continually changing as soldiers are created and killed. And you can't formulate a fixed number of actions at the soldier level, since the soldier's actions are conditional based on its immediate environment. If a soldier sees no opponents, then it might only be able to walk, whereas if it sees 10 opponents, then it has 10 new possible actions, attacking 1 of the 10 opponents.

What you describe is nothing unusual. Reinforcement learning is a way of finding the value function of a Markov Decision Process. In an MDP, every state has its own set of actions. To proceed with reinforcement learning application, you have to clearly define what the states, actions, and rewards are in your problem.

If you have a number of actions for each soldier that are available or not depending on some conditions, then you can still model this as selection from a fixed set of actions. For example:
Create a "utility value" for each of the full set of actions for each soldier
Choose the highest valued action, ignoring those actions that are not available at a given time
If you have multiple possible targets, then the same principle applies, except this time you model your utility function to take the target designation as an additional parameter, and run the evaluation function multiple times (one for each target). You pick the target that has the highest "attack utility".

In continuous domain action spaces, the policy NN often outputs the mean and/or the variance, from which you, then, sample the action, assuming it follows a certain distribution.

Related

Will Q Learning algorithm produce the same result if I do not use e-greedy?

I am trying to implement Q-Learning algorithm but I do not have enough time to select the action by e-greedy.For simplicity I am choosing a random action,without any proper justification.Will this work ?

Yes, random action selection will allow Q-learning to learn an optimal policy. The goal of e-greedy exploration of to ensure that all the state-action pairs are (asymptotically) visited infinitely often, which is a convergence requirement [Sutton & Barto, Section 6.5]. Obviously, a random action selection process also complies this requirement.
The main drawback is that your agent will be acting poorly during all the learning stage. Also, maybe the convergence speed could be penalized, but I guess this last point is very application dependent.

Neural Network-based Reinforcement Learning with variable action set

How does one handle variable (state-dependent) action set with Reinforcement Learning, specifically Actor-Critic method? I have found a similar question (Reinforcement Learning With Variable Actions), but it offers no complete answer I can use.
The problem is that neural network that models Policy function has a fixed number of outputs (corresponding to a maximum possible set of actions). It can't help but compute probabilities for all actions, including ones that are impossible in the current state. It becomes an especially huge problem, when there are states where only one or two actions out of, say, initial 50 are possible.
I see two possibilities:
1) Ignore impossible actions, and choose an actions among possible ones, re-normalizing probability of each one to the sum of their probabilities.
2) Let action selector choose impossible actions, but penaltize the network for doing so, until it learns to never choose impossible actions.
I have tried both, and they all have some problems:
1) Ignoring impossible actions may lead to action with output probability of 0.01 to have re-normalized probability of 0.99, depending on other actions. During the back-propagation step, this will lead to a big gradient, because of log(probability) factor (the network will use the original, non-normalized probability in calculation). I'm not entirely sure if this is desirable, and I seem to get not particularly good results with this approach.
2) Penaltizing choosing bad actions is even more problematic. If I only penaltize them slightly, action choose can get stuck for a long time while it adjusts over-inflated probabilities of impossible actions until previously small probabilities of possible actions become viable. But if I penaltize bad actions hugely, it leads to bad outcomes, like having a network where one action has 1.0 probability, and the rest are 0.0. What's worse, the network can even become stuck in an endless loop because of precision problems.
I haven't seen a proper discussion of this question anywhere, but maybe I just can't think of a proper search terms. I would be very thankful if anyone could direct me to a paper or blog post that deals with this question, or give me a extended answer on proper handling of this case.

Which machine learning model is applicable to the following case

I want to build a model that recognizes the species based on multiple indicators. The problem is, neural networks (usually) receive vectors, and my indicators are not always easily expressed in numbers. For example, one of the indicators is not only whether species performs some actions (that would be, say, '0' or '1', or anything in between, if the essence of action permits that), but sometimes, in which order are those actions performed. I want the system to be able to decide and classify species based on these indicators. There are not may classes but rather many indicators.
The amount of training data is not an issue, I can get as much as I want.
What machine learning techniques should I consider? Maybe some special kind of neural network would do? Or maybe something completely different.

If you treat a sequence of actions as a string, then using features like "an action A was performed" is akin to unigram model. If you want to account for order of actions, you should add bigrams, trigrams, etc.
That will blow up your feature space, though. For example, if you have M possible actions, then there are M (M-1) / 2 bigrams. In general, there are O(Mk) k-grams. This leads to the following issues:
The more features you have — the harder it is to apply some methods. For example, many models suffer from curse of dimensionality
The more features you have — the more data you need to capture meaningful relations.
This is just one possible approach to your problem. There may be others. For example, if you know that there's some set of parameters ϴ, that governs action-generating process in a known (at least approximately) way, you can build a separate model to infer these first, and then use ϴ as features.
The process of coming up with sensible numerical representation of your data is called feature engineering. Once you've done that, you can use any Machine Learning algorithm at your disposal.

Regression or Classification? How to determine sample size?

I have a group of instances with n features (numerical) each.
I am resampling my features every X time steps, so each instance has a set of features at t1:tn.
The continues respond variable (e.g. with range 50:100) is only measured every X*z times. (e.g. sampling every minute, respond only every 30) The features might change over time. So might the response.
Now at any time point T i want to map a new instance to the respond range.
In case i did not lose you yet :-)
Do you rather see this as a regression or a multi class classification problem (with discretized respond range)?
In either case, is there a rule of thumb how many instances i will need? In case the instances do not follow the same distribution, (e.g. different response for same set of feature values, can i use clustering to filter / analyze this?)

Hidden markov model next state only depends on previous one state? What about previous n states?

I am working on a prototype framework.
Basically I need to generate a model or profile for each individual's lifestyle based on some sensor data about him/her, such as GPS, motions, heart rate, surrounding environment readings, temperature etc.
The proposed model or profile is a knowledge representation of an individual's lifestyle pattern. Maybe a graph with probabilities.
I am thinking to use Hidden Markov Model to implement this. As the states in HMM can be Working, Sleeping, Leisure, Sport and etc. Observations can be a set of various sensor data.
My understanding of HMM is that next state S(t) is only depends on previous one state S(t-1). However in reality, a person's activity might depends on previous n states. Is it still a good idea to use HMM? Or should I use some other more appropriate models? I have seen some work on second order and multiple order of Markov Chains, does it also apply HMM?
I really appreciate if you can give me a detailed explanation.
Thanks!!

What you are talking about is a First Order HMM in which your model would only have knowledge of the previous history State. In case of an Order-n Markov Model, the next state would be dependent on the previous 'n' States and may be this is what you are looking for right?
You are right that as far as simple HMMs are considered, the next state is dependent only upon the current state. However, it is also possible to achieve a mth Order HMM by defining the transition probabilities as shown in this link. However, as the order increases, so does the overall complexity of your matrices and hence your model, so it's really upto you if your up for the challenge and willing to put the requisite effort.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart