Implementations of Hierarchical Reinforcement Learning

Implementations of Hierarchical Reinforcement Learning - machine-learning

Can anyone recommend a reinforcement learning library or framework that can handle large state spaces by abstracting them?
I'm attempting to implement the intelligence for a small agent in a game world. The agent is represented by a small two-wheeled robot that can move forward and backwards, and turn left and right. It has a couple sensors for detecting a boundary on the ground, a couple ultrasonic sensors for detecting objects far away, and a couple bump sensors for detecting contact with an object or opponent. It also can do some simple dead reckoning to estimate its position in the world using its starting position as a reference. So all the state features available to it are:
edge_detected=0|1
edge_left=0|1
edge_right=0|1
edge_both=0|1
sonar_detected=0|1
sonar_left=0|1
sonar_left_dist=near|far|very_far
sonar_right=0|1
sonar_right_dist=near|far|very_far
sonar_both=0|1
contact_detected=0|1
contact_left=0|1
contact_right=0|1
contact_both=0|1
estimated_distance_from_edge_in_front=near|far|very_far
estimated_distance_from_edge_in_back=near|far|very_far
estimated_distance_from_edge_to_left=near|far|very_far
estimated_distance_from_edge_to_right=near|far|very_far
The goal is to identify the state where the reward signal is received, and learn a policy to acquire that reward as quickly as possible. In a traditional Markov model, this state space represented discretely would have 2985984 possible values, which is far too much to explore each and every one using something like Q-learning or SARSA.
Can anyone recommend a reinforcement library appropriate for this domain (preferably with Python bindings) or an unimplemented algorithm that I could potentially implement myself?

Your actual state is the robot's position and orientation in the world. Using these sensor readings is an approximation, since it is likely to render many states indistinguishable.
Now, if you go down this road, you could use linear function approximation. Then this is just 24 binary features (12 0|1 + 6*2 near|far|very_far). This is such a small number that you could even use all pairs of features for learning. Farther down this road is online discovery of feature dependencies (see Alborz Geramifard's paper, for example). This is directly related to your interest in hierarchical learning.
An alternative is to use a conventional algorithm to track the robot's position and use the position as input to RL.

Related

How do I correct sensor drift from external environment?

I have an electromagnetic sensor which reports how much electromagnetic field strength it reads in space.
And I also have a device that emits electromagnetic field. It covers 1 meter area.
So I want to kind of predict position of the sensor using its reading.
But the sensor is affected by metal so it makes the position prediction drifts.
It's like if the reading is 1, and you put it near a metal, you get 2.
Something like that. It's not just noise, it's a permanent drift. Unless you remove the metal it will give reading 2 always.
What are the techniques or topics I need to learn in general to recover reading 1 from 2?
Suppose that the metal is fixed somewhere in space and I can calibrate the sensor by putting it near metal first.
You can suggest anything about removing the drift in general. Also please consider that I can have another emitter putting somewhere so I should be able to recover the true reading easier.

Let me suggest that you view your sensor output as a combination of two factors:
sensor_output = emitter_effect + environment_effect
And you want to obtain emitter_effect without the addition of environment_effect. So, of course you need to subtract:
emitter_effect = sensor_output - environment_effect
Subtracting the environment's effect on your sensor is usually called compensation. In order to compensate, you need to be able to model or predict the effect your environment (extra metal floating around) is having on the sensor. The form of the model for your environment effect can be very simple or very complex.
Simple methods generally use a seperate sensor to estimate environment_effect. I'm not sure exactly what your scenario is, but you may be able to select a sensor which would independently measure the quantity of interference (metal) in your setup.
More complex methods can perform compensation without referring to an independent sensor for measuring inteference. For example, if you expect the distance to be at 10.0 on average with only occasional deviations, you could use that fact to estimate how much interference is present. In my experience, this type of method is less reliable; systems with independent sensors for measuring interference are more predictable and reliable.
You can start reading about Kalman filtering if you're interested in model-based estimation:
https://en.wikipedia.org/wiki/Kalman_filter
It's a complex topic, so you should expect a steep learning curve. Kalman filtering (and related Bayesian estimation methods) are the formal way to convert from "bad sensor reading" to "corrected sensor reading".

Neural Network - Should I Remove All Derived / Calculated Variables?

I'm using a neural network to control the movement of a character in a game. I've currently got a huge amount of dimensions and in the interest of trimming them to improve storage and code manageability, I'm considering removing all derived variables i.e. any variable which can be calculated from data already sent into to the network.
An example of this would be the relationship between a) position, b) velocity, and c) acceleration along a path. Currently, I send the last 50 data points of all three to the NN to help it decide its next movement. However, I wonder if system control / error could be minimized just as easily by sending only position. Theoretically the neural network should be able to derive the velocity and acceleration at a point in time entirely on it's own given the position history.
Generally, is dimension reduction in this capacity recommended? Why or why not?
I know the oft recommendation in this scenario is just to test it and see what happens, but in this case there are so many variables here that it would take days to test, so I was hoping to hear anyone's experience given this type of situation and what they surmise the general rule to be.
Bonus question--would this assessment / decision be different for a neural network (intent on mapping functions to data) as opposed to a random forest (seems to use more of a nearest neighbor approach).
Thanks!!

Implement PCA to reduce the number of features. They reduced features will have unusual units like [positionvelocityacceleration]. However, if you do PCA correctly you can retain a feature set that has 99% variance of the original set.
Then use the new feature set in your NN.
Reducing dimensions is recommended to speed-up algorithms because, as you observed, there is a lot of similarity between your features.

How to use machine learning to calculate a graph of states from a sequence of data?

Generic formulation
I have a dataset consisting of a sequence of points with 12 features each.
I am interested in detecting an event in this data.
In the training data I know the moments the event occurred.
When the event occurs I can see an observable pattern in the sequence of points before the event. The pattern is formed from about 300 consecutive points.
I am interested in detecting when the event occurred in a infinite sequence of points.
The analysis happens post factum. I am not interested in predicting if the event will occur.
Concrete example
You may skip this section
I am building a wearable that needs to detect when a old person has fallen. It will be weared on the wrist, like a watch and it has an IMU (3 axis accelerometer, 3 axis gyroscope and magnetometer)
The most relevant data is the absolute acceleration on the Z axis but there are many other information to be factored in because hand movements are complex and can generate many false positives.
The human observable pattern based on the absolute Z acceleration consists of 3 states:
an acceleration towards the ground
an impact acceleration up the Z axis
silence - very small movements
Those states could be expanded further by supervised learning and should factor in the other accelerations in the wearable coordinate system and in the absolute coordinate system and rotation vectors expressed as Euler and quaternions.
Approach
My idea was to find an algorithm that uses the training data to calculate a graph of states, where in each state there are some thresholds based on the 12 features. If the thresholds are passed the state changes according to the graph.
What do you guys think?
Markov Models ?
Markov Model would have worked if I was interested to predict the event before it occurred based by example on the starting 200 points from the 300 pattern.
But I am interested to detect the event shortly after the pattern ended, no need to predict.
I'm not sure if Markov Models applies and if does I don't know how to calculate the model.

Hidden Markov models are not unusual for event detection in sequences, especially separating the frames of a certain event from "background signal" in a sequence.
The most basic realization of this is to concatenate 2 HMMs, one specialized at modeling the "signal", the other at modeling the background, suitably combined to allow transitions like background-->signal-->background.
A specific example is keyword spotting in speech recognition (separate a "keyword" from background signal). A brief explanation is found in Section 11.2 of Grangier et al., "Discriminative keyword spotting". Figure 11.1 in that reference may be the model topology you're looking for.
Other alternatives are binary classification with a sliding window. This is discussed, for instance, in T. Dietterich, "Machine Learning for Sequential Data: A review" (among other alternatives).

Machine Learning: How to learn MTG draft card game

I have become familiar with many various approaches to machine learning, but I am having trouble identifying which approach might be most appropriate for my given fun problem. (IE: is this a supervised learning problem and if so, what is my input x and output y?)
A magic the gathering draft consists of many actors sitting around a table holding a pack of 15 cards. the players pick a card and pass the remainder of the pack to the player next to them. They open a new pack and do this again for 3 total rounds (45 decisions). People end up with a deck which they use to compete.
I am having trouble understanding how to structure the data I have into trials which are needed for learning. I want a solution that 1) builds knowledge about which cards are picked relative to the previous cards that are picked 2) can then be used to make a decision about which card to pick from a given new pack.
I've got a dataset of human choices I'd like to learn from. It also includes info on cards they ended up picking but discarding ultimately. What might be an elegant way to structure this data for learning, aka, what are my features?

These kind of problems are usually tackled by reinforcment learning, planning and markov decision processes. Thus this is not oa typical scheme of supervised/unsupervised learning. This is rather about learning to play something - to interact with the environment (rules of the game, chances etc.). Take a look at methods like:
Q-learning
SARSA
UCT
In particular, great book by Sutton and Barto "Reinforcement Learning: An Introduction" can be a good starting point.

Yes, you can train a model do handle this -- eventually -- with either supervised or unsupervised learning. The problem is the quantity of factors and local stable points. Unfortunately, the input at this point is the state of the game: cards picked by all players (especially the current state of the AI's deck) and those available from the deck in hand.
Your output result should be, as you say, the card chosen ... out of those available. I feel that you have a combinatorial explosion that will require either massive amounts of data, or simplification of the card features to allow the algorithm to extract a meaning deeper than "Choose card X out of this set of 8."
In practice, you may want the model to score the available choices, rather than simply picking a particular card. Return rankings, or fitness metrics, or probabilities of picking each particular card.
You can supply some supervision in choice of input organization. For instance, you can provide each card as a set of characteristics, rather than simply a card ID -- give the algorithm a chance to "understand" building a consistent deck type.
You might also wish to put in some work to abstract (i.e. simplify) the current game state, such as writing evaluation routines to summarize the other decks being built. For instance, if there are 6 players in the group, and your RHO and his opposite are both building burn decks, you don't want to do the same -- RHO will take the best burn cards in 5 of 6 decks passed around, leaving you with 2nd (or 3rd) choice.
As for the algorithm ...
A neural network will explode with this many input variables. You'll want something simpler that matches your input data. If you go with abstracted properties, you might consider some decision-tree algorithm (Naive Bayes, Random Forest, etc.). You might also go for a collaborative filtering model, given the similarities in situations.
I hope this helps launch you toward designing your features. Do note that you're attacking a complex problem: one of the features that can make a game popular for humans is that it's hard to automate the decision-making.

Every single "pick" is a decision, with the input information as A:(what you already have), and B:(what the available choices are).
Thus, a machine that decides "whether you should pick this card", can be a simple binary classifier given the input of A+B+(the card in question).
For example, the pack 1 pick 2 of someone basically provides 1 "yes" (the card picked) and 13 "no" (the cards not picked), total 14 rows of training data.
We may want to weight these training data according to which pick it is. (When there are less cards left, the choice might be less important than when there are more choices.)
We may also want to weight these training data according to the rarity of cards.
Finally, the main challenge is that the input data (the features), A+B+card, is inappropriate unless we do a smart transformation. (Simply treating the card as a categorical variable and 1-hot coding them leads to something that is too big and very low information density. That will definitely fail.)
This challenge can be resolved by making it a 2-stage process. First we vectorize the cards, then build features based on the vectors.
http://www.cs.toronto.edu/~mvolkovs/recsys2018_challenge.pdf

Qlearning - Defining states and rewards

I need some help with solving a problem that uses the Q-learning algorithm.
Problem description:
I have a rocket simulator where the rocket is taking random paths and also crashes sometimes. The rocket has 3 different engines that can be either on or off. Depending on which engine(s) is activated, the rocket flies towards different directions.
Functions for turning the engines off/on is available
The task:
Construct the Q-learning controller that will turn to rocket to face up all the time.
A sensor that reads the angle of the rocket is available as input.
My solution:
I have the following states:
I also have the following actions:
all engines off
left engine on
right engine on
middle engine on
left and right on
left and middle on
right and middle on
And the following rewards:
Angle = 0, Reward = 100
All other angles, reward = 0
Question:
Now to the question, is this a good choice of rewards and states ? Can I improve my solution ? Is it better to have more rewards for other angles ?
Thanks in advance

16 states x 7 actions is a very small problem.
Rewards for other angles will help you learn faster, but can create odd behaviors later depending on your dynamics.
If you don't have momentum you may decrease the number of states, which will speed up learning and reduce memory useage (which is already tiny). To find the optimal number of states, try decreasing the number of states while analyzing a metric such as reward/timesteps over multiple games, or mean error (normalized by starting angle) over multiple games. Some state representations may perform much better than others. If not, choose the one which converges fastest. This should be relatively cheap with your small Q table.
If you want to learn quickly, you may also try Q-lambda or some other modified Reinforcement Learning algorithm to make use of temporal difference learning.
Edit: Depending on your dynamics this problem may not actually be suitable as a Markov Decision Process. For example, you may need to include the current rotation rate.

Try putting smaller rewards on the states next to the desired state. This will get your agent to learn to go up quicker.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart