Training PPO from stable_baselines3 on a grid world that randomizes

Training PPO from stable_baselines3 on a grid world that randomizes - machine-learning

I'm new to RL and I was hoping to get some advice from yol:
I created a custom environment that is a 10x10 grid world where the agent and its target destination (as well as some obstacles, namely: Fires) can be randomly placed. The state of the env that the model is trained on is just the Box numpy array representing the different position (0s for empty spaces, 1 for the target, etc).
what the world could look like
The PPO model (from stable_baselines3) is unable to learn now to navigate randomly generated worlds even after 5 million time steps of training (each reset of an environment creates new random world layout). Tensor-board is showing only a very slight average reward increase after all that training.
I am able to train the model effectively only I keep the world layout the same on every reset (so no random placement of the agent, etc).
So my question is: should PPO be in theory able to deal with random world generation like that or am I trying to make it do something that is beyond its capabilities?
More details: I'm using all default PPO parameters (with MlpPolicy).
The reward system is as follows:
On every step the reward is -0.5 * distance between the agent (smily face) and the target ('$')
If the agent is next to a fire ('X'), it gets -100 reward
If the agent is is next to the target ('$'), it gets reward of 1000 and the episode ends
Max of 200 steps per episode.

I would rather try good old off-policy deterministic solutions like DQN for that task, but on-policy stochastic PPO should solve it as well. I recommend to change three things in your design, maybe changing them could help your training.
First, your reward signal design probably "embarrasses" your network: you have an enormously big positive terminal reward while trying to push your agent to this terminal state asap with small punishments. I would definetely suggest reward normalization for PPO.
Second, if you don't fine-tune your PPO hyperparameters, your entropy coefficient ent_coef remains 0.0, however entropy part of your loss function could be very useful in your env. I would try 0.01 at least.
Third, PPO would be really enhanced in your case (in my opinion) if you change mlp policy to a recurrent one. Take a look at RecurrentPPO

Related

Is reinforcement learning suitable for predicting bias in dice?

I would like to analyze a problem similar to the following.
Problem:
You will be given N dices.
You will be given a lot of data about each dice (eg surface information, material information, location of the center of gravity … etc).
The features of the dice are randomly generated every game and are fired at the same speed, angle and initial position.
As a result of rolling the dice, you get 1 point if you get 6 and 0 points otherwise.
There are training data of 100000 games. (Dice data and match results)
I would like to learn the rule of selecting only dice whose probability of getting 6 is higher than 1/6.
I apologize for the vague problem statement.
First of all, it is my mistake to assume that "N dice".
The dice may be one by one.
One dice with random characteristics are distributed
When it rolls, it is recorded whether 6 has come out or not.
It was easy to understand if it was made into the problem that "this [characteristics, result] data is 100,000".
If you get something other than 6, you will get -1 points.
If you get 6, you will get +5 points.
Example:
X: vector of a dice data
f: function I want to know
f: X-> [0, 1]
(if result> 0.5, I pick this dice.)
For example, a dice with a 1/5 chance of getting a 6 gets 4 out of 5 times a non-6, so I wondered if it would be better to give an immediate reward.
Is it good to decide the reward by the number of points after 100000 games?
I have read some general reinforcement learning methods, but there is a concept of state transition. However, there is no state transition in this game. (Each game ends in 1 step, and each game is independent.)
I am a student just learning neural networks from scratch. It helps if you give me a hint. Thank you.
by the way,
I think that the result of this learning can be concluded "It is good to choose the dice whose pips farthest to the center of gravity is 6."

Let's first talk about Reinforcement-Learning.
Problem setups, in order of increasing generality:
Multi-Armed Bandit - no state, just actions with unknown rewards
Contextual Bandit - rewards also depend on some context (state)
Reinforcement Learning (MDP) - actions can also influence the next state
Common to all of all three is that you want to maximize the sum of rewards over time, and there is an exploration vs exploitation trade-off. You are not just given a large dataset. If you want to know what the best action is, you have to try it a few times and observe the reward. This may cost you some reward you could have earned otherwise.
Of those three, the Contextual Bandit is the closest match to your setup, although it doesn't quite match to your goals. It would go like this: Given some properties of the dice (the context), select the best dice from a group of possible choices (the action, e.g. the output from your network), such that you get the highest expected reward. At the same time you are also training your network, so you have to pick bad or unknown properties sometimes to explore them.
However, there are two reasons why it doesn't match:
You already have data from several 100000 of games, and seem to be not interested in minimizing the cost of trial and error to acquire more data. You assume this data is representative, so no exploration is required.
You are only interested in prediction. You want classify the dice into "good for rolling 6" vs "bad". This piece of information could be used later to make a decision between different choices if you know the cost for making a wrong decision. If you are just learning f() because you are curious about the property of a dice, then is a pure statistical prediction problem. You don't have to worry about short- or long-term rewards. You don't have to worry about selection or consequences of any actions.
Because of this, you actually only have a supervised learning problem. You could still solve it with reinforcement learning because RL is more general. But your RL algorithm would be wasting a lot of time figuring out that it really cannot influence the next state.
Supervised Learning
Your dice actually behaves like a biased coin, it's a Bernoulli trial with ~1/6 success probability. This is now a standard classification problem: given your features, predict the probability that a dice will lead to a good match result.
It seems that your "match results" can be easily converted in the number of rolls and the number of positive outcomes (rolled a six) with the same dice. If you have a large number of rolls for every dice, you can simply classify this die and use this class (together with the physical properties) as one data point to train your network.
You can do more fancy things if you have fewer rolls but I won't go into that. (If you are interested, have a look at the beta distribution and how the cross-entropy loss works with neural networks.)

How to find the values for several variables so that a function returns highest value

I have 6 variables with values of something between 0 and 2 and a function, where these variables are given into. It predicts the result of a football match by looking at the past matches of both teams.
Depending on the variables, obviously, the output of the function changes. Each variable determines how much a certain part of the algorithm is weighed (e.g. how much less a game should be weighed that was 6 months ago compared to a game that was a week ago).
My goal is now to find out what the perfect ratios between the different variables and thus between the different parts of the algorithm are, so that my algorithm predicts most matches correctly. Is there any way of achieving that?
I thought of doing something like this with machine learning, something similar to linear/polynomal regression.
To determine how close a tip is I thought of giving:
2 points for when the tendency was right (predicted that Team A would win and Team A did win)
4 points for when the goal difference is right (Prediction: Team A
wins 2:1, actual result: 1:0)
5 points for when the result is
predicted correctly (predicted result: 2:1 and actual result: 2:1)
Which would make a loss function of
maximal points for game (which is 5) - points for predicted result
If I am able to minimize that, hopefully, after looking at some training sets (past seasons), it will theoretically score the most amount of points, when you give it a new season together with the variables computed in beforehand as input.
Now I'm trying to find out, by how much and in which direction I have to change each of my variables so that the loss is made smaller each time you look at a new training set.
You probably have to look at how big the loss is but I dont know how to find out which variable to change and in which direction. Is that possible and if so, how do I do that?
Currently I'm using javascript.

I am assuming that you are trying to use gradient descent to train your regression model.
Loss functions that you can use with gradient descent have to be differentiable, so simply giving/subtracting points to certain properties of the prediction is not possible.
A loss function that may be suitable for this task is the mean squared error, which is simply computed by averaging the squared differences between predicted and expected values. Your expected values would then just be the scores of both teams in the game.
You would then have to compute the gradient of the loss of your prediction with respect to the weights that the prediction function uses to compute its outputs. This can be done using backpropagation (details of which are way too broad for this answer, there are many tutorials available on the web).
The intuition behind the gradient of a function is that it points in the direction of steepest ascend of your function. If you update your parameters in that direction, the output of your function will grow. If this value is the loss of your prediction function, you want it to be smaller, so you go a small step in the opposite direction of your gradient.

argmax from probability distribution better policy than random sampling from softmax?

I am trying to train Echo State Network for text generation with stochastic optimization along the lines of Reinforcement learning, where the optimization depends on the reward signal.
I have observed that during evaluation, when I sample from the probability distribution, the bleu score is bigger than when I argmax from the distribution. The difference is almost more than 0.10 points (BLEU Score is generally between the range 0 and 1 ).
I am not sure why does that happen.
Help needed.

You don't use the argmax function as it is a deterministic approach. And the main issue with that is that it can easily get you trap into a loop. Meaning that in case of an error in the text generation you are likely to keep going on this path without any possibility to get out. The randomness allows a "jump out" of the loop.
A good example to illustrate this need of a jump out is for example the Page Rank algorithm. It uses a random walk parameter that allows the imaginary surfer to get out of dead ends.
The TensorFlow team says about this in their tutos this (without any justification)
:
Note: It is important to sample from this distribution as taking the argmax of the distribution can easily get the model stuck in a loop.

Neural Network - Should I Remove All Derived / Calculated Variables?

I'm using a neural network to control the movement of a character in a game. I've currently got a huge amount of dimensions and in the interest of trimming them to improve storage and code manageability, I'm considering removing all derived variables i.e. any variable which can be calculated from data already sent into to the network.
An example of this would be the relationship between a) position, b) velocity, and c) acceleration along a path. Currently, I send the last 50 data points of all three to the NN to help it decide its next movement. However, I wonder if system control / error could be minimized just as easily by sending only position. Theoretically the neural network should be able to derive the velocity and acceleration at a point in time entirely on it's own given the position history.
Generally, is dimension reduction in this capacity recommended? Why or why not?
I know the oft recommendation in this scenario is just to test it and see what happens, but in this case there are so many variables here that it would take days to test, so I was hoping to hear anyone's experience given this type of situation and what they surmise the general rule to be.
Bonus question--would this assessment / decision be different for a neural network (intent on mapping functions to data) as opposed to a random forest (seems to use more of a nearest neighbor approach).
Thanks!!

Implement PCA to reduce the number of features. They reduced features will have unusual units like [positionvelocityacceleration]. However, if you do PCA correctly you can retain a feature set that has 99% variance of the original set.
Then use the new feature set in your NN.
Reducing dimensions is recommended to speed-up algorithms because, as you observed, there is a lot of similarity between your features.

HMM application in speech recognition

this is my first time posting a question here so if the approach is not so standard i apologize, i understand there are lots of questions out there on this and i have read tons of thesis, questions, aritcles and tutorials yet i seem to have a problem and it's always best to ask. i am creating a speech recognition application, using phoneme level processing(not isolated word) continuous HMMs based on gaussian mixture models, involving baum welch, forward-backward, and viterbi algorithms,
i have implemented a very good feature extraction and pre-processing method (MFCC), feature vectors consist of the mfcc, delta and acceleration coefficients as well and it's working pretty well on it's part however when it comes to HMMs , i seem to either have a 'Major Misunderstading' about how HMMs are supposed to help recognize speech or i am missing a little point here...i have try harded a lot that at this point i can't really tell what's wrong and right.
first off, i recorded around 50 words, each 6 utterances, and run them through a correct compatibility and conversion program that i wrote myself and the extracted the features so that they can be used for baum-welch.
i want you to please tell me where am i making a mistake in this procedure, also i will mention a few doubts i have on it so that you can help me understand this whole subject better.
here are the steps in my application concerning anything related to the training :
steps for initial parameters of HMM model :
1 - assign all observations from each training sample of each model to their corresponding discrete state(or in other words, which feature vector belongs to which alphabetic state).
2 - use k-means to find the initial continuous emission parameters, clustering is done over all observations of each state, here the cluster size is 6 (equal to number of mixtures for probability density function), parameters would be sample means, sample covariances and some mixture weights for each cluster.
3 - create initial state-initial and transition probability matrices for each model and training sample individually(left right structure is used in this case), 0 for previous states and 1 for up to 1 next state for transitions and 1 for initial and 0 for others in state initials.
4 - calculate gaussian mixture model based probability density function for each state -> it's corresponding cluster -> assigned to all the vectors in all the training samples for each model
5 - calculate initial emission parameters using the pdf and mixture weights for clusters.
6 - now calculate the gamma variables using initial paramters(transitions, emissions, initials) in forward-backward and initial PDFs, using the continuous formula for gamma..(gamma = probability of being in a certain state at a certain time for any of the mixtures)
7 - estimate new state initials
8 - estimate new state tranisitons
9 - estimate new sample means
10 - estimate new sample covariances
11 - estimate new pdfs
12 - estimate new emissions using new pdfs
repeat the steps from 6 to 12 using new estimated values on each iteration, use viterbi to get an overlook on how the estimating is going and when the probability is not changing anymore, stop and save.
now my issues :
first i don't know if the entire procedure i have followed is correct or not, or is there a better method to approach this...for all i know is that the convergence is pretty fast, for up to 4-5 iterations and it's already not changing anymore, however considering that if i am right then :
it's not possible for me to sit down and pre assign each feature vector to it's state in the beginning at step 1...and i don't think it's a standard procedure either...again i don't even know if i have to do it necessarily, from all my studies it was the best method i could find to get a rapid convergence.
second, say this whole baum welch has done a great job in re estimating and finding local maximums, what's raising my doubt about my baum welch implementation is that how are they later going to help me recognize speech? i assume the estimated parameters are used in viterbi for finding the optimal state for every spoken utterance...if so then emission parameters are not known cause if you look closely you will see that final emission parameters in my algorithm will be assigning each alphabetic state of each model to all the observed signals in each model, other than that...no emission parameters can be found if the signal is not exactly match to the ones used in re-estimation, and it won't obviously work, any attempt to try and match out the signals and find emissions will make the whole HMM lose it's purpose...
again i might have a wrong idea about almost everything here, i would appreciate if you help me understand what i am doing wrong here...if ANYTHING is wrong, notify me please...thank you.

You're attempting to determine the most likely set of phonemes that would have generated the sounds that you're observing - you're not attempting to work out emission parameters, you're working out the most likely set of inputs that would have produced them.
Also, your input corpus is quite small - it's unsurprising that it would converge so quickly. If you're doing this while involved with a university, see if they have access to one of the larger speech corpuses commonly used to train this kind of algorithm on.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart