Deep Reinforcement Learning Training Accuracy - machine-learning

I am using a deep reinforcement learning approach to predict time series behavior. I am quite a newbie on that so my question is more conceptual than a computer programming one. My colleague has given me the following chart, with training, validation and testing accuracy of time series data classification using deep reinforcement learning.
From this graph, it's possible to see that both validation and testing accuracies are random, so, of course, the agent is overfitting.
But what makes me more surprised (maybe because of lack of knowledge, and that is why I am here to ask you), is how my colleague trains his agent. In the X-axis of this chart, you can find the "epoch" number (or iteration). In other words, the agent is fitted (or trained) several times
as in the code below:
#initiating the agent
self.agent = DQNAgent(model=self.model, policy=self.policy,
nb_actions=self.nbActions, memory=self.memory, nb_steps_warmup=200,
target_model_update=1e-1,
enable_double_dqn=True,enable_dueling_network=True)
#Compile the agent with the Adam optimizer and with the mean absolute error metric
self.agent.compile(Adam(lr=1e-3), metrics=['mae'])
#there will be 100 iterations, I will fit and test the agent 100 times
for i in range(0,100):
#delete previous environments and create new ones
del(trainEnv)
trainEnv = SpEnv(parameters)
del(validEnv)
validEnv=SpEnv(parameters)
del(testEnv)
testEnv=SpEnv(parameters)
#Reset the callbacks used to show the metrics while training, validating and testing
self.trainer.reset()
self.validator.reset()
self.tester.reset()
####TRAINING STEP####
#Reset the training environment
trainEnv.resetEnv()
#Train the agent
self.agent.fit(trainEnv,nb_steps=floor(self.trainSize.days-self.trainSize.days*0.2),visualize=False,verbose=0)
#Get metrics from the train callback
(metrics)=self.trainer.getInfo()
#################################
####VALIDATION STEP####
#Reset the validation environment
validEnv.resetEnv()
#Test the agent on validation data
self.agent.test(validEnv,other_parameters)
#Get the info from the validation callback
(metrics)=self.validator.getInfo()
####################################
####TEST STEP####
#Reset the testing environment
testEnv.resetEnv()
#Test the agent on testing data
self.agent.test(testEnv,nb_episodes=floor(self.validationSize.days-self.validationSize.days*0.2),visualize=False,verbose=0)
#Get the info from the testing callback
(metrics)=self.tester.getInfo()
What is strange to me according to the chart and the code is that the agent is fitted several times, independently from each other, but the training accuracy increases with the time. It seems that previous experiences are helping the agent to increase the training accuracy. But how can that be possible if the environments are reset and the agent has simply fitted again? is there any backpropagation of error from previous fittings that are helping the agent to increase its accuracy in the next fittings?

What is reset is the environment, not the agent. So the agent actually accumulates experience from every iteration.

The environment is getting reset but not the agent.
The learnable parameters belong to the agent, not the environment. So the parameters of agent are changing across all the episodes i.e, agent is learning each time you fit the data.
If the data is same for all the times you fit, then it only makes our agent to overfit on the data distribution

Related

Training PPO from stable_baselines3 on a grid world that randomizes

I'm new to RL and I was hoping to get some advice from yol:
I created a custom environment that is a 10x10 grid world where the agent and its target destination (as well as some obstacles, namely: Fires) can be randomly placed. The state of the env that the model is trained on is just the Box numpy array representing the different position (0s for empty spaces, 1 for the target, etc).
what the world could look like
The PPO model (from stable_baselines3) is unable to learn now to navigate randomly generated worlds even after 5 million time steps of training (each reset of an environment creates new random world layout). Tensor-board is showing only a very slight average reward increase after all that training.
I am able to train the model effectively only I keep the world layout the same on every reset (so no random placement of the agent, etc).
So my question is: should PPO be in theory able to deal with random world generation like that or am I trying to make it do something that is beyond its capabilities?
More details: I'm using all default PPO parameters (with MlpPolicy).
The reward system is as follows:
On every step the reward is -0.5 * distance between the agent (smily face) and the target ('$')
If the agent is next to a fire ('X'), it gets -100 reward
If the agent is is next to the target ('$'), it gets reward of 1000 and the episode ends
Max of 200 steps per episode.
I would rather try good old off-policy deterministic solutions like DQN for that task, but on-policy stochastic PPO should solve it as well. I recommend to change three things in your design, maybe changing them could help your training.
First, your reward signal design probably "embarrasses" your network: you have an enormously big positive terminal reward while trying to push your agent to this terminal state asap with small punishments. I would definetely suggest reward normalization for PPO.
Second, if you don't fine-tune your PPO hyperparameters, your entropy coefficient ent_coef remains 0.0, however entropy part of your loss function could be very useful in your env. I would try 0.01 at least.
Third, PPO would be really enhanced in your case (in my opinion) if you change mlp policy to a recurrent one. Take a look at RecurrentPPO

In Machine Learning, is it okay to add the development set to the training set after development?

Usually we train our models on the training set, evaluate them on the development set, make some changes, train and evaluate again, etc. (the development phase), and in the end evaluate once on the test set.
Assume we have little training data. Then, it could make sense to use training AND development set after the development phase. One could estimate hyperparameters as usual and in the end (the final training) add the dev set to the training set, train the model with the previously estimated hyperparameters and evaluate it once on the test set.
Would this be "cheating" in any way? Do people do this, or do they usually leave out the dev set from any training?
I don't think it's cheating in any way. If it improves your model against real world data and your unseen test data, it should be ok. There are reasons why a training/dev/test set is recommended, but if you have such small training data set, I believe it's a valid strategy. In any case, it's hard to have definitive answer without knowing more details such as nature of data and the task you would like to accomplish. Another approach you might like to have look is data augmentation.
I'd recommend the following course which covers training/dev/test set distribution, among other things:
https://www.coursera.org/learn/machine-learning-projects
Once you decided on the hyperparameter using the dev set, you can use the train + dev to perform the training again. This is method is used quite often.
For example with using GridSearchCV method in sklearn, if you use refit=True, this would perform the training after the hyperparameter search is done. i.e. if cv=4 and refit=True, the model performs training 5 times, (4 times for searching best hyperparameters) + (1 for the final training using the complete training set)

Using training data and testing data in a shared task

I am working on this shared task http://alt.qcri.org/semeval2017/task4/index.php?id=data-and-tools
which is just a twitter sentiment analysis. Since i am pretty new to machine learning, I am not quite sure how to use both training data and testing data.
So the shared task provides two same sets of twitter tweets one without the result (train) and one with the result.
I current understandings of using these kinds of data in machine learning are as follows:
training set: we are supposed to split this into training and testing portions (90% training and 10% testing maybe?)
But the existing of a separate test data kind of confuses.
Are we supposed to use the result that we got in the test using the 10% portion of the 'training set' and compare that to the actual result 'testing set' ?
Can someone correct my understanding?
When training a machine learning model, you are feeding your algorithm with the dataset called training set, which in this stage, you are telling the algorithm what is the ground truth of each sample you put into the algorithm, that way, the algorithm learns from each sample you are feeding to it. the training set is usually 80% of the whole dataset, the other 20% of the dataset is the testing set, which in this case, you know what is the ground truth of each sample, but you let your algorithm predict what it think the truth is to each sample you let it predict. All those prediction over the testing set are based on what the algorithm have learned from the training set you fed it before.
After you make all the predictions over your testing set you can then check how accurate your model is based on the ground truth in compare to the prediction the model have made.

How to correctly evaluate a neural network model?

I am currently doing hyper-parameter optimization for my neural network.
I have train, dev and test file that were given to me. For my hyper-parameter optimization, I am running a complete training using the train and dev sets. In the end, I evaluate on the test set of the training for a given combination of parameters.
I am choosing the parameters that maximize the score on the test set. My issue is that I feel that it is wrong since I am sort of leaking the test set.
Is this procedure bad? Should I use optunity to maximize the accuracy on the dev set and in the end report a score on the test set?
Typically the validation (dev) set is used to compare models with various hyper-parameters. Once your preferred model is chosen and trained, you run it on the test set to measure its performance.
Your intuition is correct; using the test set to chose model parameters is in a sense using that data to aid in the training procedure, which is not advisable.
The division and usage of train/validation/test sets are discussed in more detail in this post and in this video by Andrew Ng.

Pybrain recurrent network for regression - how to properly kickstart trained network for predictions

I am trying to solve regression task using recurrent neural network (I use pybrain to build it). After my network is fit I want to use it to make predictions. But prediction of recurrent network is affected by its previous prediction (whih in turn is affected by prediction before it etc).
Question is - once network is trained and I want to make predictions with it on a dataset, how to properly kickstart the prediction process. If I will just call .activate() on first example from a dataset for predictions that means that the recurrent connection will pass 0 to network and it will affect the subsequent predictions in an undesireable way. Is there a way to force fully trained recurrent network to think that previous activation result was of a some special value? If yes, which value is the best here (maybe mean of possible activation output values or smth like it?)
UPDATE. Ok, since no one had any ideas within a day on how to do this with recurrent network in pybrain, let me maybe a bit change a formulation to forget about pybrain. Consider that I build a pybrain network for regression (for example, predicting price of a stock). Network will be used with a dataset which has 10 features. I add one additional feature into the dataset and fill it with previous price of from a dataset. Thus I replicate a recurrent network (aditional input neuron replicates recurrent connection). The questions are:
1) In the dataset for training I fill this additional feature with previous price. But what to do with the FIRST record in a training dataset (I don't know previous price). Should leave it 0? It should a bad idea, previous price WAS NOT zero. Should I use mean of prices in training dataset? Any other suggestions?
2) Again, same question as #1 but for running fully trained network against test dataset. While running my network against test dataset I should always pick up its prediction and put the result into this new 11th input neuron before making next prediction. But again, what to do when I need to run first prediction in dataset (since I don't know previous price)?
This isn't my understanding of recurrent networks at all.
When you initially create a recurrent network the recurrent connections (say middle layer to middle layer) will be randomized, as with any other connection. This is their starting value. Each time you activate a recurrent network you'll alter those connections and thus your output will be altered.
Carrying this logic forwards, if you wrote some code to train a recurrent network and saved it to a file, you'd have in that file a recurrent network ready to go with your real data, albeit the first invocation will contain the recurrent feedback from your last activation during the training.
The thing you want to do is make sure that you re-save your recurrent network anytime you wish to persist it's state. For a simple FFN this wouldn't be an issue because you only change the state during training, but for a recurrent network you'll want to persist the state after any activation because the recurrent weights will have updated.
I don't think it's the case that a recurrent network will be poisoned because of the initial value of the recurrent connections; certainly I wouldn't trust the first invocation, but given they're designed for sequences that shouldn't be an issue in either case.
Regarding your updated question, I'm not at all convinced that arbitrarily adding a single input node will simulate this. In point of fact I suspect you'd absolutely break the networks predictive capabilities. In your example, starting with 10 input nodes, and lets pretend you have 20 middle nodes, just by adding an extra input node you'll generate an additional 20 connections to the network, that will be initially randomized. Every additional point will compound this change, and after 10 additional input nodes you'll have as many randomized connections as trained.
I don't see this working, and I certainly don't believe it would simulate recurrent learning in the way you think.

Resources