Ways to utilize policy learned in reinforcement learning

Ways to utilize policy learned in reinforcement learning - machine-learning

I would like to cross check my understanding on reinforcement learning. How easy/difficult or common to train a policy and then reuse the learned policy later on? What I understood so far is that when we stop the training and if we would again start, it would need start from scratch i.e. not able to benefit from the learned policy. Thank you.

It depends what specific method you are using but generally, once a learning method converges, there is no need to “train”. In the case of Q-learning, for example, which is a model-free off-policy approach to learning, before the algorithm converges the agent must still take random actions to ensure every relevant point in the Q(s,a) space has been explored. But each individual step takes advantage of the experience gained from prior episodes, so to say that you start from scratch each episode would be incorrect.

Related

Is reinforcement learning applicable to a RANDOM environment?

I have a fundamental question on the applicability of reinforcement learning (RL) on a problem we are trying to solve.
We are trying to use RL for inventory management - where the demand is entirely random (it probably has a pattern in real life but for now let us assume that we have been forced to treat as purely random).
As I understand, RL can help learn how to play a game (say chess) or help a robot learn to walk. But all games have rules and so does the ‘cart-pole’ (of OpenAI Gym) – there are rules of ‘physics’ that govern when the cart-pole will tip and fall over.
For our problem there are no rules – the environment changes randomly (demand made for the product).
Is RL really applicable to such situations?
If it does - then what will improve the performance?
Further details:
- The only two stimuli available from the ‘environment’ are the currently available level of product 'X' and the current demand 'Y'
- And the ‘action’ is binary - do I order a quantity 'Q' to refill or do I not (discrete action space).
- We are using DQN and an Adam optimizer.
Our results are poor - I admit I have trained only for about 5,000 or 10,000 - should I let it train on for days because it is a random environment?
thank you
Rajesh

You are saying random in the sense of non-stationary, so, no, RL is not the best here.
Reinforcement learning assumes your environment is stationary. The underlying probability distribution of your environment (both transition and reward function) must be held constant throughout the learning process.
Sure, RL and DRL can deal with some slightly non-stationary problems, but it struggles at that. Markov Decision Processes (MDPs) and Partially-Observable MDPs assume stationarity. So value-based algorithms, which are specialized in exploiting MDP-like environments, such as SARSA, Q-learning, DQN, DDQN, Dueling DQN, etc., will have a hard time learning anything in non-stationary environments. The more you go towards policy-based algorithms, such as PPO, TRPO, or even better gradient-free, such as GA, CEM, etc., the better chance you have as these algorithms don't try to exploit this assumption. Also, playing with the learning rate would be essential to make sure the agent never stops learning.
Your best bet is to go towards black-box optimization methods such as Genetic Algorithms, etc.

Randomness can be handled by replacing single average reward output with a distribution with possible values. By introducing a new learning rule, reflecting the transition from Bellman’s (average) equation to its distributional counterpart, the Value distribution approach has been able to surpass the performance of all other comparable approaches.
https://www.deepmind.com/blog/going-beyond-average-for-reinforcement-learning

Reinforcement Learning: The dilemma of choosing discretization steps and performance metrics for continuous action and continuous state space

I am trying to write an adaptive controller for a control system, namely a power management system using Q-learning. I recently implemented a toy RL problem for the cart-pole system and worked out the formulation of the helicopter control problem from Andrew NG's notes. I appreciate how value function approximation is imperative in such situations. However both these popular examples have very small number of possible discrete actions. I have three questions:
1) What is the correct way to handle such problems if you don't have a small number of discrete actions? The dimensionality of my actions and states seems to have blown up and the learning looks very poor, which brings me to my next question.
2) How do I measure the performance of my agent? Since the reward changes in conjunction with the dynamic environment, at every time-step I can't decide the performance metrics for my continuous RL agent. Also unlike gridworld problems, I can't check the Q-value table due to huge state-action pairs, how do I know my actions are optimal?
3) Since I have a model for the evoluation of states through time. States = [Y, U]. Y[t+1] = aY[t] + bA, where A is an action.
Choosing discretization step for actions A will also affect how finely I have to discretize my state variable Y. How do I choose my discretization steps?
Thanks a lot!

You may use a continuous action reinforcement learning algorithm and completely avoid the discretization issue. I'd suggest you to take a look at CACLA.
As for the performance, you need to measure your agent's accumulated reward during an episode with learning turned off. Since your environment is stochastic, take many measurements and average them.

Have a look at policy search algorithms. Basically, they directly learn a parametric policy without an explicit value function, thus avoiding the problem of approximating the Q-function for continuous actions (eg, no discretization of the action space is needed).
One of the easiest and earliest policy search algorithm is policy gradient. Have a look here for a quick survey about the topic. And here for a survey about policy search (currently, there are more recent techniques, but that's a very good starting point).
In the case of control problem, there is a very simple toy task you can look, the Linear Quadratic Gaussian Regulator (LQG). Here you can find a lecture including this example and also an introduction to policy search and policy gradient.
Regarding your second point, if your environment is dynamic (that is, the reward function of the transition function (or both) change through time), then you need to look at non-stationary policies. That's typically a much more challenging problem in RL.

In Q-learning with function approximation, is it possible to avoid hand-crafting features?

I have little background knowledge of Machine Learning, so please forgive me if my question seems silly.
Based on what I've read, the best model-free reinforcement learning algorithm to this date is Q-Learning, where each state,action pair in the agent's world is given a q-value, and at each state the action with the highest q-value is chosen. The q-value is then updated as follows:
Q(s,a) = (1-α)Q(s,a) + α(R(s,a,s') + (max_a' * Q(s',a'))) where α is the learning rate.
Apparently, for problems with high dimensionality, the number of states become astronomically large making q-value table storage infeasible.
So the practical implementation of Q-Learning requires using Q-value approximation via generalization of states aka features. For example if the agent was Pacman then the features would be:
Distance to closest dot
Distance to closest ghost
Is Pacman in a tunnel?
And then instead of q-values for every single state you would only need to only have q-values for every single feature.
So my question is:
Is it possible for a reinforcement learning agent to create or generate additional features?
Some research I've done:
This post mentions A Geramifard's iFDD method
http://www.icml-2011.org/papers/473_icmlpaper.pdf
http://people.csail.mit.edu/agf/Files/13RLDM-GQ-iFDD+.pdf
which is a way of "discovering feature dependencies", but I'm not sure if that is feature generation, as the paper assumes that you start off with a set of binary features.
Another paper that I found was apropos is Playing Atari with Deep Reinforcement Learning, which "extracts high level features using a range of neural network architectures".
I've read over the paper but still need to flesh out/fully understand their algorithm. Is this what I'm looking for?
Thanks

It seems like you already answered your own question :)
Feature generation is not part of the Q-learning (and SARSA) algorithm. In a process which is called preprocessing you can however use a wide array of algorithms (of which you showed some) to generate/extract features from your data. Combining different machine learning algorithms results in hybrid architectures, which is a term you might look into when researching what works best for your problem.
Here is an example of using features with SARSA (which is very similar to Q-learning).
Whether the papers you cited are helpful for your scenario, you'll have to decide for yourself. As always with machine learning, your approach is highly problem-dependent. If you're in robotics and it's hard to define discrete states manually, a neural network might be helpful. If you can think of heuristics by yourself (like in the pacman example) then you probably won't need it.

Online machine learning for obstacle crossing or bypassing

I want to program a robot which will sense obstacles and learn whether to cross over them or bypass around them.
Since my project, must be realized in week and a half period, I must use an online learning algorithm (GA or such would take a lot time to test because robot needs to try to cross over the obstacle in order to determine is it possible to cross).
I'm really new to online learning so I don't really know which online learning algorithm to use.
It would be a great help if someone could recommend me a few algorithms that would be the best for my problem and some link with examples wouldn't hurt.
Thanks!

I think you could start with A* (A-Star)
It's simple and robust, and widely used.
There are some nice tutorials on the web like this http://www.raywenderlich.com/4946/introduction-to-a-pathfinding

Online algorithm is just the one that can collect new data and update a model incrementally without re-training with full dataset (i.e. it may be used in online service that works all the time). What you are probably looking for is reinforcement learning.
RL itself is not a method, but rather general approach to the problem. Many concrete methods may be used with it. Neural networks have been proved to do well in this field (useful course). See, for example, this paper.
However, to create real robot being able to bypass obstacles you will need much then just knowing about neural networks. You will need to set up sensors carefully, preprocess data from them, work out your model and collect a dataset. Not sure it's possible to even learn it all in a week and a half.

Best approach to what I think is a machine learning problem [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I am wanting some expert guidance here on what the best approach is for me to solve a problem. I have investigated some machine learning, neural networks, and stuff like that. I've investigated weka, some sort of baesian solution.. R.. several different things. I'm not sure how to really proceed, though. Here's my problem.
I have, or will have, a large collection of events.. eventually around 100,000 or so. Each event consists of several (30-50) independent variables, and 1 dependent variable that I care about. Some independent variables are more important than others in determining the dependent variable's value. And, these events are time relevant. Things that occur today are more important than events that occurred 10 years ago.
I'd like to be able to feed some sort of learning engine an event, and have it predict the dependent variable. Then, knowing the real answer for the dependent variable for this event (and all the events that have come along before), I'd like for that to train subsequent guesses.
Once I have an idea of what programming direction to go, I can do the research and figure out how to turn my idea into code. But my background is in parallel programming and not stuff like this, so I'd love to have some suggestions and guidance on this.
Thanks!
Edit: Here's a bit more detail about the problem that I'm trying to solve: It's a pricing problem. Let's say that I'm wanting to predict prices for a random comic book. Price is the only thing I care about. But there are lots of independent variables one could come up with. Is it a Superman comic, or a Hello Kitty comic. How old is it? What's the condition? etc etc. After training for a while, I want to be able to give it information about a comic book I might be considering, and have it give me a reasonable expected value for the comic book. OK. So comic books might be a bogus example. But you get the general idea. So far, from the answers, I'm doing some research on Support vector machines and Naive Bayes. Thanks for all of your help so far.

Sounds like you're a candidate for Support Vector Machines.
Go get libsvm. Read "A practical guide to SVM classification", which they distribute, and is short.
Basically, you're going to take your events, and format them like:
dv1 1:iv1_1 2:iv1_2 3:iv1_3 4:iv1_4 ...
dv2 1:iv2_1 2:iv2_2 3:iv2_3 4:iv2_4 ...
run it through their svm-scale utility, and then use their grid.py script to search for appropriate kernel parameters. The learning algorithm should be able to figure out differing importance of variables, though you might be able to weight things as well. If you think time will be useful, just add time as another independent variable (feature) for the training algorithm to use.
If libsvm can't quite get the accuracy you'd like, consider stepping up to SVMlight. Only ever so slightly harder to deal with, and a lot more options.
Bishop's Pattern Recognition and Machine Learning is probably the first textbook to look to for details on what libsvm and SVMlight are actually doing with your data.

If you have some classified data - a bunch of sample problems paired with their correct answers -, start by training some simple algorithms like K-Nearest-Neighbor and Perceptron and seeing if anything meaningful comes out of it. Don't bother trying to solve it optimally until you know if you can solve it simply or at all.
If you don't have any classified data, or not very much of it, start researching unsupervised learning algorithms.

It sounds like any kind of classifier should work for this problem: find the best class (your dependent variable) for an instance (your events). A simple starting point might be Naive Bayes classification.

This is definitely a machine learning problem. Weka is an excellent choice if you know Java and want a nice GPL lib where all you have to do is select the classifier and write some glue. R is probably not going to cut it for that many instances (events, as you termed it) because it's pretty slow. Furthermore, in R you still need to find or write machine learning libs, though this should be easy given that it's a statistical language.
If you believe that your features (independent variables) are conditionally independent (meaning, independent given the dependent variable), naive Bayes is the perfect classifier, as it is fast, interpretable, accurate and easy to implement. However, with 100,000 instances and only 30-50 features you can likely implement a fairly complex classification scheme that captures a lot of the dependency structure in your data. Your best bet would probably be a support vector machine (SMO in Weka) or a random forest (Yes, it's a silly name, but it helped random forest catch on.) If you want the advantage of easy interpretability of your classifier even at the expense of some accuracy, maybe a straight up J48 decision tree would work. I'd recommend against neural nets, as they're really slow and don't usually work any better in practice than SVMs and random forest.

The book Programming Collective Intelligence has a worked example with source code of a price predictor for laptops which would probably be a good starting point for you.

SVM's are often the best classifier available. It all depends on your problem and your data. For some problems other machine learning algorithms might be better. I have seen problems that neural networks (specifically recurrent neural networks) were better at solving. There is no right answer to this question since it is highly situationally dependent but I agree with dsimcha and Jay that SVM's are the right place to start.

I believe your problem is a regression problem, not a classification problem. The main difference: In classification we are trying to learn the value of a discrete variable, while in regression we are trying to learn the value of a continuous one. The techniques involved may be similar, but the details are different. Linear Regression is what most people try first. There are lots of other regression techniques, if linear regression doesn't do the trick.

You mentioned that you have 30-50 independent variables, and some are more important that the rest. So, assuming that you have historical data (or what we called a training set), you can use PCA (Principal Componenta Analysis) or other dimensionality reduction methods to reduce the number of independent variables. This step is of course optional. Depending on situations, you may get better results by keeping every variables, but add a weight to each one of them based on relevant they are. Here, PCA can help you to compute how "relevant" the variable is.
You also mentioned that events that are occured more recently should be more important. If that's the case, you can weight the recent event higher and the older event lower. Note that the importance of the event doesn't have to grow linearly accoding to time. It may makes more sense if it grow exponentially, so you can play with the numbers here. Or, if you are not lacking of training data, perhaps you can considered dropping off data that are too old.
Like Yuval F said, this does look more like a regression problem rather than a classification problem. Therefore, you can try SVR (Support Vector Regression), which is regression version of SVM (Support Vector Machine).
some other stuff you can try are:
Play around with how you scale the value range of your independent variables. Say, usually [-1...1] or [0...1]. But you can try other ranges to see if they help. Sometimes they do. Most of the time they don't.
If you suspect that there are "hidden" feature vector with a lower dimension, say N << 30 and it's non-linear in nature, you will need non-linear dimensionality reduction. You can read up on kernel PCA or more recently, manifold sculpting.

What you described is a classic classification problem. And in my opinion, why code fresh algorithms at all when you have a tool like Weka around. If I were you, I would run through a list of supervised learning algorithms (I don't completely understand whey people are suggesting unsupervised learning first when this is so clearly a classification problem) using 10-fold (or k-fold) cross validation, which is the default in Weka if I remember, and see what results you get! I would try:
-Neural Nets
-SVMs
-Decision Trees (this one worked really well for me when I was doing a similar problem)
-Boosting with Decision trees/stumps
-Anything else!
Weka makes things so easy and you really can get some useful information. I just took a machine learning class and I did exactly what you're trying to do with the algorithms above, so I know where you're at. For me the boosting with decision stumps worked amazingly well. (BTW, boosting is actually a meta-algorithm and can be applied to most supervised learning algs to usually enhance their results.)
A nice thing aobut using Decision Trees (if you use the ID3 or similar variety) is that it chooses the attributes to split on in order of how well they differientiate the data - in other words, which attributes determine the classification the quickest basically. So you can check out the tree after running the algorithm and see what attribute of a comic book most strongly determines the price - it should be the root of the tree.
Edit: I think Yuval is right, I wasn't paying attention to the problem of discretizing your price value for the classification. However, I don't know if regression is available in Weka, and you can still pretty easily apply classification techniques to this problem. You need to make classes of price values, as in, a number of ranges of prices for the comics, so that you can have a discrete number (like 1 through 10) that represents the price of the comic. Then you can easily run classification it.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart