What does "soft" in reinforcement learning literature mean? - machine-learning

I have noticed that some papers refers to soft agents, and I thought that it was just an agent where the entropy is included in the objective function of the policy network. But now I'm not sure anymore.. Can anyone confirm or offer another explanation?

So, it seems like this is the case. Asked about where the entropy enters SAC on ai.stackexchange and got a good answer for those interested.

An epsilon-soft policy is a policy that takes every action with a probability of at least epsilon in every state. (Source http://incompleteideas.net/sutton/book/RLbook2018.pdf exercise 4.6 page 82 (104 of the pdf)). A soft policy is a policy that takes every action with positive probability (page 100/122).

Related

Ways to utilize policy learned in reinforcement learning

I would like to cross check my understanding on reinforcement learning. How easy/difficult or common to train a policy and then reuse the learned policy later on? What I understood so far is that when we stop the training and if we would again start, it would need start from scratch i.e. not able to benefit from the learned policy. Thank you.
It depends what specific method you are using but generally, once a learning method converges, there is no need to “train”. In the case of Q-learning, for example, which is a model-free off-policy approach to learning, before the algorithm converges the agent must still take random actions to ensure every relevant point in the Q(s,a) space has been explored. But each individual step takes advantage of the experience gained from prior episodes, so to say that you start from scratch each episode would be incorrect.

Reinforcement Learning: The dilemma of choosing discretization steps and performance metrics for continuous action and continuous state space

I am trying to write an adaptive controller for a control system, namely a power management system using Q-learning. I recently implemented a toy RL problem for the cart-pole system and worked out the formulation of the helicopter control problem from Andrew NG's notes. I appreciate how value function approximation is imperative in such situations. However both these popular examples have very small number of possible discrete actions. I have three questions:
1) What is the correct way to handle such problems if you don't have a small number of discrete actions? The dimensionality of my actions and states seems to have blown up and the learning looks very poor, which brings me to my next question.
2) How do I measure the performance of my agent? Since the reward changes in conjunction with the dynamic environment, at every time-step I can't decide the performance metrics for my continuous RL agent. Also unlike gridworld problems, I can't check the Q-value table due to huge state-action pairs, how do I know my actions are optimal?
3) Since I have a model for the evoluation of states through time. States = [Y, U]. Y[t+1] = aY[t] + bA, where A is an action.
Choosing discretization step for actions A will also affect how finely I have to discretize my state variable Y. How do I choose my discretization steps?
Thanks a lot!
You may use a continuous action reinforcement learning algorithm and completely avoid the discretization issue. I'd suggest you to take a look at CACLA.
As for the performance, you need to measure your agent's accumulated reward during an episode with learning turned off. Since your environment is stochastic, take many measurements and average them.
Have a look at policy search algorithms. Basically, they directly learn a parametric policy without an explicit value function, thus avoiding the problem of approximating the Q-function for continuous actions (eg, no discretization of the action space is needed).
One of the easiest and earliest policy search algorithm is policy gradient. Have a look here for a quick survey about the topic. And here for a survey about policy search (currently, there are more recent techniques, but that's a very good starting point).
In the case of control problem, there is a very simple toy task you can look, the Linear Quadratic Gaussian Regulator (LQG). Here you can find a lecture including this example and also an introduction to policy search and policy gradient.
Regarding your second point, if your environment is dynamic (that is, the reward function of the transition function (or both) change through time), then you need to look at non-stationary policies. That's typically a much more challenging problem in RL.

Incorporating user feedback in a ML model

I have developed a ML model for a classification (0/1) NLP task and deployed it in production environment. The prediction of the model is displayed to users, and the users have the option to give a feedback (if the prediction was right/wrong).
How can I continuously incorporate this feedback in my model ? From a UX stand point you dont want a user to correct/teach the system more than twice/thrice for a specific input, system shld learn fast i.e. so the feedback shld be incorporated "fast". (Google priority inbox does this in a seamless way)
How does one build this "feedback loop" using which my system can improve ? I have searched a lot on net but could not find relevant material. any pointers will be of great help.
Pls dont say retrain the model from scratch by including new data points. Thats surely not how google and facebook build their smart systems
To further explain my question - think of google's spam detector or their priority inbox or their recent feature of "smart replies". Its a well known fact that they have the ability to learn / incorporate (fast) user feed.
All the while when it incorporates the user feedback fast (i.e. user has to teach the system correct output atmost 2-3 times per data point and the system start to give correct output for that data point) AND it also ensure it maintains old learnings and does not start to give wrong outputs on older data points (where it was giving right output earlier) while incorporating the learning from new data point.
I have not found any blog/literature/discussion w.r.t how to build such systems - An intelligent system that explains in detaieedback loop" in ML systems
Hope my question is little more clear now.
Update: Some related questions I found are:
Does the SVM in sklearn support incremental (online) learning?
https://datascience.stackexchange.com/questions/1073/libraries-for-online-machine-learning
http://mlwave.com/predicting-click-through-rates-with-online-machine-learning/
https://en.wikipedia.org/wiki/Concept_drift
Update: I still dont have a concrete answer but such a recipe does exists. Read the section "Learning from the feedback" in the following blog Machine Learning != Learning Machine. In this Jean talks about "adding a feedback ingestion loop to machine". Same in here, here, here4.
There could be couple of ways to do this:
1) You can incorporate the feedback that you get from the user to only train the last layer of your model, keeping the weights of all other layers intact. Intuitively, for example, in case of CNN this means you are extracting the features using your model but slightly adjusting the classifier to account for the peculiarities of your specific user.
2) Another way could be to have a global model ( which was trained on your large training set) and a simple logistic regression which is user specific. For final predictions, you can combine the results of the two predictions. See this paper by google on how they do it for their priority inbox.
Build a simple, light model(s) that can be updated per feedback. Online Machine learning gives a number of candidates for this
Most good online classifiers are linear. In which case we can have a couple of them and achieve non-linearity by combining them via a small shallow neural net
https://stats.stackexchange.com/questions/126546/nonlinear-dynamic-online-classification-looking-for-an-algorithm

In Q-learning with function approximation, is it possible to avoid hand-crafting features?

I have little background knowledge of Machine Learning, so please forgive me if my question seems silly.
Based on what I've read, the best model-free reinforcement learning algorithm to this date is Q-Learning, where each state,action pair in the agent's world is given a q-value, and at each state the action with the highest q-value is chosen. The q-value is then updated as follows:
Q(s,a) = (1-α)Q(s,a) + α(R(s,a,s') + (max_a' * Q(s',a'))) where α is the learning rate.
Apparently, for problems with high dimensionality, the number of states become astronomically large making q-value table storage infeasible.
So the practical implementation of Q-Learning requires using Q-value approximation via generalization of states aka features. For example if the agent was Pacman then the features would be:
Distance to closest dot
Distance to closest ghost
Is Pacman in a tunnel?
And then instead of q-values for every single state you would only need to only have q-values for every single feature.
So my question is:
Is it possible for a reinforcement learning agent to create or generate additional features?
Some research I've done:
This post mentions A Geramifard's iFDD method
http://www.icml-2011.org/papers/473_icmlpaper.pdf
http://people.csail.mit.edu/agf/Files/13RLDM-GQ-iFDD+.pdf
which is a way of "discovering feature dependencies", but I'm not sure if that is feature generation, as the paper assumes that you start off with a set of binary features.
Another paper that I found was apropos is Playing Atari with Deep Reinforcement Learning, which "extracts high level features using a range of neural network architectures".
I've read over the paper but still need to flesh out/fully understand their algorithm. Is this what I'm looking for?
Thanks
It seems like you already answered your own question :)
Feature generation is not part of the Q-learning (and SARSA) algorithm. In a process which is called preprocessing you can however use a wide array of algorithms (of which you showed some) to generate/extract features from your data. Combining different machine learning algorithms results in hybrid architectures, which is a term you might look into when researching what works best for your problem.
Here is an example of using features with SARSA (which is very similar to Q-learning).
Whether the papers you cited are helpful for your scenario, you'll have to decide for yourself. As always with machine learning, your approach is highly problem-dependent. If you're in robotics and it's hard to define discrete states manually, a neural network might be helpful. If you can think of heuristics by yourself (like in the pacman example) then you probably won't need it.

Machine Learning: sign visibility

I work at an airport where we need to determine the visibility conditions of pilots.
To do this, we have signs placed every 200 meters along the runway that allow us to determine how far the visibility is. We have multiple runways, and the visibility needs to be checked every hour.
Right now the visibility check is done manually with a human being who looks at the photos from the cameras placed at the end of each runway. So it can be tedious.
I'm a programmer who has very little experience with machine learning, but this sounds like an easy problem to automate. How should I approach this problem? Which algorithms should I study? Would OpenCV help me?
Thanks!
I think this can be automated using computer vision techniques. openCV could make the implementation easier. If all the signs are similar then ,we can train our program to recognize the sign in a specific conditions(lights). Then, we can use the trained classifier to check for the visibility of signs every hours using a simple script.
There is harr-like feature extraction already in openCV. You can use to train classifier which will output a .xml file and use that .xml file for detecting the sign regularly.
I have done a similar project RTVTR(Real Time Vehicle Tracking and Recognition) using openCV and it worked great. http://www.youtube.com/watch?v=xJwBT76VEZ4
Answering to your questions:
How should I approach this problem?
It depends on the result you want/need to obtain. Is this an "hobby" project (even if job-related) or do you need to build a machine vision system to solve the problem and should it be compliant with some regulations or standard?
Which algorithms should I study?
I am very interested in your question but I am not an expert in the field of meteorology and so searching in the relative literature is, for me, a time consuming task... so I reserve to update this part of the answer in the future. I think there will be different algorithms involved in the solution of the problem, some are very general like for example algorithms for the image segmentation, some are very specific like for example how to measure the visibility.
Update: one of the keyword for searching in the literature is Meteorological Visibility, for example
HAUTIERE, Nicolas, et al. Automatic fog detection and estimation of visibility distance through use of an onboard camera. Machine Vision and Applications, 2006, 17.1: 8-20.
LENOR, Stephan, et al. An Improved Model for Estimating the Meteorological Visibility from a Road Surface Luminance Curve. In: Pattern Recognition. Springer Berlin Heidelberg, 2013. p. 184-193.
Would OpenCV help me?
Yes, I think OpenCV can help giving you a starting point.
An idea for a naïve algorithm:
Segment the image in order to get the pixel regions belonging to the signs and to the background.
Compute the measure of visibility according to some procedure, the measure is computed by a function that has as input the regions of all the signs and the background region.
The segmentation can be simplified a lot if the signs are always in the same fixed and known position inside the image.
The measure of visibility is obviously the core of the algorithm and it can be performed in a lot of ways...
You can follow a simple approach where you compute the visibility with a mathematical formula based on the average gray level of the signs and background regions.
You can follow a more sophisticated and machine-learning oriented approach where you implement an algorithm that mimics your current human being based procedure. In this case your problem can be framed as a supervised learning task: you have a set of training examples, each training example is a pair composed by a) the photo of the runway (the input) and b) the visibility related to that photo and computed by human (the desired output). Then the system is trained by means of the training set and when you give a new photo as input it will give you back the visibility measure. I think you have a log for past visibility measures (METAR?) and if you saved the related images too, you will already have a relevant amount of data in order to build a training set and a test set.
Update in the age of Convolutional Neural Networks:
YOU, Yang, et al. Relative CNN-RNN: Learning Relative Atmospheric Visibility from Images. IEEE Transactions on Image Processing, 2018.
Both Tensor and uvts_cvs 's replies are very helpful. While the opencv mainly aims to recognize the sign pattern or even segment it from the background, when you extract the core feature in your problem : visibility, you may still need to include the background signal in your training set. I assume manual check of visibility is based on image contrast, if so, the signal-to-noise ratio(SNR) or contrast-to-noise ratio(CNR) is a good feature in learning. A threshold is defined to classify 'visible-1' and 'invisible-0'. The SNR/CNR can be obtained automatically especially if your sign position and size are fixed in your camera images.
Gather whole bunch of photos and videos and propose it as a challenge on Kaggle. I am sure many people would like to try solve it, even if reward would not be very high.
You can use the template matching functionality of openCV:
http://docs.opencv.org/doc/tutorials/imgproc/histograms/template_matching/template_matching.html
Where the template is the sign. If you manage to find a correct match, then the sign is visible. I think you can also get a sense of the scale of the sign in the image from that code.
As this is a very controlled and static environment, you have perfect conditions to estimate the visibility with vision-based approaches. Nonetheless, it is not so easy to decide which approach to take. In my thesis, I am reviewing this topic in depth for the less well-controlled environment of road traffic. See: LENOR, Stephan. Model-Based Estimation of Meteorological Visibility in the Context of Automotive Camera Systems. 2016. Doktorarbeit. (https://archiv.ub.uni-heidelberg.de/volltextserver/20855/1/20160509_lenor_thesis_final_print.pdf).
I see two major directions you could follow up:
Model-based approaches: Advantages: Not so much dependent on your very specific setup. You do not need heavy collection of data.
Data-based approaches/ML: Advantages: Can hide the whole complexity of different light and weather conditions. You seem to have a good source of data if there are people doing the job right now. Very promising without much engineering effort (just use a light-weighted CNN with few layers or so).
You could also combine both, etc. etc. If you are still interested in a solution, you can contact me again and I am happy to consult in more depth.

Resources