I'm trying to find optimal policy in environment with continuous states (dim. = 20) and discrete actions (3 possible actions). And there is a specific moment: for optimal policy one action (call it "action 0") should be chosen much more frequently than other two (~100 times more often; this two action more risky).
I've tried Q-learning with NN value-function approximation. Results were rather bad: NN learns always choose "action 0". I think that policy gradient methods (on NN weights) may help, but don't understand how to use them on discrete actions.
Could you give some advise what to try? (maybe algorithms, papers to read).
What are the state-of-the-art RL algorithms when state space is continuous and action space is discrete?
Thanks.
Applying Q-learning in continuous (states and/or actions) spaces is not a trivial task. This is especially true when trying to combine Q-learning with a global function approximator such as a NN (I understand that you refer to the common multilayer perceptron and the backpropagation algorithm). You can read more in the Rich Sutton's page. A better (or at least more easy) solution is to use local approximators such as for example Radial Basis Function networks (there is a good explanation of why in Section 4.1 of this paper).
On the other hand, the dimensionality of your state space maybe is too high to use local approximators. Thus, my recommendation is to use other algorithms instead of Q-learning. A very competitive algorithm for continuous states and discrete actions is Fitted Q Iteration, which usually is combined with tree methods to approximate the Q-function.
Finally, a common practice when the number of actions is low, as in your case, it is to use an independent approximator for each action, i.e., instead of a unique approximator that takes as input the state-action pair and return a Q value, using three approximators, one per action, that take as input only the state. You can find an example of this in Example 3.1 of the book Reinforcement Learning and Dynamic Programming Using Function Approximators
Related
I want to use a feature selection method where "combinations" of features or "between features" interactions are considered for a simple linear regression.
SelectKBest only looks at one feature to the target, one at a time, and ranks them by Pearson's R values. While this is quick, but I'm afraid it's ignoring some important interactions between features.
Recursive Feature Elimination first uses ALL my features, fits a Linear Regression model, and then kicks out the feature with the smallest absolute value coefficient. I'm not sure whether if this accounts for "between features" interaction...I don't think so since it's simply kicking out the smallest coefficient one at a time until it reaches your designated number of features.
What I'm looking for is, for those seasoned feature selection scientists out there, a method to find the best subset or combination of features. I read through all Feature Selection documentation and can't find a method that describes what I have in mind.
Any tips will be greatly appreciated!!!!!!
I want to use a feature selection method where "combinations" of features or "between features" interactions are considered for a simple linear regression.
For this case, you might consider using Lasso (or, actually, the elastic net refinement). Lasso attempts to minimize linear least squares, but with absolute-value penalties on the coefficients. Some results from convex-optimization theory (mainly on duality), show that this constraint takes into account "between feature" interactions, and removes the more inferior of correlated features. Since Lasso is know to have some shortcomings (it is constrained in the number of features it can pick, for example), a newer variant is elastic net, which penalizes both absolute-value terms and square terms of the coefficients.
In sklearn, sklearn.linear_model.ElasticNet implements this. Note that this algorithm requires you to tune the penalties, which you'd typically do using cross validation. Fortunately, sklearn also contains sklearn.linear_model.ElasticNetCV, which allows very efficient and convenient searching for the values of these penalty terms.
I believe you have to generate your combinations first and only then apply the feature selection step. You may use http://scikit-learn.org/stable/modules/preprocessing.html#generating-polynomial-features for the feature combinations.
I am trying to understand some machine learning terminology: parameters, hyperparameters, and structure -- all used in a Bayes-net context. 1) In particular, how is structure different than parameters or hyperparameters. 2) What does parameterize mean? Thanks.
In general (however exact definition may vary across authors/papers/models):
structure - describes how elements of your graph/model are connected/organized, thus it is usually a generic description of how information flows. Often it is expressed as a directed graph. On the level of structure you often omit details like models details. Example: logistic regression model consists of an input node and output node, where output node produces P(y|x).
parametrization - since a common language in Bayesian (and whole ML) approach is a language of probability many models are expressed in terms of probabilities / other quantities which are a nice mathematical objects, but cannot be anyhow implemented/optimized/used. They are just abstract concepts. Parametrization is a process of taking such abstract object and narrowing down the space of possible values to a set of functions which are parametrized (usually by real-valued vector/matrices/tensors). For example, our P(y|x) of logistic regression can be parametrized as a linear function of x through P(y|x) = 1/(1 + exp(-<x, w>)) where w is a real-valued vector of parameters.
parameters - as seen above - are elements of your model, introduced during parametrization, which are usually learnable. Meaning, that you can provide reasonable mathematical ways of finding best values of them. For example in the above example w is a parameter, learnable during probability maximization, using for example steepest descent method (SGD).
hyperparameters - these are values, very similar to parameters, but for which you cannot really provide nice learning schemes. It is usually due to their non-continuos nature, often alternating the structure. For example, in a neural net, a hyperparameter is number of hidden units. You cannot differentiate through this element, so SGD cannot really learn this value. You have to set it apriori, or use some meta-learning technique (which is often extremely inefficient). In general, distinction between parameter and hyperparameter is very fuzzy and depending on the context - they change assigment. For example if you apply genetic algorithm to learn hyperparameters of the neural net, these hyperparameters of the neural net become parameters of the model being learned by GA.
STRUCTURE
The structure, or topology, of the network should capture qualitative relationships between variables.In particular, two nodes should be connected directly if one affects or causes the other, with the arc indicating the direction of the effect.
Lets consider above example, we might ask what factors affect a patient’s chance of
having cancer? If the answer is “Pollution and smoking,” then we should add arcs
from Pollution and Smoker to Cancer. Similarly, having cancer will affect the patient’s
breathing and the chances of having a positive X-ray result. So we add arcs
from Cancer to Dyspnoea and XRay. The resultant structure is shown in above figure.
Structure terminology and layout
In talking about network structure it is useful to employ a family metaphor: a node is a parent of a child, if there is an arc from the former to the latter. Extending the metaphor, if there is a directed chain of nodes, one node is an ancestor of another if
it appears earlier in the chain, whereas a node is a descendant of another node if it
comes later in the chain. In our example, the Cancer node has two parents, Pollution and Smoker, while Smoker is an ancestor of both X-ray and Dyspnoea. Similarly, Xray is a child of Cancer and descendant of Smoker and Pollution. The set of parent nodes of a node X is given by Parents(X).
By convention, for easier visual examination of BN structure, networks are usually laid out so that the arcs generally point from top to bottom. This means that the BN “tree” is usually depicted upside down, with roots at the top and leaves at the bottom!
To add to the answer of lejlot, I would like to spend some words on the term "parameter".
For many algorithms, a synonym for paratemer is weight. This is true for most linear models, where a weight is a coefficient of the line describing the model. In this case parameters is used only for the parameters of the learning algorithm and this may be a bit confusing when moving to other kinds of ML algorithms. Also, contrary to what lejlot said, these parameters may not be that abstract: often they have a clear meaning in terms of effect on the learning process. For example, with SVMs, parameters may weight the importance of misclassifications.
What is the difference between deep reinforcement learning and reinforcement learning? I basically know what reinforcement learning is about, but what does the concrete term deep stand for in this context?
Reinforcement Learning
In reinforcement learning, an agent tries to come up with the best action given a state.
For example, in the video game Pac-Man, the state space would be the 2D game world you are in, the surrounding items (pac-dots, enemies, walls, etc), and actions would be moving through that 2D space (going up/down/left/right).
So, given the state of the game world, the agent needs to pick the best action to maximise rewards. Through reinforcement learning's trial and error, it accumulates "knowledge" through these (state, action) pairs, as in, it can tell if there would be positive or negative reward given a (state, action) pair. Let's call this value Q(state, action).
A rudimentary way to store this knowledge would be a table like below
state | action | Q(state, action)
---------------------------------
... | ... | ...
The (state, action) space can be very big
However, when the game gets complicated, the knowledge space can become huge and it no longer becomes feasible to store all (state, action) pairs. If you think about it in raw terms, even a slightly different state is still a distinct state (e.g. different position of the enemy coming through the same corridor). You could use something that can generalize the knowledge instead of storing and looking up every little distinct state.
So, what you can do is create a neural network, that e.g. predicts the reward for an input (state, action) (or pick the best action given a state, however you like to look at it)
Approximating the Q value with a Neural Network
So, what you effectively have is a NN that predicts the Q value, based on the input (state, action). This is way more tractable than storing every possible value like we did in the table above.
Q = neural_network.predict(state, action)
Deep Reinforcement Learning
Deep Neural Networks
To be able to do that for complicated games, the NN may need to be "deep", meaning a few hidden layers may not suffice to capture all the intricate details of that knowledge, hence the use of deep NNs (lots of hidden layers).
The extra hidden layers allows the network to internally come up with features that can help it learn and generalize complex problems that may have been impossible on a shallow network.
Closing words
In short, the deep neural network allows reinforcement learning to be applied to larger problems. You can use any function approximator instead of an NN to approximate Q, and if you do choose NNs, it doesn't absolutely have to be a deep one. It's just researchers have had great success using them recently.
Summary: Deep RL uses a Deep Neural Network to approximate Q(s,a).
Non-Deep RL defines Q(s,a) using a tabular function.
Popular Reinforcement Learning algorithms use functions Q(s,a) or V(s) to estimate the Return (sum of discounted rewards). The function can be defined by a tabular mapping of discrete inputs and outputs. However, this is limiting for continuous states or an infinite/large number of states. A more generalized approach is necessary for large number of states.
Function approximation is used for a large state space. A popular function approximation method is Neural Networks. You can make a Deep Neural Network by adding many hidden layers.
Thus, Deep Reinforcement Learning uses Function Approximation, as opposed to tabular functions. Specifically DRL uses Deep Neural Networks to approximate Q or V (or even A).
I would like to know what are the various techniques and metrics used to evaluate how accurate/good an algorithm is and how to use a given metric to derive a conclusion about a ML model.
one way to do this is to use precision and recall, as defined here in wikipedia.
Another way is to use the accuracy metric as explained here. So, what I would like to know is whether there are other metrics for evaluating an ML model?
I've compiled, a while ago, a list of metrics used to evaluate classification and regression algorithms, under the form of a cheatsheet. Some metrics for classification: precision, recall, sensitivity, specificity, F-measure, Matthews correlation, etc. They are all based on the confusion matrix. Others exist for regression (continuous output variable).
The technique is mostly to run an algorithm on some data to get a model, and then apply that model on new, previously unseen data, and evaluate the metric on that data set, and repeat.
Some techniques (actually resampling techniques from statistics):
Jacknife
Crossvalidation
K-fold validation
bootstrap.
Talking about ML in general is a quite vast field, but I'll try to answer any way. The Wikipedia definition of ML is the following
Machine learning, a branch of artificial intelligence, concerns the construction and study of systems that can learn from data.
In this context learning can be defined parameterization of an algorithm. The parameters of the algorithm are derived using input data with a known output. When the algorithm has "learned" the association between input and output, it can be tested with further input data for which the output is well known.
Let's suppose your problem is to obtain words from speech. Here the input is some kind of audio file containing one word (not necessarily, but I supposed this case to keep it quite simple). You'd record X words N times and then use (for example) N/2 of the repetitions to parameterize your algorithm, disregarding - at the moment - how your algorithm would look like.
Now on the one hand - depending on the algorithm - if you feed your algorithm with one of the remaining repetitions, it may give you some certainty estimate which may be used to characterize the recognition of just one of the repetitions. On the other hand you may use all of the remaining repetitions to test the learned algorithm. For each of the repetitions you pass it to the algorithm and compare the expected output with the actual output. After all you'll have an accuracy value for the learned algorithm calculated as the quotient of correct and total classifications.
Anyway, the actual accuracy will depend on the quality of your learning and test data.
A good start to read on would be Pattern Recognition and Machine Learning by Christopher M Bishop
There are various metrics for evaluating the performance of ML model and there is no rule that there are 20 or 30 metrics only. You can create your own metrics depending on your problem. There are various cases wherein when you are solving real - world problem where you would need to create your own custom metrics.
Coming to the existing ones, it is already listed in the first answer, I would just highlight each metrics merits and demerits to better have an understanding.
Accuracy is the simplest of the metric and it is commonly used. It is the number of points to class 1/ total number of points in your dataset. This is for 2 class problem where some points belong to class 1 and some to belong to class 2. It is not preferred when the dataset is imbalanced because it is biased to balanced one and it is not that much interpretable.
Log loss is a metric that helps to achieve probability scores that gives you better understanding why a specific point is belonging to class 1. The best part of this metric is that it is inbuild in logistic regression which is famous ML technique.
Confusion metric is best used for 2-class classification problem which gives four numbers and the diagonal numbers helps to get an idea of how good is your model.Through this metric there are others such as precision, recall and f1-score which are interpretable.
Is there an objective way to validate the output of a clustering algorithm?
I'm using scikit-learn's affinity propagation clustering against a dataset composed of objects with many attributes. The difference matrix supplied to the clustering algorithm is composed of the weighted difference of these attributes. I'm looking for a way to objectively validate tweaks in the distance weightings as reflected in the resulting clusters. The dataset is large and has enough attributes that manual examination of small examples is not a reasonable way to verify the produced clusters.
Yes:
Give the clusters to a domain expert, and have him analyze if the structure the algorithm found is sensible. Not so much if it is new, but if it is sensible.
... and No:
There is not automatic evaluation available that is fair. In the sense that it takes the objective of unsupervised clustering into account: knowledge discovery aka: learn something new about your data.
There are two common ways of evaluating clusterings automatically:
internal cohesion. I.e. there is some particular property such as in-cluser variance compared to between-cluster variance to minimize. The problem is that it's usually fairly trivial to cheat. I.e. to construct a trivial solution that scores really well. So this method must not be used to compare methods based on different assumptions. You can't even fairly compare different types of linkage for hiearchical clustering.
external evaluation. You use a labeled data set, and score algorithms by how well they rediscover existing knowledge. Sometimes this works quite well, so it is an accepted state of the art for evaluation. Yet, any supervised or semi-supervised method will of course score much better on this. As such, it is A) biased towards supervised methods, and B) actually going completely against the knowledge discovery idea of finding something you did not yet know.
If you really mean to use clustering - i.e. learn something about your data - you will at some point have to inspect the clusters, preferrably by a completely independent method such as a domain expert. If he can tell you that e.g. the user group identified by the clustering is a non-trivial group not yet investigated closely, then you are a winner.
However, most people want to have a "one click" (and one-score) evaluation, unfortunately.
Oh, and "clustering" is not really a machine learning task. There actually is no learning involved. To the machine learning community, it is the ugly duckling that nobody cares about.
There is another way to evaluate the clustering quality by computing a stability metric on subfolds, a bit like cross validation for supervised models:
Split the dataset in 3 folds A, B and C. Compute two clustering with you algorithm on A+B and A+C. Compute the Adjusted Rand Index or Adjusted Mutual Information of the 2 labelings on their intersection A and consider this value as an estimate of the stability score of the algorithm.
Rinse-repeat by shuffling the data and splitting it into 3 other folds A', B' and C' and recompute a stability score.
Average the stability scores over 5 or 10 runs to have a rough estimate of the standard error of the stability score.
As you can guess this is very computer intensive evaluation method.
It is still an open research area to know whether or not this Stability-based evaluation of clustering algorithms is really useful in practice and to identify when it can fail to produce a valid criterion for model selection. Please refer to Clustering Stability: An Overview by Ulrike von Luxburg and references therein for an overview of the state of the art on those matters.
Note: it is important to use Adjusted for Chance metrics such as ARI or AMI if you want to use this strategy to select the best value of k in k-means for instance. Non adjusted metrics such as NMI and V-measure will tend to favor models with higher k arbitrarily.