Incorporating Transition Probabilities in SARSA - machine-learning

I am implementing a SARSA(lambda) model in C++ to overcome some of the limitations (the sheer amount of time and space DP models require) of DP models, which hopefully will reduce the computation time (takes quite a few hours atm for similar research) and less space will allow adding more complexion to the model.
We do have explicit transition probabilities, and they do make a difference. So how should we incorporate them in a SARSA model?
Simply select the next state according to the probabilities themselves? Apparently SARSA models don't exactly expect you to use probabilities - or perhaps I've been reading the wrong books.
PS- Is there a way of knowing if the algorithm is properly implemented? First time working with SARSA.

The fundamental difference between Dynamic Programming (DP) and Reinforcement Learning (RL) is that the first assumes that environment's dynamics is known (i.e., a model), while the latter can learn directly from data obtained from the process, in the form of a set of samples, a set of process trajectories, or a single trajectory. Because of this feature, RL methods are useful when a model is difficult or costly to construct. However, it should be notice that both approaches share the same working principles (called Generalized Policy Iteration in Sutton's book).
Given they are similar, both approaches also share some limitations, namely, the curse of dimensionality. From Busoniu's book (chapter 3 is free and probably useful for your purposes):
A central challenge in the DP and RL fields is that, in their original
form (i.e., tabular form), DP and RL algorithms cannot be implemented
for general problems. They can only be implemented when the state and
action spaces consist of a finite number of discrete elements, because
(among other reasons) they require the exact representation of value
functions or policies, which is generally impossible for state spaces
with an infinite number of elements (or too costly when the number of
states is very high).
Even when the states and actions take finitely many values, the cost
of representing value functions and policies grows exponentially with
the number of state variables (and action variables, for Q-functions).
This problem is called the curse of dimensionality, and makes the
classical DP and RL algorithms impractical when there are many state
and action variables. To cope with these problems, versions of the
classical algorithms that approximately represent value functions
and/or policies must be used. Since most problems of practical
interest have large or continuous state and action spaces,
approximation is essential in DP and RL.
In your case, it seems quite clear that you should employ some kind of function approximation. However, given that you know the transition probability matrix, you can choose a method based on DP or RL. In the case of RL, transitions are simply used to compute the next state given an action.
Whether is better to use DP or RL? Actually I don't know the answer, and the optimal method likely depends on your specific problem. Intuitively, sampling a set of states in a planned way (DP) seems more safe, but maybe a big part of your state space is irrelevant to find an optimal pocliy. In such a case, sampling a set of trajectories (RL) maybe is more effective computationally. In any case, if both methods are rightly applied, should achive a similar solution.
NOTE: when employing function approximation, the convergence properties are more fragile and it is not rare to diverge during the iteration process, especially when the approximator is non linear (such as an artificial neural network) combined with RL.

If you have access to the transition probabilities, I would suggest not to use methods based on a Q-value. This will require additional sampling in order to extract information that you already have.
It may not always be the case, but without additional information I would say that modified policy iteration is a more appropriate method for your problem.

Related

Supervised learning linear regression

I am confused about how linear regression works in supervised learning. Now I want to generate a evaluation function for a board game using linear regression, so I need both the input data and output data. Input data is my board condition, and I need the corresponding value for this condition, right? But how can I get this expected value? Do I need to write an evaluation function first by myself? But I thought I need to generate an evluation function by using linear regression, so I'm a little confused about this.
It's supervised-learning after all, meaning: you will need input and output.
Now the question is: how to obtain these? And this is not trivial!
Candidates are:
historical-data (e.g. online-play history)
some form or self-play / reinforcement-learning (more complex)
But then a new problem arises: which output is available and what kind of input will you use.
If there would be some a-priori implemented AI, you could just take the scores of this one. But with historical-data for example you only got -1,0,1 (A wins, draw, B wins) which makes learning harder (and this touches the Credit Assignment problem: there might be one play which made someone lose; it's hard to understand which of 30 moves lead to the result of 1). This is also related to the input. Take chess for example and take a random position from some online game: there is the possibility that this position is unique over 10 million games (or at least not happening often) which conflicts with the expected performance of your approach. I assumed here, that the input is the full board-position. This changes for other inputs, e.g. chess-material, where the input is just a histogram of pieces (3 of these, 2 of these). Now there are much less unique inputs and learning will be easier.
Long story short: it's a complex task with a lot of different approaches and most of this is somewhat bound by your exact task! A linear evaluation-function is not super-uncommon in reinforcement-learning approaches. You might want to read some literature on these (this function is a core-component: e.g. table-lookup vs. linear-regression vs. neural-network to approximate the value- or policy-function).
I might add, that your task indicates the self-learning approach to AI, which is very hard and it's a topic which somewhat gained additional (there was success before: see Backgammon AI) popularity in the last years. But all of these approaches are highly complex and a good understanding of RL and the mathematical-basics like Markov-Decision-Processes are important then.
For more classic hand-made evaluation-function based AIs, a lot of people used an additional regressor for tuning / weighting already implemented components. Some overview at chessprogramming wiki. (the chess-material example from above might be a good one: assumption is: more pieces better than less; but it's hard to give them values)

would we ever compute the cost J(θ) on the *test* set?

I'm pretty sure that the answer is no, but wanted to confirm...
When training a neural network or other learning algorithm, we will compute the cost function J(θ) as an expression of how well our algorithm fits the training data (higher values mean it fits the data less well). When training our algorithm, we generally expect to see J(theta) go down with each iteration of gradient descent.
But I'm just curious, would there ever be a value to computing J(θ) against our test data?
I think the answer is no, because since we only evaluate our test data once, we would only get one value of J(θ), and I think that it is meaningless except when compared with other values.
Your question touches on a very common ambiguity regarding the terminology: one between the validation and the test sets (the Wikipedia entry and this Cross Vaidated post may be helpful in resolving this).
So, assuming that you indeed refer to the test set proper and not the validation one, then:
You are right in that this set is only used once, just at the end of the whole modeling process
You are, in general, not right in assuming that we don't compute the cost J(θ) in this set.
Elaborating on (2): in fact, the only usefulness of the test set is exactly for evaluating our final model, in a set that has not been used at all in the various stages of the fitting process (notice that the validation set has been used indirectly, i.e. for model selection); and in order to evaluate it, we obviously have to compute the cost.
I think that a possible source of confusion is that you may have in mind only classification settings (although you don't specify this in your question); true, in this case, we are usually interested in the model performance regarding a business metric (e.g. accuracy), and not regarding the optimization cost J(θ) itself. But in regression settings it may very well be the case that the optimization cost and the business metric are one and the same thing (e.g. RMSE, MSE, MAE etc). And, as I hope is clear, in such settings computing the cost in the test set is by no means meaningless, despite the fact that we don't compare it with other values (it provides an "absolute" performance metric for our final model).
You may find this and this answers of mine useful regarding the distinction between loss & accuracy; quoting from these answers:
Loss and accuracy are different things; roughly speaking, the accuracy is what we are actually interested in from a business perspective, while the loss is the objective function that the learning algorithms (optimizers) are trying to minimize from a mathematical perspective. Even more roughly speaking, you can think of the loss as the "translation" of the business objective (accuracy) to the mathematical domain, a translation which is necessary in classification problems (in regression ones, usually the loss and the business objective are the same, or at least can be the same in principle, e.g. the RMSE)...

Which machine learning model is applicable to the following case

I want to build a model that recognizes the species based on multiple indicators. The problem is, neural networks (usually) receive vectors, and my indicators are not always easily expressed in numbers. For example, one of the indicators is not only whether species performs some actions (that would be, say, '0' or '1', or anything in between, if the essence of action permits that), but sometimes, in which order are those actions performed. I want the system to be able to decide and classify species based on these indicators. There are not may classes but rather many indicators.
The amount of training data is not an issue, I can get as much as I want.
What machine learning techniques should I consider? Maybe some special kind of neural network would do? Or maybe something completely different.
If you treat a sequence of actions as a string, then using features like "an action A was performed" is akin to unigram model. If you want to account for order of actions, you should add bigrams, trigrams, etc.
That will blow up your feature space, though. For example, if you have M possible actions, then there are M (M-1) / 2 bigrams. In general, there are O(Mk) k-grams. This leads to the following issues:
The more features you have — the harder it is to apply some methods. For example, many models suffer from curse of dimensionality
The more features you have — the more data you need to capture meaningful relations.
This is just one possible approach to your problem. There may be others. For example, if you know that there's some set of parameters ϴ, that governs action-generating process in a known (at least approximately) way, you can build a separate model to infer these first, and then use ϴ as features.
The process of coming up with sensible numerical representation of your data is called feature engineering. Once you've done that, you can use any Machine Learning algorithm at your disposal.

The options for the first step of document clustering

I checked several document clustering algorithms, such as LSA, pLSA, LDA, etc. It seems they all require to represent the documents to be clustered as a document-word matrix, where the rows stand for document and the columns stand for words appearing in the document. And the matrix is often very sparse.
I am wondering, is there any other options to represent documents besides using the document-word matrix? Because I believe the way we express a problem has a significant influence on how well we can solve it.
As #ffriend pointed out, you cannot really avoid using the term-document-matrix (TDM) paradigm. Clustering methods operates on points in a vector space, and this is exactly what the TDM encodes. However, within that conceptual framework there are many things you can do to improve the quality of the TDM:
feature selection and re-weighting attempt to remove or weight down features (words) that do not contribute useful information (in the sense that your chosen algorithm does just as well or better without these features, or if their counts are decremented). You might want to read more about Mutual Information (and its many variants) and TF-IDF.
dimensionality reduction is about encoding the information as accurately as possible in the TDM using less columns. Singular Value Decomposition (the basis of LSA) and Non-Negative Tensor Factorisation are popular in the NLP community. A desirable side effect is that the TDM becomes considerably less sparse.
feature engineering attempts to build a TDM where the choice of columns is motivated by linguistic knowledge. For instance, you may want to use bigrams instead of words, or only use nouns (requires a part-of-speech tagger), or only use nouns with their associated adjectival modifier (e.g. big cat, requires a dependency parser). This is a very empirical line of work and involves a lot of experimentation, but often yield improved results.
the distributional hypothesis makes if possible to get a vector representing the meaning of each word in a document. There has been work on trying to build up a representation of an entire document from the representations of the words it contains (composition). Here is a shameless link to my own post describing the idea.
There is a massive body of work on formal and logical semantics that I am not intimately familiar with. A document can be encoded as a set of predicates instead of a set of words, i.e. the columns of the TDM can be predicates. In that framework you can do inference and composition, but lexical semantics (the meaning if individual words) is hard to deal with.
For a really detailed overview, I recommend Turney and Pantel's "From Frequency to Meaning : Vector Space Models of Semantics".
You question says you want document clustering, not term clustering or dimensionality reduction. Therefore I'd suggest you steer clear of the LSA family of methods, since they're a preprocessing step.
Define a feature-based representation of your documents (which can be, or include, term counts but needn't be), and then apply a standard clustering method. I'd suggest starting with k-means as it's extremely easy and there are many, many implementations of it.
OK, this is quite a very general question, and many answers are possible, none is definitive
because it's an ongoing research area. So far, the answers I have read mainly concern so-called "Vector-Space models", and your question is termed in a way that suggests such "statistical" approaches. Yet, if you want to avoid manipulating explicit term-document matrices, you might want to have a closer look at the Bayesian paradigm, which relies on
the same distributional hypothesis, but exploits a different theoretical framework: you don't manipulate any more raw distances, but rather probability distributions and, which is the most important, you can do inference based on them.
You mentioned LDA, I guess you mean Latent Dirichlet Allocation, which is the most well-known such Bayesian model to do document clustering. It is an alternative paradigm to vector space models, and a winning one: it has been proven to give very good results, which justifies its current success. Of course, one can argue that you still use kinds of term-document matrices through the multinomial parameters, but it's clearly not the most important aspect, and Bayesian researchers do rarely (if ever) use this term.
Because of its success, there are many software that implements LDA on the net. Here is one, but there are many others:
http://jgibblda.sourceforge.net/

Does prior distribution matter in classification?

Currently I get a classification problem with two classes. what I want to do is that given a bunch of candidates, find out who will more likely to be the class 1. The problem is that class 1 is very rare (around 1%), which I guess makes my prediction quite inaccurate.
For training the dataset, can I sample half class 1 and half class 0? This will change the prior distribution, but I don't know whether the prior distribution affects the classification results?
Indeed, a very imbalanced dataset can cause problems in classification. Because by defaulting to the majority class 0, you can get your error rate already very low.
There are some workarounds that may or may not work for your particular problem, such as giving equal weight to the two classes (thus weighting instances from the rare class stronger), oversampling the rare class (i.e. learning each instance multiple times), producing slight variations of the rare objects to restore balance etc. SMOTE and so on.
You really should to grab some classification or machine learning book, and check the index for "imbalanced classification" or "unbalanced classification". If the book is any good, it will discuss this problem. (I just assume you did not know the term that they use.)
If you're forced to pick exactly one from a group, then the prior distribution over classes won't matter because it will be constant for all members of that group. If you must look at each in turn and make an independent decision as to whether they're class one or class two, the prior will potentially change the decision, depending on which method you choose to do the classification. I would suggest you get hold of as many examples of the rare class as possible, but beware that feeding a 50-50 split to a classifier as training blindly may make it implicitly fit a model that assumes this is the distribution at test time.
Sampling your two classes evenly doesn't change assumed priors unless your classification algorithm computes (and uses) priors based on the training data. You stated that your problem is "given a bunch of candidates, find out who will more likely to be the class 1". I read this to mean that you want to determine which observation is most likely to belong to class 1. To do this, you want to pick the observation $x_i$ that maximizes $p(c_1|x_i)$. Using Bayes' theorem, this becomes:
$$
p(c_1|x_i)=\frac{p(x_i|c_1)p(c_1)}{p(x_i)}
$$
You can ignore $p(c_1)$ in the equation above since it is a constant. However, computing the denominator will still involve using prior probabilities. Since your problem is really more of a target detection problem than a classification problem, an alternate approach for detecting low probability targets is to take the likelihood ratio of the two classes:
$$
\Lambda=\frac{p(x_i|c_1)}{p(x_i|c_0)}
$$
To pick which of your candidates is most likely to belong to class 1, pick the one with the highest value of $\Lambda$. If your two classes are described by multivariate Gaussian distributions, you can replace $\Lambda$ with its natural logarithm, resulting in a simpler quadratic detector. If you further assume that the target and background have the same covariance matrices, this results in a linear discriminant (http://en.wikipedia.org/wiki/Linear_discriminant_analysis).
You may want to consider Bayesian utility theory to re-weight the costs of different kinds of error to get away from the problem of the priors dominating the decision.
Let A be the 99% prior probability class, B be the 1% class.
If we just say that all errors incur the same cost (negative utility), then
it's possible that the optimal decision approach is to always declare "A". Many
classification algorithms (implicitly) assume this.
If instead, we declare that the cost of declaring "B" when, in fact, the instance
was "A" is much bigger than the cost of the opposite error, then the decision logic
becomes, in a sense, more sensitive to slighter differences in the features.
This kind of situation frequently comes up in fault detection -- faults in the monitored
system will be rare, but you want to be sure that if we see any data that points to
an error condition, action needs to be taken (even if it is just reviewing the data).

Resources