Does an LSTM model use trend in features? - machine-learning

Does an LSTM take into account a trend in a feature? Or does it only see trends from the previous output (Y predicted)?
To illustrate, imagine we have a trend in feature A. In our problem, we know that in the real world Y tends to decrease as A increases (inversely proportional). However, Y is NOT related to the actual value of A.
Example, if A increases from 10 to 20, Y decrease by 1. If A increase from 40 to 50, Y decrease by 1 as well. Similarly, if A decreases from 30 to 20, Y should decrease by 1 and if A decreases from 60 to 50, Y should decrease by 1 as well.
In the above example, the LSTM will do well if it can understand the decreasing trend of feature A. However, if it using the actual value of feature A, it will not be useful at all. The value of "10" or "50" for A above is in itself meaningless because there is not direct correlation between the value of A and Y, only the trend in A influences Y. A real world example of this is the correlation between SPY (stock market) and VIX (volatility index). The VIX has an inverse correlation to the movement of SPY, but the actual value of SPY doesn't matter.
I have spent a lot of time learning about LSTMs in general, but it's not clear whether the "memory" remembers the trend in a feature's value. From what I can see, it does not, and only remembers the previous "weights" of the features and the trend in the output (Y).

There are no "previous weights"; the weights are fixed at evaluation time. The network remembers whatever function of the previous inputs and the previous state it learns to remember, based on the recurrent weights it learned during training. The difference between an input at the current time-step and the previous time-step, or an approximation of a longer-range derivative, is certainly something that it could learn if that was useful.

Related

What should be the threshold for modified z score?

I am trying to find out outliers in my data set. I was previously using z score to calculate that.I was using a 99% confidence interval which is like +/- 2.576 on z score table. However i realized that calculating zscore using median absolute deviation would be better. I have the modified z score based on
0.0645*(x- median)/MAD
My problem is i am not sure whats a good cutoff in case of modified z score or is it based on kind of data i have?
This depends on the type of data you have. In general, median-based operations lose a little outlier information. However, the results for sufficiently large data sets should be similar, with the centroid shifted from the mean to the median; in a skewed data set, this will likely give you better results.
As for the cut-off point, here's a starting hint.
Think about the math: traditional Z-scores are based on a root-sum-square computation. Think about the root(N) factor in this. How would that affect your 99% point for the median computation, which is a simple linear computation?

Stochastic state transitions in MDP: How does Q-learning estimate that?

I am implementing Q-learning to a grid-world for finding the most optimal policy. One thing that is bugging me is that the state transitions are stochastic. For example, if I am in the state (3,2) and take an action 'north', I would land-up at (3,1) with probability 0.8, to (2,2) with probability 0.1 and to (4,2) with probability 0.1. How do I fit this information in the algorithm? As I have read so far, Q-learning is a "model-free" learning- It does not need to know the state-transition probabilities. I am not convinced as to how the algorithm will automatically find these transition probabilities during the training process. If someone can clear things up, I shall be grateful.
Let's look at what Q-learning guarantees to see why it handles transition probabilities.
Let's call q* the optimal action-value function. This is the function that returns the correct value of taking some action in a some state. The value of a state-action pair is the expected cumulative reward of taking the action, then following an optimal policy afterwards. An optimal policy is simply a way of choosing actions that achieves the maximum possible expected cumulative reward. Once we have q*, it's easy to find an optimal policy; from each state s that you find yourself in, greedily choose the action that maximizes q*(s,a). Q-learning learns q* given that it visits all states and actions infinitely many times.
For example, if I am in the state (3,2) and take an action 'north', I would land-up at (3,1) with probability 0.8, to (2,2) with probability 0.1 and to (4,2) with probability 0.1. How do I fit this information in the algorithm?
Because the algorithm visits all states and actions infinitely many times, averaging q-values, it learns an expectation of the value of attempting to go north. We go north so many times that the value converges to the sum of each possible outcome weighted by its transition probability. Say we knew all of the values on the gridworld except for the value of going north from (3,2), and assume there is no reward for any transition from (3,2). After sampling north from (3,2) infinitely many times, the algorithm converges to the value 0.8 * q(3,1) + 0.1 * q(2,2) + 0.1 * q(4,2). With this value in hand, a greedy action choice from (3,2) will now be properly informed of the true expected value of attempting to go north. The transition probability gets baked right into the value!

receiver operating characteristic (ROC) on a test set

The following image definitely makes sense to me.
Say you have a few trained binary classifiers A, B (B not much better than random guessing etc. ...) and a test set composed of n test samples to go with all those classifiers. Since Precision and Recall are computed for all n samples, those dots corresponding to classifiers make sense.
Now sometimes people talk about ROC curves and I understand that precision is expressed as a function of recall or simply plotted Precision(Recall).
I don't understand where does this variability come from, since you have a fixed number of test samples. Do you just pick some subsets of the test set and find precision and recall in order to plot them and hence many discrete values (or an interpolated line) ?
The ROC curve is well-defined for a binary classifier that expresses its output as a "score." The score can be, for example, the probability of being in the positive class, or it could also be the probability difference (or even the log-odds ratio) between probability distributions over each of the two possible outcomes.
The curve is obtained by setting the decision threshold for this score at different levels and measuring the true-positive and false-positive rates, given that threshold.
There's a good example of this process in Wikipedia's "Receiver Operating Characteristic" page:
For example, imagine that the blood protein levels in diseased people and healthy people are normally distributed with means of 2 g/dL and 1 g/dL respectively. A medical test might measure the level of a certain protein in a blood sample and classify any number above a certain threshold as indicating disease. The experimenter can adjust the threshold (black vertical line in the figure), which will in turn change the false positive rate. Increasing the threshold would result in fewer false positives (and more false negatives), corresponding to a leftward movement on the curve. The actual shape of the curve is determined by how much overlap the two distributions have.
If code speaks more clearly to you, here's the code in scikit-learn that computes an ROC curve given a set of predictions for each item in a dataset. The fundamental operation seems to be (direct link):
desc_score_indices = np.argsort(y_score, kind="mergesort")[::-1]
y_score = y_score[desc_score_indices]
y_true = y_true[desc_score_indices]
# accumulate the true positives with decreasing threshold
tps = y_true.cumsum()
fps = 1 + list(range(len(y_true))) - tps
return fps, tps, y_score
(I've omitted a bunch of code in there that deals with (common) cases of having weighted samples and when the classifier gives near-identical scores to multiple samples.) Basically the true labels are sorted in descending order by the score assigned to them by the classifier, and then their cumulative sum is computed, giving the true positive rate as a function of the score assigned by the classifier.
And here's an example showing how this gets used: http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html
ROC curve just shows "How much sensitivity you will obtain if you increase FPR by some amount". Tradeoff between TPR and FPR. Variability comes from varying some parameter of classifier (For logistic regression case below - it is threshold value).
For example logistic regression gives you probability that object belongs to positive class (values in [0..1]), but it's just probability. It's not a class. So in general case you have to specify threshold for probability, above which you will classify object as positive. You can learn logistic regression, obtain from it probabilities of positive class for each object of your set, and then you just vary this threshold parameter, with some step from 0 to 1, by thresholding your probabilities (computed on previous step) with this threshold you will get class labels for every object, and compute TPR and FPR from this labels. Thus you will get TPR and FPR for every threshold. You can mark them on plot and eventually, after you compute (TPR,FPR) pairs for all thresholds - draw a line through them.
Also for linear binary classifiers you can think about this varying process as a process of choosing distance between decision line and positive (or negative, if you want) class cluster. If you move decision line far from positive class - you will classify more objects as a positive (because you increased positive class space), and at the same time you increased FPR by some value (because space of negative class decreased).

how to interpret the "soft" and "max" in the SoftMax regression?

I know the form of the softmax regression, but I am curious about why it has such a name? Or just for some historical reasons?
The maximum of two numbers max(x,y) could have sharp corners / steep edges which sometimes is an unwanted property (e.g. if you want to compute gradients).
To soften the edges of max(x,y), one can use a variant with softer edges: the softmax function. It's still a max function at its core (well, to be precise it's an approximation of it) but smoothed out.
If it's still unclear, here's a good read.
Let's say you have a set of scalars xi and you want to calculate a weighted sum of them, giving a weight wi to each xi such that the weights sum up to 1 (like a discrete probability). One way to do it is to set wi=exp(a*xi) for some positive constant a, and then normalize the weights to one. If a=0 you get just a regular sample average. On the other hand, for a very large value of a you get max operator, that is the weighted sum will be just the largest xi. Therefore, varying the value of a gives you a "soft", or a continues way to go from regular averaging to selecting the max. The functional form of this weighted average should look familiar to you if you already know what a SoftMax regression is.

Word2Vec: Number of Dimensions

I am using Word2Vec with a dataset of roughly 11,000,000 tokens looking to do both word similarity (as part of synonym extraction for a downstream task) but I don't have a good sense of how many dimensions I should use with Word2Vec. Does anyone have a good heuristic for the range of dimensions to consider based on the number of tokens/sentences?
Typical interval is between 100-300. I would say you need at least 50D to achieve lowest accuracy. If you pick lesser number of dimensions, you will start to lose properties of high dimensional spaces. If training time is not a big deal for your application, i would stick with 200D dimensions as it gives nice features. Extreme accuracy can be obtained with 300D. After 300D word features won't improve dramatically, and training will be extremely slow.
I do not know theoretical explanation and strict bounds of dimension selection in high dimensional spaces (and there might not a application-independent explanation for that), but I would refer you to Pennington et. al, Figure2a where x axis shows vector dimension and y axis shows the accuracy obtained. That should provide empirical justification to above argument.
I think that the number of dimensions from word2vec depends on your application. The most empirical value is about 100. Then it can perform well.
The number of dimensions reflects the over/under fitting. 100-300 dimensions is the common knowledge. Start with one number and check the accuracy of your testing set versus training set. The bigger the dimension size the easier it will be overfit on the training set and had bad performance on the test. Tuning this parameter is required in case you have high accuracy on training set and low accuracy on the testing set, this means that the dimension size is too big and reducing it might solve the overfitting problem of your model.

Resources