I'm trying to implement Deep Mind's DNC - Nature paper- with PyTorch 0.4.0.
When implementing the variant of LSTM they used I encountered some troubles with dimensions.
To simplify suppose BATCH=1.
The equations they list in the paper are these:
where [x;h] means a concatenation of x and h into one single vector, and i, f and o are column vectors.
My question is about how the state s_t is computed.
The second addendum is obtained by multiplying i with a column vector and so the result is either a scalar (transpose i first, then do scalar product) or wrong (two column vectors multiplied).
So the state results in a single scalar...
With the same reasoning the hidden state h_t is a scalar too, but it has to be a column vector.
Obviously I'm wrong somewhere, but I can't figure out where.
By looking at Wikipedia LSTM Article I think I figured it out.
This is the formal implementation of standard LSTM found in the article:
The circle represents element-by-element product.
By using this product in the corresponding parts of DNC equations (s_t and o_t) the dimensions work.
I was learning Machine Learning from this course on Coursera taught by Andrew Ng. The instructor defines the hypothesis as a linear function of the "input" (x, in my case) like the following:
hθ(x) = θ0 + θ1(x)
In supervised learning, we have some training data and based on that we try to "deduce" a function which closely maps the inputs to the corresponding outputs. To deduce the function, we introduce the hypothesis as a linear function of input (x). My question is, why the function involving two θs is chosen? Why it can't be as simple as y(i) = a * x(i) where a is a co-efficient? Later we can go about finding a "good" value of a for a given example (i) using an algorithm? This question might look very stupid. I apologize but I'm not very good at machine learning I am just a beginner. Please help me understand this.
Thanks!
The a corresponds to θ1. Your proposed linear model is leaving out the intercept, which is θ0.
Consider an output function y equal to the constant 5, or perhaps equal to a constant plus some tiny fraction of x which never exceeds .01. Driving the error function to zero is going to be difficult if your model doesn't have a θ0 that can soak up the D.C. component.
Let's say I've calculated the TF-IDF scores for a corpus of documents, resulting in a matrix of TF-IDF features. If a subset of those documents are of a certain class, can I somehow "combine" the scores of that subset to get a single value for each feature in that class in a meaningful way?
For example, if I have a corpus of 1000 documents, and corpus[0:200] are of class A, then can I take rows 0-200 of the tf-idf scores and somehow combine them so that I can say, "In class A, features[3] has a score of 0.5."
My hope is to extract the most meaningful terms from each class within the corpus. Is there a reasonable way to do this? Does my question demonstrate a misunderstanding of the concept to begin with? Any feedback at all is wildly appreciated.
What you described is like "feature selection". After you extracted the tf-idf score, your document may look like this:
doc1: london:0.2 british:0.5 travel:0.1
doc2: coffee:0.2 brazil:0.1 travel:0.5
doc3: meat:0.8 avian 0.001 birds:0.2
doc4: agriculture:0.5 meat:0.5 chicken:0.01
Suppose doc1 and doc2 are class 0:"cities" and doc3 and doc4 are class 1: "poultry". You can then calcuate the expected mutual information of a term t and a class c.
See details for how to perform mutual-information based feature selection at here.
I've been reading some papers on CRFs and am slightly confused about the feature functions. Unary (node) and binary (edge) features f are normally of the form
f(yc, xc) = 1{yc=y ̃c}fg(xc).
where {.} is the indicator function evaluating to 1 if the condition enclosed is true, and 0 otherwise. fg is a function of the data xc which extracts useful attributes (features) from the data.
Now it seems to me that to create CRF features the true labels (yc) must be known. This is true for training but for the testing phase the true class labels are unknown (since we are trying to determine their most likely value).
Am I missing something? How can this be correctly implemented?
The idea with the CRF is that it assigns a score to each setting of the labels. So what you do, notionally, is compute the scores for all possible label assignments and then whichever labeling gets the biggest score is what the CRF predicts/outputs. This is only going to make sense if the CRF gives different scores to different label assignments. When you think of it that way it's clear that the labels must be involved in the feature functions for this to work.
So lets say the log probability function for your CRF is F(x,y). So it assigns a number to each combination of a data sample x and a labeling y. So when you get a new data sample the predicted label during test time is just argmax_y F(new_x, y). That is, you find the value of y that makes F(new_x,y) the biggest and that's the predicted labeling.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
By processing a time series graph, I Would like to detect patterns that look similar to this:
Using a sample time series as an example, I would like to be able to detect the patterns as marked here:
What kind of AI algorithm (I am assuming marchine learning techniques) do I need to use to achieve this? Is there any library (in C/C++) out there that I can use?
Here is a sample result from a small project I did to partition ecg data.
My approach was a "switching autoregressive HMM" (google this if you haven't heard of it) where each datapoint is predicted from the previous datapoint using a Bayesian regression model. I created 81 hidden states: a junk state to capture data between each beat, and 80 separate hidden states corresponding to different positions within the heartbeat pattern. The pattern 80 states were constructed directly from a subsampled single beat pattern and had two transitions - a self transition and a transition to the next state in the pattern. The final state in the pattern transitioned to either itself or the junk state.
I trained the model with Viterbi training, updating only the regression parameters.
Results were adequate in most cases. A similarly structure Conditional Random Field would probably perform better, but training a CRF would require manually labeling patterns in the dataset if you don't already have labelled data.
Edit:
Here's some example python code - it is not perfect, but it gives the general approach. It implements EM rather than Viterbi training, which may be slightly more stable.
The ecg dataset is from http://www.cs.ucr.edu/~eamonn/discords/ECG_data.zip
import numpy as np
import numpy.random as rnd
import matplotlib.pyplot as plt
import scipy.linalg as lin
import re
data=np.array(map(lambda l: map(float, filter(lambda x:len(x)>0,
re.split('\\s+',l))), open('chfdb_chf01_275.txt'))).T
dK=230
pattern=data[1,:dK]
data=data[1,dK:]
def create_mats(dat):
'''
create
A - an initial transition matrix
pA - pseudocounts for A
w - emission distribution regression weights
K - number of hidden states
'''
step=5 #adjust this to change the granularity of the pattern
eps=.1
dat=dat[::step]
K=len(dat)+1
A=np.zeros( (K,K) )
A[0,1]=1.
pA=np.zeros( (K,K) )
pA[0,1]=1.
for i in xrange(1,K-1):
A[i,i]=(step-1.+eps)/(step+2*eps)
A[i,i+1]=(1.+eps)/(step+2*eps)
pA[i,i]=1.
pA[i,i+1]=1.
A[-1,-1]=(step-1.+eps)/(step+2*eps)
A[-1,1]=(1.+eps)/(step+2*eps)
pA[-1,-1]=1.
pA[-1,1]=1.
w=np.ones( (K,2) , dtype=np.float)
w[0,1]=dat[0]
w[1:-1,1]=(dat[:-1]-dat[1:])/step
w[-1,1]=(dat[0]-dat[-1])/step
return A,pA,w,K
# Initialize stuff
A,pA,w,K=create_mats(pattern)
eta=10. # precision parameter for the autoregressive portion of the model
lam=.1 # precision parameter for the weights prior
N=1 #number of sequences
M=2 #number of dimensions - the second variable is for the bias term
T=len(data) #length of sequences
x=np.ones( (T+1,M) ) # sequence data (just one sequence)
x[0,1]=1
x[1:,0]=data
# Emissions
e=np.zeros( (T,K) )
# Residuals
v=np.zeros( (T,K) )
# Store the forward and backward recurrences
f=np.zeros( (T+1,K) )
fls=np.zeros( (T+1) )
f[0,0]=1
b=np.zeros( (T+1,K) )
bls=np.zeros( (T+1) )
b[-1,1:]=1./(K-1)
# Hidden states
z=np.zeros( (T+1),dtype=np.int )
# Expected hidden states
ex_k=np.zeros( (T,K) )
# Expected pairs of hidden states
ex_kk=np.zeros( (K,K) )
nkk=np.zeros( (K,K) )
def fwd(xn):
global f,e
for t in xrange(T):
f[t+1,:]=np.dot(f[t,:],A)*e[t,:]
sm=np.sum(f[t+1,:])
fls[t+1]=fls[t]+np.log(sm)
f[t+1,:]/=sm
assert f[t+1,0]==0
def bck(xn):
global b,e
for t in xrange(T-1,-1,-1):
b[t,:]=np.dot(A,b[t+1,:]*e[t,:])
sm=np.sum(b[t,:])
bls[t]=bls[t+1]+np.log(sm)
b[t,:]/=sm
def em_step(xn):
global A,w,eta
global f,b,e,v
global ex_k,ex_kk,nkk
x=xn[:-1] #current data vectors
y=xn[1:,:1] #next data vectors predicted from current
# Compute residuals
v=np.dot(x,w.T) # (N,K) <- (N,1) (N,K)
v-=y
e=np.exp(-eta/2*v**2,e)
fwd(xn)
bck(xn)
# Compute expected hidden states
for t in xrange(len(e)):
ex_k[t,:]=f[t+1,:]*b[t+1,:]
ex_k[t,:]/=np.sum(ex_k[t,:])
# Compute expected pairs of hidden states
for t in xrange(len(f)-1):
ex_kk=A*f[t,:][:,np.newaxis]*e[t,:]*b[t+1,:]
ex_kk/=np.sum(ex_kk)
nkk+=ex_kk
# max w/ respect to transition probabilities
A=pA+nkk
A/=np.sum(A,1)[:,np.newaxis]
# Solve the weighted regression problem for emissions weights
# x and y are from above
for k in xrange(K):
ex=ex_k[:,k][:,np.newaxis]
dx=np.dot(x.T,ex*x)
dy=np.dot(x.T,ex*y)
dy.shape=(2)
w[k,:]=lin.solve(dx+lam*np.eye(x.shape[1]), dy)
# Return the probability of the sequence (computed by the forward algorithm)
return fls[-1]
if __name__=='__main__':
# Run the em algorithm
for i in xrange(20):
print em_step(x)
# Get rough boundaries by taking the maximum expected hidden state for each position
r=np.arange(len(ex_k))[np.argmax(ex_k,1)<3]
# Plot
plt.plot(range(T),x[1:,0])
yr=[np.min(x[:,0]),np.max(x[:,0])]
for i in r:
plt.plot([i,i],yr,'-r')
plt.show()
Why not using a simple matched filter? Or its general statistical counterpart called cross correlation. Given a known pattern x(t) and a noisy compound time series containing your pattern shifted in a,b,...,z like y(t) = x(t-a) + x(t-b) +...+ x(t-z) + n(t). The cross correlation function between x and y should give peaks in a,b, ...,z
Weka is a powerful collection of machine-learning software, and supports some time-series analysis tools, but I do not know enough about the field to recommend a best method. However, it is Java-based; and you can call Java code from C/C++ without great fuss.
Packages for time-series manipulation are mostly directed at the stock-market. I suggested Cronos in the comments; I have no idea how to do pattern recognition with it, beyond the obvious: any good model of a length of your series should be able to predict that, after small bumps at a certain distance to the last small bump, big bumps follow. That is, your series exhibits self-similarity, and the models used in Cronos are designed to model it.
If you don't mind C#, you should request a version of TimeSearcher2 from the folks at HCIL - pattern recognition is, for this system, drawing what a pattern looks like, and then checking whether your model is general enough to capture most instances with a low false-positive rate. Probably the most user-friendly approach you will find; all others require quite a background in statistics or pattern recognition strategies.
I'm not sure what package would work best for this. I did something similar at one point in college where I tried to automatically detect certain similar shapes on an x-y axis for a bunch of different graphs. You could do something like the following.
Class labels like:
no class
start of region
middle of region
end of region
Features like:
relative y-axis relative and absolute difference of each of the
surrounding points in a window 11 points wide
Features like difference from average
Relative difference between point before, point after
I am using deep learning if it's an option for you. It's done in Java, Deeplearning4j. I am experimenting with LSTM. I tried 1 hidden layer and 2 hidden layers to process time series.
return new NeuralNetConfiguration.Builder()
.seed(HyperParameter.seed)
.iterations(HyperParameter.nItr)
.miniBatch(false)
.learningRate(HyperParameter.learningRate)
.biasInit(0)
.weightInit(WeightInit.XAVIER)
.momentum(HyperParameter.momentum)
.optimizationAlgo(
OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT // RMSE: ????
)
.regularization(true)
.updater(Updater.RMSPROP) // NESTEROVS
// .l2(0.001)
.list()
.layer(0,
new GravesLSTM.Builder().nIn(HyperParameter.numInputs).nOut(HyperParameter.nHNodes_1).activation("tanh").build())
.layer(1,
new GravesLSTM.Builder().nIn(HyperParameter.nHNodes_1).nOut(HyperParameter.nHNodes_2).dropOut(HyperParameter.dropOut).activation("tanh").build())
.layer(2,
new GravesLSTM.Builder().nIn(HyperParameter.nHNodes_2).nOut(HyperParameter.nHNodes_2).dropOut(HyperParameter.dropOut).activation("tanh").build())
.layer(3, // "identity" make regression output
new RnnOutputLayer.Builder(LossFunctions.LossFunction.MSE).nIn(HyperParameter.nHNodes_2).nOut(HyperParameter.numOutputs).activation("identity").build()) // "identity"
.backpropType(BackpropType.TruncatedBPTT)
.tBPTTBackwardLength(100)
.pretrain(false)
.backprop(true)
.build();
Found a few things:
LSTM or RNN is very good at picking out patterns in time-series.
Tried on one time-series, and a group different time-series. Pattern were picked out easily.
It is also trying to pick out patterns not for just one cadence. If there are patterns by week, and by month, both will be learned by the net.