I am trying to predict a time series from another time series. The input and the output have the same length but they have a different structure (The input is more noisy), the output has a nice sinusoidal shape. I tried implementing a simple recurrent net that takes a window of length n of the input ( starting at time t-n, t-n+1, etc... until t) in order to predict the value of the output at time t, then I shift the window to predict the output at time t+1. The model is performing poorly since the outputs sort of mimic the shape of the input whereas it should have the values corresponding to the output.
This is a type of seq2seq model however I can not find an adaptation of seq2seq2 to numerical data but mainly machine translation etc. I was wondering if there is a special ML model that could be applied for this task.
Thank you!
Related
I'm currently working on a classification problem with tensorflow, and i'm new to the world of machine learning, but I don't get something.
I have successfully tried to train models that output the y tensor like this:
y = [0,0,1,0]
But I can't understand the principal behind it...
Why not just train the same model to output classes such as y = 3 or y = 4
This seems much more flexible, because I can imagine having a multi-classification problem with 2 million possible classes, and it would be much more efficient to output a number between 0-2,000,000 than to output a tensor of 2,000,000 items for every result.
What am I missing?
Ideally, you could train you model to classify input instances and producing a single output. Something like
y=1 means input=dog, y=2 means input=airplane. An approach like that, however, brings a lot of problems:
How do I interpret the output y=1.5?
Why I'm trying the regress a number like I'm working with continuous data while I'm, in reality, working with discrete data?
In fact, what are you doing is treating a multi-class classification problem like a regression problem.
This is locally wrong (unless you're doing binary classification, in that case, a positive and a negative output are everything you need).
To avoid these (and other) issues, we use a final layer of neurons and we associate an high-activation to the right class.
The one-hot encoding represents the fact that you want to force your network to have a single high-activation output when a certain input is present.
This, every input=dog will have 1, 0, 0 as output and so on.
In this way, you're correctly treating a discrete classification problem, producing a discrete output and well interpretable (in fact you'll always extract the output neuron with the highest activation using tf.argmax, even though your network hasn't learned to produce the perfect one-hot encoding you'll be able to extract without doubt the most likely correct output )
The answer is in how that final tensor, or single value, are calculated. In an NN, your y=3 would be build by a weighted sum over the values of the previous layer.
Trying to train towards single values would then imply a linear relationship between the category IDs where none exists: For the true value y=4, the output y=3 would be considered better than y=1 even though the categories are random, and may be 1: dogs, 3: cars, 4: cats
Neural networks use gradient descent to optimize a loss function. In turn, this loss function needs to be differentiable.
A discrete output would be (indeed is) a perfectly valid and valuable output for a classification network. Problem is, we don't know how to optimize this net efficiently.
Instead, we rely on a continuous loss function. This loss function is usually based on something that is more or less related to the probability of each label -- and for this, you need a network output that has one value per label.
Typically, the output that you describe is then deduced from this soft, continuous output by taking the argmax of these pseudo-probabilities.
Let S and T be sets of time series labelled with a property. Each time series is highly periodic and in fact contains subsequent repeats of the same process (consider e.g. a gait recording, which is a time series of foot positions that repeat the same motion, which I'm calling a segment for simplicity's sake).
What is a good feature extractor if my objective is to build a model that from a sequence of such segments returns a similarity score to S or T? Ignore the model itself for now - just consider feature extraction for the time being,
What you described falls into below problem:
Given a sequence features.
Classify or recognize the hidden state.
For example, in machine-vision, the sequence could be images captured continuously against a moving human. The goal is to identify certain categories of gestures.
In your problem, the input is d-dimensional time series data and your output is the probability of two classes (S and T).
There are some general methods to handle such problem, namely, hidden markov model (HMM) and conditional random fields (CRF).
I have the following dataset
for a chemical process in a refinery. It's comprised of 5x5 input vector where each vector is sampled at every minute. The output is the result of the whole process and sampled each 5 minutes.
I concluded that the output (yellow) depends highly on past input vectors in a timely manner. And got recently to have a look on LSTMs and trying to learn a bit about them on Python and Torch.
However I don't have any idea how should I prepare my dataset in a manner where my LSTM could process it and show me future predictions if tested with new input vectors.
Is there a straight forward manner to preprocess my dataset accordingly?
EDIT1: Actually i found out this awesome blog about training LSTMs on natural language processing http://karpathy.github.io/2015/05/21/rnn-effectiveness/ . Long story short, an LSTM takes a character as an input and tries to generate the next character. Eventually, it can be trained on Shakespeare poems to generate new Shakespeare poems! But GPU acceleration is recommended.
EDIT2: Based on EDIT1, the best way to format my dataset is to just transform my excel to txt with TAB-separated columns. I'll post the results of the LSTM prediction on my above numbers dataset as soon as possible.
I have a training set where the input vectors are speed, acceleration and turn angle change. Output is a crisp class- an activity state from the given set {rest, walk, run}. e.g- say for input vectors [3.1 1.2 2]-->run ; [2.1 1 1]-->walk and so on.
I am using weka to develop a Neural Network model. The output I am defining as crisp ones (or rather qualitative ones in words- categorical values). After training the model, the model can fairly classify on test data.
I was wondering how the internal process (mapping function) is taking place? Is the qualitative output states are getting some nominal value inside the model and after processing it is again getting converted to the categorical data? because a NN model cannot map float input values to a categorical data through hidden neurons, so what is actually happening, although the model is working fine.
If the model converts the categorical outputs into nominal ones and then start processing then on what basis it converts the categorical value into some arbitrary numerical values?
Yes, categorical values are usually being converted to numbers, and the networks learn to associate input data with these numbers. However these numbers are often further encoded, not to use only single output neuron. The most common way to do it, for unordered labels, is to add dummy output neurons dedicated to each category and use 1-of-C encoding, with 0.1 and 0.9 as target values. Output is interpreted using the Winner-take-all paradigm.
Using only one neuron and encoding categories with different numbers for unordered labels often leads to problems - as the network will treat middle categories as "averages" of the boundary categories. This however may sometimes be desired, if you have ordered categorical data.
You can find very good explanation of this issue in this part of the online Neural Network FAQ.
The neural net's computations all take place on continuous values. To do multiclass classification with discrete output, its final layer produces a vector of such values, one for each class. To make a discrete class prediction, take the index of the maximum element in that vector.
So if the final layer in a classification network for four classes predicts [0 -1 2 1], then the third element of the vector is the largest and the third class is selected. Often, these values are also constrained to form a probability distribution by means of a softmax activation function.
I am new to Weka and I am trying to build a classifier to classify EEG data. The EEG attribute data is 5 minutes of recorded raw signals as well as other attributes. How can I specify in WEKA arff file format that my instance has a vector input of a 5 minute raw signal?
for example:
Num. -- raw -- class
1 -- [1,2,3,4,5,6] -- Relaxed
2 -- [2,3,4,5,6] --- Bored
Where raw is an attribute vector..
Think about your problem- what are you trying to classify/predict, and how can it be best represented. Chances are that you don't want to predict the next raw EEG reading, so a time-series approach probably isn't critical.
Weka can only handle instances (rows of data) with a fixed set of attributes (features, values, or in other words, a vector of a predefined length). The possible types of attributes one can have are nominal (e.g. "red","green","blue"), numeric (any integer/floating point value), string (mostly for text mining). and date. There is no way to represent a vector of raw signal as a single attribute. Here is the documentation: http://weka.wikispaces.com/ARFF+%28stable+version%29
That said, your instances could look like this:
num,class1,reading_1,reading_2,reading_3 ... reading_n,relaxed,bored
where reading_1 is the first raw reading and reading_n is the last one at the end of 5 minutes. This would be asking WEKA to predict your class based on the raw readings, and probably won't be very effective (because the readings may not line up with each other, and because this treats each reading separately, with no care for things like frequency or average which are relative).
Alternatively, you can do some pre-processing of the raw data so that it is useful for most machine learning algorithms in WEKA. In this case, you would need to decide on important features and then create them. A crude example could be:
num,class1,average,frequency,max_magnitude,standard_deviation,relaxed,bored
Where you have calculated things like average and frequency of the data before putting it into an ARFF file. Then the algorithms have a much more informative picture of the dataset on which to base their predictions.
However, still another concern is what are you representing? Is the entire 5 minute sample the same class, or is the user relaxed for part of it and bored for part of it? If this is the case, you should probably have two samples: one for when the user is bored and one for when she is relaxed.