I have the following dataset
for a chemical process in a refinery. It's comprised of 5x5 input vector where each vector is sampled at every minute. The output is the result of the whole process and sampled each 5 minutes.
I concluded that the output (yellow) depends highly on past input vectors in a timely manner. And got recently to have a look on LSTMs and trying to learn a bit about them on Python and Torch.
However I don't have any idea how should I prepare my dataset in a manner where my LSTM could process it and show me future predictions if tested with new input vectors.
Is there a straight forward manner to preprocess my dataset accordingly?
EDIT1: Actually i found out this awesome blog about training LSTMs on natural language processing http://karpathy.github.io/2015/05/21/rnn-effectiveness/ . Long story short, an LSTM takes a character as an input and tries to generate the next character. Eventually, it can be trained on Shakespeare poems to generate new Shakespeare poems! But GPU acceleration is recommended.
EDIT2: Based on EDIT1, the best way to format my dataset is to just transform my excel to txt with TAB-separated columns. I'll post the results of the LSTM prediction on my above numbers dataset as soon as possible.
Related
How can I run SVM on a large text classification dataset for detecting fake news of 400 thousand entries that uses positional encoding for embeddings from keras and has a maximum sentence length of 15 with padding, without using TFIDF or word2vec as it tokenizes into words? I have tried running it on Google Colab free version, but it takes too long and keeps getting disconnected, due to the large sparse matrix. I would like to maintain the sentence embeddings as important information in the analysis. Are there any solutions or suggestions to this issue?
If there are resources or notebooks based on this,please do provide the link to it.
sent_length=15
embedded_docs=pad_sequences(onehot_repr,padding='post',maxlen=sent_length)
embedded_docs_test=pad_sequences(onehot_repr_test,padding='post',maxlen=sent_length)
## Creating model
embedding_vector_features=300 ##features representation
model=Sequential()
model.add(Embedding(voc_size,10,input_length=sent_length))
model.compile('adam','mse')
`
After getting the matrix,i am providing the above matrix as an input to SVM.
I tried using postional embeddings to run SVM,but due to sparsity it not able to run
At the moment I'm trying to build an Autoencoder for detecting anomalies in time series data.
My approach is based on this tutorial: https://keras.io/examples/timeseries/timeseries_anomaly_detection/
But as often, my data is more complex then this simple tutorial.
I have two different time series, from two sensors and some metadata, like from which machine the time series was recorded.
with a normal MLP network you could have one network for the time series and one for the metadata and merge them in higher layers. But how can you use this data as an input to an Autoencoder?
Do you have any ideas, links to tutorials or papers I didn't found?
in this tutorial you can see a LSTM-VAE where the input time series is somehow concatenated with categorical data: https://github.com/cerlymarco/MEDIUM_NoteBook/tree/master/VAE_TimeSeries
There is an article explayining the code (but not on detail). There you can find the following explanation of the model:
"The encoder consists of an LSTM cell. It receives as input 3D sequences resulting from the concatenation of the raw traffic data and the embeddings of categorical features. As in every encoder in a VAE architecture, it produces a 2D output that is used to approximate the mean and the variance of the latent distribution. The decoder samples from the 2D latent distribution upsampling to form 3D sequences. The generated sequences are then concatenated back with the original categorical embeddings which are passed through an LSTM cell to reconstruct the original traffic sequences."
But sadly I don't understand exactly how they concatenate the input datas. If you understand it it would be nice if you could explain it =)
I think I understood it. you have to take a look at the input of the .fit() funktion. It is not one array, but there are seperate arrays for seperate categorical datas. additionaly there is the original input (in this case a time series). Because he has so many arrays in the input, he needs to have a corresponding number of input layers. So there is one Input layer for the Timeseries, another for the same time series (It's an autoencoder so x_train works like y_train) and a list of input layers, directly stacked with the embedding layers for the categorical data. after he has all the data in the corresponding Input layers he can concatenate them as you said.
by the way, he's using the same list for the decoder to give him additional information. I tried it out and it turns out that it was helpfull to add a dropout layer (high dropout e.g. 0.6) between the additional inputs and the decoder. If you do so, the decoder has to learn from the latent z and not only from the additional data!
hope I could help you =)
Can a recurrent neural network be used to learn a sequence with slightly different variations? For example, could I get an RNN trained so that it could produce a sequence of consecutive integers or alternate integers if I have enough training data?
For example, if I train using
1,2,3,4
2,3,4,5
3,4,5,6
and so on
and also train the same network using
1,3,5,7
2,4,6,8
3,5,7,9
and so on,
would I be able to predict both sequences successfully for the test set?
What if I have even more variations in the training data like sequences of every three integers or every four integers, et cetera?
Yes, provided there is enough information in the sequence so that it is not ambiguous, a neural network should be able to learn to complete these sequences correctly.
You should note a few details though:
Neural networks, and ML models in general, are bad at extrapolation. A simple network is very unlikely to learn about sequences in general. It will never learn the concept of sequence logic in the way a child quickly would. So if you feed in test data outside of its experience (e.g. steps of 3 between items, when they were not in the training data), it will perform badly.
Neural networks prefer scaled inputs - a common pre-processing step is to normalise to mean 0 standard deviation 1 for each input column. Whilst it is possible for a network to accept larger range of numbers at inputs, that will reduce effectiveness of training. With a generated training set such as artificial numeric sequences, you may be able to force your way through that by training for longer with more examples.
You will need more neurons, and more layers, to support a larger variation of sequences.
For a RNN, it will predict badly if the sequence it has processed so far is ambiguous. E.g. if you train 1,2,3,4 and 1,2,3,5 with equal numbers of samples, it will predict either 4.5 (for regression) or 50% chance 4 or 5 (for classifier) when it shown sequence 1,2,3 and asked to predict.
I am just starting to grasp the idea of Backpropagation and MLP Networks. What I have confusion about is that how is the input vectors "clamped" in the input layer ?
For example lets take a mock IRIS dataset-:
[5.0,4.4,2.7,1.5,0],
[3.0,3.6,1.8,1.7,1],
[2.0,1.2,3.3.4.2,2]
Are these inputs feed in all together in the input layer ? Or are they fed in one by one.
What I mean is that on the first iteration is the first input vector fed like-:
[5.0,4.4,2.7,1.5,1]
and then the error is calculated and then the next input vector is sent ie.
[3.0,3.6,1.8,1.7,2]
OR are they all sent in together as -:
[[A vector of all petal lengths],[A vector of all sepal lengths],etc]
I know different frameworks handle these differently but feel free to comment on how any popular deeplearning framework would do this. I use DeepLearning4J myself.
Thanks.
Input vectors are usually fed into neural networks in batches. How many vectors those batches contain is dependent on the batch size. E.g. a batch size of 128 would mean that you feed 128 input vectors into the network (or less if there aren't that many left) and then update the weights/parameters. The iris tutorial of Deeplearning4J seems to use a batch size of 150: int batchSize = 150; and later DataSetIterator iter = new IrisDataSetIterator(batchSize, numSamples);.
Note that there's also a batch mode for updating the weights of neural networks, which - confusingly - updates the weights only after all input vectors have been fed into the network. The batch mode however is basically never used.
I am implementing a software for speech recognition using Mel Frequency Cepstrum Coefficients. In particular the system must recognize a single specified word. Since the audio file I get the MFCCs in a matrix with 12 rows(the MFCCs) and as many columns as the number of voice frames. I make the average of the rows, so I get a vector with only the 12 rows (the ith-row is the average of all ith-MFCCs of all frames). My question is how to train a classifier to detect the word? I have a training set with only positive samples, the MFCCs that i get from several audio file (several registration of the same word).
I make the average of the rows, so I get a vector with only the 12 rows (the ith-row is the average of all ith-MFCCs of all frames).
This is a very bad idea because you lose all information about the word, you need to analyze the whole mfcc sequence, not a part of it
My question is how to train a classifier to detect the word?
The simple form would be a GMM classifier, you can check here:
http://www.mathworks.com/company/newsletters/articles/developing-an-isolated-word-recognition-system-in-matlab.html
In more complex form you need to learn more complex model like HMM. You can learn more about HMM from textbook like this one
http://www.amazon.com/Fundamentals-Speech-Recognition-Lawrence-Rabiner/dp/0130151572