At the moment I'm trying to build an Autoencoder for detecting anomalies in time series data.
My approach is based on this tutorial: https://keras.io/examples/timeseries/timeseries_anomaly_detection/
But as often, my data is more complex then this simple tutorial.
I have two different time series, from two sensors and some metadata, like from which machine the time series was recorded.
with a normal MLP network you could have one network for the time series and one for the metadata and merge them in higher layers. But how can you use this data as an input to an Autoencoder?
Do you have any ideas, links to tutorials or papers I didn't found?
in this tutorial you can see a LSTM-VAE where the input time series is somehow concatenated with categorical data: https://github.com/cerlymarco/MEDIUM_NoteBook/tree/master/VAE_TimeSeries
There is an article explayining the code (but not on detail). There you can find the following explanation of the model:
"The encoder consists of an LSTM cell. It receives as input 3D sequences resulting from the concatenation of the raw traffic data and the embeddings of categorical features. As in every encoder in a VAE architecture, it produces a 2D output that is used to approximate the mean and the variance of the latent distribution. The decoder samples from the 2D latent distribution upsampling to form 3D sequences. The generated sequences are then concatenated back with the original categorical embeddings which are passed through an LSTM cell to reconstruct the original traffic sequences."
But sadly I don't understand exactly how they concatenate the input datas. If you understand it it would be nice if you could explain it =)
I think I understood it. you have to take a look at the input of the .fit() funktion. It is not one array, but there are seperate arrays for seperate categorical datas. additionaly there is the original input (in this case a time series). Because he has so many arrays in the input, he needs to have a corresponding number of input layers. So there is one Input layer for the Timeseries, another for the same time series (It's an autoencoder so x_train works like y_train) and a list of input layers, directly stacked with the embedding layers for the categorical data. after he has all the data in the corresponding Input layers he can concatenate them as you said.
by the way, he's using the same list for the decoder to give him additional information. I tried it out and it turns out that it was helpfull to add a dropout layer (high dropout e.g. 0.6) between the additional inputs and the decoder. If you do so, the decoder has to learn from the latent z and not only from the additional data!
hope I could help you =)
Related
I've got this question here: For example, if it is necessary to predict a disease using both image data and some numeric data, so that the features would be like:
feature 1: image of the disease.
in shape: (batch_size, width,height)
feature 2: numeric data about the patient(age,height, sex, country, salary...)
in shape: (batch_size,number_of_numeric_features)
and the output of the model should be 0/1, 0 is healthy, 1 is sick.
I know one way is to use the flat feature as a shape: (width*height+number_of_numeric_feature)
in this case the advantage of CNN in image classification won't be utilized. (a feedforward network)
So my question is: is there a best solution to combine image feature and numeric feature using CNN?
Would adding numeric features as image pixels in one channel of the CNN feature helpful? in such case the positional distance of the numeric feature as pixels won't make any sense since they don't have relationship in distance of two pixels.
You should not use SUCH numerical data with CNN, as you mentioned yourself, it won't make any sense, but there is a way in which you could use your image with CNN, and use another network (e.g. MLP) for the numerical data, at the end, you can combine the output of MLP and CNN together and feed them to another MLP, or just take averages from their output and compare the results.
I have a file of raw feedbacks that needs to be labeled(categorized) and then work as the training input for SVM Classifier(or any classifier for that matter).
But the catch is, I'm not assigning whole feedback to a certain category. One feedback may belong to more than one category based on the topics it talks about (noun n-grams are extracted). So, I'm labeling the topics(terms) not the feedbacks(documents). And so, I've extracted the n-grams using TFIDF while saving their features so i could train my model on. The problem with that is, using tfidf, it returns a document-term matrix that's train_x, but on the other side, I've got train_y; The labels that are assigned to each n-gram (not the whole document). So, I've ended up with a document to frequency matrix that contains x number of rows(no of documents) against a label of y number of n-grams(no of unique topics extracted).
Below is a sample of what the data look like. Blue is the n-grams(extracted by TFIDF) while the red is the labels/categories (calculated for each n-gram with a function I've manually made).
Instead of putting code, this is my strategy in implementing my concept:
The problem lies in that part where TFIDF producesx_train = tf.Transform(feedbacks), which is a document-term matrix and it doesn't make sense for it to be an input for the classifier against y_train, which is the labels for the terms and not the documents. I've tried to transpose the matrix, it gave me an error. I've tried to input 1-D array that holds only feature values for the terms directly, which also gave me an error because the classifier expects from X to be in a (sample, feature) format. I'm using Sklearn's version of SVM and TfidfVectorizer.
Simply, I want to be able to use SVM classifier on a list of terms (n-grams) against a list of labels to train the model and then test new data (after cleaning and extracting its n-grams) for SVM to predict its labels.
The solution might be a very technical thing like using another classifier that expects a different format or not using TFIDF since it's document focused (referenced) or even broader, a whole change of approach and concept (if it's wrong).
I'd very much appreciate it if someone could help.
What is the current state of the art data augmentation technic about text classification?
I made some research online about how can I extend my training set by doing some data transformation, the same we do on image classification.
I found some interesting ideas such as:
Synonym Replacement: Randomly choose n words from the sentence that does not stop words. Replace each of these words with one of its synonyms chosen at random.
Random Insertion: Find a random synonym of a random word in the sentence that is not a stop word. Insert that synonym into a random place in the sentence. Do this n times.
Random Swap: Randomly choose two words in the sentence and swap their positions. Do this n times.
Random Deletion: Randomly remove each word in the sentence with probability p.
But nothing about using pre-trained word vector representation model such as word2vec. Is there a reason?
Data augmentation using a word2vec might help the model to get more data based on external information. For instance, replacing a toxic comment token randomly in the sentence by its closer token in a pre-trained vector space trained specifically on external online comments.
Is it a good method or do I miss some important drawbacks of this technic?
Your idea of using word2vec embedding usually helps. However, that is a context-free embedding. To go one step further, the state of the art (SOTA) as of today (2019-02) is to use a language model trained on large corpus of text and fine-tune your own classifier with your own training data.
The two SOTA models are:
GPT-2 https://github.com/openai/gpt-2
BERT https://github.com/google-research/bert
These data augmentation methods you mentioned might also help (depends on your domain and the number of training examples you have). Some of them are actually used in the language model training (for example, in BERT there is one task to randomly mask out words in a sentence at pre-training time). If I were you I would first adopt a pre-trained model and fine tune your own classifier with your current training data. Taking that as a baseline, you could try each of the data augmentation method you like and see if they really help.
I'm starting to learn about neural networks and I came across data normalisation. I understand the need for it but I don't quite know what to do with my data once my model is trained and in the field.
Let say I take my input data, subtract its mean and divide by the standard deviation. Then I take that as inputs and I train my neural network.
Once in the field, what do I do with my input sample on which I want a prediction?
Do I need to keep my training data mean and standard deviation and use that to normalise?
Correct. The mean and standard deviation that you use to normalize the training data will be the same that you use to normalize the testing data (i.e, don't compute a mean and standard deviation for the test data).
Hopefully this link will give you more helpful info: http://cs231n.github.io/neural-networks-2/
An important point to make about the preprocessing is that any preprocessing statistics (e.g. the data mean) must only be computed on the training data, and then applied to the validation / test data. E.g. computing the mean and subtracting it from every image across the entire dataset and then splitting the data into train/val/test splits would be a mistake. Instead, the mean must be computed only over the training data and then subtracted equally from all splits (train/val/test).
I would like to use some pre-trained word embeddings in a Keras NN model, which have been published by Google in a very well known article. They have provided the code to train a new model, as well as the embeddings here.
However, it is not clear from the documentation how to retrieve an embedding vector from a given string of characters (word) from a simple python function call. Much of the documentation seems to center on dumping vectors to a file for an entire sentence presumably for sentimental analysis.
So far, I have seen that you can feed in pretrained embeddings with the following syntax:
embedding_layer = Embedding(number_of_words??,
out_dim=128??,
weights=[pre_trained_matrix_here],
input_length=60??,
trainable=False)
However, converting the different files and their structures to pre_trained_matrix_here is not quite clear to me.
They have several softmax outputs, so I am uncertain which one would belong - and furthermore how to align the words in my input to the dictionary of words for which they have.
Is there a simple manner to use these word/char embeddings in keras and/or to construct the character/word embedding portion of the model in keras such that further layers may be added for other NLP tasks?
The Embedding layer only picks up embeddings (columns of the weight matrix) for integer indices of input words, it does not know anything about the strings. This means you need to first convert your input sequence of words to a sequence of indices using the same vocabulary as was used in the model you take the embeddings from.
For NLP applications that are related to word or text encoding I would use CountVectorizer or TfidfVectorizer. Both are announced and described in a brief way for Python in the following reference: http://www.bogotobogo.com/python/scikit-learn/files/Python_Machine_Learning_Sebastian_Raschka.pdf
CounterVectorizer can be used for simple application as a SPAM-HAM detector, while TfidfVectorizer gives a deeper insight of how relevant are each term (word) in terms of their frequency in the document and the number of documents in which appears this result in an interesting metric of how discriminant are the terms considered. This text feature extractors may consider a stop-word removal and lemmatization to boost features representations.