I have a question regarding concatenating two doc2vec models. I followed the official gensim IMDB example on doc2vec and implemented example data.
When concatenating two models (PV-DM + PV-DBOW), as outlined in the original paper, I wondered that the concatenated model appears not to have 200-dim, like the two input models, but 400-dim:
Shape Train(11948, **400**)
Shape Test(2987, **400**)
The input shapes were each:
np.asarray(X_train).shape)
(11948, **200**)
(2987, **200**)
Is this correct? I expected the number of dimensions to be 200 again.
This is correct. PV-DM and PV-DBOW are two different models, each producing different embeddings of dimension dim, where dim=200 in your case. Hence, when concatenating the dimension should double.
Related
At the moment I'm trying to build an Autoencoder for detecting anomalies in time series data.
My approach is based on this tutorial: https://keras.io/examples/timeseries/timeseries_anomaly_detection/
But as often, my data is more complex then this simple tutorial.
I have two different time series, from two sensors and some metadata, like from which machine the time series was recorded.
with a normal MLP network you could have one network for the time series and one for the metadata and merge them in higher layers. But how can you use this data as an input to an Autoencoder?
Do you have any ideas, links to tutorials or papers I didn't found?
in this tutorial you can see a LSTM-VAE where the input time series is somehow concatenated with categorical data: https://github.com/cerlymarco/MEDIUM_NoteBook/tree/master/VAE_TimeSeries
There is an article explayining the code (but not on detail). There you can find the following explanation of the model:
"The encoder consists of an LSTM cell. It receives as input 3D sequences resulting from the concatenation of the raw traffic data and the embeddings of categorical features. As in every encoder in a VAE architecture, it produces a 2D output that is used to approximate the mean and the variance of the latent distribution. The decoder samples from the 2D latent distribution upsampling to form 3D sequences. The generated sequences are then concatenated back with the original categorical embeddings which are passed through an LSTM cell to reconstruct the original traffic sequences."
But sadly I don't understand exactly how they concatenate the input datas. If you understand it it would be nice if you could explain it =)
I think I understood it. you have to take a look at the input of the .fit() funktion. It is not one array, but there are seperate arrays for seperate categorical datas. additionaly there is the original input (in this case a time series). Because he has so many arrays in the input, he needs to have a corresponding number of input layers. So there is one Input layer for the Timeseries, another for the same time series (It's an autoencoder so x_train works like y_train) and a list of input layers, directly stacked with the embedding layers for the categorical data. after he has all the data in the corresponding Input layers he can concatenate them as you said.
by the way, he's using the same list for the decoder to give him additional information. I tried it out and it turns out that it was helpfull to add a dropout layer (high dropout e.g. 0.6) between the additional inputs and the decoder. If you do so, the decoder has to learn from the latent z and not only from the additional data!
hope I could help you =)
I have two predictors - want to vectorize each one of them using tf-idf (don't want to concatenate them since we need to have separate vocabulary for each). Should I apply the tf-idf vectorizers on each and then join the features.
For e.g. If i apply tf-idf on predictor1, I get 100 features from that and 200 from predictor2. My features for the training data would simply be 300 (100+200). Am i thinking correctly here?
I will get two matrices from this (one for each predictor), can i concatenate these using numpy functions and use them as features?
Your suggestion on getting this done is correct. The most common way of using two vectors like this is to concatenate them into a longer vector and then feed it to the model.
If, for some reason, this doesn't work out for you, we can explore alternatives based on what your constraints are.
For example, if your constraint is total dimension size, one way to solve this would be to create a multilayered MLP autoencoder
We can train it with the combined vectors as both input and output until the encoder is trained
Subsequently, we can use any intermediate layer's activations as input to our model
It would be easier to suggest a solution if you can describe your constraints in the question.
I am trying to use FFM to predict binary labels. My dataset is as follows:
sex|age|price|label
0|0|0|0
1|0|1|1
I know that FFM is a model that consider some attributes as a same field. If I use one hot encoding to transform the dataset, then the dataset will looks like follows:
sex_0|sex_1|age_0|age_1|price_0|price_1|label
0|0|0|0|0|0|0
0|1|0|0|0|1|1
Thus, sex_0 and sex_1 can be considered as one field. The other attributes are similar.
My question is whether can I use embedding layer to repalce the process of one hot encoding? However, this gives me some concerns.
I have no any other related dataset, so I can not use any
pre-trained embedding models. I can only randomly initialize the embedding
weights and the train it by my own dataset. Will this way approach
work?
If I use embedding layer instead of one hot encoding, does it
mean that each attribute will belongs one field?
What is the difference between these two methods? Which is better?
Yes you can use embeddings and that approach does work.
The attribute will not be equal to one element in the embedding but that combination of elements will equal to that attribute. The size of the embedding is something that you will have to select yourself. A good formula to follow is embedding_size = min(50, m+1// 2). Where m is the number of categories so if you have m=10 you will have an embedding size of 5.
A higher embedding size means it will capture more details on the relationship between the categorical variables.
In my experience embeddings do help especially when you have 100's of categories(if you have a small number of categories i.e. sex of a person, then one-hot encoding is sufficient) within a certain category.
On which is better I find embeddings do perform better in general when there are 100's of unique values in a category. Why this is so I do not have any concrete reasons but some intuitions for it.
For example, representing categories as 300-dimensional dense vectors(word embeddings) requires classifiers to learn far fewer weights than if the categories were represented as 50,000-dimensional vectors(one-hot encoding), and the smaller parameter space possibly helps with generalization and avoiding overfitting.
I have a file of raw feedbacks that needs to be labeled(categorized) and then work as the training input for SVM Classifier(or any classifier for that matter).
But the catch is, I'm not assigning whole feedback to a certain category. One feedback may belong to more than one category based on the topics it talks about (noun n-grams are extracted). So, I'm labeling the topics(terms) not the feedbacks(documents). And so, I've extracted the n-grams using TFIDF while saving their features so i could train my model on. The problem with that is, using tfidf, it returns a document-term matrix that's train_x, but on the other side, I've got train_y; The labels that are assigned to each n-gram (not the whole document). So, I've ended up with a document to frequency matrix that contains x number of rows(no of documents) against a label of y number of n-grams(no of unique topics extracted).
Below is a sample of what the data look like. Blue is the n-grams(extracted by TFIDF) while the red is the labels/categories (calculated for each n-gram with a function I've manually made).
Instead of putting code, this is my strategy in implementing my concept:
The problem lies in that part where TFIDF producesx_train = tf.Transform(feedbacks), which is a document-term matrix and it doesn't make sense for it to be an input for the classifier against y_train, which is the labels for the terms and not the documents. I've tried to transpose the matrix, it gave me an error. I've tried to input 1-D array that holds only feature values for the terms directly, which also gave me an error because the classifier expects from X to be in a (sample, feature) format. I'm using Sklearn's version of SVM and TfidfVectorizer.
Simply, I want to be able to use SVM classifier on a list of terms (n-grams) against a list of labels to train the model and then test new data (after cleaning and extracting its n-grams) for SVM to predict its labels.
The solution might be a very technical thing like using another classifier that expects a different format or not using TFIDF since it's document focused (referenced) or even broader, a whole change of approach and concept (if it's wrong).
I'd very much appreciate it if someone could help.
I am following the tutorial here for implementing word2vec, and I am not sure if I understand how the skip-gram input vector is constructed.
This is the part I am confused about. I thought we were not doing one-hot encoding in word2vec.
For example, if we were to have two sentences "dogs like cats", "cats like dogs", or some more informative sentences, what would the input vector look like? Thank you.
What Skip-gram tries to do is essentially to train a model that predicts its context words given the center word.
Take 'dogs like cats' as example, assuming that window size is three, which means we will use the center word("like") to predict one word before "like" and one word after "like"(correct answers here are "dogs" and "cats").
So the input vector for this sentence will be an one hot vector with kth element being one(assuming "like" is the kth word in your dictionary).