Gensim word embedding training with initial values - machine-learning

I have a dataset with documents separated into different years, and my objective is to train an embedding model for each year's data, while at the same time, the same word appearing in different years will have similar vector representations. Like this: for word 'compute', its vector in year 1 is
[0.22, 0.33, 0.20]
and in year 2 it's something around:
[0.20, 0.35, 0.18]
Is there a way to accomplish this? For example, train the model of year 2 with both initial values (if the word is trained already in year 1, modify its vector) and randomness (if this is a new word for the corpus).

I think the easiest solution is to save the embeddings after training on the first data set, then load the trained model and continue training for the second data set. This way you shouldn't expect the embeddings to drift away from the saved state much (unless your data sets are very different).
It would also make sense to create a single vocabulary from all documents: vocabulary words that aren't present in a particular document will get some random representation, but still it will be a working word2vec model.
Example from the documentation:
>>> model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)
>>> model.save(fname)
>>> model = Word2Vec.load(fname) # continue training with the loaded model

Related

Is there a way to train a neuronal network (LSTM model) with multiple datasets in order to do time series forecasting?

I want to do predictions with a LSTM model, but the dataset isn't a single file, it's composed with multiple files (for example 3 Excels).
I've already checked that if you want to deal with a time series forecasting problem you have to prepare your data like (number of samples, number of time steps, number of features) and it works well if I implement this for a single Excel.
The problem consists in training with the three Excels at the same time, in this case the input tensor for the LSTM layer has the shape: (n_files, n_samples, n_timesteps, n_features), with dim = 4. This doesn't work because LSTM layers only admits input tensors with dim = 3.
My files have the same amount of data. It's collected from a device and the data has 1 value for each minute along the duration of the experiment. All the experiments have the same duration too.
I've tried to concatenate the files in order to have 1, and choosing the batch_size as the number of samples in one Excel (because I can't mix the different experiments) but it doesn't produce a good result (at least as good as the results from predicting with 1 experiment).
def build_model():
model = keras.Sequential([
layers.Masking(mask_value = 0.0, input_shape=[1,1]),
layers.LSTM(50, return_sequences=True),
layers.Dense(1)
])
optimizer = tf.keras.optimizers.Adam(0.001)
model.compile(loss='mse',
optimizer=optimizer,
metrics=['mae','RootMeanSquaredError'])
return model
model_pred = build_model()
model_pred.fit(Xchopped_train, ychopped_train, batch_size = 252,
epochs=500, verbose=1)
Where Xchopped_train and ychopped_train are the concatenated data from the 3 Excel.
Another thing I've tried is to train the model within a loop, and changing the Excel:
for i in range(len(Xtrain)):
model_pred.fit(Xtrain[i], Ytrain[i], epochs=167, verbose=1)
Where Xtrain is (3,252,1,1) and the first index refers to the number of Excel.
And by far this is my best approach but it feels like this isn't a good solution since I don't know what's happening with the NN weights or which loss function is minimizing...
Is there a more efficient way to do this? Thanks!

Transforming Features to increase similarity

I have a large dataset (~20,000 samples x 2,000 features-- each sample w/ a corresponding y-value) that I'm constructing a regression ML model for.
The input vectors are bitvectors with either 1s or 0s at each position.
Interestingly, I have noticed that when I 'randomly' select N samples such that their y-values are between two arbitrary values A and B (such that B-A is much smaller than the total range of values in y), the subsequent model is much better at predicting other values with the A-->B range not used in the training of the model.
However, the overall similarity of the input X vectors for these values are in no way more similar than any random selection of X values across the whole dataset.
Is there an available method to transform the input X-vectors such that those with more similar y-values are "closer" (I'm not particular the methodology, but it could be something like cosine similarity), and those with not similar y-values are separated?
After more thought, I believe this question can be re-framed as a supervised clustering problem. What might be able to accomplish this might be as simple as:
import umap
print(df.shape)
>> (23,312, 2149)
print(len(target))
>> 23,312
embedding = umap.UMAP().fit_transform(df, y=target)

LSTM, multi-variate, multi-feature in pytorch

I'm having trouble understanding the format of data for an LSTM in pytorch. Lets say i have a CSV file with 4 features, laid out in timestamps one after the other ( a classic time series)
time1 feature1 feature2 feature3 feature4
time2 feature1 feature2 feature3 feature4
time3 feature1 feature2 feature3 feature4
time4 feature1 feature2 feature3 feature4, label
However, this entire set of 4 sequences only has a single label. The thing we're trying to classify started at time1, but we don't know how to label it until time 4.
My question is, can a typical pytorch LSTM support this? All of the tutorials i've read, watched, walked through, involve looking at a time sequence of a single feature, or a word model, which is still a dataset with a single dimension.
If it can support it, does the data need to be flattened in some way?
Pytorch's LSTM reference states:
input: tensor of shape (L,N,Hin)(L, N, H_{in})(L,N,Hin​) when batch_first=False or (N,L,Hin)(N, L, H_{in})(N,L,Hin​) when batch_first=True containing the features of the input sequence. The input can also be a packed variable length sequence.
Does this mean that it cannot support any input that contains multiple sequences? Or is there another name for this?
I'm really lost here, and could use any advice, pointers, help, so on. Maybe some disambiguation too.
I've posted a couple times here but gotten no responses at all. If this post is misplaced, could someone kindly direct me towards the correct place to post it?
Edit: Following Daniel's advice, do i understand correctly that the four features should be put together like this:
[(feature1, feature2, feature3, feature4, feature1, feature2, feature3, feature4, feature1, feature2, feature3, feature4, feature1, feature2, feature3, feature4), label] when given to the LSTM?
If that's correct, is the input size (16) in this case?
Finally, I was under the impression that the output of the LSTM Would be the predicted label. Do I have that wrong?
As you show, the LSTM layer's input size is (batch_size, Sequence_length, feature_size). This means that the feature is assumed to be a 1D vector.
So to use it in your case you need to stack your four features into one vector (if they are more then 1D themselves then flatten them first) and use that vector as the layer's input.
Regarding the label. It is defiantly supported to have a label only after a few iterations. The LSTM will output a sequence with the same length as the input sequence, but when training the LSTM you can choose to use any part of that sequence in the loss function. In your case you will want to use the last element only.

Calculating Probability of a Classification Model Prediction

I have a classification task. The training data has 50 different labels. The customer wants to differentiate the low probability predictions, meaning that, I have to classify some test data as Unclassified / Other depending on the probability (certainty?) of the model.
When I test my code, the prediction result is a numpy array (I'm using different models, this is one is pre-trained BertTransformer). The prediction array doesn't contain probabilities such as in Keras predict_proba() method. These are numbers generated by prediction method of pretrained BertTransformer model.
[[-1.7862008 -0.7037363 0.09885322 1.5318055 2.1137428 -0.2216074
0.18905772 -0.32575375 1.0748093 -0.06001111 0.01083148 0.47495762
0.27160102 0.13852511 -0.68440574 0.6773654 -2.2712054 -0.2864312
-0.8428862 -2.1132915 -1.0157436 -1.0340284 -0.35126117 -1.0333195
9.149789 -0.21288703 0.11455813 -0.32903734 0.10503325 -0.3004114
-1.3854568 -0.01692022 -0.4388664 -0.42163098 -0.09182278 -0.28269592
-0.33082992 -1.147654 -0.6703184 0.33038092 -0.50087476 1.1643585
0.96983343 1.3400391 1.0692116 -0.7623776 -0.6083422 -0.91371405
0.10002492]]
I'm using numpy.argmax() to identify the correct label. The prediction works just fine. However, since these are not probabilities, I cannot compare the best result with a threshold value.
My question is, how can I define a threshold (say, 0.6), and then compare the probability of the argmax() element of the BertTransformer prediction array so that I can classify the prediction as "Other" if the probability is less than the threshold value?
Edit 1:
We are using 2 different models. One is Keras, and the other is BertTransformer. We have no problem in Keras since it gives the probabilities so I'm skipping Keras model.
The Bert model is pretrained. Here is how it is generated:
def model(self, data):
number_of_categories = len(data['encoded_categories'].unique())
model = BertForSequenceClassification.from_pretrained(
"dbmdz/bert-base-turkish-128k-uncased",
num_labels=number_of_categories,
output_attentions=False,
output_hidden_states=False,
)
# model.cuda()
return model
The output given above is the result of model.predict() method. We compare both models, Bert is slightly ahead, therefore we know that the prediction works just fine. However, we are not sure what those numbers signify or represent.
Here is the Bert documentation.
BertForSequenceClassification returns logits, i.e., the classification scores before normalization. You can normalize the scores by calling F.softmax(output, dim=-1) where torch.nn.functional was imported as F.
With thousands of labels, the normalization can be costly and you do not need it when you are only interested in argmax. This is probably why the models return the raw scores only.

Keras LSTM: Injecting already-known *future* values into prediction

I've built an LSTM In Keras with the goal of predicting future values of a time-series from a high-dimensional, time-index input.
However, there's a unique requirement: for certain time points in the future, we know with certainty what some values of the input series will be. For example:
model = SomeLSTM()
trained_model = model.train(train_data)
known_data = [(24, {feature: 2, val: 7.0}), (25, {feature: 2, val: 8.0})]
predictions = trained_model(look_ahead=48, known_data=known_data)
Which would train the model up to time t (the end of training), and predict forward 48 time periods from time t, but substituting known_data values for feature 2 at times 24 and 25.
How exactly can I explicitly inject this into the LSTM at some time?
For reference, here's the model:
model = Sequential()
model.add(LSTM(hidden, input_shape=(look_back, num_features)))
model.add(Dropout(dropout))
model.add(Dense(look_ahead))
model.add(Activation('linear'))
This may be a result of my un-intuitive grasp of LSTMs, and I'd appreciate any clarification. I've dived into the Keras source code, and my first guess is to inject it right into the LSTM state variable, but I'm unsure how to do that at time t (or even if that is correct.)
I think a clean way of doing this is to introduce 2*look_ahead new features, where for each 0 <= i < look_ahead 2*i-th feature is an indicator whether the value of the i-th time step is known and (2*i+1)-th is the value itself (0 if not known). Accordingly, you can generate training data with these features to make your model take into account these known values.
I am not exactly sure what you are trying to do, but maybe create your own layer to go at the end that sets the data to the known values, similar to how dropout sets random values to zero. As a side note, I have had better results with pooling than dropout, so maybe try switching that out and training it. Here is a good guide on how to do it. https://www.tutorialspoint.com/keras/keras_customized_layer.htm

Resources