Given that BERT is bidirectional, does it implicitly model for word count in some given text? I am asking in the case of classifying data column descriptions as valid or not. I am looking for a model based on word count, and was wondering if that even needs to be done given BERT is bidirectional.
BERT by default considers "word-piece" tokenization and not "word" tokenization. BERT makes available the max-sequence length attribute, which is responsible to limit the number of word-piece tokens in a given sentence, it also ensures processing of an equal number of tokens.
Related
I've used gensim Word2Vec to learn the embedding of monetary amounts and other numeric data in bank transaction memos. The goal is to use this to be able to extract these amounts and currencies from future input strings.
Design
Our input strings are something like
"AMAZON.COM TXNw98e7r3347 USD 49.00 # 1.283"
During preprocessing, I tokenize and also replace all tokens that have the possibility of being a monetary amount (string consisting only of digits, commas, and <= 1 decimal point/period) with a special VALUE_TOKEN. And I also manually replace exchange rates with RATE_TOKEN. The result would be
["AMAZON", ".COM", "TXNw", "98", "e", "7", "r", "3347", "USD", "VALUE_TOKEN", "#", "RATE_TOKEN"]
With all my preprocessed lists of strings in list data, I generate model
model = Word2Vec(data, window=3, min_count=3)
The embeddings of model that I'm most interested in are that of VALUE_TOKEN, RATE_TOKEN, as well as any currencies (USD, EUR, CAD, etc.). Now that I generated the model, I'm not sure what to do with it.
Problem
Say I have a new string that the model has never seen before,
new_string = "EUR 299.99 RATE 1.3289 WITH FEE 5.00"
I would like to use model to identify which tokens of new_string is most contextually similar to VALUE_TOKEN (which should return ["299.99", "5.00"]), which is closest to RATE_TOKEN ("1.3289"). It should be able to classify these based on the learned embedding. I can preprocess new_string the way I do with the training data, but because I don't know the exchange rate before hand, all three tokens of ["299.99", "5.00", "1.3289"] will be tagged the same (either with VALUE_TOKEN or a new UNIDENTIFIED_TOKEN).
I've looked into methods like most_similar and similarity but don't think they work for tokens that are not necessarily in the vocabulary. What methods should I use to do this? Is this the right approach?
Word2vec's fuzzy, dense embedded token representations don't strike me as the right tool for what you're doing, though they might perhaps be an indirect contributor to a hybrid approach.
In particular:
The word2vec algorithm originated from, & has the most consistent public results, when applied to natural-language texts, with their particular patterns of relative token frequences, and varied co-occurrences. Certainly, many ahave applied it, with success, to other kinds of text/record data, but such uses may require a lot more preprocessing/parameter-tuning, and to the extent the underlying data has some fixed, highly-repetitive scheme, might be more amenable to other approaches.
If you replace all known values with 'VALUE_TOKEN', & all known rates with 'RATE_TOKEN', then the model is only going to learn token-vectors for 'VALUE_TOKEN' & 'RATE_TOKEN'. Such a model won't be able to supply any vector for non-replaced tokens it's never seen like '$1.2345' or '299.99'. Even collapsing all those to 'UNIDENTIFIED_TOKEN' just limits the model to whatever it learned earlier was the vector for 'UNIDENTIFIED_TOKEN' (if any, in the training data).
I've not noticed existing word2vec implementations offering an interface for inferring the word-vector for new unknown-vectors, from just one or several new examples of its appearance in-context. They could, in the same style of new-document-vector inference used by 'Paragraph Vectors'/Doc2Vec, but just don't.) The closest I've seen is Gensim's predict_output_word(), which does a CBOW-like forward-propagation on negative-sampling models, to every 'output node' (one per known word), to give a ranked list of the known-words most-likely to appear given some context words.
That predict_output_word() might, if fed surrounding known-tokens, contribute to your needs by whether it says your 'VALUE_TOKEN' or 'RATE_TOKEN' is a more-likely model-prediction. You could adapt its code to only evaluate those two candidates, if you're always sure the right answer is one or the other, for a speed-up. A simple comparison of the average-of-context-word-vectors, and the candidate-answer vectors, might be as effective as the full forward-propagation.
Alternatively, you might want use the word2vec model solely as a source of features (via context-words) for some other classifier, which is trained to answer VALUE or TOKEN. This other classifier's input might include things like:
some average of the vectors of all nearby tokens
the full vectors of closest neighbors
a one-hot encoding ('bag-of-words') of all nearby (or 'preceding') or 'following) known-tokens, assuming the vocabulary of non-numerical tokens is fairly short & highly indicative
?
If the data streams might include arbitrary new or corrupted tokens whose meaning might be inferrable from substrings, you could consider a FastText model as well.
In the paper describing BERT, there is this paragraph about WordPiece Embeddings.
We use WordPiece embeddings (Wu et al.,
2016) with a 30,000 token vocabulary. The first
token of every sequence is always a special classification
token ([CLS]). The final hidden state
corresponding to this token is used as the aggregate
sequence representation for classification
tasks. Sentence pairs are packed together into a
single sequence. We differentiate the sentences in
two ways. First, we separate them with a special
token ([SEP]). Second, we add a learned embedding
to every token indicating whether it belongs
to sentence A or sentence B. As shown in Figure 1,
we denote input embedding as E, the final hidden
vector of the special [CLS] token as C 2 RH,
and the final hidden vector for the ith input token
as Ti 2 RH.
For a given token, its input representation is
constructed by summing the corresponding token,
segment, and position embeddings. A visualization
of this construction can be seen in Figure 2.
As I understand, WordPiece splits Words into wordpieces like #I #like #swim #ing, but it does not generate Embeddings. But I did not find anything in the paper and on other sources how those Token Embeddings are generated. Are they pretrained before the actual Pre-training? How? Or are they randomly initialized?
The wordpieces are trained separately, such the most frequent words remain together and the less frequent words get split eventually down to characters.
The embeddings are trained jointly with the rest of BERT. The back-propagation is done through all the layers up to the embeddings which get updated just like any other parameters in the network.
Note that only the embeddings of tokens which are actually present in the training batch get updated and the rest remain unchanged. This also a reason why you need to have relatively small word-piece vocabulary, such that all embeddings get updated frequently enough during the training.
The token embeddings are simply taking their index in the vocabulary.
One answerer here gave an example, but it is not clearly stated that the number is the index of the vocabulary:
BERT’s input is essentially subwords.
For example, if I want to feed BERT the sentence
“Welcome to HuggingFace Forums!”, what I actually gets fed in is:
['[CLS]', 'welcome', 'to', 'hugging', '##face', 'forums', '!', '[SEP]'].
Each of these tokens is mapped to an integer:
[101, 6160, 2000, 17662, 12172, 21415, 999, 102].
Then I searched and downloaded the vocabulary (vocab.txt bert-base-uncased) and verified the above numbers.
I have recently come across Bert(Bidirectional Encoder Representations from Transformers). I saw that Bert requires a strict format for the train data. The third column needed is described as follows:
Column 3: A column of all the same letter — this is a throw-away column that you need to include because the BERT model expects it.
What is a throw-away column and why is this column needed in the dataset since it is stated that it contains the same letter?
Thank you.
BERT was pre-trained on two tasks - Masked Language Modelling & Next Sentence Prediction.
The third column as you refer to it as is used only in Next Sentence Prediction and downstream tasks that require multiple sentences such as question answering. In these cases the value of the column won't just be A or 0 for everything. Sentence 1 will be all 0 while sentence 2 will be all 1 indicating that the former is sentence A and the latter is sentence B.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I'm working on project that generates mnemonics. I have a problem with my Model.
My question is ,How do I make sure my Model Generates Meaningful Sentence Using a Loss Function?
Aim of the project
To generate Mnemonics for a list of words. Given a list of words user wants to remember, the model will Output a meaningful, simple and easy to remember sentence which encapsulates the one-two first letters of the words that the user wants to remember in the words of Mnemonic to be generated. My model will receive only the first two letters of the words the user wants to remember as that is it carries all the information for the mnemonic to be generated.
Dataset
I’m Using Kaggle’s 55000+ song lyrics data and the sentences in those lyrics contain 5 to 10 words and Mnemonic I want to generate also contain the same number of words.
Input/Output Preprocessing.
I am iterating through all the sentences after removing punctuation and numbers and extracting first 2 letters from each word in a sentence and assigning a unique number to those pair of letters from a predefined dictionary which contains pairs of keys a key and a unique number as value.
List of these unique number assigned while act as input and Glove vectors of those words will act as the output. At each time step LSTM model will take these unique numbers assigned to these words and will output the corresponding word’s GloVe vector.
Model Architecture
I'm using LSTM's with 10 time steps.
At each time step the unique number associated with the pair of letters will be fed and the output will be the GloVe vector of the corresponding word.
optimizer=rmsprop(lr=0.0008)
model=Sequential()
model.add(Embedding(input_dim=733,output_dim=40,input_length=12))
model.add(Bidirectional(LSTM(250,return_sequences=True,activation='relu'),merge_mode='concat'))
Dropout(0.4)
model.add(Bidirectional(LSTM(350,return_sequences=True,activation='relu'),merge_mode='concat'))
Dropout(0.4)
model.add(TimeDistributed(Dense(332, activation='tanh')))
model.compile(loss='cosine_proximity',optimizer=optimizer,metrics=['acc'])
Results:
My model is outputting Mnemonics which match the first two letters of each word in the input. But the mnemonic generated carries little to no meaning.
I have realized this problem is caused because of the way I’m training. The order of letter extracts from words is already ready for sentence formation. But this is not the same in case of while testing. The order with which I’m feeding the letter extracts of words may not have a high probability of sentence formation.
So I built a bigram for my data and feed that permutation that had the highest probability of sentence formation into my mnemonic generator model. Though there were some improvements, the sentence as a whole didn’t make any sense.
I’m stuck at this point.
Input
Output
My question is,
How do I make sure my Model Generates Meaningful Sentence? Using a Loss Function
First, I have a couple of unrelated suggestions. I do not think you should output the GLoVe vector of each word. Why? Word2Vec approaches are meant to encapsulate word meanings and would probably not contain information about their spelling. However, the meaning is also helpful in order to produce a meaningful sentence. Thus, I would instead have the LSTM produce it's own hidden state after reading the first two letters of each word (just as you currently do). I would then have that sequence be unrolled (as you currently do) into sequences of dimension one (indexes into a index to word map). I would then take that output, process it through an embedding layer that maps the word indexes to their GLoVe embeddings, and I would run that through another output LSTM to produce more indexes. You can stack this as much as you want - but 2 or 3 levels will probably be good enough.
Even with these changes, it is unlikely you will see any success in generating easy-to-remember sentences. For that main issue, I think there are generally two ways you can go. The first is to augment your loss with some sense that the resulting sentence being a 'valid English sentence'. You can do this with some accuracy programtically by POS tagging the output sentence and adding loss relative to whether it follows a standard sentence structure (subject predicate adverbs direct-objects, etc). Though this result might be easier than the following alternative, It might not yield actually natural results.
I would recommend, in addition to training your model in it's current fashion, to use a GAN to judge if the output sentences are natural sentences. There are many resources of Keras GANs, so I do not think you need specific code in this answer. However, here is an outline of how your model should train logically:
Augment your current training with two additional phases.
first train the discriminator to judge whether or not the output sentence is natural. You can do this by having an LSTM model read sentences and giving a sigmoid output (0/1) to whether or not they are 'natural'. You can then train this model on some dataset of real sentences with 1 labels and your sentences with 0 labels at roughly a 50/50 split.
Then, in addition to the current loss function for actually generating the Mnemonics, add the loss that is the binary cross-entropy score for your generated sentences with 1 (true) labels. Be sure to obviously freeze the discriminator model while doing this.
Continue iterating over these two steps (training each for 1 epoch at a time) until you start to see more reasonable results. You may need to play with how much each loss term is weighted in the generator (your model) in order to get the correct trade-off between a correct mnemonic and an easy-to-remember sentence.
I have question regarding the particular Naive Bayse algorithm that is used in document classification. Following is what I understand:
construct some probability of each word in the training set for each known classification
given a document we strip all the words that it contains
multiply together the probabilities of the words being present in a classification
perform (3) for each classification
compare the result of (4) and choose the classification with the highest posterior
What I am confused about is the part when we calculate the probability of each word given training set. For example for a word "banana", it appears in 100 documents in classification A, and there are totally 200 documents in A, and in total 1000 words appears in A. To get the probability of "banana" appearing under classification A do I use 100/200=0.5 or 100/1000=0.1?
I believe your model will more accurately classify if you count the number of documents the word appears in, not the number of times the word appears in total. In other words
Classify "Mentions Fruit":
"I like Bananas."
should be weighed no more or less than
"Bananas! Bananas! Bananas! I like them."
So the answer to your question would be 100/200 = 0.5.
The description of Document Classification on Wikipedia also supports my conclusion
Then the probability that a given document D contains all of the words W, given a class C, is
http://en.wikipedia.org/wiki/Naive_Bayes_classifier
In other words, the document classification algorithm Wikipedia describes tests how many of the list of classifying words a given document contains.
By the way, more advanced classification algorithms will examine sequences of N-words, not just each word individually, where N can be set based on the amount of CPU resources you are willing to dedicate to the calculation.
UPDATE
My direct experience is based on short documents. I would like to highlight research that #BenAllison points out in the comments that suggests my answer is invalid for longer documents. Specifically
One weakness is that by considering only the presence or absence of terms, the BIM ignores information inherent in the frequency of terms. For instance, all things being equal, we would expect that if 1 occurrence of a word is a good clue that a document belongs in a class, then 5 occurrences should be even more predictive.
A related problem concerns document length. As a document gets longer, the number of distinct words used, and thus the number of values of x(j) that equal 1 in the BIM, will in general increase.
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.46.1529