Why is a throw-away column required in Bert format? - machine-learning

I have recently come across Bert(Bidirectional Encoder Representations from Transformers). I saw that Bert requires a strict format for the train data. The third column needed is described as follows:
Column 3: A column of all the same letter — this is a throw-away column that you need to include because the BERT model expects it.
What is a throw-away column and why is this column needed in the dataset since it is stated that it contains the same letter?
Thank you.

BERT was pre-trained on two tasks - Masked Language Modelling & Next Sentence Prediction.
The third column as you refer to it as is used only in Next Sentence Prediction and downstream tasks that require multiple sentences such as question answering. In these cases the value of the column won't just be A or 0 for everything. Sentence 1 will be all 0 while sentence 2 will be all 1 indicating that the former is sentence A and the latter is sentence B.

Related

NLP Sentiment analysis - basic guidelines

I am doing my first project in the NLP domain which is sentiment analysis of a dataset with ~250 tagged english data points/sentences. The dataset is reviews of a pharmaceutical product having positive, negative or neutral tags. I have worked with numeric data in supervised learning for 3 years but NLP is unchartered territory for me. So I want to know the best pre-processing techniques and the steps that I need to do that are best suited to my problem. A guideline from an NLP expert would be much appreciated!
Based on your comment on mohammad karami answer, what you haven't understood is the paragraph or sentence representation (you said "converting to numeric is the real question"). So in numerical data, suppose you have like a table with 2 columns (features) and a label, maybe something like "work experience", "age", and a label "salary" (to predict a salary based on age and work experience). In NLP, features are usually if not most of the time on word level (can sometimes be character level or subword level too). These features are called tokens. Now the columns are replaced with these tokens. The simplest way to make a paragraph representation is by using bag of words. So after preprocessing, every unique words will be mapped as column. So suppose we have data train with 2 rows as follows:
"I help you and you should help me"
"you and I"
the unique words will become the column, so the table might look like:
I | help | you | and | should | me
Now the two samples would have value as follows:
[1, 2, 2, 1, 1, 1]
[1, 0, 1, 1, 0, 0]
Notice that the first element of the array is 1, because both samples have word I and occurred once, now see the second element is 2 on first row, and 0 on second row, because word help occurred twice on first row and never occurred on the second row. The logic behind this would be something like "if word A, word B... exists and word H, word I... doesn't exist, then the label is positive".
Bag of words works most of the time but it has problem such as dimensionality problem (imagine there are four billion unique words, the features are too many), and also notice that it doesn't take order of words into account, notice that similar words are represented the same way, and there are many more. The current state of the art for NLP is called BERT, learn that if you want to use what's best.
First of all, you have to specify what features you want to have and then do the pre-processing. However, you can: 1- Remove HTML tags
2- Remove extra whitespaces
3- Convert accented characters to ASCII characters
4- Expand contractions
5- Remove special characters
5 - Lowercase all texts
6- Convert number words to numeric form
7- Remove numbers
8- Remove stopwords
9- Lemmatization
Do your own Data. I suggest looking at the NLTK package for NLP. NLTK has sentiment analysis Function (maybe help your work).
Then extract your features with tf-idf or any other feature extraction or feature selection algorithms . And then give the machine learning algorithm after scaling.

Using BERT for classification given character length or number of words in a sentence

I have a dataset of titles, their descriptions and 0 or 1s that correspond to whether the description is valid or not. I want to be able to classify whether they are valid or not based on BERT alongside the character/word count of the description. How would I do so?
This question is little broad, but you can start off as follows:
You can probably use Cola processor of bert, which is a suitable processor for Binary Classification problem.
You can consider the Titles as the ID,as it should not influence the training, and it can uniquely identify the description.
Create the TSV files as per the required problem, you can use Glue data for Cola task to see how the data has to be formatted for bert.
Generally training and Dev set has 4 columns, namely, id, class, segment ID, text data, and test set has only 2 columns id and text data.
You can perform the fine-tuning once you get the data in the required format. You can use the run_classifier.py script to do the finetuning. The authors have documented the way to use the mentioned script for fine-tuning here

Does BERT implicitly model for word count?

Given that BERT is bidirectional, does it implicitly model for word count in some given text? I am asking in the case of classifying data column descriptions as valid or not. I am looking for a model based on word count, and was wondering if that even needs to be done given BERT is bidirectional.
BERT by default considers "word-piece" tokenization and not "word" tokenization. BERT makes available the max-sequence length attribute, which is responsible to limit the number of word-piece tokens in a given sentence, it also ensures processing of an equal number of tokens.

Mnemonic Generation Using LSTM's | How do I make sure my Model Generates Meaningful Sentence Using a Loss Function? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I'm working on project that generates mnemonics. I have a problem with my Model.
My question is ,How do I make sure my Model Generates Meaningful Sentence Using a Loss Function?
Aim of the project
To generate Mnemonics for a list of words. Given a list of words user wants to remember, the model will Output a meaningful, simple and easy to remember sentence which encapsulates the one-two first letters of the words that the user wants to remember in the words of Mnemonic to be generated. My model will receive only the first two letters of the words the user wants to remember as that is it carries all the information for the mnemonic to be generated.
Dataset
I’m Using Kaggle’s 55000+ song lyrics data and the sentences in those lyrics contain 5 to 10 words and Mnemonic I want to generate also contain the same number of words.
Input/Output Preprocessing.
I am iterating through all the sentences after removing punctuation and numbers and extracting first 2 letters from each word in a sentence and assigning a unique number to those pair of letters from a predefined dictionary which contains pairs of keys a key and a unique number as value.
List of these unique number assigned while act as input and Glove vectors of those words will act as the output. At each time step LSTM model will take these unique numbers assigned to these words and will output the corresponding word’s GloVe vector.
Model Architecture
I'm using LSTM's with 10 time steps.
At each time step the unique number associated with the pair of letters will be fed and the output will be the GloVe vector of the corresponding word.
optimizer=rmsprop(lr=0.0008)
model=Sequential()
model.add(Embedding(input_dim=733,output_dim=40,input_length=12))
model.add(Bidirectional(LSTM(250,return_sequences=True,activation='relu'),merge_mode='concat'))
Dropout(0.4)
model.add(Bidirectional(LSTM(350,return_sequences=True,activation='relu'),merge_mode='concat'))
Dropout(0.4)
model.add(TimeDistributed(Dense(332, activation='tanh')))
model.compile(loss='cosine_proximity',optimizer=optimizer,metrics=['acc'])
Results:
My model is outputting Mnemonics which match the first two letters of each word in the input. But the mnemonic generated carries little to no meaning.
I have realized this problem is caused because of the way I’m training. The order of letter extracts from words is already ready for sentence formation. But this is not the same in case of while testing. The order with which I’m feeding the letter extracts of words may not have a high probability of sentence formation.
So I built a bigram for my data and feed that permutation that had the highest probability of sentence formation into my mnemonic generator model. Though there were some improvements, the sentence as a whole didn’t make any sense.
I’m stuck at this point.
Input
Output
My question is,
How do I make sure my Model Generates Meaningful Sentence? Using a Loss Function
First, I have a couple of unrelated suggestions. I do not think you should output the GLoVe vector of each word. Why? Word2Vec approaches are meant to encapsulate word meanings and would probably not contain information about their spelling. However, the meaning is also helpful in order to produce a meaningful sentence. Thus, I would instead have the LSTM produce it's own hidden state after reading the first two letters of each word (just as you currently do). I would then have that sequence be unrolled (as you currently do) into sequences of dimension one (indexes into a index to word map). I would then take that output, process it through an embedding layer that maps the word indexes to their GLoVe embeddings, and I would run that through another output LSTM to produce more indexes. You can stack this as much as you want - but 2 or 3 levels will probably be good enough.
Even with these changes, it is unlikely you will see any success in generating easy-to-remember sentences. For that main issue, I think there are generally two ways you can go. The first is to augment your loss with some sense that the resulting sentence being a 'valid English sentence'. You can do this with some accuracy programtically by POS tagging the output sentence and adding loss relative to whether it follows a standard sentence structure (subject predicate adverbs direct-objects, etc). Though this result might be easier than the following alternative, It might not yield actually natural results.
I would recommend, in addition to training your model in it's current fashion, to use a GAN to judge if the output sentences are natural sentences. There are many resources of Keras GANs, so I do not think you need specific code in this answer. However, here is an outline of how your model should train logically:
Augment your current training with two additional phases.
first train the discriminator to judge whether or not the output sentence is natural. You can do this by having an LSTM model read sentences and giving a sigmoid output (0/1) to whether or not they are 'natural'. You can then train this model on some dataset of real sentences with 1 labels and your sentences with 0 labels at roughly a 50/50 split.
Then, in addition to the current loss function for actually generating the Mnemonics, add the loss that is the binary cross-entropy score for your generated sentences with 1 (true) labels. Be sure to obviously freeze the discriminator model while doing this.
Continue iterating over these two steps (training each for 1 epoch at a time) until you start to see more reasonable results. You may need to play with how much each loss term is weighted in the generator (your model) in order to get the correct trade-off between a correct mnemonic and an easy-to-remember sentence.

Does use dummy value make model's performance better?

I see many feature engineering has the get_dummies step on the object features. For example, dummy the sex column which contains 'M' and 'F' to two columns and label them in one-hot representation.
Why we not directly make the 'M' and 'F' as 0 and 1 in the sex column?
Does the dummy method has positive impact on machine learning model both in classification and regression model ?
If it is , and why?
Thanks.
In general, directly encoding a categorical variable with N different values directly with (0,1, ... , N-1) and turning into a numerical variable won't work with many algorithms, because you are giving ad hoc meaning to the different category variables. The gender example works since it is binary, but think of a price estimation example with car models. If there are N distinct models and if you encode the model A with 3 and model B with 6, this would mean, for example, for the OLS liner regression that the model B affects the response variable 2 times more compared to model A. You can't simply give such random meanings to different categorical values, the generated model would be meaningless. In order to prevent such numerical ambiguity, the most common way is to encode a categorical variable with N distinct values with N-1 binary, one-hot variables.
To one-hot-encode a feature with N possible values you only need N-1 columns with 0 / 1 values. So you are right: binary sex can be encoded with a single binary feature.
Using dummy coding with N features instead of N-1 shouldn't really add performance to any Machine Learning model and it complicates some statistical analysis such as ANOVA.
See the patsy docs on contrasts for reference.

Resources