I am trying to finetune gpt2 for a generative question answering task.
Basically I have my data in a format similar to:
Context : Matt wrecked his car today.
Question: How was Matt's day?
Answer: Bad
I was looking on the huggingface documentation to find out how I can finetune GPT2 on a custom dataset and I did find the instructions on finetuning at this address:
https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling
The issue is that they do not provide any guidance on how your data should be prepared so that the model can learn from it. They give different datasets that they have available, but none is in a format that fits my task well.
I would really appreciate if someone with more experience could help me.
Have a nice day!
Your task right now is ambiguous, it could be any of:
QnA via Classification (answer is categorical)
QnA via Extraction (answer is in the text)
QnA via Language Modeling (answer can be anything)
Classification
If all you're examples have Answer: X, where X is categorical (i.e. always "Good", "Bad", etc ...), you can do classification.
In this setup, you'd would have text-label pairs:
Text
Context: Matt wrecked his car today.
Question: How was Matt's day?
Label
Bad
For classification, you're probably better off just fine-tuning a BERT style model (something like RoBERTTa).
Extraction
If all you're examples have Answer: X, where X is a word (or consecutive words) in the text (for example), then it's probably best to do a SQuAD-style fine-tuning with a BERT-style model. In this setup, you're input is (basically) text, start_pos, end_pos triplets:
Text
Context: In early 2012, NFL Commissioner Roger Goodell stated that the league planned to make the 50th Super Bowl "spectacular" and that it would be "an important game for us as a league".
Question: Who was the NFL Commissioner in early 2012?
Start Position, End Position
6, 8
Note: The start/end position values of course positions of tokens, so these values will depend on how you tokenize your inputs
In this setup, you're also better off using a BERT-style model. In fact, there are already models on huggingface hub trained on SQuAD (and similar datasets). They should already be good at these tasks out of the box (but you can always fine-tune on top of this).
Language Modeling
If all you're examples have Answer: X, where X can basically be anything (it need not be contained in the text, and is not categorical), then you'd need to do language modeling.
In this setup, you have to use a GPT-style model, and your input would just be the whole text as is:
Context: Matt wrecked his car today.
Question: How was Matt's day?
Answer: Bad
There is no need for labels, since the text itself is the label (we're asking the model to predict the next word, for each word). Larger models like GPT-3 and https://cohere.com (full disclosure, I work at Cohere) should be good at these tasks without any finetuning (if you give it the right prompt + examples), but of course, these are accessed behind APIs. These platforms also allow you to fine-tune models (via language modeling), so you don't need to run any code yourself. Not sure how much mileage you'll get with finetuning a smaller model like GPT-2. If this project is for learning, then yeah, definitely go ahead and fine-tune a GPT-2 model! But if performance is key, I highly recommend using a solution like https://cohere.com, which will just work out of the box.
Related
I'm working on a project that aims to find conflicting Semantic Sentences (NLP - Semantic Search )
For example
Our text is: "I ate today. The lunch was very tasty. I was an honest guest."
Query: "I had lunch with my friend"
Do we want to give the query model and find the meaning of the sentences with a certain point in terms of synonyms and antonyms?
The solution that came to my mind was to first find the synonymous sentences and extract the key words from the synonymous sentences and then get the semantic opposite words and then find the semantic synonymous sentences based on these opposite words.
Do you think this idea is possible? If you have a solution or experience in this area, please reply
Thanks
You have not mentioned the exact use case for your problem so I am not sure if the solution I know will help your cause. But there is an approach in NLP (using Deep learning) which helps to find whether two sentences are correlated, unrelated or contradictory.
Below is the information about the pretrained model which is trained specifically for this task ->
https://huggingface.co/facebook/bart-large-mnli
The dataset on which the above model is trained is given here ->
https://huggingface.co/datasets/glue/viewer/mnli/train
You can check the dataset to verify if your use case is related to the classification task performed on the dataset.
Since the model is already pretrained, you do not need to perform any training and can jump straight to evaluation. Once you can somewhat satisfied with the results, you can fine tune the model a bit for your specific problem.
We can talk in comments if you need more clarification.
Take the following sentence:
I'm going to change the light bulb
The meaning of change means replace, as in someone is going to replace the light bulb. This could easily be solved by using a dictionary api or something similar. However, the following sentences
I need to go the bank to change some currency
You need to change your screen brightness
The first sentence does not mean replace anymore, it means Exchangeand the second sentence, change means adjust.
If you were trying to understand the meaning of change in this situation, what techniques would someone use to extract the correct definition based off of the context of the sentence? What is what I'm trying to do called?
Keep in mind, the input would only be one sentence. So something like:
Screen brightness is typically too bright on most peoples computers.
People need to change the brightness to have healthier eyes.
Is not what I'm trying to solve, because you can use the previous sentence to set the context. Also this would be for lots of different words, not just the word change.
Appreciate the suggestions.
Edit: I'm aware that various embedding models can help gain insight on this problem. If this is your answer, how do you interpret the word embedding that is returned? These arrays can be upwards of 500+ in length which isn't practical to interpret.
What you're trying to do is called Word Sense Disambiguation. It's been a subject of research for many years, and while probably not the most popular problem it remains a topic of active research. Even now, just picking the most common sense of a word is a strong baseline.
Word embeddings may be useful but their use is orthogonal to what you're trying to do here.
Here's a bit of example code from pywsd, a Python library with implementations of some classical techniques:
>>> from pywsd.lesk import simple_lesk
>>> sent = 'I went to the bank to deposit my money'
>>> ambiguous = 'bank'
>>> answer = simple_lesk(sent, ambiguous, pos='n')
>>> print answer
Synset('depository_financial_institution.n.01')
>>> print answer.definition()
'a financial institution that accepts deposits and channels the money into lending activities'
The methods are mostly kind of old and I can't speak for their quality but it's a good starting point at least.
Word senses are usually going to come from WordNet.
I don't know how useful this is but from my POV, word vector embeddings are naturally separated and the position in the sample space is closely related to different uses of the word. However like you said often a word may be used in several contexts.
To Solve this purpose, generally encoding techniques that utilise the context like continuous bag of words, or continous skip gram models are used for classification of the usage of word in a particular context like change for either exchange or adjust. This very idea is applied in LSTM based architectures as well or RNNs where the context is preserved over input sequences.
The interpretation of word-vectors isn't practical from a visualisation point of view, but only from 'relative distance' point of view with other words in the sample space. Another way is to maintain a matrix of the corpus with contextual uses being represented for the words in that matrix.
In fact there's a neural network that utilises bidirectional language model to first predict the upcoming word then at the end of the sentence goes back and tries to predict the previous word. It's called ELMo. You should go through the paper.ELMo Paper and this blog
Naturally the model learns from representative examples. So the better training set you give with the diverse uses of the same word, the better model can learn to utilise context to attach meaning to the word. Often this is what people use to solve their specific cases by using domain centric training data.
I think these could be helpful:
Efficient Estimation of Word Representations in
Vector Space
Pretrained language models like BERT could be useful for this as mentioned in another answer. Those models generate a representation based on the context.
The recent pretrained language models use wordpieces but spaCy has an implementation that aligns those to natural language tokens. There is a possibility then for example to check the similarity of different tokens based on the context. An example from https://explosion.ai/blog/spacy-transformers
import spacy
import torch
import numpy
nlp = spacy.load("en_trf_bertbaseuncased_lg")
apple1 = nlp("Apple shares rose on the news.")
apple2 = nlp("Apple sold fewer iPhones this quarter.")
apple3 = nlp("Apple pie is delicious.")
print(apple1[0].similarity(apple2[0])) # 0.73428553
print(apple1[0].similarity(apple3[0])) # 0.43365782
I have used a ML approach to my research using python scikit-learn. I found that SVM and logistic regression classifiers work best (eg: 85% accuracy), decision trees works markedly worse (65%), and then Naive Bayes works markedly worse (40%).
I will write up the conclusion to illustrate the obvious that some ML classifiers worked better than the others by a large margin, but what else can I say about my learning task or data structure based on these observations?
Edition:
The data set involved 500,000 rows, and I have 15 features but some of the features are various combination of substrings of certain text, so it naturally expands to tens of thousands of columns as a sparse matrix. I am using people's name to predict some binary class (eg: Gender), though I feature engineer a lot from the name entity like the length of the name, the substrings of the name, etc.
I recommend you to visit this awesome map on choosing the right estimator by the scikit-learn team http://scikit-learn.org/stable/tutorial/machine_learning_map
As describing the specifics of your own case would be an enormous task (I totally understand you didn't do it!) I encourage you to ask yourself several questions. Thus, I think the map on 'choosing the right estimator' is a good start.
Literally, go to the 'start' node in the map and follow the path:
is my number of samples > 50?
And so on. In the end you might end at some point and see if your results match with the recommendations in the map (i.e. did I end up in a SVM, which gives me better results?). If so, go deeper into the documentation and ask yourself why is that one classifier performing better on text data or whatever insight you get.
As I told you, we don't know the specifics of your data, but you should be able to ask such questions: what type of data do I have (text, binary, ...), how many samples, how many classes to predict, ... So ideally your data is going to give you some hints about the context of your problem, therefore why some estimators perform better than others.
But yeah, your question is really broad to grasp in a single answer (and specially without knowing the type of problem you are dealing with). You could also check if there might by any of those approaches more inclined to overfit, for example.
The list of recommendations could be endless, this is why I encourage you to start defining the type of problem you are dealing with and your data (plus to the number of samples, is it normalized? Is it disperse? Are you representing text in sparse matrix, are your inputs floats from 0.11 to 0.99).
Anyway, if you want to share some specifics on your data we might be able to answer more precisely. Hope this helped a little bit, though ;)
Consider the text classification problem of spam or not spam with the Naive Bayes algorithm.
The question is the following:
how do you make predictions about a document W = if in that set of words you see a new word wordX that was not seen at all by your model (so you do not even have a laplace smoothing probabilty estimated for it)?
Is the usual thing to do is just ignore that wordX eventhough it was seen in the current text because it has no probability associated with? I.e. I know sometimes the laplace smoothing is used to try to solve this problem, but what if that word is definitively new?
Some of the solutions that I've thought of:
1) Just ignore that words in estimating a classification (most simple, but sometimes wrong...?, however, if the training set is large enough, this is probably the best thing to do, as I think its reasonable to assume your features and stuff were selected well enough if you have say 1M or 20M data).
2) Add that word to your model and change your model completely, because the vocabulary changed so probabilities have to change everywhere (this does have a problem though since it could mean that you have to update the model frequently, specially if your analysis 1M documents, say)
I've done some research on this, read some of the Dan Jurafsky NLP and NB slides and watched some videos on coursera and looked through some research papers but I was not able to find something I found useful. It feels to me this problem is not new at all and there should be something (a heuristic..?) out there. If there isn't, it would be awesome to know that too!
Hope this is a useful post for the community and Thanks in advance.
PS: to make the issue a little more explicit with one of the solutions I've seen is, say that we see an unknown new word wordX in a spam, then for that word we can do 1/ count(spams) + |Vocabulary + 1|, the issue I have with doing something like that is that, then, does that mean we change the size of the vocabulary and now, every new document we classify, has a new feature and vocabulary word? This video seems to attempt to solve that issue but I'm not sure if either, thats a good thing to do or 2, maybe I have misunderstood it:
https://class.coursera.org/nlp/lecture/26
From a practical perspective (keeping in mind this is not all you're asking), I'd suggest the following framework:
Train a model using an initial train set, and start using it for classificaion
Whenever a new word (with respect to your current model) appears, use some smoothing method to account for it. e.g. Laplace smoothing, as suggested in the question, might be a good start.
Periodically retrain your model using new data (usually in addition to the original train set), to account for changes in the problem domain, e.g. new terms. This can be done on preset intervals, e.g once a month; after some number of unknown words was encountered, or in an online manner, i.e. after each input document.
This retrain step can be done manually, e.g. collect all documents containing unknown terms, manually label them, and retrain; or using semi-supervised learning methods, e.g. automatically add the highest scored spam/ non spam documents to the respective models.
This will ensure your model stays updated and accounts for new terms - by adding them to the model from time to time, and by accounting for them even before that (simply ignoring them is usually not a good idea).
I read different documents how CRF(conditional random field) works but all the papers puts the formula only. Is there any one who can send me a paper that describes about CRF with examples like if we have a sentence
"Mr.Smith was born in New York. He has been working for the last 20 years in Microsoft company."
if the above sentence is given as an input to train, how does the Model works during the training taking in to consideration for the formula for CRF?
Smith is tagged as "PER" New York is as "LOC" Microsoft Company as "ORG".
Moges.A
Here is a link to a set of slides made by Shasha Rush, a PhD student who is currently working on NLP at Google. One of the reasons I really like the slides is because they contain concrete examples and walk you through executions of important algorithms.
It is not a paper, but there is available whole online free course on probabilistic graphical models -- CRF is one of them.
It is very definitive and you'll get an intuitive level of understanding after completing it.
I don't think somebody will write such tutorial. You can check HMM tutorial which is easier to understand and can be explained by example. The problem with CRF is that it is global optimization with many dependencies, so it is very hard to show step by step how we optimize parameters and how we predict labels. But the idea is very simple - maximization of dependency(clique) graph using sparsity...