How to force GPT2 to generate specific tokens in each sentence? - machine-learning

My input is a string and the outputs are vector representations (corresponding to the generated tokens). I'm trying to force the outputs to have specific tokens (e.g., 4 commas/2 of the word "to", etc). That is, each generated sentence must have those.
Is there a potential loss component that can force GPT2 to generate specific tokens? Another approach that will be easier and more robust (but I'm not sure is possible), is similar to the masking of tokens in BERT. That is, instead of forcing GPT2 to generate sentences with unique tokens, to have the predefined tokens in the sentence beforehand:
However, an issue with this approach is that there isn't a predefined number of tokens that should be generated/masked before or after the [specific_token], nor there is a predefined number of sentences to generate for each given input (else I would have used BERT).
from transformers import logging
from transformers import GPT2Tokenizer, GPT2Model
import torch
checkpoint = 'gpt2'
tokenizer = GPT2Tokenizer.from_pretrained(checkpoint)
model = GPT2Model.from_pretrained(checkpoint)
num_added_tokens = tokenizer.add_special_tokens({'pad_token': '[CLS]'})
embedding_layer = model.resize_token_embeddings(len(tokenizer)) # Update the model embeddings with the new vocabulary size
input_string = 'Architecturally, the school has a Catholic character.'
token_ids = tokenizer(input_string, truncation = True, padding=True)
output = model(torch.tensor(token_ids['input_ids']))


Getting sentence embedding from huggingface Feature Extraction Pipeline

How do i get an embedding for the whole sentence from huggingface's feature extraction pipeline?
I understand how to get the features for each token (below) but how do i get the overall features for the sentence as a whole?
feature_extraction = pipeline('feature-extraction', model="distilroberta-base", tokenizer="distilroberta-base")
features = feature_extraction("i am sentence")
To explain more on the comment that I have put under stackoverflowuser2010's answer, I will use "barebone" models, but the behavior is the same with the pipeline component.
BERT and derived models (including DistilRoberta, which is the model you are using in the pipeline) agenerally indicate the start and end of a sentence with special tokens (mostly denoted as [CLS] for the first token) that usually are the easiest way of making predictions/generating embeddings over the entire sequence. There is a discussion within the community about which method is superior (see also a more detailed answer by stackoverflowuser2010 here), however, if you simply want a "quick" solution, then taking the [CLS] token is certainly a valid strategy.
Now, while the documentation of the FeatureExtractionPipeline isn't very clear, in your example we can easily compare the outputs, specifically their lengths, with a direct model call:
from transformers import pipeline, AutoTokenizer
# direct encoding of the sample sentence
tokenizer = AutoTokenizer.from_pretrained('distilroberta-base')
encoded_seq = tokenizer.encode("i am sentence")
# your approach
feature_extraction = pipeline('feature-extraction', model="distilroberta-base", tokenizer="distilroberta-base")
features = feature_extraction("i am sentence")
# Compare lengths of outputs
print(len(encoded_seq)) # 5
# Note that the output has a weird list output that requires to index with 0.
print(len(features[0])) # 5
When inspecting the content of encoded_seq, you will notice that the first token is indexed with 0, denoting the beginning-of-sequence token (in our case, the embedding token). Since the output lengths are the same, you could then simply access a preliminary sentence embedding by doing something like
sentence_embedding = features[0][0]
If you want to get meaningful embedding of whole sentence, please use SentenceTransformers. Pooling is well implemented in it and it also provides various APIs to Fine Tune models to produce features/embeddings at sentence/text-chunk level
pip install sentence-transformers
Once you have installed sentence-transformers, below code can be used to produce sentence embeddings
from sentence_transformers import SentenceTransformer
model_st = SentenceTransformer('distilroberta-base')
embeddings = model_st.encode("I am a sentence')
Visit official site for more info on sentence transformers.
If you have the embeddings for each token, you can create an overall sentence embedding by pooling (summarizing) over them. Note that if you have D-dimensional token embeddings, you should get a D-dimensional sentence embeddings through one of these approaches:
Compute the mean over all token embeddings.
Compute the max of each of the D-dimensions over all the token embeddings.

How can i make spacy not produce the -PRON- lemma?

I am using spacy in order to lemmatize a large amount of tweets. However when i lemmatize words like "I", the token -PRON- is produced. How can i avoid that?
-PRON- is the default lemma for pronouns in spaCy (see the docs):
About spaCy's custom pronoun lemma
Unlike verbs and common nouns, there’s no clear base form of a personal pronoun. Should the lemma of “me” be “I”, or should we normalize person as well, giving “it” — or maybe “he”? spaCy’s solution is to introduce a novel symbol, -PRON-, which is used as the lemma for all personal pronouns.
If you don't want it, you can simply replace it by something else, such as the word form of the token in question (see code snippet below). Just be aware that this may have unexpected consequences for subsequent processing. spaCy uses both a string and an integer representation of token attributes, so you may want to change both of these (if possible), or keep the original integer value for traceability.
if token.lemma_ == '-PRON-':
token.lemma_ = token.orth_ # change the string representation
token.lemma = token.orth # change the integer representation (I didn't test this part)

How to set whitespace tokenizer on NER Model?

i am creating a custom NER model using CoreNLP 3.6.0
My props are:
# location of the training file
trainFile = /home/damiano/stanford-ner.tsv
# location where you would like to save (serialize) your
# classifier; adding .gz at the end automatically gzips the file,
# making it smaller, and faster to load
serializeTo = ner-model.ser.gz
# structure of your training file; this tells the classifier that
# the word is in column 0 and the correct answer is in column 1
map = word=0,answer=1
# This specifies the order of the CRF: order 1 means that features
# apply at most to a class pair of previous class and current class
# or current class and next class.
# these are the features we'd like to train with
# some are discussed below, the rest can be
# understood by looking at NERFeatureFactory
# word character ngrams will be included up to length 6 as prefixes
# and suffixes only
# the last 4 properties deal with word shape features
I build with this command:
java -classpath "stanford-ner.jar:lib/*" -prop /home/damiano/stanford-ner.prop
The problem is when i use this model to retrieve the entities inside a textfile. The command is:
java -classpath "stanford-ner.jar:lib/*" -loadClassifier ner-model.ser.gz -textFile file.txt
Where file.txt is:
The output is:
Hello/O !/O
my/O name/O is/O John/PERSON ./O
As you can see it split "Hello!" into two tokens. Same thing for "John."
I must use whitespace tokenizer.
How can i set it?
why does CoreNlp is splitting those words in two tokens?
You set your own tokenizer by specifying the classname to the tokenizerFactory flag/property:
tokenizerFactory = edu.stanford.nlp.process.WhitespaceTokenizer$WhitespaceTokenizerFactory
You can specify any class that implements Tokenizer<T> interface, but the included WhitespaceTokenizer sounds like what you want. If the tokenizer has options you can specify them with tokenizerOptions For instance, here, if you also specify:
tokenizerOptions = tokenizeNLs=true
then the newlines in your input will be preserved in the input (for output options that don't convert things always into a one-token-per-line format).
Note: Options like tokenize.whitespace=true apply at the level of CoreNLP. They aren't interpreted (you get a warning saying that the option is ignored) if provided to individual components like CRFClassifier.
As Nikita Astrakhantsev notes, this isn't necessarily a good thing to do. Doing it at test time would only be correct if your training data is also whitespace separated, but otherwise will adversely affect performance. And having tokens like the ones you get from whitespace separation are bad for doing subsequent NLP processing such as parsing.
Upd. If you want to use whitespace tokenizer here, simply add tokenize.whitespace=true to your properties file. look at Christopher Manning's answer.
However, and answering to your second question, 'why does CoreNlp is splitting those words in two tokens?', I'd suggest to keep the default tokenizer (which is PTBTokenizer), because it simply lets to obtain better results. Usually the reason to switch to whitespace tokenization is high demand to processing speed or (usually - and) low demand to tokenization quality.
Since you are going to use it for further NER, I doubt that it is your case.
Even in your example, if you have token John. after tokenization, it can not be captured by gazette or train examples.
More details and reasons why tokenization isn't that simple can be found here.

How to filter features from CountVectorizer?

I am doing a text analysis (topic modeling) and when I run it through CountVectorizer, I get a bunch of numbers, dates, and locations that are quite irrelevant to my needs. I thought I would be feeding in the preprocessing function, but the scikit-learn page for preprocessing doesn't seem to have any information I need in building the preprocessor.
You can change token_pattern parameter in CountVectorizer.
Token pattern is regular expression denoting what constitutes a “token”, only used if analyzer == 'word'. Type of token patter is string.
Default token_pattern=r"(?u)\b\w\w+\b". The default regexp select tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator). You can change it to meet your needs(for example ignore dates).

NLP - How would you parse highly noisy sentence (with Earley parser)

I need to parse a sentence. Now I have an implemented Earley parser and a grammar for it. And everything works just fine when a sentence has no misspellings. But the problem is a lot of sentences I have to deal with are highly noisy. I wonder if there's an algorithm which combines parsing with errors correction? Possible errors are:
typos 'cheker' instead of 'checker'
typos like 'spellchecker' instead of 'spell checker'
contractions like 'Ear par' instead 'Earley parser'
If you know an article which can answer my question I would appriciate a link to it.
I assume you are using a tagger (or lexer) stage that is applied before the Earley parser, i.e. an algorithm that splits the input string into tokens and looks each token up in a dictionary to determine its part-of-speech (POS) tag(s):
John --> PN
loves --> V
a --> DT
woman --> NN
named --> JJ,VPP
Mary --> PN
It should be possible to build some kind of approximate string lookup (aka fuzzy string lookup) into that stage, so when it is presented with a misspelled token, such as 'lobes' instead of 'loves', it will not only identify the tags found by exact string matching ('lobes' as a noun plural of 'lobe'), but also tokens that are similar in shape ('loves' as third-person singular of verb 'love').
This will imply that you generally get a larger number of candidate tags for each token, and therefore a larger number of possible parse results during parsing. Whether or not this will produce the desired result depends on how comprehensive the grammar is, and how good the parser is at identifying the correct analysis when presented with many possible parse trees. A probabilistic parser may be better for this, as it assigns every candidate parse tree a probability (or confidence score), which may be used to select the most likely (or best) analysis.
If this is the solution you'd like to try, there are several possible implementation strategies. Firstly, if the tokenization and tagging is performed as a simple dictionary lookup (i.e. in the style of a lexer), you may simply use a data structure for the dictionary that enables approximate string matching. General methods for approximate string comparison are described in Approximate string matching algorithms, while methods for approximate string lookup in larger dictionaries are discussed in Quickly compare a string against a Collection in Java.
If, however, you use an actual tagger, as opposed to a lexer, i.e. something that performs POS disambiguation in addition to mere dictionary lookup, you will have to build the approximate dictionary lookup into that tagger. There must be a dictionary lookup function, which is used to generate candidate tags before disambiguation is applied, somewhere in the tagger. That dictionary lookup will have to be replaced with one that enables approximate string lookup.
