How to create Affix(Prefix + Suffix) embeddings in NLP - machine-learning

I am working on a named entity recognition task. Traditional method is to concatenate word embeddings and character level embeddings for creating a word representation first. I want to also use affix embeddings to better understand the relation between the tags and the words.
For example the words "Afghanistan" and "Kajikistan" are clear examples of Location. Here the suffix "istan" or "tan" will be useful to identify future "location" tags. So I want to extract the suffixes and prefixes of all the words and create embeddings for them, and then concatenate with the initial word representation. How to achieve this?

You can do this:
Search for a suffixes vocabulary from Google.
Write a simple max-backward segmentation script to generate all words' suffixes, and add them as another item into your training and testing data, just like words and characters.
Concatenate suffix embedding with words and character embedding.

Related

What are the differences between contextual embedding and word embedding

I am trying to understand the concept of embedding for the deep learning models.
I understand how employing word2vec can address the limitations of using the one-hot vectors.
However, recently I see a plethora of blog posts stating ELMo, BERT, etc. talking about contextual embedding.
How are word embeddings different from contextual embeddings?
Both embedding techniques, traditional word embedding (e.g. word2vec, Glove) and contextual embedding (e.g. ELMo, BERT), aim to learn a continuous (vector) representation for each word in the documents. Continuous representations can be used in downstream machine learning tasks.
Traditional word embedding techniques learn a global word embedding. They first build a global vocabulary using unique words in the documents by ignoring the meaning of words in different context. Then, similar representations are learnt for the words appeared more frequently close each other in the documents. The problem is that in such word representations the words' contextual meaning (the meaning derived from the words' surroundings), is ignored. For example, only one representation is learnt for "left" in sentence "I left my phone on the left side of the table." However, "left" has two different meanings in the sentence, and needs to have two different representations in the embedding space.
On the other hand, contextual embedding methods are used to learn sequence-level semantics by considering the sequence of all words in the documents. Thus, such techniques learn different representations for polysemous words, e.g. "left" in example above, based on their context.
Word embeddings and contextual embeddings are slightly different.
While both word embeddings and contextual embeddings are obtained from the models using unsupervised learning, there are some differences.
Word embeddings provided by word2vec or fastText has a vocabulary (dictionary) of words. The elements of this vocabulary (or dictionary) are words and its corresponding word embeddings. Hence, given a word, its embeddings is always the same in whichever sentence it occurs. Here, the pre-trained word embeddings are static.
However, contextual embeddings (are generally obtained from the transformer based models). The emeddings are obtained from a model by passing the entire sentence to the pre-trained model. Note that, here there is a vocabulary of words, but the vocabulary will not contain the contextual embeddings. The embeddings generated for each word depends on the other words in a given sentence. (The other words in a given sentence is referred as context. The transformer based models work on attention mechanism, and attention is a way to look at the relation between a word with its neighbors). Thus, given a word, it will not have a static embeddings, but the embeddings are dynamically generated from pre-trained (or fine-tuned) model.
For example, consider the two sentences:
I will show you a valid point of reference and talk to the point.
Where have you placed the point.
Now, the word embeddings from a pre-trained embeddings such as word2vec, the embeddings for the word 'point' is same for both of its occurrences in example 1 and also the same for the word 'point' in example 2. (all three occurrences has same embeddings).
While, the embeddings from BERT or ELMO or any such transformer based models, the the two occurrences of the word 'point' in example 1 will have different embeddings. Also, the word 'point' occurring in example 2 will have different embeddings than the ones in example 1.

How to train a model to distinguish/categorize words by predefined meanings?

Semantic analysis in deep learning and NLP is usually about the meaning of a whole sentence, such as sentiment analysis. In many cases, the meaning of a word can be understood by the sentence structure. For example,
Can you tell this from that?
Can you tell me something about this?
Is there any established method for training a model by a dataset of
word meaning_id sentence
tell 1 Can you tell this from that?
tell 2 Can you tell me something about this?
Note that the purpose is just to categorize words by predefined meanings/examples.
I use Stanford CoreNLP, but I doubt if there is such a possibility. Any deep learning program is OK.
What I can think of is using contextualized WordEmbeddings. Those are embeddings for a word given its context. That means depending on the context, a word has another embedding. BERT generates such contexualized embeddings.
One can assume that those embeddings differ signifficantly for different meaning, but are quite similar for the same meaning of a word.
What you would do is:
run BERT with sentences containing your word "tell"
extract the embeddings for the word "tell" from the last layer of BERT
try to cluster the specific embeddings to different meanings. If you already have predefined meanings and some example sentences, you can even try to train a clasifier.
Here is a blog entry showing the feasability of such an approach.

Replacing empty texts - text embedding

I am trying to embed texts, using pre-trained fastText models. Some are empty. How would one replace them to make embedding possible? I was thinking about replacing them with dummy words, like that (docs being a pandas DataFrame object):
docs = docs.replace(np.nan, 'unknown', regex=True)
However it doesn't really make sense as the choice of this word is arbitrary and it is not equivalent to having an empty string.
Otherwise, I could associate the 0 vector embedding to empty strings, or the average vector, but I am not convinced either would make sense, as the embedding operation is non-linear.
In FastText, the sentence embedding is basically an average of the word vectors, as is shown in one of the FastText papers:
Given this fact, zeroes might a logical choice. But, the answer depends on what you want to do with the embeddings.
If you use them as an input for a classifier, it should be fine to select an arbitrary vector as a representation of empty string and the classifier will learn what that means. FastText also learns a special embedding for </s>, i.e., end of a sentence. This is another natural candidate for an embedding of the empty string, especially if you do similarity search.

Sequence-to-sequence learning with real numbers

TensorFlow has a tutorial on sequence to sequence models shows how to translate English to French.
Is it possible to use real numbers instead of words in these models?
Suppose the vocabulary contains the following words
[hello:0, a:1, the:2, what:3, is:4, problem:5, there:6, it:7, this:8, listen:9]
since a word only appears once in the vocabulary we can use its respective index as a reference.
so a sentence like
listen there is a problem
will be converted into an array of indices like
[9, 6, 4, 1, 5]
so this is how a single input sentence can be represented using such embeddings. This vocabulary embedding with the corresponding indices is fed into the network along with a list of these indexed sentences i.e. indexed corpus.
But this is not going to cut it since your network also needs to learn the common relationships among words it requires some more semantic information. This can be achieved using word2vec. You can use trained word2vec and get vectors for all the words in your vocabulary and associate them with these indices and use this as your embedding lookup instead of the word2index dictionary.
Now you can map all of the word indices to their corresponding word vector generated using word2vec and store it in an embedding matrix where the row index is the word index and each row is a fixed length word vector corresponding to that word. And then feed this information to your network in the form a feed dict.
Alternatively, you can also consider word senses or features(combination of word, pos tag and ner tag) and create a vocabulary and then create a similar word2index and embedding matrix

Sentence classification using Weka

I want to classify sentences with Weka. My features are sentence terms (words) and a Part of Speech tag of each terms. I don't know how figure attributes, because if each term is presented as one feature, number of feature for each instance (sentence) has become different. And, if all words in sentence is presented as one feature, how relate words and their POS tag.
Any ideas how I should proceed?
If I understand the question correctly, the answer is as follows: It is most common to treat words independently of their position in the sentence and represent a sentence in the feature space by the number of times each of the known words occurs in that sentence. I.e. there is usually a separate numerical feature for each word present in the training data. Or, if you're willing to use n-grams, a separate feature for every n-gram in the training data (possibly with some frequency threshold).
As for the POS tags, it might make sense to use them as separate features, but only if the classification you're interested in has to do with sentence structure (syntax). Otherwise you might want to just append the POS tag to the word, which would partly disambiguate those words that can represent different parts of speech.

Resources