Replacing empty texts - text embedding - machine-learning

I am trying to embed texts, using pre-trained fastText models. Some are empty. How would one replace them to make embedding possible? I was thinking about replacing them with dummy words, like that (docs being a pandas DataFrame object):
docs = docs.replace(np.nan, 'unknown', regex=True)
However it doesn't really make sense as the choice of this word is arbitrary and it is not equivalent to having an empty string.
Otherwise, I could associate the 0 vector embedding to empty strings, or the average vector, but I am not convinced either would make sense, as the embedding operation is non-linear.

In FastText, the sentence embedding is basically an average of the word vectors, as is shown in one of the FastText papers:
Given this fact, zeroes might a logical choice. But, the answer depends on what you want to do with the embeddings.
If you use them as an input for a classifier, it should be fine to select an arbitrary vector as a representation of empty string and the classifier will learn what that means. FastText also learns a special embedding for </s>, i.e., end of a sentence. This is another natural candidate for an embedding of the empty string, especially if you do similarity search.

Related

Can I use embedding layer instead of one hot as category input?

I am trying to use FFM to predict binary labels. My dataset is as follows:
sex|age|price|label
0|0|0|0
1|0|1|1
I know that FFM is a model that consider some attributes as a same field. If I use one hot encoding to transform the dataset, then the dataset will looks like follows:
sex_0|sex_1|age_0|age_1|price_0|price_1|label
0|0|0|0|0|0|0
0|1|0|0|0|1|1
Thus, sex_0 and sex_1 can be considered as one field. The other attributes are similar.
My question is whether can I use embedding layer to repalce the process of one hot encoding? However, this gives me some concerns.
I have no any other related dataset, so I can not use any
pre-trained embedding models. I can only randomly initialize the embedding
weights and the train it by my own dataset. Will this way approach
work?
If I use embedding layer instead of one hot encoding, does it
mean that each attribute will belongs one field?
What is the difference between these two methods? Which is better?
Yes you can use embeddings and that approach does work.
The attribute will not be equal to one element in the embedding but that combination of elements will equal to that attribute. The size of the embedding is something that you will have to select yourself. A good formula to follow is embedding_size = min(50, m+1// 2). Where m is the number of categories so if you have m=10 you will have an embedding size of 5.
A higher embedding size means it will capture more details on the relationship between the categorical variables.
In my experience embeddings do help especially when you have 100's of categories(if you have a small number of categories i.e. sex of a person, then one-hot encoding is sufficient) within a certain category.
On which is better I find embeddings do perform better in general when there are 100's of unique values in a category. Why this is so I do not have any concrete reasons but some intuitions for it.
For example, representing categories as 300-dimensional dense vectors(word embeddings) requires classifiers to learn far fewer weights than if the categories were represented as 50,000-dimensional vectors(one-hot encoding), and the smaller parameter space possibly helps with generalization and avoiding overfitting.

What are the differences between contextual embedding and word embedding

I am trying to understand the concept of embedding for the deep learning models.
I understand how employing word2vec can address the limitations of using the one-hot vectors.
However, recently I see a plethora of blog posts stating ELMo, BERT, etc. talking about contextual embedding.
How are word embeddings different from contextual embeddings?
Both embedding techniques, traditional word embedding (e.g. word2vec, Glove) and contextual embedding (e.g. ELMo, BERT), aim to learn a continuous (vector) representation for each word in the documents. Continuous representations can be used in downstream machine learning tasks.
Traditional word embedding techniques learn a global word embedding. They first build a global vocabulary using unique words in the documents by ignoring the meaning of words in different context. Then, similar representations are learnt for the words appeared more frequently close each other in the documents. The problem is that in such word representations the words' contextual meaning (the meaning derived from the words' surroundings), is ignored. For example, only one representation is learnt for "left" in sentence "I left my phone on the left side of the table." However, "left" has two different meanings in the sentence, and needs to have two different representations in the embedding space.
On the other hand, contextual embedding methods are used to learn sequence-level semantics by considering the sequence of all words in the documents. Thus, such techniques learn different representations for polysemous words, e.g. "left" in example above, based on their context.
Word embeddings and contextual embeddings are slightly different.
While both word embeddings and contextual embeddings are obtained from the models using unsupervised learning, there are some differences.
Word embeddings provided by word2vec or fastText has a vocabulary (dictionary) of words. The elements of this vocabulary (or dictionary) are words and its corresponding word embeddings. Hence, given a word, its embeddings is always the same in whichever sentence it occurs. Here, the pre-trained word embeddings are static.
However, contextual embeddings (are generally obtained from the transformer based models). The emeddings are obtained from a model by passing the entire sentence to the pre-trained model. Note that, here there is a vocabulary of words, but the vocabulary will not contain the contextual embeddings. The embeddings generated for each word depends on the other words in a given sentence. (The other words in a given sentence is referred as context. The transformer based models work on attention mechanism, and attention is a way to look at the relation between a word with its neighbors). Thus, given a word, it will not have a static embeddings, but the embeddings are dynamically generated from pre-trained (or fine-tuned) model.
For example, consider the two sentences:
I will show you a valid point of reference and talk to the point.
Where have you placed the point.
Now, the word embeddings from a pre-trained embeddings such as word2vec, the embeddings for the word 'point' is same for both of its occurrences in example 1 and also the same for the word 'point' in example 2. (all three occurrences has same embeddings).
While, the embeddings from BERT or ELMO or any such transformer based models, the the two occurrences of the word 'point' in example 1 will have different embeddings. Also, the word 'point' occurring in example 2 will have different embeddings than the ones in example 1.

Am I using word-embeddings correctly?

Core question : Right way(s) of using word-embeddings to represent text ?
I am building sentiment classification application for tweets. Classify tweets as - negative, neutral and positive.
I am doing this using Keras on top of theano and using word-embeddings (google's word2vec or Stanfords GloVe).
To represent tweet text I have done as follows:
used a pre-trained model (such as word2vec-twitter model) [M] to map words to their embeddings.
Use the words in the text to query M to get corresponding vectors. So if the tweet (T) is "Hello world" and M gives vectors V1 and V2 for the words 'Hello' and 'World'.
The tweet T can then be represented (V) as either V1+V2 (add vectors) or V1V2 (concatinate vectors)[These are 2 different strategies] [Concatenation means juxtaposition, so if V1, V2 are d-dimension vectors, in my example T is 2d dimension vector]
Then, the tweet T is represented by vector V.
If I follow the above, then My Dataset is nothing but vectors (which are sum or concatenation of word vectors depending on which strategy I use).
I am training a deepnet such as FFN, LSTM on this dataset. But my results arent coming out to be great.
Is this the right way to use word-embeddings to represent text ? What are the other better ways ?
Your feedback/critique will be of immense help.
I think that, for your purpose, it is better to think about another way of composing those vectors. The literature on word embeddings contains examples of criticisms to these kinds of composition (I will edit the answer with the correct references as soon as I find them).
I would suggest you to consider also other possible approaches, for instance:
Using the single word vectors as input to your net (I do not know your architecture, but the LSTM is recurrent so it can deal with sequences of words).
Using a full paragraph embedding (i.e. https://cs.stanford.edu/~quocle/paragraph_vector.pdf)
Summing them doesn't make any sense to be honest, because on summing them you get another vector which i don't think represents the semantics of "Hello World" or may be it does but it won't surely hold true for longer sentences in general
Instead it would be better to feed them as sequence as in that way it at least preserves sequence in meaningful way which seems to fit more to your problem.
e.g A hates apple Vs Apple hates A this difference would be captured when you feed them as sequence into RNN but their summation will be same.
I hope you get my point!

Naive Bayes, not so Naive?

I have a Naive Bayes classifier (implemented with WEKA) that looks for uppercase letters.
contains_A
contains_B
...
contains_Z
For a certain class the word LCD appears in almost every instance of the training data. When I get the probability for "LCD" to belong to that class it is something like 0.988. win.
When I get the probability for "L" I get a plain 0 and for "LC" I get 0.002. Since features are naive, shouldn't the L, C and D contribute to overall probability independently, and as a result "L" have some probability, "LC" some more and "LCD" even more?
At the same time, the same experiment with an MLP, instead of having the above behavior it gives percentages of 0.006, 0.5 and 0.8
So the MLP does what I would expect a Naive Bayes to do, and vise versa. Am I missing something, can anyone explain these results?
I am not familiar with the internals of WEKA - so please correct me if you think that I am not righth.
When using a text as a "feature" than this text is transformed to a vector of binary values. Each value correponds to one concrete word. The length of the vector is equal to the size of the dictionary.
if your dictionary contains 4 worlds: LCD, VHS, HELLO, WORLD
then for example a text HELLO LCD will be transformed to [1,0,1,0].
I do not know how WEKA builds it's dictionary, but I think it might go over all the words present in the examples. Unless the "L" is present in the dictionary (and therefor is present in the examples) than it's probability is logicaly 0. Actually it should not even be considered as a feature.
Actually you can not reason over the probabilities of the features - and you cannot add them together, I think there is no such a relationship between the features.
Beware that in text mining, words (letters in your case) may be given weights different than their actual counts if you are using any sort of term weighting and normalization, e.g. tf.idf. In the case of tf.idf for example, characters counts are converted into a logarithmic scale, also characters that appear in every single instance may be penalized using idf normalization.
I am not sure what options you are using to convert your data into Weka features, but you can see here that Weka has parameters to be set for such weighting and normalization options
http://weka.sourceforge.net/doc.dev/weka/filters/unsupervised/attribute/StringToWordVector.html
-T
Transform the word frequencies into log(1+fij)
where fij is the frequency of word i in jth document(instance).
-I
Transform each word frequency into:
fij*log(num of Documents/num of documents containing word i)
where fij if frequency of word i in jth document(instance)
I checked the weka documentation and I didn't see support for extracting letters as features. This implies the weka function may need a space or punctuation to delimit each feature from those adjacent. If so, then the search for "L", "C" and "D" would be interpreted as three separate one-letter-words and would explain why they were not found.
If you think this is it, you could try splitting the text into single characters delimited by \n or space, prior to ingestion.

How to include words as numerical feature in classification

Whats the best method to use the words itself as the features in any machine learning algorithm ?
The problem I have to extract word related feature from a particular paragraph. Should I use the index in the dictionary as the numerical feature ? If so, how will I normalize these ?
In general, How are words itself used as features in NLP ?
There are several conventional techniques by which words are mapped to features (columns in a 2D data matrix in which the rows are the individual data vectors) for input to machine learning models.classification:
a Boolean field which encodes the presence or absence of that word in a given document;
a frequency histogram of a
predetermined set of words, often the X most commonly occurring words from among all documents comprising the training data (more about this one in the
last paragraph of this Answer);
the juxtaposition of two or more
words (e.g., 'alternative' and
'lifestyle' in consecutive order have
a meaning not related either
component word); this juxtaposition can either be captured in the data model itself, eg, a boolean feature that represents the presence or absence of two particular words directly adjacent to one another in a document, or this relationship can be exploited in the ML technique, as a naive Bayesian classifier would do in this instanceemphasized text;
words as raw data to extract latent features, eg, LSA or Latent Semantic Analysis (also sometimes called LSI for Latent Semantic Indexing). LSA is a matrix decomposition-based technique which derives latent variables from the text not apparent from the words of the text itself.
A common reference data set in machine learning is comprised of frequencies of 50 or so of the most common words, aka "stop words" (e.g., a, an, of, and, the, there, if) for published works of Shakespeare, London, Austen, and Milton. A basic multi-layer perceptron with a single hidden layer can separate this data set with 100% accuracy. This data set and variations on it are widely available in ML Data Repositories and academic papers presenting classification results are likewise common.
Standard approach is the "bag-of-words" representation where you have one feature per word, giving "1" if the word occurs in the document and "0" if it doesn't occur.
This gives lots of features, but if you have a simple learner like Naive Bayes, that's still OK.
"Index in the dictionary" is a useless feature, I wouldn't use it.
tf-idf is a pretty standard way of turning words into numeric features.
You need to remember to use a learning algorithm that supports numeric featuers, like SVM. Naive Bayes doesn't support numeric features.

Resources