Text Feature Representation As Vectors for SVM - machine-learning

I am learning the Semantic Role Labeling (SRL) task. I have read a lot, and now I come to a problem for how to represent the text features as vectors.
For example, for the sentence:
We like StackOverflow very much
given the predicate verb: like, a few features are:
the left 1st word: I
the right 1st word: StackOverflow
the POS tag of the left 1st word: Pronoun
The POS tag of the right 1st word: Adverbial
What are the right ways to represent these features as vectors?
If possible, can you also give me some guidances for how to normalize these features please?
I basically want to train the data with these type of features using SVM models.

It doesn't matter what classifier you use (SVM or not) the feature generation for text is the same.
I suggest you to take a look at this:
Binary Feature Extraction
Also this library would make your life much easier:
http://cogcomp.cs.illinois.edu/page/software_view/LBJ
A tutorial is here: http://cogcomp.cs.illinois.edu/page/tutorial.201310

Related

Natural Language Processing techniques for understanding contextual words

Take the following sentence:
I'm going to change the light bulb
The meaning of change means replace, as in someone is going to replace the light bulb. This could easily be solved by using a dictionary api or something similar. However, the following sentences
I need to go the bank to change some currency
You need to change your screen brightness
The first sentence does not mean replace anymore, it means Exchangeand the second sentence, change means adjust.
If you were trying to understand the meaning of change in this situation, what techniques would someone use to extract the correct definition based off of the context of the sentence? What is what I'm trying to do called?
Keep in mind, the input would only be one sentence. So something like:
Screen brightness is typically too bright on most peoples computers.
People need to change the brightness to have healthier eyes.
Is not what I'm trying to solve, because you can use the previous sentence to set the context. Also this would be for lots of different words, not just the word change.
Appreciate the suggestions.
Edit: I'm aware that various embedding models can help gain insight on this problem. If this is your answer, how do you interpret the word embedding that is returned? These arrays can be upwards of 500+ in length which isn't practical to interpret.
What you're trying to do is called Word Sense Disambiguation. It's been a subject of research for many years, and while probably not the most popular problem it remains a topic of active research. Even now, just picking the most common sense of a word is a strong baseline.
Word embeddings may be useful but their use is orthogonal to what you're trying to do here.
Here's a bit of example code from pywsd, a Python library with implementations of some classical techniques:
>>> from pywsd.lesk import simple_lesk
>>> sent = 'I went to the bank to deposit my money'
>>> ambiguous = 'bank'
>>> answer = simple_lesk(sent, ambiguous, pos='n')
>>> print answer
Synset('depository_financial_institution.n.01')
>>> print answer.definition()
'a financial institution that accepts deposits and channels the money into lending activities'
The methods are mostly kind of old and I can't speak for their quality but it's a good starting point at least.
Word senses are usually going to come from WordNet.
I don't know how useful this is but from my POV, word vector embeddings are naturally separated and the position in the sample space is closely related to different uses of the word. However like you said often a word may be used in several contexts.
To Solve this purpose, generally encoding techniques that utilise the context like continuous bag of words, or continous skip gram models are used for classification of the usage of word in a particular context like change for either exchange or adjust. This very idea is applied in LSTM based architectures as well or RNNs where the context is preserved over input sequences.
The interpretation of word-vectors isn't practical from a visualisation point of view, but only from 'relative distance' point of view with other words in the sample space. Another way is to maintain a matrix of the corpus with contextual uses being represented for the words in that matrix.
In fact there's a neural network that utilises bidirectional language model to first predict the upcoming word then at the end of the sentence goes back and tries to predict the previous word. It's called ELMo. You should go through the paper.ELMo Paper and this blog
Naturally the model learns from representative examples. So the better training set you give with the diverse uses of the same word, the better model can learn to utilise context to attach meaning to the word. Often this is what people use to solve their specific cases by using domain centric training data.
I think these could be helpful:
Efficient Estimation of Word Representations in
Vector Space
Pretrained language models like BERT could be useful for this as mentioned in another answer. Those models generate a representation based on the context.
The recent pretrained language models use wordpieces but spaCy has an implementation that aligns those to natural language tokens. There is a possibility then for example to check the similarity of different tokens based on the context. An example from https://explosion.ai/blog/spacy-transformers
import spacy
import torch
import numpy
nlp = spacy.load("en_trf_bertbaseuncased_lg")
apple1 = nlp("Apple shares rose on the news.")
apple2 = nlp("Apple sold fewer iPhones this quarter.")
apple3 = nlp("Apple pie is delicious.")
print(apple1[0].similarity(apple2[0])) # 0.73428553
print(apple1[0].similarity(apple3[0])) # 0.43365782

Does summing up word embedding vectors in ML destroy their meaning?

For example, I have a paragraph which I want to classify in a binary manner. But because the inputs have to have a fixed length, I need to ensure that every paragraph is represented by a uniform quantity.
One thing I've done is taken every word in the paragraph, vectorized it using GloVe word2vec and then summed up all of the vectors to create a "paragraph" vector, which I've then fed in as an input for my model. In doing so, have I destroyed any meaning the words might have possessed? Considering these two sentences would have the same vector:
"My dog bit Dave" & "Dave bit my dog", how do I get around this? Am I approaching this wrong?
What other way can I train my model? If I take every word and feed that into my model, how do I know how many words I should take? How do I input these words? In the form of a 2D array, where each word vector is a column?
I want to be able to train a model that can classify text accurately.
Surprisingly, I'm getting a high (>90%) for a relatively simple model like RandomForestClassifier just by using this summing up method. Any insights?
Edit: One suggestion I have received is to instead featurize my data as a 2D array where each word is a column, on which a CNN could work. Another suggestion I received was to use transfer learning through the huggingface transformer to get a vector for the whole paragraph. Which one is more feasible?
I want to be able to train a model that can classify text accurately. Surprisingly, I'm getting a high (>90%) for a relatively simple model like RandomForestClassifier just by using this summing up method. Any insights?
If you look up papers on aggregating word embeddings you'll find out that this in fact occurs sometimes, especially if the texts are shorter.
What other way can I train my model? If I take every word and feed that into my model, how do I know how many words I should take? How do I input these words? In the form of a 2D array, where each word vector is a column?
Have you tried keyword extraction? It can alleviate some of the problems with averaging
In doing so, have I destroyed any meaning the words might have
possessed?
As you remarked, you throw out information on word order. But that's not even the worst part: most of the times for longer documents if you embed everything the mean will get dominated by common words ("how", "like", "do" et c). BTW see my answer to this question
Other than that, one trick I've seen is to average word vectors, but subtract first principal component of PCA on word embedding matrix. For details you can see for example this repo which also links to the paper (BTW this paper suggests you can ignore "Smooth Inverse Frequency" stuff since principal component reduction does the useful part).

how to use both word2vec and RNN for NLP together?

I recently studied and understood how word2vec works, it is responsible to convert words into numerical form so when we plot them or put them in the world space they will be spread and reveal the relationship between every word and the other.
my question here, I found also RNNs and suddenly I became confused. Is word2vec an alternative to RNNs or I can use word2vec to transfer the words to numeric form and then use them on RNNs ?
I mean both of them predict the next word, so I want to know if they are the different approaches for the same problem or I can use them both together ?
NOTE: I finished computer vision and started in NLP so please don't judge my question I am just starting, thanks in advance.
You did not understood the meaning of word2vec clearly. word2vec is a representation of words in a multi-dimensional space while RNN is an algorithm like Linear Regression or random forest or logistic regression. word2vec do NOT predicts the next words. Here is a small explanation of word2vec:
Take three words: apple,orange and car. Suppose they are represented in word2vec as:
apple = [0.01, 0.04 ...] orange = [0.02, 0.06 ...] car = [0.03, 0.09 ...]
Now you know apple and orange are similar to each other while car is not. So if you will take the dot product of apple and orange, the result value will be close to 1, say it's 0.85 but if you take the dot product of apple and car, the result will be far from 1 say it's 0.25. This is the concept of word2vec. It gives you a vector representation of words in a numerical form such that similar words are kept near to each other in the graph.
Now for RNN, as I said, it's an algorithm. You will feed some numerical data to it and it'll give you some output. You need to learn RNN in detail from some online tutorial.
To answer your question about how to use them together, RNN takes numerical inputs. It can't take English words directly. So we need to convert all words into some kind of numerical form. This is where word2vec comes into picture. You take each word, get it's numerical representation from word2vec (like I showed above for apple, orange and car) and then feed it to the RNN.
This is just a simple overview and it's not possible to explain everything here. If you really want to learn more then I would strongly suggest you to take this course. Everything from word2vec to RNN is explained beautifully there. It would be even better if you complete the whole specialisation there instead of completing only this course.

Am I using word-embeddings correctly?

Core question : Right way(s) of using word-embeddings to represent text ?
I am building sentiment classification application for tweets. Classify tweets as - negative, neutral and positive.
I am doing this using Keras on top of theano and using word-embeddings (google's word2vec or Stanfords GloVe).
To represent tweet text I have done as follows:
used a pre-trained model (such as word2vec-twitter model) [M] to map words to their embeddings.
Use the words in the text to query M to get corresponding vectors. So if the tweet (T) is "Hello world" and M gives vectors V1 and V2 for the words 'Hello' and 'World'.
The tweet T can then be represented (V) as either V1+V2 (add vectors) or V1V2 (concatinate vectors)[These are 2 different strategies] [Concatenation means juxtaposition, so if V1, V2 are d-dimension vectors, in my example T is 2d dimension vector]
Then, the tweet T is represented by vector V.
If I follow the above, then My Dataset is nothing but vectors (which are sum or concatenation of word vectors depending on which strategy I use).
I am training a deepnet such as FFN, LSTM on this dataset. But my results arent coming out to be great.
Is this the right way to use word-embeddings to represent text ? What are the other better ways ?
Your feedback/critique will be of immense help.
I think that, for your purpose, it is better to think about another way of composing those vectors. The literature on word embeddings contains examples of criticisms to these kinds of composition (I will edit the answer with the correct references as soon as I find them).
I would suggest you to consider also other possible approaches, for instance:
Using the single word vectors as input to your net (I do not know your architecture, but the LSTM is recurrent so it can deal with sequences of words).
Using a full paragraph embedding (i.e. https://cs.stanford.edu/~quocle/paragraph_vector.pdf)
Summing them doesn't make any sense to be honest, because on summing them you get another vector which i don't think represents the semantics of "Hello World" or may be it does but it won't surely hold true for longer sentences in general
Instead it would be better to feed them as sequence as in that way it at least preserves sequence in meaningful way which seems to fit more to your problem.
e.g A hates apple Vs Apple hates A this difference would be captured when you feed them as sequence into RNN but their summation will be same.
I hope you get my point!

Feature extraction from a single word

Usually one wants to get a feature from a text by using the bag of words approach, counting the words and calculate different measures, for example tf-idf values, like this: How to include words as numerical feature in classification
But my problem is different, I want to extract a feature vector from a single word. I want to know for example that potatoes and french fries are close to each other in the vector space, since they are both made of potatoes. I want to know that milk and cream also are close, hot and warm, stone and hard and so on.
What is this problem called? Can I learn the similarities and features of words by just looking at a large number documents?
I will not make the implementation in English, so I can't use databases.
hmm,feature extraction (e.g. tf-idf) on text data are based on statistics. On the other hand, you are looking for sense (semantics). Therefore no such a method like tf-idef will work for you.
In NLP exists 3 basic levels:
morphological analyses
syntactic analyses
semantic analyses
(higher number represents bigger problems :)). Morphology is known for majority languages. Syntactic analyses is a bigger problem (it deals with things like what is verb, noun in some sentence,...). Semantic analyses has the most challenges, since it deals with meaning which is quite difficult to represent in machines, have many exceptions and are language-specific.
As far as I understand you want to know some relationships between words, this can be done via so-called dependency tree banks, (or just treebank): http://en.wikipedia.org/wiki/Treebank . It is a database/graph of sentences where a word can be considered as a node and relationship as arc. There is good treebank for czech language and for english there will be also some, but for many 'less-covered' languages it can be a problem to find one ...
user1506145,
Here is a simple idea that I have used in the past. Collect a large number of short documents like Wikipedia articles. Do a word count on each document. For the ith document and the jth word let
I = the number of documents,
J = the number of words,
x_ij = the number of times the jth word appears in the ith document, and
y_ij = ln( 1+ x_ij).
Let [U, D, V] = svd(Y) be the singular value decomposition of Y. So Y = U*D*transpose(V)), U is IxI, D is diagonal IxJ, and V is JxJ.
You can use (V_1j, V_2j, V_3j, V_4j) as a feature vector in R^4 for the jth word.
I am surprised the previous answers haven't mentioned word embedding. Word embedding algorithm can produce word vectors for each word a given dataset. These algorithms can nfer word vectors from the context. For instance, by looking at the context of the following sentences we can say that "clever" and "smart" is somehow related. Because the context is almost the same.
He is a clever guy
He is a smart guy
A co-occurrence matrix can be constructed to do this. However, it is too inefficient. A famous technique designed for this purpose is called Word2Vec. It can be studied from the following papers.
https://arxiv.org/pdf/1411.2738.pdf
https://arxiv.org/pdf/1402.3722.pdf
I have been using it for Swedish. It is quite effective in detecting similar words and completely unsupervised.
A package could be find in gensim and tensorflow.

Resources