Features in SVM based Sentiment Analysis - machine-learning

I am unable to convert semantic and lexical information into feature vectors.
I know the following information
Part of Speech tag - output of POS tagger ex Adjective,verb
Word Sense- output of Word Sense Disambiguation ex Bank - financial institution,heap
Ontological information - ex mammal,Location
n-gram - ex good-boy
Head word - ex act root word of acting
My question is how to represent them as real values.Should I just just choose the occurrence of each of the feature(POS,sense,etc..) i.e. boolean vector but then the semantic information will be lost in case of n-grams(ex very good boy and good boy have different semantic orientation in case of sentiment analysis).

There is no good method for converting nominal values into real valued vectors. The most common approach is as you suggested - conversion to he boolean vectors. In case of n-grams I do not see your point. What is your object? You say that you have POS, POS is a feature of a word, n-gram on the other hand has no meaning on the single word level, but rather as a representation of the part of the sentence. Do you mean "n-gram in which it appears"? It is exactly the same as "previous word" then (or n-1 previous words) and you do not loose any information then (simply you have k dimensions per each "previous" word, where k is a size of a vocablurary). Keep in mind, that your representation will be huge.

Related

Word2Vec Data Setup

In the Word2Vec Skip-gram setup that follows, what is the data setup for the output layer? Is it a matrix that is zero everywhere but with a single "1" in each of the C rows - that represents the words in the C context?
Add to describe Data Setup Question:
Meaning what the dataset would look like that was presented to the NN? Lets consider this to be "what does a single training example look like"?. I assume the total input is a matrix, where each row is a word in the vocabulary (and there is a column for each word as well and each cell is zero except where for the specific word - one hot encoded)? Thus, a single training example is 1xV as shown below (all zeros except for the specific word, whose value is a 1). This aligns with the picture above in that the input is V-dim. I expected that the total input matrix would have duplicated rows however - where the same one-hot encoded vector would be repeated for each time the word was found in the corpus (as the output or target variable would be different).
The Output (target) is more confusing to me. I expected it would exactly mirror the input -- a single training example has a "multi"-hot encoded vector that is zero except is a "1" in C of the cells, denoting that a particular word was in the context of the input word (C = 5 if we are looking, for example, 2 words behind and 3 words ahead of the given input word instance). The picture doesn't seem to agree with this though. I dont understand what appears like C different output layers that share the same W' weight matrix?
The skip-gram architecture has word embeddings as its output (and its input). Depending on its precise implementation, the network may therefore produce two embeddings per word (one embedding for the word as an input word, and one embedding for the word as an output word; this is the case in the basic skip-gram architecture with the traditional softmax function), or one embedding per word (this is the case in a setup with the hierarchical softmax as an approximation to the full softmax, for example).
You can find more information about these architectures in the original word2vec papers, such as Distributed Representations of Words and Phrases
and their Compositionality by Mikolov et al.

NLP - Word Representations

I am working on a Word representation algorithm, similar to Word2Vec and GloVe.I have been asked to make it more dynamic, such that new words could be added to the vocabulary,and new documents could be submitted to the program even after the representations (vectors) have been created.
The problem is, how do I know if my representation work? How do I know if it actually captures the meaning of each word? How do I compare my representation with other existing vector space models?
As of now, I am doing the following tests to check the quality of my word vectors:
Distance test:
Does the cosine distance between vectors reflect the semantic distance between words?
Analogy test:
Can the representation be used to solve problems like "King is to queen what man is to ________ ", (the answer should be woman)
Picking the odd one out:
Can the vectors be used to pick the odd word in a given list of words. If the input is {"cat","dog","phone"}, the output should be "phone"?
What are the other tests that I should do to check the quality of the vectors? What other tasks are word vectors expected to be capable of doing? Is there a benchmark for vector space models?
Your tests sound very reasonable — they are the usual evaluation tasks that are used in research papers to test the quality of word embeddings.
In addition, the website www.wordvectors.org can give you a good idea of how your vectors measure up. It allows you to upload your embeddings, generates plots, gives correlations with word pair similarity rankings, and compares your embeddings with pre-trained vectors from previous research. You can find a more detailed description in the accompanying paper.

Feature extraction from a single word

Usually one wants to get a feature from a text by using the bag of words approach, counting the words and calculate different measures, for example tf-idf values, like this: How to include words as numerical feature in classification
But my problem is different, I want to extract a feature vector from a single word. I want to know for example that potatoes and french fries are close to each other in the vector space, since they are both made of potatoes. I want to know that milk and cream also are close, hot and warm, stone and hard and so on.
What is this problem called? Can I learn the similarities and features of words by just looking at a large number documents?
I will not make the implementation in English, so I can't use databases.
hmm,feature extraction (e.g. tf-idf) on text data are based on statistics. On the other hand, you are looking for sense (semantics). Therefore no such a method like tf-idef will work for you.
In NLP exists 3 basic levels:
morphological analyses
syntactic analyses
semantic analyses
(higher number represents bigger problems :)). Morphology is known for majority languages. Syntactic analyses is a bigger problem (it deals with things like what is verb, noun in some sentence,...). Semantic analyses has the most challenges, since it deals with meaning which is quite difficult to represent in machines, have many exceptions and are language-specific.
As far as I understand you want to know some relationships between words, this can be done via so-called dependency tree banks, (or just treebank): http://en.wikipedia.org/wiki/Treebank . It is a database/graph of sentences where a word can be considered as a node and relationship as arc. There is good treebank for czech language and for english there will be also some, but for many 'less-covered' languages it can be a problem to find one ...
user1506145,
Here is a simple idea that I have used in the past. Collect a large number of short documents like Wikipedia articles. Do a word count on each document. For the ith document and the jth word let
I = the number of documents,
J = the number of words,
x_ij = the number of times the jth word appears in the ith document, and
y_ij = ln( 1+ x_ij).
Let [U, D, V] = svd(Y) be the singular value decomposition of Y. So Y = U*D*transpose(V)), U is IxI, D is diagonal IxJ, and V is JxJ.
You can use (V_1j, V_2j, V_3j, V_4j) as a feature vector in R^4 for the jth word.
I am surprised the previous answers haven't mentioned word embedding. Word embedding algorithm can produce word vectors for each word a given dataset. These algorithms can nfer word vectors from the context. For instance, by looking at the context of the following sentences we can say that "clever" and "smart" is somehow related. Because the context is almost the same.
He is a clever guy
He is a smart guy
A co-occurrence matrix can be constructed to do this. However, it is too inefficient. A famous technique designed for this purpose is called Word2Vec. It can be studied from the following papers.
https://arxiv.org/pdf/1411.2738.pdf
https://arxiv.org/pdf/1402.3722.pdf
I have been using it for Swedish. It is quite effective in detecting similar words and completely unsupervised.
A package could be find in gensim and tensorflow.

Naive Bayes, not so Naive?

I have a Naive Bayes classifier (implemented with WEKA) that looks for uppercase letters.
contains_A
contains_B
...
contains_Z
For a certain class the word LCD appears in almost every instance of the training data. When I get the probability for "LCD" to belong to that class it is something like 0.988. win.
When I get the probability for "L" I get a plain 0 and for "LC" I get 0.002. Since features are naive, shouldn't the L, C and D contribute to overall probability independently, and as a result "L" have some probability, "LC" some more and "LCD" even more?
At the same time, the same experiment with an MLP, instead of having the above behavior it gives percentages of 0.006, 0.5 and 0.8
So the MLP does what I would expect a Naive Bayes to do, and vise versa. Am I missing something, can anyone explain these results?
I am not familiar with the internals of WEKA - so please correct me if you think that I am not righth.
When using a text as a "feature" than this text is transformed to a vector of binary values. Each value correponds to one concrete word. The length of the vector is equal to the size of the dictionary.
if your dictionary contains 4 worlds: LCD, VHS, HELLO, WORLD
then for example a text HELLO LCD will be transformed to [1,0,1,0].
I do not know how WEKA builds it's dictionary, but I think it might go over all the words present in the examples. Unless the "L" is present in the dictionary (and therefor is present in the examples) than it's probability is logicaly 0. Actually it should not even be considered as a feature.
Actually you can not reason over the probabilities of the features - and you cannot add them together, I think there is no such a relationship between the features.
Beware that in text mining, words (letters in your case) may be given weights different than their actual counts if you are using any sort of term weighting and normalization, e.g. tf.idf. In the case of tf.idf for example, characters counts are converted into a logarithmic scale, also characters that appear in every single instance may be penalized using idf normalization.
I am not sure what options you are using to convert your data into Weka features, but you can see here that Weka has parameters to be set for such weighting and normalization options
http://weka.sourceforge.net/doc.dev/weka/filters/unsupervised/attribute/StringToWordVector.html
-T
Transform the word frequencies into log(1+fij)
where fij is the frequency of word i in jth document(instance).
-I
Transform each word frequency into:
fij*log(num of Documents/num of documents containing word i)
where fij if frequency of word i in jth document(instance)
I checked the weka documentation and I didn't see support for extracting letters as features. This implies the weka function may need a space or punctuation to delimit each feature from those adjacent. If so, then the search for "L", "C" and "D" would be interpreted as three separate one-letter-words and would explain why they were not found.
If you think this is it, you could try splitting the text into single characters delimited by \n or space, prior to ingestion.

Sentence classification using Weka

I want to classify sentences with Weka. My features are sentence terms (words) and a Part of Speech tag of each terms. I don't know how figure attributes, because if each term is presented as one feature, number of feature for each instance (sentence) has become different. And, if all words in sentence is presented as one feature, how relate words and their POS tag.
Any ideas how I should proceed?
If I understand the question correctly, the answer is as follows: It is most common to treat words independently of their position in the sentence and represent a sentence in the feature space by the number of times each of the known words occurs in that sentence. I.e. there is usually a separate numerical feature for each word present in the training data. Or, if you're willing to use n-grams, a separate feature for every n-gram in the training data (possibly with some frequency threshold).
As for the POS tags, it might make sense to use them as separate features, but only if the classification you're interested in has to do with sentence structure (syntax). Otherwise you might want to just append the POS tag to the word, which would partly disambiguate those words that can represent different parts of speech.

Resources