I have 50 files, each containing the sentences of one ambiguous word (words which have one spelling but have more than one meaning). Actually this is a word sense disambiguation project. some files have 2 senses, some 3 and some 4. I disambiguated them with Naive Bayes algorithm. And now I have to calculate f-measure. Because this is a multi classification task, I have to calculate macro average and micro average. Now my question is that, is calculating just one of them is sufficient and scientific or should I calculate both of them? And after calculating f_measure for each file (each ambiguous word), How should I calculate the total f-measure of all of the 50 words so that just have one number at the end? (Is it necessary to calculate that or just calculating for each word separately and have 50 f-measure at the end?). I need it for my thesis, so I want a scientific and correct answer of an expert. Thanks.
After calculating macro average or micro average for each word, to calculate f-measure for all of the words, we have to multiply each word macro or micro average to the next word average (for all of the 50 words) and divide it to the sum of each word's number of senses.
Related
I learned many documents about training a n-gram model using MLE, but as I noticed all the implementation is just to calculate the conditional probability by count the n-grams, my question is what is the relationship with MLE?
Intuitively, you would have to count all the n-grams in all texts in the world to compute their probabilities. Since this is highly unrealistic, MLE provides a way to estimate these n-gram probabilities by counting them in the given corpus.
For instance, if you need the bigram probability of a word y following a word x, you count the number of their occurrence as a pair, . Then you have to normalize this count by dividing it by the sum of all bigrams starting with x (i.e: x being followed by every possible word), , so that the MLE estimate ultimately lies between 0 and 1.
Therefore, this bigram probability can be estimated by the following expression:
Note that this expression can be further simplified because the sum of all bigram counts starting with x must add up to the unigram count of x itself:
I would like to use word2vec to transform 'words' into numerical vectors and possibly make predictions for new words. I've tried extracting features from words manually and training a linear regression model (using Stocahstic Gradient Descent), but this only works to an extent.
The input data I have is:
Each word is associated with a numerical value. You can think of this value as being the word's coordinate in 1D space.
For each word I can provide a distance to any other word (because I have the words' coordinates).
because of this I can provide the context for each word. If given a distance, I can provide all the other words within this distance from the target one.
Words are composed from latin letters only (e.g. AABCCCDE, BKEDRRS).
Words almost never repeat, but their structural elements repeat a lot within different words.
Words can be of different length (say 5-50 letters max).
Words have common features, some subsequences in them will occur multiple times in different words (e.g. some dublets or triplets of letters, their position within a word, etc).
The question:
Is there an implementation of word2vec which allows provision of your own distances and context for each word?
A big bonus would be if the trained model could spit out the predicted coordinate for any word you feed in after training.
Preferrably in Java, Python is also fine, but in general anything will do.
I am also not restricting myself to word2vec, it just seems as a good fit, but my knowledge of machine-learning and data mining are very limited, so I might be missing a better way to tackle the problem.
PS: I know about deeplearning4j, but I haven't looked around the code enough to figure out if what I want to do is easy to implement in it.
Example of data: (typical input contains thousands to tens of thousands of words)
ABCD 0.50
ABCDD 0.51
ABAB 0.30
BCDAB 0.60
DABBC 0.59
SPQTYRQ 0.80
I am extracting the features for a document. One of the features is the frequency of the word in the document. The problem is that the number of sentences in the training set and test set is not necessarily the same. So, I need to normalized it in some way. One possibility (that came to my mind) was to divide the frequency of the word by the number of sentences in the document. By my supervisor told me that it's better to normalize it in a logarithmic way. I have no idea what does that mean. Can anyone help me?
Thanks in advance,
PS: I also saw this topic, but it didn't help me.
The first question to ask is, what algorithm you are using subsequently? For many algorithms it is sufficient to normalize the bag of words vector, such that it sums up to one or that some other norm is one.
Instead of normalizing by the number of sentence you should, however, normalize by the total number of words in the document. Your test corpus might have longer sentences, for example.
I assume the recommendation of your supervisor means that you do not report the counts of the words but the logarithm of the counts. In addition I would suggest to look into the TF/IDF measure in general. this is imho more common in Textmining
'normalize it in a logarithmic way' probably simply means to replace the frequency feature by log(frequency).
One reason why taking the log might be useful is the Zipfian nature of word occurrences.
Yes, there is a logarithm way, It's called TF-IDF.
TF-IDF is the product of the term frequency and the inverse document frequency.
TF-IDF =
(total number of your word appers in the present document ÷ total number of words in the present document) *
log(total numbers of documents in your collection ÷ the number of documents where your word appears in your collection )
If you use python there is a nice library called GENSIM that contains the algorithm, but your data object must be a Dictionary from the gensim.corpora.
You can find a example here: https://radimrehurek.com/gensim/models/tfidfmodel.html
tf-idf help to normalized -> check the results with tf and tf-idf arguments,
dtm <- DocumentTermMatrix(corpus);dtm
<>
Non-/sparse entries: 27316/97548
Sparsity : 78%
Maximal term length: 22
Weighting : term frequency (tf)
dtm <- DocumentTermMatrix(corpus,control = list(weighting=weightTfIdf));dtm
<>
Non-/sparse entries: 24052/100812
Sparsity : 81%
Maximal term length: 22
Weighting : term frequency - inverse document frequency (normalized) (tf-idf)
I am using a classification technique for multi document extractive text summarization. I have calculated f-measure, recall, precision and accuracy. What will be the ideal metric for my purpose here to evaluate the summaries generated by this method?
ROUGE calculates Recall, Precision and F-measure for a variety of metrics: ROUGE-N, ROUGE-L, ROUGE-W, ROUGE-S. Here is the paper for ROUGE.
ROUGE-N is the number of matching n-grams divided by the total number of n-grams.
ROUGE-L looks at the longest common subsequences of two texts, a subsequence can contain gaps so that 1,3,5 is a subsequence of 1,2,3,4,5.
ROUGE-W also uses longest common subsequence as a score but gives a higher weight to subsequences with less gaps.
ROUGE-S uses skip-bigrams, a skip-bigram is 2-gram that can contain any 2 words as long as they are in sentence order i.e do not have to be consecutive.
I'm reading the paper below and I have some trouble , understanding the concept of negative sampling.
http://arxiv.org/pdf/1402.3722v1.pdf
Can anyone help , please?
The idea of word2vec is to maximise the similarity (dot product) between the vectors for words which appear close together (in the context of each other) in text, and minimise the similarity of words that do not. In equation (3) of the paper you link to, ignore the exponentiation for a moment. You have
v_c . v_w
-------------------
sum_i(v_ci . v_w)
The numerator is basically the similarity between words c (the context) and w (the target) word. The denominator computes the similarity of all other contexts ci and the target word w. Maximising this ratio ensures words that appear closer together in text have more similar vectors than words that do not. However, computing this can be very slow, because there are many contexts ci. Negative sampling is one of the ways of addressing this problem- just select a couple of contexts ci at random. The end result is that if cat appears in the context of food, then the vector of food is more similar to the vector of cat (as measures by their dot product) than the vectors of several other randomly chosen words (e.g. democracy, greed, Freddy), instead of all other words in language. This makes word2vec much much faster to train.
Computing Softmax (Function to determine which words are similar to the current target word) is expensive since requires summing over all words in V (denominator), which is generally very large.
What can be done?
Different strategies have been proposed to approximate the softmax. These approaches can be grouped into softmax-based and sampling-based approaches. Softmax-based approaches are methods that keep the softmax layer intact, but modify its architecture to improve its efficiency (e.g hierarchical softmax). Sampling-based approaches on the other hand completely do away with the softmax layer and instead optimise some other loss function that approximates the softmax (They do this by approximating the normalization in the denominator of the softmax with some other loss that is cheap to compute like negative sampling).
The loss function in Word2vec is something like:
Which logarithm can decompose into:
With some mathematic and gradient formula (See more details at 6) it converted to:
As you see it converted to binary classification task (y=1 positive class, y=0 negative class). As we need labels to perform our binary classification task, we designate all context words c as true labels (y=1, positive sample), and k randomly selected from corpora as false labels (y=0, negative sample).
Look at the following paragraph. Assume our target word is "Word2vec". With window of 3, our context words are: The, widely, popular, algorithm, was, developed. These context words consider as positive labels. We also need some negative labels. We randomly pick some words from corpus (produce, software, Collobert, margin-based, probabilistic) and consider them as negative samples. This technique that we picked some randomly example from corpus is called negative sampling.
Reference :
(1) C. Dyer, "Notes on Noise Contrastive Estimation and Negative Sampling", 2014
(2) http://sebastianruder.com/word-embeddings-softmax/
I wrote an tutorial article about negative sampling here.
Why do we use negative sampling? -> to reduce computational cost
The cost function for vanilla Skip-Gram (SG) and Skip-Gram negative sampling (SGNS) looks like this:
Note that T is the number of all vocabs. It is equivalent to V. In the other words, T = V.
The probability distribution p(w_t+j|w_t) in SG is computed for all V vocabs in the corpus with:
V can easily exceed tens of thousand when training Skip-Gram model. The probability needs to be computed V times, making it computationally expensive. Furthermore, the normalization factor in the denominator requires extra V computations.
On the other hand, the probability distribution in SGNS is computed with:
c_pos is a word vector for positive word, and W_neg is word vectors for all K negative samples in the output weight matrix. With SGNS, the probability needs to be computed only K + 1 times, where K is typically between 5 ~ 20. Furthermore, no extra iterations are necessary to compute the normalization factor in the denominator.
With SGNS, only a fraction of weights are updated for each training sample, whereas SG updates all millions of weights for each training sample.
How does SGNS achieve this? -> by transforming multi-classification task into binary classification task.
With SGNS, word vectors are no longer learned by predicting context words of a center word. It learns to differentiate the actual context words (positive) from randomly drawn words (negative) from the noise distribution.
In real life, you don't usually observe regression with random words like Gangnam-Style, or pimples. The idea is that if the model can distinguish between the likely (positive) pairs vs unlikely (negative) pairs, good word vectors will be learned.
In the above figure, current positive word-context pair is (drilling, engineer). K=5 negative samples are randomly drawn from the noise distribution: minimized, primary, concerns, led, page. As the model iterates through the training samples, weights are optimized so that the probability for positive pair will output p(D=1|w,c_pos)≈1, and probability for negative pairs will output p(D=1|w,c_neg)≈0.