How to overcome Drawback of Inverse Document Frequency (IDF) - search-engine

Please tell me how to overcome the problem of negative weighting in IDF. Can someone give a small example?

IDF is defined as N/n(t) where n(t) is the number of documents that a term 't' occurs in and N is the total number of documents in the collection. Sometimes, a log() is applied around this fraction.
Please observe that this fraction N/n(t) is always >= 1. For a word which appears in all documents, a likely case of which is the English word "the", the value of idf is 1. Even if a log is applied around this fraction, the value is always >= zero. (Recall the graph of the log function which monotonically increases from -inf to +inf with log(x)<0 if x<1 log(1)=0 and log(x)>0 if x>1).
So, there's no way in which a standard definition of idf can be negative.

The answer given by Debasis is entirely correct. Negative idf might still resulf, however, if a small +1 term is added to the divisor to avoid divide-by-zero errors. One source that suggests this is the Wikipedia article on tf-idf. The problem is that the LOG operation then results in a negative value if the occurrence count n(t) is equal to the count of documents N (i.e. it appears in all of them). I just ran into this issue when implementing tf-idf on a toy problem, where a document count N = 3 and an occurrence count of 3 would ordinarily = 0, but resulted in an idf of -0.287682072451781 because of the +1 correction term increased the divisor to 4, > than the count of documents. Perhaps this was the culprit behind the negative weights the O.P. experienced. I figured I'd post this just in case someone else runs into this initially baffling problem again. The fix is simple: remove the +1 term and find another way to avoid divide-by-zero errors.

Related

Is it possible to solve a fractional knapsack including negative values using greedy algorithm?

I have a problem which I think can be converted to a variant of
fractional knapsack problem.
The objective function is in the form of:
$\sum_{i} x_iv_i$
However, my problem differs in that it allows $v_i$ s and $x_i$ to be negative.
I want to prove that this problem can be solved using the greedy algorithm (explained in the link).
I have tested this for many test cases and greedy algorithm seems to solve it, but I want a definite
proof that greedy algorithm is still applicable given the extra constraint.
In the fractional knapsack problem, you find the Value/Weight of every item that you may put in the knapsack, and sort these items from the best V/W ratio to the worst. You then start with the best ratio, and fill the knapsack is either full or you run out. If you run out, you then head to the next item in the list and fill the knapsack with it. This pattern continues until the knapsack is full. It is greedy, because once we sort this list we know that we can confidently add the items fractionally in this order and that we will end with the greatest potential value in the bag.
By allowing the values and "weights" to be negative, as in this problem, however, the algorithm is no longer greedy. It is ruined by the fact that an item could have a negative "weight" and negative value, resulting in a positive V/W ratio. For example, take the following list of items:
V=-1, W=-1 -> V/W = 1.0
V=.9, W=1 -> V/W = 0.9
V=.8, W=1 -> V/W = 0.8
Following the greedy algorithm, we would want to add as much of item 1 as exists, because it has the best V/W ratio. However, adding item 1 really hurts us in the long run, because we are losing more value per weight then we can add later on. For example, let's assume the |W|=10 for each, and the max weight of the knapsack is 10. By adding all of 1, we will have a weight of -10 and a value of -10. Then we add all of 2, which results in a weight of 0 and a value of -1. Then we add all of 3, which results in a weight of 10 and a value of 7.
If instead of this, we just added all of item 2 from the start, we would have a weight of 10 and a value of 9. Therefore by contradiction, if weight and value can be negative, the algorithm is NOT a greedy algorithm.

Are duplicates useful in data sets?

I downloaded Skin Segmentation Data Set and found that it contains a lot of duplicates.
For example, this row 0 128 0 2 encountered 199 times.
Please, supply a few examples when duplicates is good and when is evil.
Yes of course, because if it is a random sample, that represents the underlying distribution in the data, that tells you that this particular value has a higher probability. Removing duplicates will just render the dataset pretty useless.
It is important.
For example: If row 'a' appears 5 times in your data and another row, 'b', appears only once, then you will want to classify row 'a' better than 'b' because when you will calculate the cost function, row 'a' will appear more time and have a bigger influence on the cost.
And, if your training represents well the test data, then there is a high probability that row 'a' will appear more times than row 'b' there.

NLP: How to correctly normalise a feature for gender classification?

NOTE Before I begin, this F-measure is not related to precision and recall, and its title and definition is taken from this paper.
I have a feature known as the F-measure, which is used to measure formality in a given text. It is mostly used in gender classification of text which is what I'm working on as a project.
The F-measure is defined as:
F = 0.5 * (noun freq. + adjective freq. + preposition freq. + article freq. – pronoun
freq. – verb freq. – adverb freq. – interjection freq. + 100)
where the frequencies are taken from a given text (for example, a blog post).
I would like to normalize this feature for use in a classification task. Initially, my first thought was that since the value F is bound by the number of words in the given text (text_length), I thought of first taking F and dividing by text_length. Secondly, and finally, since this measure can take on both positive and negative values (as can be inferred from the equation) I then thought of squaring (F/text_length) to only get a positive value.
Trying this I found that the normalised values did not seem to be too correct as I started getting really small values in (below 0.10) for all the cases I tested the feature with and I am thinking that the reason might be because I am squaring the value which would essentially make it smaller since its the square of a fraction. However this is required if I want to guarantee positive values only. I am not sure what else to consider to improve the normalisation such that a nice distribution within [0,1] is produced, and would like to know if there is some kind of strategy involved to correctly normalise NLP features.
How should I approach the normalisation of my feature, and what might I be doing wrong?
If you carefully read the article, you'll find that the measure is already normalized:
F will then vary between 0 and 100%
The reason for this is that "frequencies" in the formula are calculated as follows:
The frequencies are here expressed as percentages of the number of words belonging to a particular category with respect to the total number of words in the excerpt.
I.e. you should normalize them by the total number of words (just as you suggested). But afterwards don't forget to multiply each one by 100.

Understanding Recall and Precision

I am currently learning Information retrieval and i am rather stuck with an example of recall and precision
A searcher uses a search engine to look for information. There are 10 documents on the first screen of results and 10 on the second.
Assuming there is known to be 10 relevant documents in the search engines index.
Soo... there is 20 searches all together of which 10 are relevant.
Can anyone help me make sense of this?
Thanks
Recall and precision measure the quality of your result. To understand them let's first define the types of results. A document in your returned list can either be
classified correctly
a true positive (TP): a document which is relevant (positive) that was indeed returned (true)
a true negative (TN): a document which is not relevant (negative) that was indeed NOT returned (true)
misclassified
a false positive (FP): a document which is not relevant but was returned positive
a false negative (FN): a document which is relevant but was not returned negative
the precision is then:
|TP| / (|TP| + |FP|)
i.e. the fraction of retrieved documents which are indeed relevant
the recall is then:
|TP| / (|TP| + |FN|)
i.e. the fraction of relevant documents which are in your result set
So, in your example 10 out of 20 results are relevant. This gives you a precision of 0.5. If there are no more than these 10 relevant documents, you have got a recall of 1.
(When measuring the performance of an Information Retrieval system it only makes sense to consider both precision and recall. You can easily get a precision of 100% by returning no result at all (i.e. no spurious returned instance => no FP) or a recall of 100% by returning every instance (i.e. no relevant document was missed => no FN). )
Well, this is an extension of my answer on recall at: https://stackoverflow.com/a/63120204/6907424. First read about precision here and than go to read recall. Here I am only explaining Precision using the same example:
ExampleNo Ground-truth Model's Prediction
0 Cat Cat
1 Cat Dog
2 Cat Cat
3 Dog Cat
4 Dog Dog
For now I am calculating precision for Cat. So Cat is our Positive Class and the rest of the classes (Here Dog only) are the Negative Classes. Precision means what the percentage of positive detection was actually positive. So here for Cat there are 3 detection by the model. But are all of them correct? No! Out of them only 2 are correct (in example 0 and 2) and another is wrong (in example 3). So the percentage of correct detection is 2 out of 3 which is (2 / 3) * 100 % = 66.67%.
Now coming to the formulation, here:
TP (True positive): Predicting something positive when it is actually positive. If cat is our positive example then predicting something a cat when it is actually a cat.
FP (False positive): Predicting something as positive but which is not actually positive, i.e, saying something positive "falsely".
Now the number of correct detection of a certain class is the number of TP of that class. But apart from them the model also predicted some other examples as positives but which were not actually positives and so these are the false positives (FP). So irrespective of correct or wrong the total number of positive class detected by the model is TP + FP. So the percentage of correct detection of positive class among all detection of that class will be: TP / (TP + FP) which is the precision of the detection of that class.
Like recall we can also generalize this formula for any number of classes. Just take one class at a time and consider it as the positive class and the rest of the classes as negative classes and continue the same process for all of the classes to calculate precision for each of them.
You can calculate precision and recall in another way (basically the other way of thinking the same formulae). Say for Cat, first count how many examples at the same time have Cat in both Ground-truth and Model's prediction (i.e, count the number of TP). Therefore if you are calculating precision then divide this count by the number of "Cat"s in the Model's Prediction. Otherwise for recall divide by the number of "Cat"s in the Ground-truth. This works as the same as the formulae of precision and recall. If you can't understand why then you should think for a while and review what actually TP, FP, TN and FN means.
If you have difficulty understanding precision and recall, consider reading this
https://medium.com/seek-product-management/8-out-of-10-brown-cats-6e39a22b65dc

Maximum Likelihood Estimation-MLE

I have this question related to "Maximum Likelihood Estimation"...I've tried to solve it but I couldn't ... would you please help!
Suppose for an event X, there are three possible
values, A, B and C. Now we repeat X for N times. The number of times that we observe A or B
is N1, the number of times that we observe A or C is N2. Let pA be the unknown frequency of
value A. Please give the maximum likelihood estimation of pA
This question doesn't really belong in Stackoverflow, but I will answer it anyway.
You can look at the number of observations of A (I am calling this N_A) as a hidden variable and maximize with respect to the marginal distribution (the sum over all possible values of N_A).
There is no closed form solution, in general, for the MLE parameters - solutions are the zeros of a polynomial constrained to the simplex. Below, I have derived an Expectation Maximization updates.

Resources