widely used formula for DCG seems to be wrong? - search-engine

There are two widely used formulas (found in most Information retrieval lecture slides on the internet, e.g. at Stanford) for computing the Discounted Cumulative Gain. One of them is, for the DCG at rank p:
This is, in fact:
Because log_2(2) = 1. This means that the so-called "discounted" CG is actually not discounted before the third rank!
The following rankings are therefore not distinguishable by the DCG using this formula: (10,5,1,2,...) and (5,10,1,2,...).
I am guessing that the formula is incorrect and should be:
Note btw that the other very common formula (see wikipedia) has this denominator.
I would not be asking if I hadn't seen this formula in practically all the lectures I found on the internet and even in my own lectures at UCL. Is it not wrong? It would be incredible that an error has propagated from Wikipedia and not been picked up by the professors... Am I wrong then?

I found this paper from Microsoft (see equation 6) which backs up my claim that it is basically a typo start discounting at rank 3 only. When you think of it, it makes no sense at all to not discount the rank 2! The metric would be unable to distinguish the rankings (10, 5, 2) and (5, 10, 2), when the first ranking is better. Note that all the other DCG formulas do discount rank 2 and thus would pick up a difference.
So a "+1" is indeed missing in the log, and it is a typo which has been creeping in a lot of papers and lectures...

Related

Could you explain this question? i am new to ML, and i faced this problem, but its solution is not clear to me

The problem is in the picture
Question's image:
Question 2
Many substances that can burn (such as gasoline and alcohol) have a chemical structure based on carbon atoms; for this reason they are called hydrocarbons. A chemist wants to understand how the number of carbon atoms in a molecule affects how much energy is released when that molecule combusts (meaning that it is burned). The chemists obtains the dataset below. In the column on the right, kj/mole is the unit measuring the amount of energy released. examples.
You would like to use linear regression (h a(x)=a0+a1 x) to estimate the amount of energy released (y) as a function of the number of carbon atoms (x). Which of the following do you think will be the values you obtain for a0 and a1? You should be able to select the right answer without actually implementing linear regression.
A) a0=−1780.0, a1=−530.9 B) a0=−569.6, a1=−530.9
C) a0=−1780.0, a1=530.9 D) a0=−569.6, a1=530.9
Since all a0s are negative but two a1s are positive lets figure out the latter first.
As you can see by increasing the number of carbon atoms the energy is become more and more negative, so the relation cannot be positively correlated which rules out options c and d.
Then for the intercept the value that produces the least error is the correct one. For the 1 and 10 (easier to calculate) the outputs are about -2300 and -7000 for a, -1100 and -5900 for b, so one would prefer b over a.
PS: You might be thinking there should be obvious values for a0 and a1 from the data, it's not. The intention of the question is to give you a general understanding of the best fit. Also this way of solving is kinda machine learning as well

Explanation for Values in Scharr-Filter used in OpenCV (and other places)

The Scharr-Filter is explained in Scharrs dissertation. However the values given on page 155 (167 in the pdf) are [47 162 47] / 256. Multiplying this with the derivation-filter would yield:
Yet all other references I found use
Which is roughly the same as the ones given by Scharr, scaled by a factor of 32.
Now my guess is that the range can be represented better, but I'm curious if there is an official explanation somewhere.
To get the ball rolling on this question in case no "expert" can be found...
I believe the values [3, 10, 3] ... instead of [47 162 47] / 256 ... are used simply for speed. Recall that this method is competing against the Sobel Operator whose coefficient values are are 0, and positive/negative 1's and 2's.
Even though the divisor in the division, 256 or 512, is a power of 2 and can can be performed by a shift, doing that and multiplying by 47 or 162 is going to take more time. A multiplication by 3 however can in fact be done on some RISC architectures like the IBM POWER series in a single shift-and-add operation. That is 3x = (x << 1) + x. (On these architectures, the shifter and adder are separate units and can be done independently).
I don't find it surprising that Phd paper used the more complicated and probably more precise formula; it needed to prove or demonstrate something, and the author probably wasn't totally certain or concerned that it be used and implemented alongside other methods. The purpose in the thesis was probably to have "perfect rotational symmetry". Afterwards when one decides to implement it, that person I suspect used the approximation formula and gave up a little on perfect rotational symmetry, to gain speed. That person's goal as I said was to have something that was competitive at the expense of little bit of speed for this rotational stuff.
Since I'm guessing you are willing to do work this as it is your thesis, my suggestion is to implement the original algorithm and benchmark it against both the OpenCV Scharr and Sobel code.
The other thing to try to get an "official" answer is: "Use the 'source', Luke!". The code is on github so check it out and see who added the Scharr filter there and contact that person. I won't put the person's name here, but I will say that the code was added 2010-05-11.

what does B1 and B2 in second order pointwise mutual mutual information mean?

I need to know how can i get and?
gama here refers to what ?
I think the metric you're using is from this paper (though the form they give is not quite the same as yours):
Islam, A. and Inkpen, D. 2006. "Second Order Co-occurrence PMI for
Determining the Semantic Similarity of Words". In Proceedings of the
International Conference on Language Resources and Evaluation (LREC
2006), Genoa, Italy, pp. 1033–1038.
which is available online here.
They give the following rule for setting beta:
where delta is a constant whose value depends on the size of the corpus. Islam & Inkpen use 6.5, but you should probably look at the original paper to get a sense of the trade-offs involved.

What is wrong with 20newsgroup-18828 dataset?

i am currently using 20NewsGroup-18828 dataset in weka. I have selected a subset of document with 100 per category (total 2000 documents) which i divided in a split of 70%(training) and 30%(testing) when i tried classification with naive bayes, SVM and K-nn its accuracy is very low.Here are list of operations i am performing on the dataset
StringtoWordVector (indexing and term weighting with Tf-Idf, Smart stopword list, Snowball stemmer)
Dimensionality reduction with feature selection (InformationGain)
Dimensionality reduction with feature transformation (Random Projection)
When i use original dataset with 20,000 docs it performs well but it has duplications like some documents are classified in multiple categories.
Did any one used this dataset or can someone tell me what i am doing wrong ?
Regarding differences between datasets
The main difference between 20newsgroup ( o riginal dataset) and 20newsgroup-18828 (m odified) is:
o contains duplicates, m does not
o contains trivial problem, as it includes newsgroup identification header, m includes only from and subject headers (so it is still easy version of the problem, but harder than o), for example:
FILE 51126 regarding atheism
in original form:
Path:
cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!fs7.ece.cmu.edu!europa.eng.gtefsd.com!howland.reston.ans.net!noc.near.net!news.centerline.com!uunet!olivea!sgigate!sgiblab!adagio.panasonic.com!nntp-server.caltech.edu!keith
From: keith#cco.caltech.edu (Keith Allan Schneider) Newsgroups:
alt.atheism Subject: Re: >>>>>>Pompous ass Message-ID:
<1pi9btINNqa5#gap.caltech.edu> Date: 2 Apr 93 20:57:33 GMT References:
<1ou4koINNe67#gap.caltech.edu> <1p72bkINNjt7#gap.caltech.edu>
<93089.050046MVS104#psuvm.psu.edu> <1pa6ntINNs5d#gap.caltech.edu>
<1993Mar30.210423.1302#bmerh85.bnr.ca> <1pcnqjINNpon#gap.caltech.edu>
Organization: California Institute
of Technology, Pasadena Lines: 9 NNTP-Posting-Host:
punisher.caltech.edu
kmr4#po.CWRU.edu (Keith M. Ryan) writes:
>>Then why do people keep asking the same questions over and over?
>Because you rarely ever answer them.
Nope, I've answered each question posed, and most were answered
multiple times.
keith
In modified form (-18828 version)
From: keith#cco.caltech.edu (Keith Allan Schneider)
Subject: Re: >>>>>>Pompous ass
kmr4#po.CWRU.edu (Keith M. Ryan) writes:
>>Then why do people keep asking the same questions over and over?
>Because you rarely ever answer them.
Nope, I've answered each question posed, and most were answered
multiple times.
keith
As you can see, original data is so simple, that you actually can find the name of the label inside of the file... this is why you will always get good scores on such data, even if your whole processing concept is very, very wrong.
So the question is not "what is wrong with 20newsgroup-18828" but rather "what is wrong with the original dataset".
General ideas
First, why would you assume that anything is wrong? You are performing very arbitrary methods of data representation processing (two different dimensionality reduction steps) on the very small (70 training vectors per class) dataset. There is nothing wrong with this data, this is a simple NLP data, which, as most of the NLP tasks require large amounts of data, and "naive" (not NLP-based) dimensionality reduction techniques have no guarantees to actually help.
Secod, even if you do something wrong, in 90% os cases (arbitrary high number) the error is between what user think he does, and what he actually does. So describing what you do won't lead to any help, you have to show what you exactly do (by giving a reproducible example).

Need help maximizing 3 factors in multiple, similar objects and ordering appropriately

I need to write an algorithm in any language that would order an array based on 3 factors. I use resorts as an example (like Hipmunk). Let's say I want to go on vacation. I want the cheapest spot, with the best reviews, and the most attractions. However, there is obviously no way I can find one that is #1 in all 3.
Example (assuming there are 20 important attractions):
Resort A: $150/night...98/100 in favorable reviews...18 of 20 attractions
Resort B: $99/night...85/100 in favorable reviews...12 of 20 attractions
Resort C: $120/night...91/100 in favorable reviews...16 of 20 attractions
Resort B looks the most appealing in price, but is 3rd in the other 2 categories. Wherein, I can choose resort C for only $21 more a night and get more attractions and better reviews. Price is still important to me, but Resort A has outstanding reviews and a ton of attractions: Is $51 more worth the splurge?
I want to be able to populate a list that will order a lit from "best to worst" (I quote bc it is subjective to the consumer). How would I go about maximizing the value for each resort?
Should I put a weight for each factor (ie: 55% price, 30% reviews, 15% amenities) and come to the result of a set number and order them that way?
Do I need a mode, median and range for all the hotels and determine the average price, and have the hotels around the average price hold the most weight?
If it is a little confusing then check out www.hipmunk.com. They have an airplane sort they call Agony (and a hotel sort which is similar to my question) that they use as their own. I used resorts as an example to make my question hopefully make a little more sense. How does one put math to a problem like this?
I was about to ask the same question about multiple-factor weighted sorting, because my research only came up with answers (e.g. formulas with explanations) for two-factor sorting.
Even though we're both asking about 3 factors, I'll list the possibilities I've found in case they're helpful.
Possibilities:
Note: S is the "sorting score", which is what you'd sort by (asc or desc).
"Linearly weighted" - use a function like: S = (w1 * F1) + (w2 * F2) + (w3 * F3), where wx are arbitrarily assigned weights, and Fx are the values of the factors. You'd also want to normalize F (i.e. Fx_n = Fx / Fmax).
"Base-N weighted" - more like grouping than weighting, it's just a linear weighting where weights are increasing multiples of base-10 (a similar principle to CSS selector specificity), so that more important factors are significantly higher: S = 1000 * F1 + 100 * F2 ....
Estimated True Value (ETV) - this is apparently what Google Analytics introduced in their reporting, where the value of one factor influences (weights) another factor - the consequence being to sort on more "statistically significant" values. The link explains it pretty well, so here's just the equation: S = (F2 / F2_max * F1) + ((1 - (F2 / F2_max)) * F1_avg), where F1 is the "more important" factor ("bounce rate" in the article), and F2 is the "significance modifying" factor ("visits" in the article).
Bayesian Estimate - looks really similar to ETV, this is how IMDb calculates their rating. See this StackOverflow post for explanation; equation: S = (F2 / (F2+F2_lim)) * F1 + (F2_lim / (F2+F2_lim)) × F1_avg, where Fx are the same as #3, and F2_lim is the minimum threshold limit for the "significance" factor (i.e. any value less than X shouldn't be considered).
Options #3 and #4 look really promising, since you don't really have to choose an arbitrary weighting scheme like you do in #1 and #2, but then the problem is how do you do this for more than two factors?
In your case, assigning the weights in #1 would probably be fine. You'll need to fine-tune the algorithm depending on what your users consider more important - you could expose the weights wx as a filter (like 1-10 dropdown) so your users can adjust their search on the fly. Or if you wanted to get clever you could poll your users before they're searching ("Which is more important to you?") and then assign a weighting set based on the response, and after tracking enough polls you could autosuggest the weighting scheme based on most responses.
Hope that gets you on the right track.
What about having variable weights, and letting the user adjust it through some input like levers, so that the sort order will be dynamically updated?

Resources