I see people brag about or discuss their pagerank information.
How would one objectively criticize what someone says. For instance, if someone says their blog is "Pagerank 3" what exactly is that telling me
Do I have enough information to really understand the utility of that metric? such as:
1. what search engine
2. what search query would be necessary to see the blog in the search results
3. something else?
from my understanding "pagerank 3" means that the blog would be the third result in the search results for a particular query. but maybe this is a very rudimentary understanding of pagerank, like if it turned out they are tiers
I will give a more complete answer.
At the beginning of Google, results where ordered like that
#1 result : Pagerank score * relevancy score (for instance : 100)
#2 result : Pagerank score * relevancy score (for instance : 90)
#3 result : Pagerank score * relevancy score (for instance : 80)
To give an idea of what pagerank was for websites owner, google gave a metric where (it is an example):
real PageRank score > 10000 = PageRank metric (given by google) 5
real PageRank score > 1000 = PageRank metric 4
real PageRank score > 100 = PageRank metric 3
real PageRank score > 10 = PageRank metric 2
real PageRank score > 5 = PageRank metrick 1
real PageRank score > 0 = PageRank metric 0
But, by now, Pagerank has become just an indicator in hundreds of indicators explaining the rankings.
Related
It seems that the number can be well above or below 1, not in the range of 0-1. Neo4j is using Lucene fulltext search and the scores are not in the range of 0-1. Is this expected in Lucene?
I believe that the default scoring out of box for Lucene does score between 0 to 1, however once boosting or other custom scoring is involved then the score value can be any positive value that fits in a float.
However, you can easily normalize the scores into the range of 0 to 1 by dividing each hit's score by the max score of any hit in the query as stated by this StackOverflow answer.
Let us say I implemented a random forest algorithm with 20 trees using 20 random subsets of training data.
and there are 4 different class labels that can be predicted.
So, what exactly should be called a majority verdict.
If there are a total of 20 trees then should a majority verdict require that the highest voted class label is having at least 10 votes or does it simply need to be higher than other lables.
example:
Total Trees = 20, Class Labels are {A,B,C,D}
Scenario 1:
A= 10 votes
B= 4 votes
C= 3 votes
D = 3 votes
Clearly,A is the winner here
Scenario 2:
A= 6 votes
B= 5 votes
C= 5 votes
D = 4 votes
Can A be called the winner here?
If you are making a hard-decision, meaning you are asked to return the best guess, then yes A is the winner.
To capture the difference between these two cases, you can consider a soft-decision system instead, where you return the winner with a confidence value. An example confidence in this case can be the ratio of votes of A. Then, the first case would be a more confident estimate than the latter
As it was told it would choose the arm having highest emperical mean with probability 1-epsilon how did epsilon/k add to it (and also epsilon/k for random probability selection)in the equation written for probability in the page no:6 of the paperAlgorithms for multi armed bandits.What does that mean epsilon/k writing there in the equation
This answer was taken from here:
Suppose you are standing in front of k = 3 slot machines. Each machine pays out according to a different probability distribution, and these distributions are unknown to you. And suppose you can play a total of 100 times.
You have two goals. The first goal is to experiment with a few coins to try and determine which machine pays out the best. The second, related, goal is to get as much money as possible. The terms “explore” and “exploit” are used to indicate that you have to use some coins to explore to find the best machine, and you want to use as many coins as possible on the best machine to exploit your knowledge.
Epsilon-greedy is almost too simple. As you play the machines, you keep track of the average payout of each machine. Then, you select the machine with the highest current average payout with probability = (1 – epsilon) + (epsilon / k) where epsilon is a small value like 0.10. And you select machines that don’t have the highest current payout average with probability = epsilon / k.
It much easier to understand with a concrete example. Suppose, after your first 12 pulls, you played machine #1 four times and won $1 two times and $0 two times. The average for machine #1 is $2/4 = $0.50.
And suppose you’ve played machine #2 five times and won $1 three times and $0 two times. The average payout for machine #2 is $3/5 = $0.60.
And suppose you’ve played machine #3 three times and won $1 one time and $0 two times. The average payout for machine #3 is $1/3 = $0.33.
Now you have to select a machine to play on try number 13. You generate a random number p, between 0.0 and 1.0. Suppose you have set epsilon = 0.10. If p > 0.10 (which it will be 90% of the time), you select machine #2 because it has the current highest average payout. But if p < 0.10 (which it will be only 10% of the time), you select a random machine, so each machine has a 1/3 chance of being selected.
Notice that machine #2 might get picked anyway because you select randomly from all machines.
Over time, the best machine will be played more and more often because it will pay out more often. In short, epsilon-greedy means pick the current best option ("greedy") most of the time, but pick a random option with a small (epsilon) probability sometimes.
There are many other algorithms for the multi-armed bandit problem. But epsilon-greedy is incredibly simple, and often works as well as, or even better than, more sophisticated algorithms such as UCB ("upper confidence bound") variations.
Let me try giving my point of view here.
Lets consider the similar example of 3 machines: A, B and C and assume B has the highest pay out.
If epsilon is 0.1, then what is the probability of choosing B?
Recall Epsilon Greedy algo, it says:
r = random() # any random number between 0 and 1(uniform distribution)
if r > epsilon:
choose "Best pay out at current time" #(currently it is B)
else:
choose randomly between three machines
so what is the probable number of chances of choosing B in 100 chances:
It will be sum of the below two:
1) 90 chances out of 100 (if condition)
2) one third chances out of remaining 10(else condition) as there are 3 machines(equal possibility of choosing each of them)
So the total chances can be 90 + 10* (1/3) (approx) = 93.33
But wait what if epsilon is 0.5?
Then total chances would be 95 + 5*(1/3)= 96.67
Thats y we say, probability of selecting the machine with the highest current average payout with probability = (1 – epsilon) + (epsilon / k).
I hope this helps.
Is there any argument in inbuilt pagerank algorithm or a separate algorithm is available for applying pagerank algorithm on a weighted neo4j graph. I found the algorithm here but don't know how to run it interatively on neo4j dekstop.
There is a Neo4j Graph Algorithms Library that contains a page rank algorithm procedure. The procedure signature is as follow:
CALL algo.pageRank(label:String, relationship:String, {iterations:5,
dampingFactor:0.85, write: true, writeProperty:'pagerank',
concurrency:4}) YIELD nodes, iterations, loadMillis, computeMillis,
writeMillis, dampingFactor, write, writeProperty - calculates page
rank and potentially writes back
You can use the algorithm running a query like this:
CALL algo.pageRank.stream('Page', 'LINKS', {iterations:20, dampingFactor:0.85})
YIELD node, score
RETURN node,score order by score desc limit 20
I'm using the wilson scoring algorithm (code below) and realized it doesn't factor in negative votes.
Example:
Upvotes Downvotes Score
1 0 0.2070
0 0 0
0 1 0 <--- this is wrong
That isn't correct as the negative net votes should have a lower score.
def calculate_wilson_score(up_votes, down_votes)
require 'cmath'
total_votes = up_votes + down_votes
return 0 if total_votes == 0
z = 1.96
positive_ratio = (1.0*up_votes)/total_votes
score = (positive_ratio + z*z/(2*total_votes) - z * CMath.sqrt((positive_ratio*(1-positive_ratio)+z*z/(4*total_votes))/total_votes))/(1+z*z/total_votes)
score.round(3)
end
Update:
Here is a description of the Wilson scoring confidence interval on Wikipedia.
The Wilson Score Lower Confidence Bound posted will certainly take negative votes into account, although the lower confidence bound will not get lower than zero, which is perfectly fine. This approximation for ranking items is generally used for identifying the highest ranked items on a best-rated list. It thus may have undesirable properties when looking at the lowest ranked items, which are the type you are describing.
This method of ranking items was popularized by Evan Miller in a post on how not to sort by average rating, although he later stated
The solution I proposed previously — using the lower bound of a confidence interval around the mean — is what computer programmers call a hack. It works not because it is a universally optimal solution, but because it roughly corresponds to our intuitive sense of what we'd like to see at the top of a best-rated list: items with the smallest probability of being bad, given the data.
If you are genuinely interested in analyzing the lowest ranked items on a list, I would suggest either using the upper confidence bound, or using a Bayesian rating systems as described in: https://stackoverflow.com/a/30111531/3884938