List item
each day I want to find the "most popular" post on the website and feature it on the home page.
For each post, I'm keeping track of how many times it has been "liked", "disliked", "favorited" and "viewed".
I would like to run a daily cron job where I do something like:
post = Post.order("popularity_score DESC").first
post.feature!
My question is, how should I compute the value of popularity_score?
Is there a formula that takes into consideration "statistical significance"? Meaning, a post which has 1 "like" vote and nothing else, although having a 100% approval rating, it shouldn't mean much because only one person voted on it.
In general I have these loose ideas off the top of my head:
a post with 10 likes and no other votes is more popular than a
post with 1 like vote.
a post post with more "dislikes" than
"likes" should have a lower score than a post with more "likes" than
"dislikes"
a post with 20 views and no other votes is more
popular than a post with 3 views.
I've punched in some arbitrary formulas to try to satisfy this goal, but there are exactly that, arbitrary and I don't really know if there is a better way to go about this?
Suggestions?
Maybe you could just take the SO approach? it seems rather decent.
+ gives 10 points
- substracts 2 points
view add a low number, like 0.01 point
comment add 2 points
One suggestion is to not reset your counter each day (that leaves the "most popular" open to a single vote).
Instead, weight the votes by their age -- newer votes count more than older votes. This will give you gradual and meaningful rerankings over time.
Related
I've been asked to create a summary for some google form responses, and though I have a working solution, I can't help but feel there must be a more elegant one.
The form collects data related to case checking - every month each team (there's 100+ teams) has to check a certain number of cases based on how many staff are in their team, and enter the results for each case they've checked in the google form. The team that have set this up want me to summarise the data by team, month, and section of the form (preliminary questions, case recording, outcomes, etc). There are 8 sections on the live form, ranging from 1-13 questions, all with Yes/No/NA/blank answers.
(honestly, it's not how I'd have approached setting all this up, but that is out of my hands!)
So they're essentially looking for a live monthly summary with team names down the side, section names along the top, and a %age completed that will keep up with entries as they come in (where we can also use importrange and query to pull the relevant bits into other google sheet summaries, as and when needed).
What I've currently got is this:
=iferror(sum(countifs('Form Responses'!$B:$B,$A3,'Form
Responses'!$F:$F,"Yes",'Form Responses'!$E:$E,">="&$B$1,'Form
Responses'!$E:$E,"<"&edate($B$1,1)),countifs('Form
Responses'!$B:$B,$A3,'Form Responses'!$G:$G,"Yes",'Form
Responses'!$E:$E,">="&$B$1,'Form
Responses'!$E:$E,"<"&edate($B$1,1)),countifs('Form
Responses'!$B:$B,$A3,'Form Responses'!$H:$H,"Yes",'Form
Responses'!$E:$E,">="&$B$1,'Form
Responses'!$E:$E,"<"&edate($B$1,1)),countifs('Form
Responses'!$B:$B,$A3,'Form Responses'!$I:$I,"Yes",'Form
Responses'!$E:$E,">="&$B$1,'Form
Responses'!$E:$E,"<"&edate($B$1,1)),countifs('Form
Responses'!$B:$B,$A3,'Form Responses'!$J:$J,"Yes",'Form
Responses'!$E:$E,">="&$B$1,'Form
Responses'!$E:$E,"<"&edate($B$1,1)),countifs('Form
Responses'!$B:$B,$A3,'Form Responses'!$K:$K,"Yes",'Form
Responses'!$E:$E,">="&$B$1,'Form
Responses'!$E:$E,"<"&edate($B$1,1)))/(countifs('Form
Responses'!$B:$B,$A3,'Form Responses'!$E:$E,">="&$B$1,'Form
Responses'!$E:$E,"<"&edate($B$1,1))*6),0)
It works, but it feels like a bit of a brute-force-and-ignorance solution. I've tried countifs & array, I've looked a pivot but I can't get the section groups, I've had a play with query but I can't figure out how to ask it to count all Yeses in multiple columns at once.
Is there a more elegant solution, or do I have to resign myself to setting up the next financial year's summaries like this?
Edit:
You can use plain array boolean multiplication to achieve the count, as trues are converted to 1s and false are converted to 0s:
=TO_PERCENT(ARRAYFORMULA(
SUM((f!F1:K="Yes")*(f!E1:E>=B1)*(f!E1:E<EDATE(B1,1))*(f!B:B=A3))/
SUM(6*(f!E1:E>=B1)*(f!E1:E<EDATE(B1,1))*(f!B:B=A3))
)
)
Renamed Form Responses to f
Numerator: SUM of
Question filter (f!F:K =Yes) and
Month filter (f!E:E is within month of B1) and
Team filter(B:B = A3)
Denominator: 6 times the SUM of
Month filter (f!E:E is within month of B1) and
Team filter(B:B = A3)
On this sample sheet that you provided you'll notice two new tabs. MK.Retab and MK.Summary.
On MK.Retab is a single formula in A2 that "re-tabulates" all of your survey data into a format that is much easier to analyze going forward. That tab can be "hidden" on your real project. It will continue to build the 6 column dataset forever. It would be a sort of "back end" sheet, only used to supply data to any further downstream analysis.
On MK.Summary is a single formula in cell A1 that Query's that dataset from MK.Retab and shows the percentage of Yes's by month by section by team in a format similar to what you proposed. I coded it to display the most recent month at the left, immediately to the right of the team names, and to push historical data off to the right. Even though people are often used to seeing time go from left to right, I find that the opposite method nice because it keeps you from having to scroll sideways to see the most recent data. It is very simple to change should you want to by getting rid of the "desc" that you find in the "order by" clause of the query string.
I find this kind of two step solution to problems like your useful, because while the summary migth not be exactly what you want, it's always easier to build formulas and analyses off of the data as laid out in the MK.Retab sheet.
As for the formula in MK.Retab, it is based on a method that I came up with a while back that constructs a large vlookup where the [search key] is actually a sequence of decimal numbers that is built by counting the number of rows in your real data set and multiplying by the number of columns of data that need to be repeated for each row. I built a demo some time ago that I'm happy to share with folks if you want to understand better how it works.
You said that your goal was to understand the formulas so that you could modify them going forward as needed. I'm not sure how easy that will be to do, but I can try my best to answer any questions you might have about the method or the solution generally.
What I can tell you is that some of the formulas are more complicated than they need to be because you just used Q1 Q2 Q3 etc instead of the actual questions. if you had a list of the questions asked somewhere (on some other tab, say), and what you wanted to call/name their corresponding "sections", it would make the formula significantly less complicated. As it stands, I had to use the appearance of the word "Comments", in row 1 to distinguish between where one section ended and another section began. The upside to that decision though, is that the formula I wrote is infinitely expandable to the right. That is, if you were to add another 100 columns worth of questions and answers to the sample set here, the formula would be able to handle that and break it out, so long as there was the word "Comments" between each section.
Hope all this helps.
Here is my example query. It specifies that Tweets must be:
Written in English
Tweeted between 23Jan2010 and 24Jan2010
Have at least 100 "favorites" (likes)
My idea is to use something like the binary search algorithm to narrow down the minimum number of likes the Tweet has. Once only one Tweet is returned by a query, I'll know it is the Tweet with the most likes. The problem is, min_faves--the value that specifies the minimum number of likes--doesn't seem to work. Look at this query. It specifies min_faves as 100. As you can see, this Justin Bieber Tweet appears. It has 1.6k likes. Now, when I attempt to increase the min_faves value to 300 (to narrow down the most liked Tweet), the Justin Bieber Tweet is excluded! I don't know if I am not understanding the query system correctly, or if it is not working, but this seems incorrect. The Justin Bieber Tweet should show up, as it has more than 300 likes. This is just one example of how it doesn't seem to work.
Perhaps this is ocurring because, within the specified time range, the Justin Bieber Tweet did not have enough likes to meet the requirements. This would be very good for me, as I am trying to find the most liked Tweet on that particular day, and not the Tweet with the most likes right now that happens to have been posted on that particular day.
But, I do not believe this is the case. For instance, this query includes 3 Tweets from "Rev Run" when min_faves is set to 249, but returns 0 Tweets when min_faves is set to 250. I doubt that these Tweets all had exactly 249 likes on that day (as implied by these symptoms).
Does anyone either:
Understand why these results occur and how I can use this method to find the most liked Tweet of a particular day
Know of a better, alternative way I can find the most liked Tweet of a particular day
Thank you all
#sinanspd requested an example from 2018:
Here is a search with min_faves at 300k. It includes a post with 769k likes and a post with 479k likes. When the query's min_faves is bumped up to 400k neither are returned.
My goal here is to generate a system similar to that of the front page of reddit.
I have things and for the sake of simplicity these things have votes. The best system I've generated is using time decay. With a halflife of 7 days, if a vote is worth 20 points today, then in seven days, it it worth 10 points, and in 14 days it will only be worth 5 points.
The problem is, that while this produces results I am very happy with, it doesn't scale. Every vote requires me to effectively recompute the value of every other vote.
So, I thought I might be able to reverse the idea. A vote today is worth 1 point. A vote seven days from now is worth 2 points, and 14 days from now is worth 4 points and so on. This works well because for each vote, I only have to update one row. The problem is that by the end of the year, I need a datatype that can hold fantastically huge numbers.
So, I tried using a linear growth which produced terrible rankings. I tried polynomial growth (squaring and cubing the number of days since site launch and submission) and it produced slightly better results. However, as I get slightly better results, I'm quickly re-approaching unmaintainable numbers.
So, I come to you stackoverflow. Who's got a genius idea or link to an idea on how to model this system so it scales well for a web application.
I've been trying to do this as well. I found what looks like a solution, but unfortunately, I forgot how to do math, so I'm having trouble understanding it.
The idea is to store the log of your score and sort by that, so the numbers won't overflow.
This doc describes the math.
https://docs.google.com/View?id=dg7jwgdn_8cd9bprdr
And the comment where I found it is here:
http://blog.notdot.net/2009/12/Most-popular-metrics-in-App-Engine#comment-25910828
Okay, thought of one solution to do that on every vote. The catch is that it requires a linked list with atomic pop/push on both sides to store votes (e.g. Redis list, but you probably don't want it in RAM).
It also requires that decay interval is constant (e.g. 1 hour)
It goes like this:
On every vote, update the score push the next time of decay of this vote to the tail of the list
Then pop the first vote from the head of the list
If it's not old enough to decay, push it back to the head
Otherwise, subtract the required amount from the total score and push the updated information to the tail
Repeat from step 2 until you hit a fresh enough vote (step 3)
You'll still have to check the heads in background to clear the posts that no one votes on anymore, of course.
It's late here so I'm hoping someone can check my math. I think this is equivalent to exponential decay.
MySQL has a BIGINT max of 2^64
For simplicity, lets use 1 day as our time interval. Let n be the number of days since the site launched.
Create an integer variable. Lets call it X and start it at 0
If an add operation would bring a score over 2^64, first, update every score by dividing it by 2^n, then set X equal to n.
On every vote, add 2^(n-X) to the score.
So, mentally, this makes better sense to me using base 10. As we add things up, our number gets longer and longer. We stop caring about the numbers in the lower digit places because the values we're incrementing scores by have a lot of digits. Which means that the lower digits kind of stop counting for very much. So if they don't count, why not just slide the decimal place over to a point that we care about and truncate the digits below the decimal place at some point. To do this, we need to slide the decimal place over on the amount we're adding each time as well.
I can't help but feel like there's something wrong with this.
Here are two possible pseudo queries that you could use. I know that they don't really address scalability, but I think that they do provide methods so that you can
SELECT article.title AS title, SUM(vp.point) AS points
FROM article
LEFT JOIN (SELECT 1 / DATEDIFF(NOW(), vote.created_at) as point, article_id
FROM vote GROUP BY article_id) AS vp
ON vp.article_id = article.id
or (not in a join, which will be a bit faster I think, but harder to hydrate),
SELECT SUM(1 / DATEDIFF(NOW(), created_at)) AS points, article_id
FROM vote
WHERE article_id IN (...) GROUP BY article_id
The benefit of these queries is that they can be run at any time with the same data and they will always return the same answers. They don't destroy any data.
If you need to, you can also run the queries in a background job and they will still give the same result.
My app implements an activity stream for different types of activities. One of the activity types is related to the different virtual currency a user can accumulate. For example, a user can accumulate "Points" for posting a comment, voting on a topic, etc. If I were to do no filtering or aggregating, you would get a lot of self-generating spam over the course of a mere hour, for example:
Earned 5 points for commenting (total points = 505)
Earned 10 points for voting (total points = 515)
Earned 5 points for commenting (total points = 520)
Earned 5 points for commenting (total points = 525)
Earned 5 points for commenting (total points = 530)
Earned 10 points for voting (total points = 540)
Earned 10 points for voting (total points = 550)
Earned 10 points for voting (total points = 560)
...
...
...
How would you go about preventing this potential for self-generating spam but also present the stream of activities in such a way that invites your friends to see what you've been doing?
I can think of a couple options. The first being an aggregation of the data. I don't know how many activity types you have, but you could distill what you have posted down to 2 items:
<Name> made <x> comments and scored <x * 5> points!
<Name> voted on <x> things.
You could make each of these list items clickable to expand and show the details. So, after a click on the summary of comments user would see this:
<Name> made <x> comments and scored <x * 5> points!
Earned 5 points for commenting (total points = 505)
Earned 5 points for commenting (total points = 520)
Earned 5 points for commenting (total points = 525)
Earned 5 points for commenting (total points = 530)
<Name> voted on <x> things.
You could use something like jQuery UI accordion to implement this.
The approach Facebook takes is that it uses a sample post and then lets users know that more items are available, like this:
Earned 5 points for commenting (total points = 505)
Made <x> more comments
Then when the user clicks on the "Made <x> more comments" the user can see every comment (within a certain span of time).
Presuming you want to see in one glance if the user was recently active and how recent, I would propose something like the following:
I am not sure where you would want to show this, but maybe in the profile-page, or in the list of "friends". I would show an aggregation, that would show the most recent time-frame the user was active, and what she did:
E.g.
has just commented on
has made comments and votes in the last hour
has made comments and votes today
has made comments and votes this week
And you would only show the most recent of those. So if a user has just commented (within the last five minutes), show the first line. If she was active in the last hour, show the second line. And so on ...
This clearly shows the user was active and how long ago. I think that is the most important.
You could combine this with showing the total score, showing how active the user was overall.
Maybe something like:
<name>[<total_score>] has just commented on <x>
or
<name>[<total_score>] has made <x> comments and <y> votes in the last hour.
Mmmmmm i want the message to be shorter:
<name>[<total_score>] has earned <x> points in the last hour.
Is that clearer? Not sure.
This message would then be clickable, and that would link you to a pop-up chart/graph showing the activity (votes/comments/points) over the last week/month. A chart because it is very compact and very understandable.
What do you think?
I'd personally go with an alert like Stack Uses for instant notification of immediate activities. They quickly alert and then get out of the way. If you make them clickable, the user can drill down for detail if they like.
Then, somewhere like in an account section, I'd list all activities using jQuery DataTables, so they could be sorted, paged, filtered, and delivered via pipelined Ajax. Simple, efficient, and user friendly!
UI is about commonality, making a user feel comfortable in an environment they haven't already been by presenting familiar interactions. You'll see this same pattern used on sites like StackOverflow, Swagbucks, MyPoints, etc.
I have problem in my new rails project.I want to implement a function which can show the user's info completeness by a bar like Linkedin.
I think I can use a variable to record the completeness,but I don't have any idea about how to calculate it.
P.S I have two Model,one is the User Model,another is the Info Model.
This is, in fact, completely arbitrary. It's based entirely on which activities on the site you want to encourage.
A couple of mechanisms you can consider:
Model "accomplishments" with a completed/not completed status. Count up the ones you care about. Store the accomplishments based on activity either as they happen or at the end of the day in some batch job. For each user, calculate the percentage with the usual math (accomplishments completed/sum of available accomplishments) * 100 = percentage.
A variation of the same, but weighted based on what you consider more valuable contributions. In this case, the math is basically sum of (weight n * accomplishment n)/total weight.
The previous Careers.stackoverflow.com model made a geeky joke about Spinal Tap by making it possible to have counts greater than 100%. You can do that simply by undercounting the maximum accomplishments.