Scalable time decay for web application - scalability

My goal here is to generate a system similar to that of the front page of reddit.
I have things and for the sake of simplicity these things have votes. The best system I've generated is using time decay. With a halflife of 7 days, if a vote is worth 20 points today, then in seven days, it it worth 10 points, and in 14 days it will only be worth 5 points.
The problem is, that while this produces results I am very happy with, it doesn't scale. Every vote requires me to effectively recompute the value of every other vote.
So, I thought I might be able to reverse the idea. A vote today is worth 1 point. A vote seven days from now is worth 2 points, and 14 days from now is worth 4 points and so on. This works well because for each vote, I only have to update one row. The problem is that by the end of the year, I need a datatype that can hold fantastically huge numbers.
So, I tried using a linear growth which produced terrible rankings. I tried polynomial growth (squaring and cubing the number of days since site launch and submission) and it produced slightly better results. However, as I get slightly better results, I'm quickly re-approaching unmaintainable numbers.
So, I come to you stackoverflow. Who's got a genius idea or link to an idea on how to model this system so it scales well for a web application.

I've been trying to do this as well. I found what looks like a solution, but unfortunately, I forgot how to do math, so I'm having trouble understanding it.
The idea is to store the log of your score and sort by that, so the numbers won't overflow.
This doc describes the math.
https://docs.google.com/View?id=dg7jwgdn_8cd9bprdr
And the comment where I found it is here:
http://blog.notdot.net/2009/12/Most-popular-metrics-in-App-Engine#comment-25910828

Okay, thought of one solution to do that on every vote. The catch is that it requires a linked list with atomic pop/push on both sides to store votes (e.g. Redis list, but you probably don't want it in RAM).
It also requires that decay interval is constant (e.g. 1 hour)
It goes like this:
On every vote, update the score push the next time of decay of this vote to the tail of the list
Then pop the first vote from the head of the list
If it's not old enough to decay, push it back to the head
Otherwise, subtract the required amount from the total score and push the updated information to the tail
Repeat from step 2 until you hit a fresh enough vote (step 3)
You'll still have to check the heads in background to clear the posts that no one votes on anymore, of course.

It's late here so I'm hoping someone can check my math. I think this is equivalent to exponential decay.
MySQL has a BIGINT max of 2^64
For simplicity, lets use 1 day as our time interval. Let n be the number of days since the site launched.
Create an integer variable. Lets call it X and start it at 0
If an add operation would bring a score over 2^64, first, update every score by dividing it by 2^n, then set X equal to n.
On every vote, add 2^(n-X) to the score.
So, mentally, this makes better sense to me using base 10. As we add things up, our number gets longer and longer. We stop caring about the numbers in the lower digit places because the values we're incrementing scores by have a lot of digits. Which means that the lower digits kind of stop counting for very much. So if they don't count, why not just slide the decimal place over to a point that we care about and truncate the digits below the decimal place at some point. To do this, we need to slide the decimal place over on the amount we're adding each time as well.
I can't help but feel like there's something wrong with this.

Here are two possible pseudo queries that you could use. I know that they don't really address scalability, but I think that they do provide methods so that you can
SELECT article.title AS title, SUM(vp.point) AS points
FROM article
LEFT JOIN (SELECT 1 / DATEDIFF(NOW(), vote.created_at) as point, article_id
FROM vote GROUP BY article_id) AS vp
ON vp.article_id = article.id
or (not in a join, which will be a bit faster I think, but harder to hydrate),
SELECT SUM(1 / DATEDIFF(NOW(), created_at)) AS points, article_id
FROM vote
WHERE article_id IN (...) GROUP BY article_id
The benefit of these queries is that they can be run at any time with the same data and they will always return the same answers. They don't destroy any data.
If you need to, you can also run the queries in a background job and they will still give the same result.

Related

VLOOKUP with wildcard and find Nth occurance?

I'm setting up a Google Sheet that will calculate the most effective purchase size of specific agricultural inputs (fertilizer, chemical, etc). I set up the price data in its own tab with a separate row for each input name + size.
To keep it easy for the user I'd like to require only the input name, # of gallons per acre, and acres and then have a formula spit out the total cost and most effective purchase (bulk if > X gallons, X # of 250 gallon containers + X 55 drums, etc). How can I use the input name plus a wildcard to find the appropriate purchase size?
https://docs.google.com/spreadsheets/d/1bMOPuk2qhmVuJT7vE_ni3KFxfcgKvwTwkM4p50xQF_0/edit?usp=sharing
I tried:
=ArrayFormula(iferror(INDEX('Data (Current)'!H2:H,SMALL(IF($A2&"*"='Data (Current)'!A2:A,ROW('Data (Current)'!A2:A)-1),1))))
...but it returns blank so I'm guessing the reference $A2&"*" to the input name isn't working properly. When I replace it with a string found in the 'Data (Current)' tab then it works fine.
=ArrayFormula(iferror(INDEX('Data (Current)'!H2:H,SMALL(IF($A2&"*"='Data (Current)'!A2:A,ROW('Data (Current)'!A2:A)-1),1))))
I expected the output to be the smallest value (in this case I think it's 5). Then when I change the last number to 2 or 3 it will find the next smallest value, in this case, 55 or 250. Then I can use simple formulas to interact with that and finish the spreadsheet.
Unfortunately, the actual output is nothing, or "".
Sorry if this isn't what you're looking for, as I had some trouble understanding your question.
Presuming what you want is essentially this:
I want to buy Y quantity of item.
I can buy item at cheaper prices if I buy in higher quantities, although sometimes they have a minimum order quantity.
What is the most optimal combination of the options I have to minimize the price I pay?
I'm unsure if there's a simple solution for this within Google Sheets alone. This might be treading more into Apps Script territory.
However, that's not to say that it's not impossible. I've "brute-forced" the above solution above with an iterative-like approach, for the "Chelated Calcium" product: https://docs.google.com/spreadsheets/d/1YSBiSx0IMr4T0R11Dqb-tqOhH4AOTTAWeH2yQfT4X5w
First, list the data in a standardized manner. This includes giving each same product something easy to look it up by. For example, on the Data (Current) tab, I've added 3 columns:
Product Common Name - This is used so that all items of different quantities can be found easily, without needing wildcards.
Gallons - Much easier to parse the data if it it's explicitly laid out.
Minimum Order Gallons - This is your threshold for Bulk. I've set it at an arbitrary 20,000 gallons for Chelated Calcium.
The data here is ordered least-effective first. How you do this will be up to you. In this case, I sorted by the Retail Cost Per Ounce parameter from your sheet, highest first. This eliminates any guesswork about which of the options are most effective, since you can just traverse your options in order. Note: The way I've laid out the formulas will only work IFF the same products are directly next to each other. It won't work if there are other products between them.
On the Field Level Tool tab, standardize your inputs to the Gallons unit. I do this in Total Gallons Needed column (I multiply anything with a "GAL" with 1, and "QUART" with 0.25).
For each item, determine the row numbers where the product begins and ends. This is marked by columns L (Least Efficient Index) and M (Most Efficient Index). I got these results by using the MATCH function.
Set up the iterations, from 0 to N-1. On this sheet, I've set up N=5 iterations, which means that it can traverse 5 different options of the same product only. Since Chelated Calcium only has 4 different options (5 Gal, 30 Gal, 250 Gal, Bulk), 5 is more than enough for this product. If you have products with more options, you may want to have more iterations.
The iterations are on the right side of the Field Level Tool tab.
In your case, you might want to put it on a different tab since the place I put it makes the file look very messy.
In each iteration, I perform the following steps:
To Fulfill - How many gallons still need to be purchased by this iteration?
ThisIndex - What is the row number of this iteration? This is determined by Most Efficient Index - Iteration Number. Remember that since we sorted in order of ascending efficiency, this means that the iteration starts with the most efficient option it can find first. There is a check to make sure that it only outputs a value if it is between the range [Least Efficient Index, Most Efficient Index]. Otherwise, it will be blank to avoid miscalculations by intruding into another product in the Data (Current) tab.
Retail Price, Minimum Gals, Gallons per Order - Simple data extraction for easy usage in the iteration, using INDEX (and indirectly, MATCH by virtue of ThisIndex).
Order - This formula does a couple of things, outlined below:
It checks whether there still remains a valid choice of product at this iteration. It does this by checking whether ThisIndex still exists. If the product doesn't exist, then it will be nulled. This is accomplished by using the IF function.
It will determine if there is a minimum threshold that must be met to purchase this choice. You can see in the 0th iteration, for example, that there is a minimum quantity of 20,000 gallons. If To Fulfill quantity is greater than or equal to the threshold OR there is no threshold, then a purchase is quantified by this column. The mathematics are simply to divide the To Fulfill amount by the Gallons per Order amount to determine the number of orders of this particular product choice. If there is a threshold but the To Fulfill amount doesn't meet it, then this iteration is skipped with a 0 order value.
If the item is already on its least efficient choice (ThisIndex == Least Efficient Index), it will do a CEILING function to ensure that the order is fulfilled. If not, it will do a FLOOR function instead. This is because you cannot order 3.5 units of an item, so they have to be rounded either up or down.
Expenditure - This is simply Order multiplied by the Retail Price, or how much money you spend in this iteration.
Remaining - How much of the product is left unfulfilled at the end of this iteration, to be used as To Fulfill for the next iteration.
Note: If you see formulas that are of the form =IF(ThisIndex, [calculations_here],), that is simply a check to nullify that calculation if ThisIndex is invalid.
Copy the iterations as many times as you want to the right. Something nice to do is to force the iterations to do a CEILING on the very last one to ensure that you never under-buy.
Generate a user-readable string for the purchase suggestion. You can see this on the Suggested Purchase column.
Calculate the Gallons Bought with a simple SUMPRODUCT over all the iterations.
Calculate the total expenditure with a simple SUM over all the iterations.
I hope this is what you were looking for. Regardless, it's at least a fun exercise on how much you can abuse Sheets. ;)

How to calculate the number of consecutive days from Core Data data?

I have some payments(income and expense) which are added to Core Data and every day I calculate the total and also I show how many consecutive days the payments total was positive.
How I am doing right now is always get the total for previous day and increase a counter if its positive. This counter is saved using UserDefaults.
My issue is when lets say the app is deleted and reinstall, the counter is lost, so I am trying to find a way calculate it dynamically every time, but I don't think reading all payments for all days is a good idea in terms of memory.
Another solution is maybe save it using Keychain ?
Is there any other more elegant method? I don't really like the idea of saving this counter.
As an answer
Assuming you reset the counter if the balance goes negative, you just need to load the balance in reverse date order, and count until you go negative?
Depending on how many records you have it may not be performant to read it all in one go. If that's the case read in batches of a manageable size (50 days, perhaps) and only get more data if you are still recording positive balances.
At some point, of course, you may just return "more than 100 days" as a valid response :-)

Prepping Data For Usage Clustering

Dataset: I'm given the number of minutes individual customers use a product each day and am trying to cluster this data in order to find common usage patterns.
My question: How can I format the data so that, for example, a power user with high levels of use for a year looks the same as a different power user who has only been able to use the device for a month before I ended data collection?
So far I've turned each customer into an array where each cell is the number of minutes used that day. This array starts when the user first uses the product and ends after the user's first year of use. All entries in the cells must be double values (e.x. 200.0 minutes used) for the clustering model. I've considered either setting all cells/days after the last day of data collection to either -1.0 or NULL. Are either of these a valid approach? If not what would you suggest?
For the problem where you want both users (one that used the product a lot every day for a year, and the other used it a lot for one month), create a new entry where it's values are:
avg_usage per time_bin
time_bin can be a month, a day or another time bin which best fits your needs.
This way, a user which use a product, let's say 200 minutes per day for one year, will get:
200 * 30 * 12 / 12 = 6000 minutes per month
and the other user, which joined just last month, will also get, with the exact same usage will get:
200 * 30 * 1 / 1 = 6000 minutes per month.
This way, it doesn't matter when you have started to use the product, the only thing that matter, is the usage rate.
An important thing you might take into consideration, that products, may be forgotten for some time. for example, a computer, and I'm away for a vacation. Those days I didn't use my computer, doesn't have (maybe) an effect of my general usage of this product. So, based on your data, product and intuition you might consider removing gaps like the one I mentioned, and not take it into account inside the calculation.
The amount of time a user has used your product could be a signal of something, but if indeed he only started some time ago, and still using it until today, it may be something you need to take into consideration, and for that use, this average binning technique may help.

how to find how the most popular post

List item
each day I want to find the "most popular" post on the website and feature it on the home page.
For each post, I'm keeping track of how many times it has been "liked", "disliked", "favorited" and "viewed".
I would like to run a daily cron job where I do something like:
post = Post.order("popularity_score DESC").first
post.feature!
My question is, how should I compute the value of popularity_score?
Is there a formula that takes into consideration "statistical significance"? Meaning, a post which has 1 "like" vote and nothing else, although having a 100% approval rating, it shouldn't mean much because only one person voted on it.
In general I have these loose ideas off the top of my head:
a post with 10 likes and no other votes is more popular than a
post with 1 like vote.
a post post with more "dislikes" than
"likes" should have a lower score than a post with more "likes" than
"dislikes"
a post with 20 views and no other votes is more
popular than a post with 3 views.
I've punched in some arbitrary formulas to try to satisfy this goal, but there are exactly that, arbitrary and I don't really know if there is a better way to go about this?
Suggestions?
Maybe you could just take the SO approach? it seems rather decent.
+ gives 10 points
- substracts 2 points
view add a low number, like 0.01 point
comment add 2 points
One suggestion is to not reset your counter each day (that leaves the "most popular" open to a single vote).
Instead, weight the votes by their age -- newer votes count more than older votes. This will give you gradual and meaningful rerankings over time.

How to calculate user's info completeness in Rails

I have problem in my new rails project.I want to implement a function which can show the user's info completeness by a bar like Linkedin.
I think I can use a variable to record the completeness,but I don't have any idea about how to calculate it.
P.S I have two Model,one is the User Model,another is the Info Model.
This is, in fact, completely arbitrary. It's based entirely on which activities on the site you want to encourage.
A couple of mechanisms you can consider:
Model "accomplishments" with a completed/not completed status. Count up the ones you care about. Store the accomplishments based on activity either as they happen or at the end of the day in some batch job. For each user, calculate the percentage with the usual math (accomplishments completed/sum of available accomplishments) * 100 = percentage.
A variation of the same, but weighted based on what you consider more valuable contributions. In this case, the math is basically sum of (weight n * accomplishment n)/total weight.
The previous Careers.stackoverflow.com model made a geeky joke about Spinal Tap by making it possible to have counts greater than 100%. You can do that simply by undercounting the maximum accomplishments.

Resources