Frequency or count for PCA - machine-learning

I have a number of observations that is a count of a certain event occurring for a given user. For example
login_count logout_count
user1 5 2
user2 20 10
user3 34 5
I would like to feed in these variables along along with a number of other ones to PCA, just wondering if I should work with counts directly (and scale the columns) or work with percentage (and scale the columns after) e.g
login_count logout_count
user1 0.71 0.28
user2 0.66 0.33
user3 0.87 0.13
which one would be a better way of representing the data?
thanks

Depends on the information you want to extract from the data.
If the correlation login=p*logout then I would go with the first one.
The other one is a little bit weird since you should be doing a login 100% of the time (how wold you else know it's user1?) and a logout perhaps 28%. And also you have the dependency 1-login_procent_i=logout_procent_i which will give you a perfect correlation before and after the preprocessing.

Related

Want to rank players for each session based on number of wins, games, and then points

I want to automatically decide who's 1st, 2nd, 3rd, and 4th for each session. I want it to be decided based on the number of match wins first, then game wins if matches are equal, and then points if games are equal. If everything is equal, then have 2 players as the same position and the next one skipped (1st, 2nd, 2nd, and 4th). The highlighted cells are where I want to calculate this. Can someone please help me with this?
Sample file:
https://docs.google.com/spreadsheets/d/1Ry3BAqXF4Di5lHGlY_roDyzQz7FmTaP1XvhO6Psm4sU/edit#gid=0
I have looked online for something, but have been unsuccessful in finding a formula for 3 columns.
I don't think there's an exact formula for this. Obviously SORT or SORTN would correctly sort them, but not return ranking with equals. What I thought of is to MAP the ranking of each column's values giving and sum them, but with different "weight". *100 to the first, *10 to the second, and *1 to the third:
=MAP(B2:B5, C2:C5, D2:D5, LAMBDA(a,b,c,RANK(a,B2:B5)*100+RANK(b,C2:C5)*10+RANK(c,D2:D5)*1))
That returns something like this:
Then wrapped it in LAMBDA in order to be used as the new source of RANKING, and calculated it with the help of BYROW. But now the ranking needs to be done in reversed order. The greater values are going to equal to the ones at the bottomo of the ranking and vice versa. That's why the 0 in RANK(e,r,0) :
=LAMBDA(r,BYROW(r,LAMBDA(e,RANK(e,r,0))))(MAP(B2:B5, C2:C5, D2:D5, LAMBDA(a,b,c,RANK(a,B2:B5)*100+RANK(b,C2:C5)*10+RANK(c,D2:D5)*1)))
Let me know!
NOTE: This would work great until 9 players, if you or anyone else needs it for more, you should change 100,10,1 with the amount of players + 1 elevated to the "priority" of the category. The first one the highest. For example, for 15 players, I would multiply the first ranking by: 16^3,16^2 and 16^1. For 20 players and 4 categories: 21^4,21^3,21^2,21^1 - Hope it's clear for universalization!

Create a Time-Quality Google Sheets Diagram

So in my current project, I am analyzing different ML models based on their quality. Right now, I'd like to put the quality in the context of the time a model needs to train. I track their quality using a F1 Score and I also log the needed time. Now I've been researching the best way to define some of a time-quality ratio but I am unsure how to reach that.
I've been thinking to create a table that has the F1 scores on the y-axis and the Time needed on the x-axis (or the other way around, I don't mind either but figured this makes most sense) but I struggle to define that in Google sheets. My table currently looks something like this (all values are imagined and could vary):
First Dataset
Time (in Min)
Quality (F1 Score)
Iteration 1
5
0
Iteration 2
8
0.1
Iteration 3
11
0.2
Iteration 4
21
0.5
Iteration 5
20
0.8
Iteration 6
21
1
And I'd like a table (this is manually created in GeoGebra) similar to this:
I'm aware I can manually pick my x-axis but was wondering what the best way would be to achieve this - if at all.
you can try a Line chart like this:

Predicting next 4 quater customer count based on last 3 years quarterly customer count

I am currently working on a project where i need to predict next 4 quarters customer count for a retail client based on previous customer count of last three years i.e. quarterly data means total 12 data points. please suggest a beat approach to predict customer count for next 4 quarters.
Note:-I can't share the data but Customer count has a declining trend YOY.
Please let me know if more information is required or question is not clear.
With only 12 data points you would be hard-pushed to justify anything more than a simple regression analysis.
If the declining trend was so strong that you were at risk of passing below 0 sales you could look at taking a log to linearise the data.
If there is a strong seasonal cycle you will need to factor that in, but doing so also reduces the effective sample size from 12 to 9 quarters of data (three degrees of freedom being used up by the seasonalisation).
Thats about it really.
You dont specify explicitly how far in the future you want to make your predictions, but rather you do that implicitly when you make sure your model is robust and does not over-fit.
What does that mean?
Make sure that distribution of labels with your available independent varaibles has similiar distributions of that what you expect in future. You cant expect your model to learn patterns that were not there in the first place. So variables that show same information for distinct customer count values 4 quarters in the future are what you want to include.

Click revenue prediction model

I'm trying to build a model for eCommerce that would predict revenue for single click that comes via online-marketing channels (e.g. google shopping). Clicks are aimed for product detail pages so my training data consists of product details like: price, delivery time, category, manufacturer. Every historical click also has attached revenue to it. The problem is that revenue equals zero for more that 95% of clicks.
Historical data would look like this:
click_id | manufacturer | category | delivery_time | price | revenue
1 |man1 | cat1 | 24 | 100 | 0
2 |man1 | cat1 | 24 | 100 | 0
3 |man1 | cat1 | 24 | 100 | 0
4 |man1 | cat1 | 24 | 100 | 120
5 |man2 | cat2 | 48 | 200 | 0
As you can see, it's possible (and common) that two data points have exactly same features and very different value of target variable (revenue). e.g first 4 data points have the same features and and only 4th has revenue. Ideally, my model would on test example with same features predict average revenue for those 4 clicks (which is 30).
My question is about data representation before I try to apply model. I believe I have two choices:
Apply regression directly to click data (like in case above) and hope that regression would do the right thing. In this case regression error would be pretty big on the end so it would be hard to tell how good the model actually is.
Try to group multiple data points (clicks) to one single point to avoid some zeros - group all data points that have the same features and calculate target (revenue) variable as SUM(revenue)/COUNT(clicks). With this approach I still have a lot of zeroes in revenue (products that got only few clicks) and sometimes there will be thousands of clicks that give you only one data point - which doesn't seem right.
Any advice how to proceed with this problem is very welcomed.
With 95% of your data having zero revenue, you may need to do something about the records, such as sampling. As currently constructed, your model could predict "no" 100% of the time and still be 95% accurate. You need to make a design choice about what type of error you'd like to have in your model. Would you like it to be "as accurate as possible", in that it misses the fewest possible records, to miss as few revenue records as possible, or avoid incorrectly classifying records as as revenue if they actually aren't (Read more on Type 1 & 2 error if you're curious)
There are a couple high level choices you could make:
1) You could over-sample your data. If you have a lot of records and want to make certain that you capture the revenue generating features, you can either duplicate those records or do some record engineering to create "fake" records that are very similar to those that generate revenue. This will increase the likelihood that your model catches on to what is driving revenue, and will make it overly likely to value those features when you apply it to real data
2) You could use a model to predict probabilities, and then scale your probabilities. For example, you may look at your model and say that anything with greater then 25% likelihood of being revenue generating as actually a "positive" case
3) You can try and cluster the data first, as you mentioned above, and try and run a classification algorithm on the "summed" values, rather than the individual records.
4) Are there some segments that hit with >5% likelihood? Maybe build a model on those subsets.
These are all model design choices and there is no right/wrong answer - it just depends on what you are trying to accomplish.
Edited per your response
Regression can be significantly impacted by outliers, so I would be a bit careful just trying to use a regression to predict the dollar amounts. It's very likely that the majority of your variables will have small coefficients, and the intercept will reflect the average spend. The other thing you should keep in mind are the interaction terms. For example, you may be more likely to buy if you're male, and more likely if you're age 25-30, but being BOTH male and 25-30 has an outsized effect.
The reason I brought up classification was you could try and do a classification to see who is likely to buy, and then afterwards apply dollar amounts. That approach would prevent you from having essentially the same very small amount for every transaction.

Scalable time decay for web application

My goal here is to generate a system similar to that of the front page of reddit.
I have things and for the sake of simplicity these things have votes. The best system I've generated is using time decay. With a halflife of 7 days, if a vote is worth 20 points today, then in seven days, it it worth 10 points, and in 14 days it will only be worth 5 points.
The problem is, that while this produces results I am very happy with, it doesn't scale. Every vote requires me to effectively recompute the value of every other vote.
So, I thought I might be able to reverse the idea. A vote today is worth 1 point. A vote seven days from now is worth 2 points, and 14 days from now is worth 4 points and so on. This works well because for each vote, I only have to update one row. The problem is that by the end of the year, I need a datatype that can hold fantastically huge numbers.
So, I tried using a linear growth which produced terrible rankings. I tried polynomial growth (squaring and cubing the number of days since site launch and submission) and it produced slightly better results. However, as I get slightly better results, I'm quickly re-approaching unmaintainable numbers.
So, I come to you stackoverflow. Who's got a genius idea or link to an idea on how to model this system so it scales well for a web application.
I've been trying to do this as well. I found what looks like a solution, but unfortunately, I forgot how to do math, so I'm having trouble understanding it.
The idea is to store the log of your score and sort by that, so the numbers won't overflow.
This doc describes the math.
https://docs.google.com/View?id=dg7jwgdn_8cd9bprdr
And the comment where I found it is here:
http://blog.notdot.net/2009/12/Most-popular-metrics-in-App-Engine#comment-25910828
Okay, thought of one solution to do that on every vote. The catch is that it requires a linked list with atomic pop/push on both sides to store votes (e.g. Redis list, but you probably don't want it in RAM).
It also requires that decay interval is constant (e.g. 1 hour)
It goes like this:
On every vote, update the score push the next time of decay of this vote to the tail of the list
Then pop the first vote from the head of the list
If it's not old enough to decay, push it back to the head
Otherwise, subtract the required amount from the total score and push the updated information to the tail
Repeat from step 2 until you hit a fresh enough vote (step 3)
You'll still have to check the heads in background to clear the posts that no one votes on anymore, of course.
It's late here so I'm hoping someone can check my math. I think this is equivalent to exponential decay.
MySQL has a BIGINT max of 2^64
For simplicity, lets use 1 day as our time interval. Let n be the number of days since the site launched.
Create an integer variable. Lets call it X and start it at 0
If an add operation would bring a score over 2^64, first, update every score by dividing it by 2^n, then set X equal to n.
On every vote, add 2^(n-X) to the score.
So, mentally, this makes better sense to me using base 10. As we add things up, our number gets longer and longer. We stop caring about the numbers in the lower digit places because the values we're incrementing scores by have a lot of digits. Which means that the lower digits kind of stop counting for very much. So if they don't count, why not just slide the decimal place over to a point that we care about and truncate the digits below the decimal place at some point. To do this, we need to slide the decimal place over on the amount we're adding each time as well.
I can't help but feel like there's something wrong with this.
Here are two possible pseudo queries that you could use. I know that they don't really address scalability, but I think that they do provide methods so that you can
SELECT article.title AS title, SUM(vp.point) AS points
FROM article
LEFT JOIN (SELECT 1 / DATEDIFF(NOW(), vote.created_at) as point, article_id
FROM vote GROUP BY article_id) AS vp
ON vp.article_id = article.id
or (not in a join, which will be a bit faster I think, but harder to hydrate),
SELECT SUM(1 / DATEDIFF(NOW(), created_at)) AS points, article_id
FROM vote
WHERE article_id IN (...) GROUP BY article_id
The benefit of these queries is that they can be run at any time with the same data and they will always return the same answers. They don't destroy any data.
If you need to, you can also run the queries in a background job and they will still give the same result.

Resources