Does pyspark.ml.recommendation.ALS create a pivot table under the hood? - machine-learning

An ALS recommendation model performs a matrix factorization where it factorizes a matrix of users vs items in latent factors.
A matrix of 3 users and 3 items would look like this:
users
item_1
item_2
item_3
user_1
NA
4
1
user_2
4
3
0
user_3
NA
1
NA
My dataframe starts such as:
users
items
rating
user_1
item_2
4
user_1
item_3
1
user_2
item_1
4
user_2
item_2
3
user_2
item_3
0
user_3
item_2
1
My question is, before inserting my dataframe in ALS module, do I need to transform it in way where, at the end, I will have a structure such as:
users
items
rating
user_1
item_1
NA
user_1
item_2
4
user_1
item_3
1
user_2
item_1
4
user_2
item_2
3
user_2
item_3
0
user_3
item_1
NA
user_3
item_2
1
user_3
item_3
NA
Or, will, under the hood, ml.recommendation.ALS function create those observations related to the places without interactions? Such as:
users
items
rating
user_1
item_1
NA
If it does not, a way to produce the expected table, would be pivot it and then unpivot it, but it would produce a very huge matrix of users vs items. However, from the examples presented in the documentation, it seems that this process (pivot and then, unpivot) is not necessary.

Yes. It is not necessary.
After you train you the ALS model, the fitted model should be used to predict the "missing interactions".
Thus, the term "fill" (in your sentence " ml.recommendation.ALS module fill those missing interactions") is not appropriate, you should uses the term "predict".

Related

Sort table so that one columns equals another

I have a table with two columns which, by design, I know have the same values (here, colA and colB) but not in the same order. One of those is in the order I want (here, colA). I want to order my whole table, except this column (colA), so that another column (here, colB) is in the order of the column in the right order (colA).
Example:
colA colB colC
5 3 is
7 5 hello
3 7 this
4 4 dog
Desired result:
colA colB colC
5 5 hello
7 7 this
3 3 is
4 4 dog
(Notice that the values in colC (and other columns) follow those in colB) (each value in colA and colB is unique). Doing this in google sheets.
try:
=INDEX(IFNA(VLOOKUP(A1:A4, B1:C4, {1, 2}, 0)))
or see: https://webapps.stackexchange.com/a/126631/186471
Insert another column before A and give them a index starting at 1... 1,2,3,4... This will be used to do a final sort... Now all the columns are named differently then your example so take that into account when reading the below...
colA colB colC colD
1 5 3 is
2 7 5 hello
3 3 7 this
4 4 4 dog
Next sort colC and colD (and any other cols) in order of colC. Sort colA and colB in order of colB. Now the numbers in colB and ColC should match. Finally sort all by colA. Delete colA

How to count occurrence in previous rows based on two columns value

I'm trying to count the number of occurrence in previous rows based on two conditional values using Google Sheet.
Let say this is my table :
Row A
Row B
Row C
Row D
1
John
Smith
2
Marty
Butler
3
John
Herbert
4
John
Smith
5
Philip
Rand
6
John
Smith
7
Marty
Butler
Is there a formula that exist that can count those occurrences. The idea is that when I log a new name, if Row B and C already exist it increase the value in Row D by 1 so I would know that it is the nth entry under that name. In my example, Row D would looks like this:
Row A
Row B
Row C
Row D
1
John
Smith
1
2
Marty
Butler
1
3
John
Herbert
1
4
John
Smith
2
5
Philip
Rand
1
6
John
Smith
3
7
Marty
Butler
2
Delete everything in Column D (including the header) and place the following in D1:
=ArrayFormula({"Header";IF(B2:B="",,COUNTIFS(B2:B&C2:C,B2:B&C2:C,ROW(A2:A),"<="&ROW(A2:A)))})
The "Header" text can be edited as you like.
The COUNTIFS reads, in plain English, "Count how many times this first and last name combination has occurred within only the rows up to the current row."

How to iterate through a function in google sheets?

On one sheet I have a table of statistics similar to this:
A B C D
1 Teams MP GF GA
2 Team A 3 3 2
3 Team B 2 1 3
4 Team C 3 5 2
5 Team D 2 2 1
I then have some formulas that calculate an expected score between two teams set up like this:
A B C D E
7 Teams GF/G GA/G Avg Exp Score
8 Team 1 =VLOOKUP(A8,$A$1:$D$5,3)/VLOOKUP(A8,$A$1:$D$5,2) =VLOOKUP(...) =AVERAGE(...) =B8-C9+D8
9 Team 2 =VLOOKUP(...) =VLOOKUP(...) =AVERAGE(...) =B9-C8+D9
I then have a separate sheet that has the matchups between teams like this:
A B C
1 Date Matchup Exp Score
2 11/15 Team D =FORMULA(
3 11/15 Team B =FORMULA(
4 11/16 Team C =FORMULA(
5 11/16 Team A =FORMULA(
6 11/17 Team B =FORMULA(
7 11/17 Team C =FORMULA(
8 11/17 Team D =FORMULA(
9 11/17 Team A =FORMULA(
My question is if there is some kind of formula that can take the teams in the matchup, copy and paste them behind the scenes into cells A8 and A9, and spit out the Exp Score that would generate in E8 and E9. Is this something that is possible to do in Google Sheets or does it have to be manually copied and pasted into the cells and then copy and paste the results to where I want them?
I've put your formulas together, and come up with the following result, but I think possibly your logic for the average is a little bit off.
Should it not be:
=SUM(C2:C5) / SUM(B2:B5) /2
So the sum of all the goals scored divided by the total number of matches (number of times any team played divided by 2)? This gives the averages goals per game, and then your other formulas add a positive delta to the team with the stronger GF/G, and a negative delta to the team with the weaker GF/G.
Also, your data may not be valid. Shouldn't the total number of goals scored BY all teams, also equal the total number of goals scored AGAINST all teams? So the sum of column C must equal the sum of column D? I therefore changed the numbers in column D slightly.
The result then for your data looks like this:
where the formula in E1 is:
=ArrayFormula({"GF/G";C2:C5/$B2:$B5})
and in G1 is:
=ArrayFormula({"Avg Goals/G";SUM($C$2:$C$5) / (SUM(B2:B5)/2) })
Adding in your matchups and projected scores, I get this:
where the projected scores for all the teams in column I are given by this formula, in K2:
=ArrayFormula(vlookup(I2:I7,$A$2:$G$5,5)
- vlookup(J2:J7,$A$2:$G$5,5)
+ $G$2/2)
Note that I've duplicated columns K:M in columns N:P, but shown with a decimal place to show the average goals per game still equals 2.2, but with rounding adjustments (no fractions of a goal) it doesn't always work out right.
Here is my sample sheet.

Having trouble dealing with NA values in YEAR column

I was trying to clean a housing dataset to build a model. I was stuck on a step where I had NA values in GarageYrBlt column. The house doesn't have a garage and thus the GarageYrBlt column has NA in it. How should I handle them?
Here's my dataset:
Id GarageType GarageYrBlt
1 1 Attchd 2003
2 2 Attchd 1976
3 3 Attchd 2001
4 4 Detchd 1998
5 5 Attchd 2000
6 6 No Garage NA
These are just sample rows. I have a big data set with lots of NA values.
Year can be a valuable feature for both regression and classification problem.In this case you can try label encoding the year column so that all NA values will be given one code.Since this column has a connection with Garage type as you mentioned it is better not to eliminate those rows.
Hope this is useful.Thank you

Mahout Boolean pref data model with multiple "purchases"

I want to obtain reccomendations on the most purchased item for an order with a specific item so for example if I have such table
user order items purchased
1 1 1,3
1 2 2,3
2 1 3
3 1 2,4
3 2 1,2,4
if I visit the page of item 2 I want item 4 as suggested product because it is present on the rows 2,4 and 5 while the item 3 is present only on row 2 (I am considering just orders with the item 2 in it) (note that the item 3 is the most purchased but I don't want it as suggested since I am looking at item 2). What kind of problem is this? Is it an item reccomender? Is it doable in Mahout or should I implement it by hand? Since it is not possible to model multiple preferences per same user and item, I have thought to convert the string user_order to userId.
Thanks very much
Yes this is a very simple recommender problem. I think you want to ignore 'order'. So your data is more like:
1,1
1,3
1,2
2,3
3,2
3,4
3,1

Resources