mahout recommendation with 3 columns without preferences - mahout

I have to recommend videos to users. I have csv file containing userId, videoId, productId. Under a product id there are many similar videos present.
Like:
userId videoId productId
1 2 1
1 3 1
1 5 2
2 7 2
2 8 1
2 2 1
for more clarity again I am factorizing it :
user and video relationship:
userId videoId
1 2
1 3
1 5
2 7
2 8
2 2
consider user and video:
As we see user 1 is similar to user 2 on the basis of videoid 2 so, i will recommend user 1 to watch 7 and 8 video. simple :)
But the twist is
actual product and video data like this:
videoId productId
2 1
3 1
5 2
7 2
8 1
2 1
4 1
6 1
video 4 and 6 also coming under productid 1. Think if user 1 come and see videoid 2 i will have to recommend 7,8(on the basis of similar user) and 4,6(on the basis of similar video under same product but not present in actual csv).
My question is:
do I need to factorize the csv.
what is the best algo to do it.
3.after getting result video , how to rank them

What do you want to recommend, product or video? Choose one and throw the other away, I don't see what use it is. The recommendations will come back ordered and with estimated preference weights.
Which version of the Mahout recommenders to use depends on how much data you have, how many users and items. Also how often you get new preference data. All of the Mahout 0.9 recommenders can only recommend to users that have expressed preferences and only use preferences used to calculate the model.
Mahout 1.0 has a completely different mechanism that can recommend to anonymous or new users as long as you have some preference data for them. This data need not be in the model built by Mahout. This method requires the use of a search engine like Solr or Elasticsearch.
Mahout docs: http://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html
A preso I put together: http://www.slideshare.net/pferrel/unified-recommender-39986309

Related

Comparing or combining values in a column

Scenario and question:
Basically I have results for a matched pair survey of couples in SPSS. It's set up where the person A's answers to questions 1-10 are the first 10 variables and then person B's answers to questions 1-10 are the next 10 variables. But I need to run tests and produce crosstabs for individuals, so if I have 20 couples the crosstabs outputs should be out of 40. I was able to simply select all the data for the "person B"s in couples and just copy and paste it over, however I lost couple-specific data and I still need to be able to create new variables based on the matched pair information. My way around this was creating a new variable while still in matched pair form called CoupleNum, so even when they were in individual form I could say if their couple number equaled each other calculate this or that. But I don't actually know how to do this. In the same dataset, how do I compare rows for the same variable?
Example for what I'm talking about:
Here's fake data
A_CoupleNum
A1_HappyScale
B_CoupleNum
B1_HappyScale
1
6
1
4
2
2
2
3
3
9
3
7
I'd move it to individual form like
CoupleNum
HappyScale
1
6
2
2
3
9
1
4
2
3
3
7
And then I'd want to be able to make a new variable called CoupleHappiness that was the HappyScale for each person in the couple added together.
CoupleNum
HappyScale
CoupleHappiness
1
6
10
2
2
5
3
9
16
1
4
10
2
3
5
3
7
16
So essentially I'd want to code something like
if CoupleNum = CoupleNum CoupleHappiness = HappyScale + HappyScale
I know this is definitely not correct but hopefully it gets my point across and what I'd like to do.
Potential solutions I've found that don't work/I don't know how to make them work for my needs:
Since I'm new to SPSS, I've found several things that might work but I don't know SPSS syntax well enough to suit them for my needs. I've noticed people mention things like LAG functions or CREATE + LEAD if they were in adjacent rows, but they could be all over the place. Someone also mentioned using case numbers but I don't exactly understand that.
Sorry this was a really long question but I would appreciate any help!!
What you are looking for is the aggregate function. In this case you can use it this way:
NOTE - this code was edited and corrected:
aggregate out=* mode=addvariables /break CoupleNum/CoupleHappiness=sum(HappyScale).
The function groups all the rows by values of CoupleNum, then for each group the function will calculate the sum of HappyScale and put it in a new variable called CoupleHappiness.

QUERY and UNIQUE

Given a table like this one:
A
B
1
TIE
1
TIE
1
TIE
2
WIN
3
TIE
3
TIE
4
LOSS
4
LOSS
I need a query that returns in a different sheet
A
TIE
WIN
TIE
LOSS
The actual sheet is here: Link

Star Schema Design for User Utilization Reports

Scenario: There are 3 kinds of utilization metrics that i have derive for the users. In my application, users activity are tracked using his login history, number of customer calls made by the user, number of status changes performed by user.
All these information are maintained in 3 different tables in my application db like UserLoginHistory, CallHistory, OrderStatusHistory. All the actions made by each user is stored in these 3 tables along with DateTime info.
Now i am trying to create a reporting db that will help me in generating the overall utilization of user. Basically the report should show me for each user over a period:
UserName
Role
Number of Logins Made
Number of Calls Made
Number of Status updates Made
Now i am in the process of designing my fact table. How should i go about creating a Fact table for this scenario? Should i go about creating a single fact table with rows in it capturing all these details at the granular date level (in my DimDate table level) or 3 different fact tables and relate them?
The 2 options i described above arent convincing and i am looking for better design. Thanks.
As rule of thumb, when you have a report which uses different facts/metrics (Number of Logins Made, Number of Calls Made, Number of Status updates Made) with the same granularity (UserName, Role, Day/Hour/Minute), you put them in the same fact table, to avoid expensive joins.
For many reasons this is not always possible, but your case seems to me a bit different.
You have three tables with the user activity, where probably you store more detailed information about logins, calls and Status updates. What you need for your report is a table with your metrics and the values aggregated for the time granularity that you need.
Let's say you need the report at the day level, you need a table like this:
Day UserID RoleID #Logins #Calls #StatusUpdate
20150101 1 1 1 5 3
20150101 2 1 4 15 8
If tomorrow the business will require the report by hour, the you will need:
DayHour UserID RoleID #Logins #Calls #StatusUpdate
20150101 10:00AM 1 1 1 2 1
20150101 11:00AM 1 1 0 3 2
20150101 09:00AM 2 1 2 10 4
20150101 10:00AM 2 1 2 5 4
Then the Day level table will be like an aggregated (by Day) version of the second one. The DayHour attribute is child of the Day one.
If you need minute details you go down with the granularity.
You can also start directly with a summary table at the minute level, but I would double check the requirement with the business, usually one hour range (or 15 minutes) are enough.
Then if they need to get more detailed information, you can always drill down querying your original tables. The good thing is that when you drill to that level you should have just a small set of rows to query (like just few hours for a specific UserName) and your database should be able to handle it.

How can I count unique values corresponding to a break variable with the function aggregate in SPSS?

I guess it is really easy, but I just cannot find the answer myself. The variable that I would like to calculate is the variable "Number_of brands_bought" (see below) and I've tried to use the aggregate function in SPSS with respondent as break variable and Brand as summaries of variables (and then I choose function count). However, it just does not give me the right answer.
Respondent Brand Number_of_brands_bought
1 1 3
1 2 3
1 3 3
1 3 3
2 1 2
2 2 2
3 1 3
3 4 3
3 5 3
Does anybody know what to do? Thanks in advance!
It's not clear from the description you have provided how the data is stored. It could be stored in one of two ways (possibly others) either:
1) Wide format
2) Long format
Hopefully this link works to my Google drive docs where I have mocked an example of both file structure formats:
Example Data
If the data is in wide format, where you have brands (bought) as individual dichotomous variables and one row per respondent then you can simply sum the values 1's indicating whether that brand had been bought (assuming 0=no/1=yes coding i.e. as oppose to 1=yes/2=no coding which sometimes is the case)
compute Num_Brands=sum(Bought_Brand01 to Bought_Brand05).
Alternatively, given you suggest the need to use aggregate function, perhaps it is that you have the data in long format i.e. respondents x brands. If that is the case then you can derive the sum of brands using aggregate:
The code in SPSS would be:
AGGREGATE OUTFILE=* MODE=ADDVARIABLES /BREAK=ID /Num_Brands=sum(Bought).

Frequency or count for PCA

I have a number of observations that is a count of a certain event occurring for a given user. For example
login_count logout_count
user1 5 2
user2 20 10
user3 34 5
I would like to feed in these variables along along with a number of other ones to PCA, just wondering if I should work with counts directly (and scale the columns) or work with percentage (and scale the columns after) e.g
login_count logout_count
user1 0.71 0.28
user2 0.66 0.33
user3 0.87 0.13
which one would be a better way of representing the data?
thanks
Depends on the information you want to extract from the data.
If the correlation login=p*logout then I would go with the first one.
The other one is a little bit weird since you should be doing a login 100% of the time (how wold you else know it's user1?) and a logout perhaps 28%. And also you have the dependency 1-login_procent_i=logout_procent_i which will give you a perfect correlation before and after the preprocessing.

Resources