Recommendations without ratings (Azure ML) - machine-learning

I'm trying to build an experiment to create recommendations (using the Movie Ratings sample database), but without using the ratings. I simply consider that if a user has rated certain movies, then he would be interested by other movies that have been rated by users that have also rated his movies.
I can consider, for instance, that ratings are 1 (exists in the database) or 0 (does not exist), but in that case, how do I transform the initial data to reflect this?
I couldn't find any kind of examples or tutorials about this kind of scenario, and I don't really know how to proceed. Should I transform the data before injecting it into an algorithm? And/or is there any kind of specific algorithm that I should use?

If you're hoping to use the Matchbox Recommender in AML, you're correct that you need to identify some user-movie pairs that are not present in the raw dataset, and add these in with a rating of zero. (I'll assume that you have already set all of the real user-movie pairs to have a rating of one, as you described above.)
I would recommend generating some random candidate pairs and confirming their absence from the training data in an Execute R (or Python) Script module. I don't know the names of your dataset's features, but here is some pseudocode in R to do that:
library(dplyr)
df <- maml.mapInputPort(1) # input dataset of observed user-movie pairs
all_movies <- unique(df[['movie']])
all_users <- unique(df[['user']])
n <- 30 # number of random pairs to start with
negative_observations <- data.frame(movie = sample(all_movies, n, replace=TRUE),
user = sample(all_users, n, replace=TRUE),
rating = rep(0, n))
acceptable_negative_observations <- anti_join(unique(negative_observations), df, by=c('movie', 'user'))
df <- rbind(df, acceptable_negative_observations)
maml.mapOutputPort("df");
Alternatively, you could try a method like association rule learning which would not require you to add in the fake zero ratings. Martin Machac has posted a nice example of how to do this in R/AML in the Cortana Intelligence Gallery.

Related

Machine learning - Features contain list of values

I have a dataset that contains many features. I have one features that contain a list of values in one data point. It's can be like this :
A B C
1 2 [3,4,5]
So what can we handle features C for recommendation system?. I have known about one hot encoding but my features C doesn't have finite values. C contain ID number of others therefore it can become larger and larger overtime. Is there any solution to deal with this type of features?
From what you described and since you mentioned about recommendation system, I would consider your data set as an example of following:
per row is a user and feature A, B are user personal information for instance and feature C is the items he bought. And of course, feature C doesn't contain the same numbers of items in each row and it can expand.
I would build two different recommendation models and combine them together afterward. One for feature A, B and another is for feature C.
Since feature C evolves with time, you can build the model on regular time base (take the snapshot of the feature C) or as long as some 'event" triggers the building process. For feature C, in my example, is user-item matrices.

Neo4j - Recursive query with relationship weights

I am building a tool that enables users to recommend sequences of online courses to other users.
Off the back of the data generated from that, I would like to generate insights on what the most recommended sequences of courses are. Here's a slice of the model:
In this graph, the numbers in green are weights that show how many people have recommended one course after the other (ex: 655 people recommend taking Stanford's ML after Intro to Prob)
The recs field in nodes is the absolute number of recommendations that course has (ex: Stanford ML has been featured by 1000 users in sequences)
What I would like to do, is starting from an end goal, finding out the most recommended pre-requisites.
The algorithm might work something like this:
Function fancy_algo (node, graph)
If (no prereqs OR prereq weight is very low)
Return graph
Get all incoming nodes
For each incoming node subject
MR = most recommended pre-req
Append MR to graph
fancy_algo(MR, graph)
The end state for someone looking to do Stanford's Machine Learning might look something like this:
Note how after "Intro to CS" we haven't included "Algebra", because its prereq weight was very low (20 out of 4567).
Is this something that can be managed with Cypher? How would I get started?

Structure of Data with Mixed ANOVA in SPSS

I was told that I need a 3x2x2 mixed ANOVA. I am new relatively new to SPSS. I was wondering if someone could explain how to the data needs to be structured in SPSS. Meaning, how would I structure the rows and columns?
I have 3 trials of data (each trial containing hundreds of measurements) with 2 treatment conditions (0 Volts, and 15 Volts), and 2 different substrates used on which I grew cells (TCPS and PCSA) this is the in-between groups.
Please see this Infographic
I originally ran multiple T-tests but was told to do ANOVA, here is the original plot with t-tests (it gives an idea of how I originally intended to look at the data).
As a side note- if anyone knows how to do this with python, I am also open to do it that way. Just based one what I have gathered, SPSS seems to be the only route.
For anyone in the future: I was not able to do this in SPSS. However doing this in R is relatively simple. Put all data in one column of a CSV file (all 3 trials). For my case I had 3 additional columns of identifiers (Trial, Voltage, and Substrate). Open the file in R, store the data as a model, then run the ANOVA on the model with:
model1 <- lm(NeuriteLength ~ Trial + Substrate*Voltage, data = D)
anova(model1)

Finding similar users based on String preperties

Im a software engineering student, and new to Data Mining, I want to implement a solution to find similar users based on their interests and skills (Strings sets).
I think I cannot use K nearest Neighbors using an edit distance(Levenshtein or ..)
If someone could help with that please
The first thing you should do is convert your data into some reasonable representation, so that you will have a well-defined notion of distance between suitably represented users.
I would recommend converting all strings into some canonical form, then sorting all n distinct skills and interest strings into a dictionary D. Now for each user u, construct a vector v(u) with n components, which has i-th component set to 1 if the property in dictionary entry i is present, and 0 otherwise. Essentially we represented each user with a characteristic vector of her interests/skills.
Now you can compare users with Jaccard index (it's just an example, you'll have to figure out what works best for you). With the notion of a distance in hand, you can start trying out various approaches. Here are some that spring to mind:
apply hierarchical clustering if the number of users is sufficiently small;
apply association rule learning (I'll leave you to think out the details);
etc.

Looping through multiple maps in clojure

Ello friends,
I feel awfully silly for asking this... but, after struggling with this issue for some time now I've decided that another pair of eyes may help to illuminate my issue.
I'm trying to loop through two records and one map (I could possibly rewrite the map to be a record as well, but I have no need) simultaneously, compare some entries, and change values if the entries match. What I have is similar to this:
EDIT: Here's an attempt to specifically describe what I'm doing. However, now that I think about it perhaps this isn't the best way to go about it.
I'm attempting to create a restaurant-selection inference engine for an AI course using clojure. I have very little experience with clojure but initially wanted to create struct called "restaurant" so that I could create multiple instances of it. I read that structs in clojure are deprecated so to use records instead. Both the restaurants that are read in from the text file and the user input are stored as 'restaurant' records.
I read in, from a previously sorted text file database, attributes of the restaurants in question (name, type of cuisine, rating, location, price, etc..) and then put them into a vector.
Each attribute has a weight associated with it so that when the user enters search criteria the restaurants can be sorted in order of most to least relevant based on what is most likely to be the most heavily weighted items (for example, the restaurant name is the most important item, followed by type of cuisine, then the rating, etc..). The record therefore also has a 'relevance' attribute.
(defrecord Restaurant [resturant-name cuisine
rating location
price relevance])
;stuff
;stuff
;stuff
(defn search
[restaurants user-input]
(def ordered-restaurants [])
(doseq [restaurant restaurants]
(let [restaurant-relevance-value 0]
(doseq [input user-input
attributes restaurant
attribute-weight weights]
(cond
(= input (val attributes))
(def restaurant-relevance-value (+ restaurant-relevance-value
(val attribute-weight)))))
(assoc restaurant :relevance restaurant-relevance-value)
(def ordered-restaurants (conj ordered-restaurants restaurant))))
(def ordered-restaurants (sort-by > (:relevance ordered-restaurants)))
ordered-restaurants)
;more stuff
;more stuff
;more stuff
(defn -main
[& args]
(def restaurant-data (process-file "resources/ResturantDatabase.txt"))
(let [input-values (Restaurant. "Italian" "1.0" "1.0" "$" "true"
"Casual" "None" "1")]
(println :resturant-name (nth (search restaurant-data input-values) 0))))
So the idea is that each restaurant is iterated though and the attribute values are compared to the user's input values. If there is a match then the local relevance-value variable get added to it the associated weight value. After which it is put into a vector, sorted, and returned. The results can then be displayed to the user. Ideally the example in -main would print the name of the most relevant restaurant.
As the comments explain, you have to get to grips with the immutable nature of the native Clojure data structures. You should not be using doseq - which works by side effects - for your fundamental algorithms.
The relevance attribute is not a property of Restaurant. What you want is to construct a map from restaurant to relevance, for some mode of calculating relevance. You then want to sort this map by value. The standard sorted-map will not do this - it sorts by key. You could sort the map entries thus, but there is a ready made priority map that will give you the result you require automatically.
You also have to decide how to identify restaurants, as with a database. If resturant-names are unique, they will do. If not, you may be better with an artificial identifier, such as a number. You can use these as keys to a map of restaurants and the maps of relevene you onstruct to order.

Resources