Neo4j - Recursive query with relationship weights - neo4j

I am building a tool that enables users to recommend sequences of online courses to other users.
Off the back of the data generated from that, I would like to generate insights on what the most recommended sequences of courses are. Here's a slice of the model:
In this graph, the numbers in green are weights that show how many people have recommended one course after the other (ex: 655 people recommend taking Stanford's ML after Intro to Prob)
The recs field in nodes is the absolute number of recommendations that course has (ex: Stanford ML has been featured by 1000 users in sequences)
What I would like to do, is starting from an end goal, finding out the most recommended pre-requisites.
The algorithm might work something like this:
Function fancy_algo (node, graph)
If (no prereqs OR prereq weight is very low)
Return graph
Get all incoming nodes
For each incoming node subject
MR = most recommended pre-req
Append MR to graph
fancy_algo(MR, graph)
The end state for someone looking to do Stanford's Machine Learning might look something like this:
Note how after "Intro to CS" we haven't included "Algebra", because its prereq weight was very low (20 out of 4567).
Is this something that can be managed with Cypher? How would I get started?

Related

Data model and traversal approach for bus routing in neo4j

I'm trying to create an application to find the best paths to use when traveling by bus within my local city. I have found some useful answers on here so far, but I'm currently struggeling with my approach in general and I'd like to get some feedback.
Current data model
There are stations and stages modeled as nodes and two relationships between stations and stops. Each stage node has a start and an end time as a string in "HH:mm" format and belongs to some higher-level structure which I call routes, that are connected to these stage nodes to describe a trajectory along stations with time details. Each :FROM relationship has a property duration to model the travel time for reduce statements.
So the following query would return something like this. The stage nodes show the start property in the picture.
match (from:Station {name: "Glosberg"})
match (to:Station {name: "Knellendorf"})
match paths=((from)-[:FROM|:TO*..10]->(to))
return paths;
Problems so far:
ShortestPath/AllShortestPaths is not a valid option as smallest number of hops does not mean best path. What I want is a reduction of travel duration, which I can achive with a Reduce statement, which I have already done. Since I have to check out all paths I'm using the general pattern matcher with a limit (as seen above). The limit I use in my queries is actually the length of the shortest paths between from and to plus 10% or so to also include paths that might consist of more hops but take less time. This is not necessarily accurate but seems like a fair trade-off.
Using dijkstra gives me all paths from A to B. Since stage nodes have a form of time data on them, most of the paths do not make sense, because they are either combined in reversed order (2pm -> 1pm) or produce long waiting times (2pm -> 4pm), which are not necessary. Therefore I have to filter out bad paths, either in cypher or at some api level. However, with the current data model there simply are too many paths to check for validity. With some sample data, which would also run in production, I have a route that visits 24 stations 2 times a day, resulting in 2^23 paths to take. I'm pretty sure that my data model is the problem, but I can't see any ways to solve this; any ideas?
Questions:
More of a side problem: How would you solve ordering paths with stages that go past 0am? As "23:59" is bigger than "00:01" but not chronologically.
What would you change about the data model?
Would you suggest any trade-offs in how the path finding works to reduce the complexity (eg. simply using shortestpath)?
Would you suggest seperating the actual route data (timetable, who stops where and when) from the infrastructure data (stations and which stations are close to which)? That way I'd have to and use neo4j to find a path/set of stations to travel along and then try to find a suiting set of elements from a timetable, similiar to wanderu's approach.
I've read that the traversal api is a better way to describe how the graph should be accessed instead of using cypher, which only describes what to look for, but I'd like to receive feedback on my thoughs until now before I dive into that.

Logic for selecting best nearby venues for display on a map

I have an app that displays information about certain venues. Each venue is awarded a rating on a scale from 0-100. The app includes a map, and on the map I'd like to show the best nearby venues. (The point is to recommend to the user alternative venues that they might like.)
What is the best way to approach this problem?
If I fetch the nearest x venues, many bad venues (i.e. those with a
low rating) show.
If I fetch the highest rated venues, many of them
will be too far away to be useful as recommendations.
This seems like a pretty common challenge for any geolocation app, so I'm interested to know what approach other people have taken.
I have considered "scoring" each possible venue by taking into account its rating and its distance in miles.
I've also considered fetching the highest rated venues within a y mile radius, but this gets problematic because in some cities there are a lot of venues in a small area (e.g. New York) and in others it's reasonable to recommend venues that are farther away.
(This is a Rails app, and I'm using Solr with the Sunspot gem to retrieve the data. But I'm not necessarily looking for answers in code here, more just advice about the logic.)
Personally, I would implement a few formulas and use some form of A/B testing to get an idea as to which ones yield the best results on some outcome metric. What exactly that metric is is up to you. It could be clicks, or it could be something more complicated.
Start out with the simplest formula you can think of (ideally one that is computationally cheap as well) to establish a baseline. From there, you can iterate, but the absolute key concept is that you'll have hard data to tell you if you're getting better or worse, not just a hunch (perhaps that a more complicated formula is better). Even if you got your hands on Yelp's formula, it might not work for you.
For instance, as you mentioned, a single score calculated based on some linear combination of inverse distance and establishment quality would be a good starting point and you can roll it out in a few minutes. Make sure to normalize each component score in some way. Here's a possible very simple algorithm you could start with:
Filter venues as much as possible on fast-to-query attributes (by type, country, etc.)
Filter remaining venues within a fairly wide radius (you'll need to do some research into exactly how to do this in a performant way; there are plenty of posts on Stackoverflow and else where on this. You'll want to index your database table on latitude and longitude, and follow a number of other best practices).
Score the remaining venues using some weights that seem intuitive to you (I arbitrarily picked 0.25 and 0.75, but they should add up to 1:
score = 0.25*(1-((distance/distance of furthest venue in remaining
set)-distance of closest venue)) + 0.75*(quality score/highest quality
score in remaining set)
Sort them by score and take the top n
I would put money on Yelp using some fancy-pants version of this simple idea. They may be using machine learning to actually select the weights for each component score, but the conceptual basis is similar.
While there are plenty of possibilities for calculating formulas of varying complexity, the only way to truly know which one works best is to gather data.
I would fix the number of venues returned at say 7.
Discard all venues with scores in the lowest quartile of reviewers scores, to avoid bad customer experiences, then return the top 7 within a postcode. If this results in less than 7 entries, then look to the neighboring post codes to find the best scores to complete the list.
This would result in a list of top to mediocre scores locally, perhaps with some really good scores only a short distance away.
From a UX perspective this would easily allow users to either select a postcode/area they are interested in or allow the app to determine its location.
From a data perspective, you already have addresses. The only "tricky" bit is determining what the neighboring postcodes/areas are, but I'm sure someone has figured that out already.
As an aside, I'm a great believer in things changing. Like restaurants changing hands or the owners waking up and getting better. I would consider offering a "dangerous" list of sub-standard eateries "at your own risk" as another form of evening entertainment. Personally I have found some of my worst dining experiences have formed some of my best dining out stories :-) And if the place has been harshly judged in the past you can sometimes find it is now a gem in the making.
First I suggest that you use bayesian average to maintain an overall rating for all the venues, more info here: https://github.com/tyrauber/acts_rateable
Then you can retrieve the nearest venues ordered by distance then ordered by rating. two order by statements in your query

how to build an efficient ItemBasedRecommender in Mahout?

I am building an Item Based Recommender System for 10 millions users who
rate categories over 20 possible categories (news categories like politic,
sport etc...)
I would like for each one of them to be recommended at least another
category which they don't know (no rating).
I runned a GenericUserBasedRecommender and asked for recommendations for
each user but It looks extremely long: maybe 1000 user proceeded per minute.
My questions are:
1- Can I run this same GenericUserBasedRecommender on hadoop and would it
really be faster? I saw and run an ItemBasedRecommender with command line on
a cluster, but I would rather run a User Based one.
1,5 - I saw many users not having a single recommendations. What is the alogrithm criterium to determine if a user get a recommendation? I thought It could be that the user who don't get recommendations are the one who only give a single rating, but I don't understand why.
2- Is there another smarter way to deal with my problem? Maybe some clustering
solution instead of recommendation? I don't exactly see how.
3- Finally, am I right when I say that the algorithms who have no command line
are not to be used with hadoop?
Thank you for your answers.
Sometimes you won't get recommendations for certain items or users because there are few items over which they overlap. It could also be a case where the user data may be 'enough', but his behaviour/use patterns are very unique and/or disagreement with popular trends in the data.
You could perhaps try LogLikelihood or Tanimoto based ItemSimilarity.
Another thing you could look into is a Matrix Factorization based model. You could use the ALSWR Factorizer to generate recommendations. this method decomposes the original User-Item matrix, to a User-Feature, Item-Feature and Diagonal matrix,--> then reduces the dimensionality-->and then recronstructs the matrix which is closest to the original matrix with same rank. You might lose some data this method, but the missing values in the user-item matrix are imputed and you get estimate preference/recommendation values.
If you have the features and not just implicit ratings, you could probably experiment with clustering techniques, perhaps start with Hierarchical Clustering.
I did not quite get your last question.

how to cluster users based on tags

I'd like to cluster users based on the categories or tags of shows they watch. What's the easiest/best algorithm to do this?
Assuming I have around 20,000 tags and several million watch events I can use as signals, is there an algorithm I can implement using say pig/hadoop/mortar or perhaps on neo4j?
In terms of data I have users, programs they've watched, and the tags that a program has (usually around 10 tags per program).
I would like to expect at the end k number of clusters (maybe a dozen?) or broad buckets which I can use to classify and group my users into buckets and also gain some insight about how they would be divided - with a set of tags representing each cluster.
I've seen some posts out there suggesting a hierarchical algorithm, but not sure how one would calculate "distance" in that case. Would that be a distance between two users, or between a user and a set of tags, etc..
You basically want to cluster the users according to their tags.
To keep it simple, assume that you only have 10 tags (instead of 20,000 ones). Assume that a user, say user_34, has the 2nd and 7th tag. For this clustering task, user_34 can be represented as a point in the 10-dimensional space, and his corresponding coordinates are: [0,1,0,0,0,0,1,0,0,0].
In your own case, each user can be similarly represented as a point in a 20,000-dimensional space.
You can use Apache Mahout which contains many effective clustering algorithms, such as K-means.
Since everything is well defined in a mathematical coordinate system, computing the distance between any two users is easy! It can be computed using any distance function, but the Euclidean distance is the de-facto standard.
Note: Mahout and many other data-mining programs support many formats suitable for SPARSE features, i.e. You do not need to insert ...,0,0,0,0,... in the file, but only need to specify which tags are selected. (See RandomAccessSparseVector in Mahout.)
Note: I assumed you only want to cluster your users. Extracting representative info from clusters is somewhat tricky. For example, for each cluster you may select the tags that are more common between the users of the cluster. Alternatively, you may use concepts from information theory, such as information gain to find out which tags contain more information about the cluster.
You should consider using neo4j. You can model your data using the following node labels and relationship types.
If you are not familiar with neo4j's Cypher language notation, (:Foo) represents a node with the label Foo, and [:BAR] represents a relationship with the type BAR. The arrows around a relationship indicate its directionality. neo4j efficiently traverses relationships in both directions.
(:Cluster) -[:INCLUDES_TAG]-> (:Tag) <-[:HAS_TAG]- (:Program) <-[:WATCHED]- (:User)
You'd have k Cluster nodes, 20K Tag nodes, and several million WATCHED relationships.
With this model, starting with any given Cluster node, you can efficiently find all its related tags, programs, and users.

Collaborative filtering for news articles or blog posts

It's known how collaborative filtering (CF) is used for movie, music, book recommendations. In the paper 'Collaborative Topic Modeling for Recommending Scientiļ¬c Articles' among other things authors show an example of collaborative filtering applied to ~5,500 users and ~17,000 scientific articles. With ~200,000 user-item pairs, the user-article matrix is obviously highly sparse.
What if you do collaborative filtering with matrix factorization for, say, all news articles shared on Twitter? The matrix will be even sparser (than that in the scientific articles case) which makes CF not very applicable. Of course, we can do some content-aware analysis (taking into account, the text of an article), but that's not my focus. Or we can potentially limit our time window (focus, say, on all news articles shared in the last day or week) to make the user-article matrix denser. Any other ideas how to fight the fact that the matrix is very sparse? What are the results in research in the area of CF for news article recommendations? Thanks a lot in advance!
You might try using an object-to-object collaborative filter instead of a user-to-object filter. Age out related pairs (and low-incidence pairs) over time since they're largely irrelevant in your use case anyway.
I did some work on the Netflix Prize back in the day, and quickly found that I could significantly outperform the base model with regard to predicting which items were users' favorites. Unfortunately, since it's basically a rank model rather than a scalar predictor, I didn't have RMSE values to compare.
I know this method works because I wrote a production version of this same system. My early tests showed that, given a task wherein 50% of users' top-rated movies were deleted, the object-to-object model correctly predicted (i.e., "replaced") about 16x more of users' actual favorites than a basic slope-one model. Plus the table size is manageable. From there it's easy to include a profitability weight against the sort order, etc. depending on your application.
Hope this helps! I have a working version in production but am still looking for beta clients to bang on the system... if anyone has time to give it a run I'd love to hear from you.
Jeb Stone, PhD
www.selloscope.com

Resources