Calculating User Counts with Common Entities - neo4j

I want to determine groups of users who have common interests.
Data Model and Characteristics
User and Interest are node labels and represent unique nodes
LIKES is the relationship among them, (User)-[:LIKES]->(Interest)
All properties of nodes are indexed
Relation nature can be characterized as many to many between the nodes
There are 300+ interests and 120,000+ users
I used the following query to determine user count with one common interest and all others;
MATCH (u:User)-[:LIKES]-(i:Interest)
WHERE i.name = "Baking"
WITH u
MATCH (u)-[:LIKES]-(i:Interest)
WHERE i.name <> "Baking"
RETURN i.name, COUNT(u) AS userCount
ORDER BY userCount DESC
I tried making a query which can have 3 common interests but that made it slower. I think this is not a good, scallable design, can anyone help?
Though maybe its not plausible but the end goal is to calculate nxn combinations of interests.

maybe you should limit the interests and only take the top five or something?
Also i don't know your data model but is the interest a unique node. That would speed up the query. So the relation [has interest]->( baking) points to the same node and you just can start from baking to get all the users.
Maybe flip your query and start from interest (cypher is strange) or you can force the query to use indexes

Related

Is there a way to preform calculations in a cypher query?

Is there a way to order by a calculated value based on relationships, rather than a label? For reference, I have a database containing users and skills. If applicable, each user node has a relationship with a skill node. Each skill has a specific value tied to it that represents how important that skill is. If a user wants to find similar users, what I am currently doing matching all distinct users with similar skills. What I want to do is sum up the values contained in each skill node that I'm looking for for a particular user, and sort from greatest to least. For example, if I'm looking for people that like to swim, run, and bike if Billy likes to swim and run I would take the values stored in each similar skill and sum them to use as the property to sort by. Is this possible in purely cypher, or would I have to return the list of results and then calculate/sort outside of cypher? If anyone has any other advice on how to better structure the database that would also be helpful.
This is pretty easy in Cypher and is for sure documented in a lot of places.
Here a couple of examples :
Finding users like Bob, based on similar skills, order by sum of skill importance on the skill node :
MATCH (n:User {name: 'Bob'})-[:HAS_SKILL]->(skill)<-[:HAS_SKILL]-(otherUser)
RETURN otherUser.name AS name, sum(skill.score) AS score
ORDER BY score DESC
In some graph models, each user can be associated a score to each skill, in which case the score would be on the relationship between the user and the skill, you can then sum up those as well :
MATCH (n:User {name: 'Bob'})-[r1:HAS_SKILL]->(skill)<-[r2:HAS_SKILL]-(otherUser)
RETURN otherUser.name AS name, sum(r1.score + r2.score) AS score
ORDER BY score DESC

How to calculate custom degree based on the node label or other conditions?

I have a scenario where I need to calcula a custom degree between the first node (:employee) where it should only be incremented to another node when this node's label is :natural or :relative, but not when it is :legal.
Example:
The thing is I'm having trouble generating this custom degree property as I needed it.
So far I've tried playing with FOREACH and CASE but had no luck. The closest I got to getting some sort of calculated custom degree is this:
match p = (:employee)-[*5..5]-()
WITH distinct nodes(p) AS nodes
FOREACH(i IN RANGE(0, size(nodes)) |
FOREACH(node IN [nodes[i]] |
SET node.degree = i
))
return *
limit 1
But even this isn't right, as despite having 5 distinct nodes, I get SIZE(nodes) = 6, as the :legal node is accounted for twice for some reason.
Does anyone know how to achieve my goal within a single cypher query?
Also, if you know why the :legal node is account for twice, please let me know. I suspect it is because it has 2 :natural nodes related to it, but don't know the inner workings that make it appear twice.
More context:
:employee nodes are, well, employees of an organization
:relative nodes are relatives to an employee
:natural nodes are natural persons that may or may not be related to a :legal
:legal nodes are companies (legal persons) that may, or may not, be related to an :employee, :relative, :natural or another :legal on an IS_PARTNER relationship when, in real life, they are part of the board of directors or are shareholders of that company (:legal).
custom degree is what I aim to create and will define how close one node is to another given some conditions to this project (specified below).
All nodes have a total_contracts property that are the total amount of money received through contracts.
The objective is to find any employees with relationships to another node that has total_contracts > 0 and are up to custom degree <= 3, as employees may be receiving money from external sources, when they shouldn't.
As for why I need this custom degree ignoring the distance when it is a :legal node, is because we threat companies as the same distance as the natural person that is a partner.
On the illustrated example above, the employee has a son, DIEGO, that is a shareholder of a company (ALLURE) and has 2 other business partners (JOSE and ROSIEL). When I ask what's the degree of the son to the employee, I should get 1, as they are directly related; when I ask whats the degree of JOSE to the employee I should get 2, as JOSE is related to DIEGO through ALLURE and we shouldn't increment the custom degree when it is a company, only when its a person.
The trick with this type of graph is making sure we avoid paths that loop back to the same nodes (which is definitely going to happen quite a lot because you're using multiple relationships between nodes instead of just one...you may want to make sure this is necessary in your model).
The easiest way to do that is via APOC Procedures, as you can adjust the uniqueness of traversals so that nodes are unique in each path.
So for example, for a specific start node (let's say the :employee has empId:1 just for the sake of mocking up a lookup of the node, we'll calculate a degree for all nodes within 5 hops of the starting node. The idea here is that we'll take the length of the path (the number of hops) - the number of :legal nodes in the path (by filtering the nodes in the path for just :legal nodes, then getting the size of that filtered list).
MATCH (e:employee {empId:1})
CALL apoc.path.expandConfig(e, {minLevel:1, maxLevel:5, uniqueness:'NODE_PATH'}) YIELD path
WITH e, last(nodes(path)) as endNode,
length(path) - size([x in nodes(path) WHERE x:legal]) as customDegree
RETURN e, endNode, customDegree

Collaborative filtering cypher with attributes in neo4j

I am using neo4j to setup a recommender system. I have the following setup:
Nodes:
Users
Movies
Movie attributes (e.g. genre)
Relationships
(m:Movie)-[w:WEIGHT {weight: 10}]->(a:Attribute)
(u:User)-[r:RATED {rating: 5}]->(m:Movie)
Here is a diagram of how it looks:
I am now trying to figure out how to apply a collaborative filtering scheme that works as follows:
Checks which attributes the user has liked (implicitly by liking the movies)
Find similar other users that have liked these similar attributes
Recommend the top movies to the user, which the user has NOT seen, but similar other users have seen.
The condition is obviously that each attribute has a certain weight for each movie. E.g. the genre adventure can have a weight of 10 for the Lord of Rings but a weight of 5 for the Titanic.
In addition, the system needs to take into account the ratings for each movies. E.g. if other user has rated Lord of the Rings 5, then his/her attributes of the Lord of Ranges are scaled by 5 and not 10. The user that has rated the implicit attributes also close to 5 should then get this movie recommended as opposed to another user that has rated similar attributes higher.
I made a start by simply recommending only other movies that other users have rated, but I am not sure how to take into account the relationships RATING and WEIGHT. It also did not work:
MATCH (user:User)-[:RATED]->(movie1)<-[:RATED]-(ouser:User),
(ouser)-[:RATED]->(movie2)<-[:RATED]-(oouser:User)
WHERE user.uid = "user4"
AND NOT (user)-[:RATED]->(movie2)
RETURN oouser
What you are looking for, mathematically speaking, is a simplified Jaccard index between two users. That is, how similar are they based on how many things they have in common. I say simplified because we are not taking into account the movies they disagree about. Essentially, and following your order, it would be:
1) Get the total weight of every Attribute for every user. For instance:
MATCH (user:User{name:'user1'})
OPTIONAL MATCH (user)-[r:RATED]->(m:Movie)->[w:WEIGHT]->(a:Attribute)
WITH user, r.rating * w.weight AS totalWeight, a
WITH user, a, sum(totalWeight) AS totalWeight
We need the last line because we had a row for each Movie-Attribute combination
2) Then, we get users with similar tastes. This is a performance danger zone, some filtering might be neccesary. But brute forcing it, we get users that like each attribute within an 10% error (for instance)
WITH user, a, totalWeight*0.9 AS minimum, totalWeight*1.10 AS maximum
MATCH (a)<-[w:WEIGHT]-(m:Movie)<-[r:RATES]-(otherUser:User)
WITH user, a, otherUser
WHERE w.weight * r.rating > minimum AND w.weight * r.rating < maximum
WITH user, otherUser
So now we have a row (unique because of last line) with any otherUser that is a match. Here, to be honest, I would need to try to be sure if otherUsers with only 1 genre match would be included.. if they are, an additional filter would be needed. But I think that should go after we get this going.
3) Now it´s easy:
MATCH (otherUser)-[r:RATES]->(m:Movie)
WHERE NOT (user)-[:RATES]->(m)
RETURN m, sum(r.rating) AS totalRating ORDER BY totalRating DESC
As mentioned before, the tricky part is 2), but after we know how to get the math going, it should be easier. Oh, and about math, for it to work properly, total weights for a movie should sum 1 (normalizing). In any other case, the difference between total weights for movies would cause an unfair comparison.
I wrote this without proper studying (paper, pencil, equations, statistics) and trying the code in a sample dataset. I hope it can help you anyway!
In case you want this recommendation without taking into account user ratings or attribute weights, it should be enough to substitute the math in lines in 1) and 2) with just r.rating or w.weight, respectively. RATES and WEIGHTS relationships would still be used, so for instance an avid consumer of Adventure movies would be recommended Movies by consumers of Adventure movies, but not modified by ratings or by attribute weight, as we chose.
EDIT: Code edited to fix syntax errors discussed in comments.
Answer to your 1st query:
Checks which attributes the user has liked (implicitly by liking the movies)
MATCH (user:User)
OPTIONAL MATCH (user)-[r:RATED]->(m:movie)
OPTIONAL MATCH (m)-[r:RATED]->(a:Attribute)
WHERE user.uid = "user4"
RETURN user, collect ({ a:a.title })
It is a subquery construct where in you find the movies rated by the user and then find attributes of the movies and finally return list of liked attributes
you can modify return statement to collect (a) as attributes if you need entire node

cypher assign group of nodes to a relation

i'm working with Users assigned to a Grid location
(User)-[:PICK_UP]->(Grid)
With the query
MATCH (u:User)-[:PICK_UP]->(g:Grid)-[:TO]-(g2:Grid)<-[:PICK_UP]-(u2:User)
RETURN g,g2,u,u2
I have the result
In the image i have two groups of nodes, that represent the grid and its neighbors with users (red node). I would like to 'group'/create relations between the users nearby to a Spot node.
E.g. with the first group: grids 34, 40, 41, with the users 1,4,5,9. I would like to group the users in my query so i can get the result [user1, u4, u5, u9] and then those users i can assign them to a Spot, like this
Any suggestions??
Thank you !!
The thing to keep in mind is that your (u:User)-[:PICK_UP]->(g:Grid)-[:TO]-(g2:Grid)<-[:PICK_UP]-(u2:User) is matching a specific path, and while you see two groups in the graphical display, there are actually overlapping paths there. Viewing your result in table mode might be helpful.
So onto answering your question! Firstly, this was a tricky one, but a really cool one. I think I've got a good solution:
MATCH path=(grid:Grid)-[:TO]-(other_grid:Grid)
WITH CASE WHEN ID(grid) < ID(other_grid) THEN ID(other_grid) ELSE ID(grid) END AS id_to_reject
WITH collect(DISTINCT id_to_reject) AS ids_to_reject
MATCH (grid:Grid)
WHERE NOT(ID(grid) IN ids_to_reject)
CREATE (spot:Spot)
WITH grid, spot
MATCH (grid)-[:TO|PICK_UP*1..6]-(user:User)
MERGE (user)-[:AT_SPOT]->(spot)
The first thing that the query does it to compare all Grid nodes which are related to each other. For each of these pairs it passes on the ID() of the Grid node which is greater. The IDs which aren't in the list are therefore the smallest in the group and can act as a representative of the group. For each one of these representative Grid nodes we create a Spot node.
Using that node, it finds all User nodes within six hops via both TO and PICK_UP relationships. That should give all users in the group (both the users of our representative grid as well as the users of the other grids).
Then it's a simple matter to MERGE a relationship from each user to the Spot.

An Example Showing the Necessity of Relationship Type Index and Related Execution Plan Optimization

Suppose I have a large knowledge base with many relationship types, e.g., hasChild, livesIn, locatedIn, capitalOf, largestCityOf...
The number of capicalOf relationships is relatively small (say, one hundred) compared to that of all nodes and other types of relationships.
I want to fetch any capital which is also the largest city in their country by the following query:
MATCH city-[:capitalOf]->country, city-[:largestCityOf]->country RETURN city
Apparently it would be wise to take the capitalOf type as clue, scan all 100 relationship with this type and refine by [:largestCityOf]. However the current execution plan engine of neo4j would do an AllNodesScan and Expand. Why not consider add an "RelationshipByTypeScan" operator into the current query optimization engine, like what NodeByLabelScan does?
I know that I can transform relationship types to relationship properties, index it using the legacy index and manually indicate
START r=relationship:rels(rtype = "capitalOf")
to tell neo4j how to make it efficient. But for a more complicated pattern query with many relationship types but no node id/label/property to start from, it is clearly a duty of the optimization engine to decide which relationship type to start with.
I saw many questions asking the same problem but getting answers like "negative... a query TYPICALLY starts from nodes... ". I just want to use the above typical scenario to ask why once more.
Thanks!
A relationship is local to its start and end node - there is no global relationship dictionary. An operation like "give me globally all relationships of type x" is therefore an expensive operation - you need to go through all nodes and collect matching relationships.
There are 2 ways to deal with this:
1) use a manual index on relationships as you've sketched
2) assign labels to your nodes. Assume all the country nodes have a Country label. Your can rewrite your query:
MATCH (city)-[:capitalOf]->(country:Country), (city)-[:largestCityOf]->(country) RETURN city
The AllNodesScan is now a NodeByLabelScan. The query grabs all countries and matches to the cities. Since every country does have one capital and one largest city this is efficient and scales independently of the rest of your graph.
If you put all relationships into one index and try to grab to ~100 capitalOf relationships that operation scales logarithmically with the total number of relationships in your graph.

Resources