I have a graph with two types of nodes: people nodes and products purchased, with the relationships being a purchase. I want to link two people nodes together if the two nodes share three product nodes, i.e if they have purchased three of the same things. What is the best way to go about doing this?
The graph has a few hundred million nodes. Which way will be the fastest also? In cypher I was thinking maybe something like the following, but it's taking ages and seems to be doing nothing?
MATCH path1 = (p1:PEOPLE)--(product:PRODUCT)--(p2:PEOPLE)
WHERE p1.person_id <> p2.person_id
WITH path1,p1,p2
MATCH path2=(p1)--(:PRODUCT)--(p2) WHERE nodes(path2)<>nodes(path1)
WITH path1,path2,p1,p2
MATCH path3 = (p1)--(:PRODUCT)--(p2)
WHERE nodes(path3)<>nodes(path2) and nodes(path3)<>nodes(path1)
create (p1)-[:LINK]->(p2)
create (p2)-[:LINK]->(p1)
Any suggestions much appreciated!
Instead of using multiple MATCH clauses, use the count aggregating function to check the number of common products between two nodes.
MATCH (p1:PEOPLE)--(product:PRODUCT)--(p2:PEOPLE)
WITH p1, p2, count(product) AS commonProductCount
WHERE commonProductCount >= 3
CREATE (p1)-[:LINK]->(p2)
CREATE (p2)-[:LINK]->(p1)
Note. As a rule of thumb, there is no point creating two LINK relationships in the opposite directions. You should pick a single direction and create the relationship, and use undirected relatioships such as ()-[:LINK]-() in your queries.
Related
I have the following graph:
I would look to get all contractors and subcontractors and clients, starting from David.
So I thought of a query likes this:
MATCH (a:contractor)-[*0..1]->(b)-[w:works_for]->(c:client) return a,b,c
This would return:
(0:contractor {name:"David"}) (0:contractor {name:"David"}) (56:client {name:"Sarah"})
(0:contractor {name:"David"}) (1:subcontractor {name:"John"}) (56:client {name:"Sarah"})
Which returns the desired result. The issue here is performance.
If the DB contains millions of records and I leave (b) without a label, the query will take forever. If I add a label to (b) such as (b:subcontractor) I won't hit millions of rows but I will only get results with subcontractors:
(0:contractor {name:"David"}) (1:subcontractor {name:"John"}) (56:client {name:"Sarah"})
Is there a more efficient way to do this?
link to graph example: https://console.neo4j.org/r/pry01l
There are some things to consider with your query.
The relationship type is not specified- is it the case that the only relationships from contractor nodes are works_for and hired? If not, you should constrain the relationship types being matched in your query. For example
MATCH (a:contractor)-[:works_for|:hired*0..1]->(b)-[w:works_for]->(c:client)
RETURN a,b,c
The fact that (b) is unlabelled does not mean that every node in the graph will be matched. It will be reached either as a result of traversing the works_for or hired relationships if specified, or any relationship from :contractor, or via the works_for relationship.
If you do want to label it, and you have a hierarchy of types, you can assign multiple labels to nodes and just use the most general one in your query. For example, you could have a label such as ExternalStaff as the generic label, and then further add Contractor or SubContractor to distinguish individual nodes. Then you can do something like
MATCH (a:contractor)-[:works_for|:hired*0..1]->(b:ExternalStaff)-[w:works_for]->(c:client)
RETURN a,b,c
Depends really on your use cases.
I would like to input two specific nodes and return the quantity of relationships that are along the path that connect the specific nodes.
(There is only 1 path possible in every case)
In some cases, two specific nodes are related through two relationships like this:
(Tim)-[]-()-[]-(Bill)
Should return 2 (relationships).
In other cases there are more nodes between my specific start and end nodes. Like this:
(Tim)-[]-()-[]-()-[]-()-[]-(Bill)
Should return 4 (relationships).
I have two types of relationships that could exist between nodes, so I need to avoid being specific about the type of relationship if possible.
New to this and performed an extensive search before asking this question as no one seemed to discuss relationships between specific nodes...
Many thanks for your help!
This query should work:
match p = (:Person {name:'Tim'})-[*]->(:Person {name:'Bill'})
RETURN length(p)
That is: return the length() of path p.
in Neo4j
if i have s:school o:office each label of them have 100 nodes
i need to create a relation in Cypher to match between one of school and only one of office that share the same ID
how to find out this ?
A bit more detail would be useful, but if you have a property of each side you can MATCH and CREATE like this:
MATCH (school:School), (office:Office)
WHERE school.property = office.property
CREATE (school)-[:YOUR_RELATIONSHIP_TYPE]->(office)
If you're using Neo4j 2.3.x you'll probably get a warning about cartesian products. It shouldn't be a big deal with only 100 nodes each, but for larger datasets this should be more efficient:
MATCH (school:School)
WITH school, school.property AS property
MATCH (office:Office {property: property})
CREATE (school)-[:YOUR_RELATIONSHIP_TYPE]->(office)
Of course you should make sure there is an index or a constraint on Office(property)
I’m new to Neo4j and graph theory and I’m trying to figure out if I can use Neo4j to solve a problem I have. Please correct me if I’m using the wrong words to describe stuff. Since I’m new to the subject I haven’t really wrapped my head around what to call everything.
I think the easiest way to describe my problem is with a lot of pictures.
Let’s say you have two disjoint subgraphs that look like this.
From the subgraphs above I want to get a list of subgraphs that fulfills one of two criteria.
Criteria 1.
If a node has a unique relationship to another node, the nodes and relationship should be returned as a subgraph.
Criteria 2.
If the relations are not unique, I'd like the node with the most relationships to be returned, as a subgraph with its relationships and related nodes.
If other nodes come in tie in criteria 2, I want all subgraphs to be returned.
Or put in the context of this graph,
Give me the people who have unique games, and if there are other people having the same games, give me back the person with the most games. If they come in tie, return all people who come in tie.
Or actually, return the whole subgraph, not only the person.
To clarify what I am after here is a picture that describes the result I want to get. The ordering of the result is not important.
Disjoint subgraph A, because of Criteria 1, Andrew is the only person who has Bubble Bobble.
Disjoint subgraph B, because of Criteria 1, Johan is the only person who has Puzzle Bobble 1.
Disjoint subgraph C, because of Criteria 2, Julia since she has the most games.
Disjoint subgraph D, because of Criteria 2, Anna since she comes in tie with Julia having the most games.
Worth noting is that Johan's relationship to Puzzle Bobble 2 is not returned because it's not unique and he has not the most games.
Is this a problem you could solve with only Neo4j and is it a good idea?
If you could solve it how would you do it in Cypher?
Create script:
CREATE (p1:Person {name:"Johan"}),
(p2:Person {name:"Julia"}),
(p3:Person {name:"Anna"}),
(p4:Person {name:"Andrew"}),
(v1:Videogame {name:"Puzzle Bobble 1"}),
(v2:Videogame {name:"Puzzle Bobble 2"}),
(v3:Videogame {name:"Puzzle Bobble 3"}),
(v4:Videogame {name:"Puzzle Bobble 4"}),
(v5:Videogame {name:"Bubble Bobble"}),
(p1)-[:HAS]->(v1),
(p1)-[:HAS]->(v2),
(p2)-[:HAS]->(v2),
(p2)-[:HAS]->(v3),
(p2)-[:HAS]->(v4),
(p3)-[:HAS]->(v2),
(p3)-[:HAS]->(v3),
(p3)-[:HAS]->(v4),
(p4)-[:HAS]->(v5)
I feel like this solution might not be quite what you're looking for, but it could be a good start:
MATCH (game:Videogame)<-[:HAS]-(owner:Person)
OPTIONAL MATCH owner-[:HAS]->(other_game:Videogame)
WITH game, owner, count(other_game) AS other_game_count
ORDER BY other_game_count DESC
RETURN game, collect(owner)[0]
Here the query:
Finds all of the games and their owners (games without owners will not be matched)
Does an OPTIONAL MATCH against any other games those owners might own (by doing an optional match we're saying that it's OK if they own zero)
Pass through each game/owner pair along with a count of the number of other games owned by that owner, sorting so that those with the most games come first
RETURN the first owner for each game (the ORDER is preserved when doing the collect)
The question:
What is the most performant way to create the following MATCH statement and why?
The detailed problem:
Let's say we have a Place node with a variable amount of properties and need to look up nodes from potentially billions of nodes by it's category. I'm trying to wrap my head around the performance of each query and it's proving to be quite difficult.
The possible queries:
Match Place node using a property lookup:
MATCH (entity:Place { category: "Food" })
Match Place node with isCategory relationship to Food node:
MATCH (entity:Place)-[:isCategory]->(category:Food)
Match Place node with Food relationship to Category node:
MATCH (entity)-[category:Food]->(:Category)
Match Food node with isCategoryFor relationship to Place node:
MATCH (category:Food)-[:isCategoryFor]->(entity:place)
And obviously all the variations in between. With relationship directions going the other way as well.
More complexity:
Let's throw in a little more complexity and say we now need to find all Place nodes using multiple categories. For example: Find all Place nodes with category Food or Bar
Would we just tack on another MATCH statement? If not, what is the most performant route to take here?
Extra:
Is there a tool to help me describe the traversal process and tell me the best method to choose?
If I understand your domain correctly, I would recommend making your Categorys into nodes themselves.
MERGE (:Category {name:"Food"})
MERGE (:Category {name:"Bar"})
MERGE (:Category {name:"Park"})
And connecting each Place node to the Categorys it belongs to.
MERGE (:Place {name:"Central Park"})-[:IS_A]->(:Category {name:"Park"})
MERGE (:Place {name:"Joe's Diner"})-[:IS_A]->(:Category {name:"Food"})
MERGE (:Place {name:"Joe's Diner"})-[:IS_A]->(:Category {name:"Bar"})
Then, if you want to find Places that belong to a Category, it can be pretty quick. Start by matching the category, then branch out to the places related to the category.
MATCH (c:Category {name:"Bar"}), (c)<-[:IS_A]-(p:Place)
RETURN p
You'll have a relatively limited number of categories, so matching the category will be quick. Then, because of the way Neo4j actually stores data, it will be fast to find all the places related to that category.
More Complexity
Finding places within multiple categories will be easy as well.
MATCH (c:Category) WHERE c.name = "Bar" OR c.name = "Food", (c)<-[:IS_A]-(p:Place)
RETURN p
Again, you just match the categories first (fast because there aren't many of them), then branch out to the connected places.
Use an Index
If you want fast, you need to use indexes where it makes sense. In this example, I would use an index on the category's name property.
CREATE INDEX ON :Category(name)
Or better yet, use a uniqueness constraint on the category names, which will index them and prevent duplicates.
CREATE CONSTRAINT ON (c:Category) ASSERT c.name IS UNIQUE
Indexes (and uniqueness) make a big difference on the speed of your queries.
Why this is fastest
Neo4j stores nodes and relationships in a very compact, quick-to-access format. Once you have a node or relationship, getting the adjacent relationships or nodes is very fast. However, it stores each node's (and relationship's) properties separately, meaning that looking through properties is relatively slow.
The goal is to get to a starting node as quickly as possible. Once there, traversing related entities is quick. If you only have 1,000 categories, but you have a billion places, it will be faster to pick out an individual Category than an individual Place. Once you have that starting node, getting to related nodes will be very efficient.
The Other Options
Just to reinforce, this is what makes your other options slower or otherwise worse.
In your first example, you are looking through properties on each node to look for the match. Property lookup is slow and you are doing it a billion times. An index can help with this, but it's still a lot of work. Additionally, you are effectively duplicating the category data over each of you billion places, and not taking advantage of Neo4j's strengths.
In all your other examples, your data models seem odd. "Food", "Bar", "Park", etc. are all instances of categories, not separate types. They should each be their own node, but they should all have the Category label, because that's what they are. In addition, categories are things, and thus they should be nodes. A relationship describes the connection between things. It does not make sense to use categories in this way.
I hope this helps!