Traversing through all nodes and comparing each one with every other one - neo4j

I am working on a little project and I have a dataset of about 60k nodes and 500k relationships between those nodes. The nodes are of two types. First type are are recipes and the second type are ingredients. Recipes are composed of ingredients like:
(ingredient)-[:IS_PART_OF]->(recipe)
My objective is to find how many common ingredients two recipes share. I have managed to obtain this information with the following query that compares one recipe to all others (the first one with all others):
MATCH (recipe:RECIPE{ ID: 1000000 }),(other)
WHERE (other.ID >= 1000001 AND other.ID <= 1057690)
OPTIONAL MATCH (recipe:RECIPE)<-[:IS_PART_OF]-(ingredient:INGREDIENT)- [:IS_PART_OF]->(other)
WITH ingredient, other
RETURN other.ID, count(distinct ingredient.name)
ORDER BY other.ID DESC
My first question: How can I obtain the number of all ingredients of two recipes in a way that the mutual ones are counted only once (union of R1 and R2 --> R1 U R2)
My second question: is it possible to write a loop that would iterate through all the recipes and check for common ingredients? The objective is to compare each recipe with all others. I think this should return (n-1)*(n/2) rows.
I have tried the above and the problem remains. Even with LIMIT and SKIP I can not run the code on the whole set. I have changed my query so it allows me to partition my set accordingly:
MATCH (recipe1)<-[:IS_PART_OF]-(ingredient:INGREDIENT)-[:IS_PART_OF]->(recipe2)
WHERE (recipe2.ID >= 1000000 AND recipe2.ID <= 1000009) AND (recipe1.ID >= 1000000 AND recipe1.ID <= 1000009) AND (recipe1.ID < recipe2.ID)
RETURN recipe1.ID, count(distinct ingredient.name) AS MutualIngredients, recipe2.ID
ORDER BY recipe1.ID
Until I get my hands on a better machine this will suffice.
I still haven't solved my first question: how can I obtain the number of all ingredients of two recipes in a way that the mutual ones are counted only once (union of R1 and R2 --> R1 U R2)

You'll need to play with this, but it's going to be something similar to this:
MATCH (recipe1:RECIPE)<-[:IS_PART_OF]-(ingred:INGREDIENT)-[:IS_PART_OF]->(recipe2:RECIPE)
WHERE ID(recipe1) < ID(recipe2)
RETURN recipe1, collect(ingred.name), recipe2
ORDER BY recipe1.ID
The match pattern gets you all of the common ingredients between two recipes. The WHERE clause ensures that you're not comparing a recipe to itself (because it would share all ingredients with itself). The return clause just gives you the two recipes you're comparing, and what they have in common.
This will be O(n^2) though, and will be very slow.
UPDATE took Nicole's suggestion, which is a good one. That should guarantee each pair is only considered once.

SOLVED: Just to share it if someone else will need it:
MATCH (recipe1)<-[:IS_PART_OF]-(ingredient:INGREDIENT)-[:IS_PART_OF]->(recipe2)
MATCH (recipe1)<-[:IS_PART_OF]-(ingredient1:INGREDIENT)
MATCH (recipe2)<-[:IS_PART_OF]-(ingredient2:INGREDIENT)
WHERE (recipe2.ID >= 1000000 AND recipe2.ID <= 1000009) AND (recipe1.ID >= 1000000 AND recipe1.ID <= 1000009) AND (recipe1.ID < recipe2.ID)
RETURN recipe1.ID, count(distinct ingredient1.name) + count(distinct ingredient2.name) - count(distinct ingredient.name) AS RecipesUnion, recipe2.ID
ORDER BY recipe1.ID

Related

Filtering path with variable length multiple relationships (People You May Know query)

So let's say we have User nodes, Company nodes, Project nodes, School nodes and Event nodes. And there are the following relationships between these nodes
(User)-[:WORKED_AT {start: timestamp, end:timestamp}]->(Company)
(User)-[:COLLABORATED_ON]->(Project)
(Company)-[:COLLABORATED_ON]->(Project)
(User)-[:IS_ATTENDING]->(Event)
(User)-[:STUDIED_AT]->(School)
I am trying to recommend users to any given user. My starting query looks like this
MATCH p=(u:User {id: {leftId}})-[r:COLLABORATED_ON|:AUTHORED|:WORKED_AT|:IS_ATTENDING|:STUDIED_AT*1..3]-(pymk:User)
RETURN p
LIMIT 24
Now this returns me all the pymk users within 1 to 3 relationships away, which is fine. But I want to filter the path according to the relationship attributes. Like remove the following path if the user and pymk work start date and end date is not overlapping.
(User)-[:WORKED_AT]->(Company)<-[:WORKED_AT]-(User)
I can do this with single query
MATCH (u:User)-[r1:WORKED_AT]->(Company)<-[r2:WORKED_AT]-(pymk:User)
WHERE
(r1.startedAt < r2.endedAt) AND (r2.startedAt < r1.endedAt)
RETURN pymk
But couldn't get my head around doing it within a collection of paths. I don't even know if this is possible.
Any help is appreciated.
This should do the trick:
MATCH p=(:User {id: {leftId}})-[:COLLABORATED_ON|:AUTHORED|:WORKED_AT|:IS_ATTENDING|:STUDIED_AT*1..3]-(:User)
WITH p, [rel in relationships(p) WHERE type(rel) = 'WORKED_AT'] as worked
WHERE size(worked) <> 2 OR
apoc.coll.max([work in worked | work.startedAt]) < apoc.coll.min([work in worked | work.endedAt])
RETURN p
LIMIT 24
We're using APOC here to get the max and min of a collection (the max() and min() aggregation functions in just Cypher are aggregation functions across rows, and can't be used on lists).
This relies on distilling the logic of overlapping down to max([start times]) < min([end times]), which you can check out in this highly popular answer here

Window in cypher

So basically it comes down to this. I have a (:PERSON) that used his (:CAR) at a given (:TIME). This triplet is fully connected. It might be that a (:CAR) is used by other (:PERSON) and a (:PERSON) can use multiple (:CAR) all of that at different (:TIME).
What I want to query is that for each combination (p:PERSON)-[:AT]-(t:TIME) I want the number of cars used in t-6H (p-[:USED]-(c:CAR)-[:AT]-(o:TIME) in t-6H).
Here is what I have achieved so far, but this only takes each :PERSON once.
MATCH (n:PERSON)-[:AT]-(t:TIME)
WITH n,t
MATCH (n)-[:USED]-(c:CAR)-[:AT]-(o:TIME)
WITH n,t,c,toFLoat(t.id) as current, toFloat(o.id) as previous
WITH n,t,c,current-previous as diff
WHERE (diff) >= 0 AND (diff) <= 3600*6
WITH n, count(distinct c) as cnt
RETURN n, cnt
Where :TIME(id) is a String containing the time in seconds
Hope this is clear. Thanks for the help.
You should count on person and 't' :
MATCH (n:PERSON)-[:AT]-(t:TIME)
WITH n,t
MATCH (n)-[:USED]-(c:CAR)-[:AT]-(o:TIME)
WITH n,t,c,toFLoat(t.id) - toFloat(o.id) as diff
WHERE (diff) >= 0 AND (diff) <= 3600*6
WITH n,t, count(distinct c) as cnt
RETURN n,t, cnt
Also you should make your TIME(id) a numeric value so you can remove the toFloat from your query which will improve the performance.
Maybe you should put your t Time in your USED relation.
Either you'll want only one USED per Person + Car then have a collection of times (no nice for querying)
or you'll have multiple USED

neo4j - find nodes with strong relationship

I have in my graph places and persons as labels, and a relationship "knows_the_place". Like:
(person)-[knows_the_place]->(place)
A person usually knows multiple places.
Now I want to find the persons with a "strong" relationship via the places (which have a lot of "places" in common), so for example I want to query all persons, that share at least 3 different places, something like this (not working!) query:
MATCH
(a:person)-[:knows_the_place]->(x:place)<-[:knows_the_place]-(b:person),
(a:person)-[:knows_the_place]->(y:place)<-[:knows_the_place]-(b:person),
(a:person)-[:knows_the_place]->(z:place)<-[:knows_the_place]-(b:person)
WHERE NOT x=y and y=z
RETURN a, b
How can I do this with neo4j Query?
Bonus-Question:
Instead of showing me the person which have x places in common with another person, even better would be, if I could get a order list like:
a shares 7 places with b
c shares 5 places with b
d shares 2 places with e
f shares 1 places with a
...
Thanks for your help!
Here you go:
MATCH (a:person)-[:knows_the_place]->(x:place)<-[:knows_the_place]-(b:person)
WITH a, b, count(x) AS count
WHERE count >= 3
RETURN a, b, count
To order:
MATCH (a:person)-[:knows_the_place]->(x:place)<-[:knows_the_place]-(b:person)
RETURN a, b, count(x) AS count
ORDER BY count(x) DESC
You can also do both by adding an ORDER BY to the of the first query.
Keep in mind that this query is a cartesian product of a and b so it will examine every combination of person nodes, which may be not great performance-wise if you have a lot of person nodes. Neo4j 2.3 should warn you about these sorts of queries.

Neo4j - Get all related nodes of type and create new relationship

I have a dataset that looks like this (Artefact)-[HAS]-(Keyword), keywords can be shared multiple times by artefacts. What I am trying to achieve is;
Returning most interconnected keyword nodes, count of artefacts related to keywords, count of the overlap between keyword nodes and the hop to another keyword (keyword)-(artefact)-(keywords), the "shared" artefact count between two keywords.
In other words a count of the artefact records within an intersect between two keyword nodes. For example given these three artefact nodes
1) spoon (keywords; metal, food)
2) sword (keywords; metal, fighting)
3) fork (keywords; metal, food)
The query would therefore return the keyword node, count of artefacts related to keyword (3, spoon, sword and fork), count of the keywords related by artefact between keyword nodes (metal has 2 indirect connections to food and 1 to fighting).
Once I've worked that out, for the sake of speed because I realise this is a big query, create a related_to relationship between keywords with the count of the number of artefacts they share in common. Only select 1 record to create this relationship, to test it works :) (hence limit 1)
MATCH (n:Keyword)-[r*2]-(x:Keyword)
WITH n, COUNT(r) AS c, x
LIMIT 1
MERGE (n)-[s:RELATED_KEY]-(x) SET s.weight = c
I'm using neo4j community edition (2.1.6),
Many thanks, Andy
This query will return you the first part of your answer :
MATCH (k:Keyword)
WITH k
LIMIT 1
MATCH (k)<-[:HAS]-(a)
WITH k, collect(a) as artefacts
WITH k, artefacts, size(artefacts) as c
UNWIND artefacts as artefact
MATCH (k)<-[:HAS]-(artefact)-[:HAS]->(k2)
RETURN c, artefacts, collect(distinct(k2.name)) as keywords, count(distinct(k2.name)) as keyWordsCount
However, I guess you may create the relationships between the related nodes directly :
MATCH (k:Keyword)
WITH k
LIMIT 1
MATCH (k)<-[:HAS]-(a)-[:HAS]->(other)
MERGE (k)-[r:RELATED_TO]->(other)
ON CREATE SET r.weight = 1
ON MATCH SET r.weight = r.weight + 1

neo4j cartesian product performance improvement

I have a Graph database with over 2 million nodes. I have an application which takes a social graph and does some inference on it. As one step of the algorithm, I have to get all possible combinations of a relationship [:friends] of two connected nodes. Currently, I have a query which looks like:
match (a)-[:friend]-(c), (b)-[:friend]-(d) where id(a)={ida} and id(b)={idb} return distinct c as first, d as second
So, I already know the nodes a and b and I want to get all the possible pairs that can be made from friends of a and b.
This is obviously a very slow operation. I was wondering if there is a more efficient way of getting the same result in neo4j. Perhaps adding indexes might help? Any ideas / clues are welcome!
Example
Node a has friends : x, y
Node b has friends : g, h, i``
Then the result should be:
x,g
x,h
x,i
y,g
y,h
y,i`
If you are not already you should use labels to speed up your query, which might look like:
MATCH (p1:Person)-[:FRIEND]->(p3:Person),(p2:Person)-[:FRIEND]->(p4:Person)
WHERE ID(p1) = 6 AND ID(p2) = 7
RETURN p3 as first, p4 as second
Obviously that will rely on you having created your nodes with a :Person label.
How many friends does the average node have?
I wouldn't use two patterns but just one and the IN operator.
MATCH (p:Person)-[:FRIEND]->(friend:Person)
WHERE id(p) IN [1,2,3]
RETURN p, collect(friend) as friends
Then you have no cross product and you can also return the friends nicely as collection per person.

Resources