I have a Graph database with over 2 million nodes. I have an application which takes a social graph and does some inference on it. As one step of the algorithm, I have to get all possible combinations of a relationship [:friends] of two connected nodes. Currently, I have a query which looks like:
match (a)-[:friend]-(c), (b)-[:friend]-(d) where id(a)={ida} and id(b)={idb} return distinct c as first, d as second
So, I already know the nodes a and b and I want to get all the possible pairs that can be made from friends of a and b.
This is obviously a very slow operation. I was wondering if there is a more efficient way of getting the same result in neo4j. Perhaps adding indexes might help? Any ideas / clues are welcome!
Example
Node a has friends : x, y
Node b has friends : g, h, i``
Then the result should be:
x,g
x,h
x,i
y,g
y,h
y,i`
If you are not already you should use labels to speed up your query, which might look like:
MATCH (p1:Person)-[:FRIEND]->(p3:Person),(p2:Person)-[:FRIEND]->(p4:Person)
WHERE ID(p1) = 6 AND ID(p2) = 7
RETURN p3 as first, p4 as second
Obviously that will rely on you having created your nodes with a :Person label.
How many friends does the average node have?
I wouldn't use two patterns but just one and the IN operator.
MATCH (p:Person)-[:FRIEND]->(friend:Person)
WHERE id(p) IN [1,2,3]
RETURN p, collect(friend) as friends
Then you have no cross product and you can also return the friends nicely as collection per person.
Related
When doing a Cypher query to retrieve a specific subgraph with automorphisms, let's say
MATCH (a)-[:X]-(b)-[:X]-(c),
RETURN a, b, c
It seems that the default behaviour is to return every retrieved subgraph and all their automorphisms.
In that exemple, if (u)-[:X]-(v)-[:X]-(w) is a graph matching the pattern, the output will be u,v,w but also w,v,u, which consist in the same graph.
Is there a way to retrieve each subgraph only once ?
EDIT: It would be great if Cypher have a feature to do that in the search, using some kind of symmetry breaking condition as it would reduce the computing time. If that is not the case, how would you post-process to find the desired output ?
In the query you are making, (a)-[r:X]-(b) and (a)-[t:X]-(c) refer to a similar pattern. Since (b) and (c) can be interchanged. What is the need to repeat matching twice? MATCH (a)-[r:X]-(b) RETURN a, r, b returns all the subgraphs you are looking for.
EDIT
You can do something as follows to find the nodes, which are having two relations of type X.
MATCH (a)-[r:X]-(b) WHERE size((a)-[:X]-()) = 2 RETURN a, r, b
For these kind of mirrored patterns, we can add a restriction on the internal graph ids so only one of the two paths is kept:
MATCH (a)-[:X]-(b)-[:X]-(c)
WHERE id(a) < id(c)
RETURN a, b, c
This will also prevent the case where a = c.
I am a newbie who just started learning graph database and I have a problem querying the relationships between nodes.
My graph is like this:
There are multiple relationships from one node to another, and the IDs of these relationships are different.
How to find relationships where the number of relationships between two nodes is greater than 2,or is there a problem with the model of this graph?
Just like on the graph, I want to query node d and node a.
I tried to use the following statement, but the result is incorrect:
match (from)-[r:INVITE]->(to)
with from, count(r) as ref
where ref >2
return from
It seems to count the number of relations issued by all from, not the relationship between from-->to.
to return nodes who have more then 2 relationship between them you need to check the size of the collected rels. something like
MATCH (x:Person)-[r:INVITE]-(:Party)
WITH x, size(collect(r)) as inviteCount
WHERE inviteCount > 2
RETURN x
Aggregating functions like COLLECT and COUNT use non-aggregating terms in the same WITH (or RETURN) clause as "grouping keys".
So, here is one way to get pairs of nodes that have more than 2 INVITE relationships (in a specific direction) between them:
MATCH (from)-[r:INVITE]->(to)
WITH from, to, COUNT(r) AS ref
WHERE ref > 2
RETURN from, to
NOTE: Ideally (for clarity and efficiency), your nodes would have specific labels and the MATCH pattern would specify those labels.
I’m supposed to have graph of multiple nodes(more than 2) with their relationships at 1st degree, second degree, third degree.
For that right now I am using this query
WITH ["1258311979208519680","3294971891","1176078684270333952",”117607868427845”] as ids
MATCH (n1:Target),(n2:Target) WHERE n1.id in ids and n2.id in ids and n1.id<>n2.id and n1.uid=103 and n2.uid=103
MATCH p = ((n1)-[*..3]-(n2)) RETURN p limit 30
In which 4 nodes Id’s are mention in WITH[ ] and next [*..3] it is used to draw 3rd degree graph between the selected nodes.
WHAT the ABOVE QUERY DOING
After running the above query it will return the mutual nodes in case of second degree [*..2] if any of the 2 selected nodes have mutual relation it’ll return.
WHAT I WANT
*1) First of all I want to optimize the query, as it is taking so much time and this query causing the Cartesian product which slow down the query process.
2) As in this above query if any 2 nodes have mutual relationship it will return the data, I WANT, the query will return mutual nodes attached with all selected nodes. Means if we have some nodes in return, these nodes must have relation to all selected target nodes.
Any suggestions to modify the query, to optimize the query.
If you are looking for to avoid the cartesian product issue with the given query
WITH ["1258311979208519680","3294971891","1176078684270333952",”117607868427845”] as ids
MATCH (n1:Target),(n2:Target) WHERE n1.id in ids and n2.id in ids and n1.id<>n2.id and n1.uid=103 and n2.uid=103
MATCH p = ((n1)-[*..3]-(n2)) RETURN p limit 30
I suggest to use this one below
MATCH (node1:Target) WHERE node1.id IN ["1258311979208519680","3294971891","1176078684270333952"]
MATCH (node2:Target) WHERE node2.id IN ["1258311979208519680","3294971891","1176078684270333952"]
and node1.id <> node2.id
MATCH p=(node1)-[*..2]-(node2)
RETURN p
It will remove the cartesian product issue.
Try this..
In a query like this
MATCH (a)
WHERE id(a) = {x}
MATCH (a)-->(b:x)
WITH a, collect(DISTINCT id(b)) AS Bs
MATCH (a)-->(c:y)
RETURN collect(c) + Bs
what I'm trying to do is to gather two sets of nodes that came from different queries, but with this kind of procedure all the b rows get to be returned multiplied by the number of a rows.
How should I deal with this kind of problem that arises from sequential queries?
[Note that the reported query is only a conceptual representation of what I mean. Please don't try to solve the code (that would be trivial) but only the presented problem.]
Your query shouldn't return any cross product since you aggregate in the WITH clause, so there is only one result item/row (the disconnected path a, collect(b)) when the second match begins. It's not clear therefore what the problem is that you want solved–cross products can be solved differently in different cases.
The way your query would work, conceptually speaking, is: match anything related from a, then filter that anything on having label :x. The second leg of the query does the same but filters on label :y. You can therefore combine your queries as
MATCH (a)-->(b)
WHERE id(a) = {x} AND (b:x OR b:y)
RETURN b
Other cases of 'path explosion' can't be solved as easily (sometimes UNION is good, sometimes you can reorder your pattern, sometimes you can do some aggregate-and-reduce to make it happen) , but you'll have to ask about that separately.
How about using UNION for this? See http://docs.neo4j.org/chunked/milestone/query-union.html#union-combine-two-queries-and-remove-duplicates
-brian
I have a scenario where I have more than 2 random nodes.
I need to get all possible paths connecting all three nodes. I do not know the direction of relation and the relationship type.
Example : I have in the graph database with three nodes person->Purchase->Product.
I need to get the path connecting these three nodes. But I do not know the order in which I need to query, for example if I give the query as person-Product-Purchase, it will return no rows as the order is incorrect.
So in this case how should I frame the query?
In a nutshell I need to find the path between more than two nodes where the match clause may be mentioned in what ever order the user knows.
You could list all of the nodes in multiple bound identifiers in the start, and then your match would find the ones that match, in any order. And you could do this for N items, if needed. For example, here is a query for 3 items:
start a=node:node_auto_index('name:(person product purchase)'),
b=node:node_auto_index('name:(person product purchase)'),
c=node:node_auto_index('name:(person product purchase)')
match p=a-->b-->c
return p;
http://console.neo4j.org/r/tbwu2d
I actually just made a blog post about how start works, which might help:
http://wes.skeweredrook.com/cypher-it-all-starts-with-the-start/
Wouldn't be acceptable to make several queries ? In your case you'd automatically generate 6 queries with all the possible combinations (factorial on the number of variables)
A possible solution would be to first get three sets of nodes (s,m,e). These sets may be the same as in the question (or contain partially or completely different nodes). The sets are important, because starting, middle and end node are not fixed.
Here is the code for the Matrix example with added nodes.
match (s) where s.name in ["Oracle", "Neo", "Cypher"]
match (m) where m.name in ["Oracle", "Neo", "Cypher"] and s <> m
match (e) where e.name in ["Oracle", "Neo", "Cypher"] and s <> e and m <> e
match rel=(s)-[r1*1..]-(m)-[r2*1..]-(e)
return s, r1, m, r2, e, rel;
The additional where clause makes sure the same node is not used twice in one result row.
The relations are matched with one or more edges (*1..) or hops between the nodes s and m or m and e respectively and disregarding the directions.
Note that cypher 3 syntax is used here.