Background
I want to create a histogram of the relationships starting from a set of nodes.
Input is a set of node ids, for example set = [ id_0, id_1, id_2, id_3, ... id_n ].
The output is a the relationship type histogram for each node (e.g. Map<Long, Map<String, Long>>):
id_0:
- ACTED_IN: 14
- DIRECTED: 1
id_1:
- DIRECTED: 12
- WROTE: 5
- ACTED_IN: 2
id_2:
...
The current cypher query I've written is:
MATCH (n)-[r]-()
WHERE id(n) IN [ id_0, id_1, id_2, id_3, ... id_n ] # set
RETURN id(n) as id, type(r) as type, count(r) as count
It returns the pair of [ id, type ] count like:
id | rel type | count
id0 | ACTED_IN | 14
id0 | DIRECTED | 1
id1 | DIRECTED | 12
id1 | WROTE | 5
id1 | ACTED_IN | 2
...
The result is collected using java and merged to the first structure (e.g. Map<Long, Map<String, Long>>).
Problem
Getting the relationship histogram on smaller graphs is fast but can be very slow on bigger datasets. For example if I want to create the histogram where the set-size is about 100 ids/nodes and each of those nodes have around 1000 relationships the cypher query took about 5 minutes to execute.
Is there more efficient way to collect the histogram for a set of nodes?
Could this query be parallelized? (With java code or using UNION?)
Is something wrong with how I set up my neo4j database, should these queries be this slow?
There is no need for parallel queries, just the need to understand Cypher efficiency and how to use statistics.
Bit of background :
Using count, will execute an expandAll, which is as expensive as the number of relationships a node has
PROFILE
MATCH (n) WHERE id(n) = 21
MATCH (n)-[r]-(x)
RETURN n, type(r), count(*)
Using size and a relationship type, uses internally getDegree which is a statistic a node has locally, and thus is very efficient
PROFILE
MATCH (n) WHERE id(n) = 0
RETURN n, size((n)-[:SEARCH_RESULT]-())
Morale of the story, for using size you need to know the relationship types a labeled node can have. So, you need to know the schema of the database ( in general you will want that, it makes things easily predictable and building dynamically efficient queries becomes a joy).
But let's assume you don't know the schema, you can use APOC cypher procedures, allowing you to build dynamic queries.
The flow is :
Get all the relationship types from the database ( fast )
Get the nodes from id list ( fast )
Build dynamic queries using size ( fast )
CALL db.relationshipTypes() YIELD relationshipType
WITH collect(relationshipType) AS types
MATCH (n) WHERE id(n) IN [21, 0]
UNWIND types AS type
CALL apoc.cypher.run("RETURN size((n)-[:`" + type + "`]-()) AS count", {n: n})
YIELD value
RETURN id(n), type, value.count
Related
I have a graph as acyclic tree with undefined depth. I need to count number of descendants for each node including node itself. So the final result should be something like that:
9
|\
4 4
|\ \
2 1 3
| |\
1 1 1
So for each node this number would be sum of numbers of its descendants + 1.
How can it be done in one query?
I could come up with something like that:
MATCH (n)
SET n.count = SIZE((n)<-[:PARENT*0..]-());
But it means a subquery for each node. Having over 1 300 000 nodes it takes ages.
Better way would be to set "1" for each leaf and ascend to the root calculating each node. Is it possible to do in one query?
I'd go for
MATCH (start)<-[:PARENT*0..]-(n)
RETURN id(start), count(n) as numberOfChildren
which counts how many nodes are found on the path. But I don't know how it performs on really large graphs (my test graph has only ~100s nodes).
You could already optimize your query by limiting the number of paths you are processing, e.g. like this :
MATCH (n)
WHERE EXISTS((n)<-[:PARENT]-())
MATCH path=(n)<-[:PARENT*0..]-(m)
WHERE NOT EXISTS((m)<-[:PARENT]-())
UNWIND nodes(path) AS node
WITH n, COUNT(DISTINCT node) AS count
SET n.count = count
I have 2 nodes. First of them "b1" has 16m relationships and second one "b" - 17k. Label B is indexed on the id property.
My query to retrieve if they have a direct relation is:
profile
MATCH (b:B {id :'D006019' }) WITH b
MATCH (b1:B {id :'D006801' }) WITH b, b1
MATCH (b)-[r]-(b1) RETURN r
Several observations:
Query is extremely slow. It's running for like 5 mins. First it makes a nodeindexscan which is very fast, but somehow it manages to grab the node b1 and continues execution with expanding this node. Byt "b1" has 16m relations and this with the following filter ruins the performance
I can make this query fast enough if I change it a little.
Here is the much faster query:
profile
MATCH (bB {id :'D006019' }) WITH b
MATCH (b1:B) WHERE b1.id IN ['D006801' ] WITH b, b1
MATCH (b)-[r]-(b1) RETURN r
So now "b1" is in "IN" clause and neo4j starts expanding over "b" which has only 17k relations and the query executes around 100 ms.
My question is: can the query be written in a way that neo4j expands automatically on the less connected node.
Sometimes you have to give Cypher some hints:
MATCH (b:B {id :'D006019'})
USING INDEX b:B(id)
MATCH (b1:B {id :'D006801'})
USING INDEX b1:B(id)
MATCH (b)-[r]-(b1)
RETURN r;
The above query tells Cypher that it should use the :B(id) index for each of the first 2 matches. Without the hints, there is currently a tendency for the planner to only use the index once.
I have a graph with about 800k nodes and I want to create random relationships among them, using Cypher.
Examples like the following didn't work because the cartesian product is too big:
match (u),(p)
with u,p
create (u)-[:LINKS]->(p);
For example I want 1 relationship for each node (800k), or 10 relationships for each node (8M).
In short, I need a query Cypher in order to UNIFORMLY create relationships between nodes.
Does someone know the query to create relationships in this way?
So you want every node to have exactly x relationships? Try this in batches until no more relationships are updated:
MATCH (u),(p) WHERE size((u)-[:LINKS]->(p)) < {x}
WITH u,p LIMIT 10000 WHERE rand() < 0.2 // LIMIT to 10000 then sample
CREATE (u)-[:LINKS]->(p)
This should work (assuming your neo4j server has enough memory):
MATCH (n)
WITH COLLECT(n) AS ns, COUNT(n) AS len
FOREACH (i IN RANGE(1, {numLinks}) |
FOREACH (x IN ns |
FOREACH(y IN [ns[TOINT(RAND()*len)]] |
CREATE (x)-[:LINK]->(y) )));
This query collects all nodes, and uses nested loops to do the following {numLinks} times: create a LINK relationship between every node and a randomly chosen node.
The innermost FOREACH is used as a workaround for the current Cypher limitation that you cannot put an operation that returns a node inside a node pattern. To be specific, this is illegal: CREATE (x)-[:LINK]->(ns[TOINT(RAND()*len)]).
If I have a graph like the following (where the nesting could go on for an arbitrary number of nodes):
(a)-[:KNOWS]->(b)-[:KNOWS]->(c)-[:KNOWS]->(d)-[:KNOWS]->(e)
| |
| (i)-[:KNOWS]->(j)
|
(f)-[:KNOWS]->(g)-[:KNOWS]->(h)-[:KNOWS]->(n)
|
(k)-[:KNOWS]->(l)-[:KNOWS]->(m)
How can I retrieve all of the full-length paths (in this case, from (a)-->(m), (a)-->(n) (a)-->(j) and (a)-->(e)? The query should also be able to return the nodes with no relationships of the given type.
So far I am just doing the following (I only want the id property):
MATCH path=(a)-[:KNOWS*]->(b)
RETURN collect(extract(n in nodes(path) | n.id)) as paths
I need the paths so that in the programming language (in this case clojure) I can create a nested map like this:
{"a" {"b" {"f" {"g" {"k" {"l" {"m" nil}}
"h" {"n" nil}}}
"c" {"d" {"e" nil}
"i" {"j" nil}}}}}
Is it possible to generate the map directly with the query?
Just had to do something similar, this worked on your example, finds all nodes which do not have outgoing [:KNOWS]:
match p=(a:Node {name:'a'})-[:KNOWS*]->(b:Node)
optional match (b)-[v:KNOWS]->()
with p,v
where v IS NULL
return collect(extract(n in nodes(p) | n.id)) as paths
Here is one query that will get you started. This query will return just the longest chain of nodes when there is a single chain without forks. It matches all of the paths like yours does but only returns the longest one by using limit to reduce the result.
MATCH p=(a:Node {name:'a'})-[:KNOWS*]->(:Node)
WITH length(p) AS size, p
ORDER BY size DESC
LIMIT 1
RETURN p AS Longest_Path
I think this gets the second part of your question where there are multiple paths. It looks for paths where the last node does not have an outbound :KNOWS relationship and where the starting node does not have an inbound :KNOWS relationship.
MATCH p=(a:Node {name:'a'})-[:KNOWS*]->(x:Node)
WHERE NOT x-[:KNOWS]->()
AND NOT ()-[:KNOWS]->(a)
WITH length(p) AS size, p
ORDER BY size DESC
RETURN reduce(node_ids = [], n IN nodes(p) | node_ids + [id(n)])
I am extending maxdemarzi's excellent graph visualisation example (http://maxdemarzi.com/2013/07/03/the-last-mile/) using VivaGraph backed by neo4j.
I want to display relationships of the type
a-->b<--c,b<--d
I tried the query
MATCH p = (a)--(b:X)--(c),(b:X)--(d)
RETURN EXTRACT(n in nodes(p) | {id:ID(n), name:COALESCE(n.name, n.title, ID(n)), type:LABELS(n)}) AS nodes,
EXTRACT(r in relationships(p)| {source:ID(startNode(r)) , target:ID(endNode(r))}) AS rels
It looks like the named query picks up only a-->b<--c pattern and omits the b<--d patterns.
Am i missing something... can i not add multiple patterns in a named query?
The most immediate problem is that the comma in the MATCH clause separates the first pattern from the second. The variable 'p' only stores the first pattern. This is why you aren't getting the results you desire. Independent of that, you are at risk of having a 'loose binding' by putting a label on both of your nodes named 'b' in the two patterns. The second 'b' node should not have a label.
So here is a version of your query that should work.
MATCH p1=(a)-->(b:X)<--(c), p2=(b)<--(d)
WITH nodes(p1) + d AS ns, relationships(p1) + relationships(p2) AS rs
RETURN EXTRACT(n IN ns | {id:ID(n), name:COALESCE(n.name, n.title, ID(n)), type:LABELS(n)}) AS nodes,
EXTRACT(r in rs| {source:ID(startNode(r)) , target:ID(endNode(r))}) AS rels
Capture both paths, then build collections from the nodes and relationships of both paths. The collection of nodes actually only extracts the nodes from p1 and adds the 'd' node. You could write that part as
nodes(p1) + nodes(p2) as ns
but then the 'b' node will appear in the list twice.