I have 2 different nodes with label Class and Parents. These nodes are connected with hasParents Relationship. There are 4 million Class nodes, 700K Parents nodes. I wanted to create a Sibling Relationship between the Class nodes. I did the following query:
Match (A:Class)-[:hasParents]-> (B:Parents) <-[:hasParents]-(C:Class) Merge (A)-[:Sibling]-[C]
This query is taking ages to complete. I have indexed in both class_id and parent_id property of Class and Parents node. I am using Neo4j version 2.1.6. Any suggestion to speed this up.
First of all, the indices won't help the query since the properties are not referenced anywhere in the query.
With 700K Parent nodes and 4M Class nodes, you have on average 5.7 classes per parent. With 5 classes under one parent, there are 15 Sibling relationships, so there would be more than 10M relationships to create for the whole graph.
That's a lot for one transaction, you're almost guaranteed to hit an OutOfMemory error.
To avoid that, you should batch changes into several smaller transactions.
I'd use a marker label to manage the progression. First, mark all the parents:
MATCH (p:Parent) SET p:ToProcess
Then, repeatedly select a subset of the nodes that remain to be processed, and connect the siblings:
MATCH (p:ToProcess)
REMOVE p:ToProcess
WITH p
LIMIT 1000
OPTIONAL MATCH (p)<-[:hasParents]-(c:Class)
WITH p, collect(c) AS children
FOREACH (c1 IN children |
FOREACH (c2 IN filter(c IN children WHERE c <> c1) |
MERGE (c1)-[:Sibling]-(c2)))
RETURN count(p)
As the query returns the number of parents that were processed, you just repeat it until it returns 0. At that point, no parent has the ToProcess label anymore.
Related
I'm learning Cypher and I created a 'Crime investigation' project on Neo4j.
I'm trying to return as an output the parent that only has two sons/daughters in total and each member of the family must have committed a crime.
So, in order to get this in the graph, I executed this query:
match(:Crime)<-[:PARTY_TO]-(p:Person)-[:FAMILY_REL]->(s:Person)-[:PARTY_TO]->(:Crime)
where size((p)-[:FAMILY_REL]->())=2
return p, s
FAMILY_REL relation shows the sons the Person (p) and PARTY_TO relation shows the Crime nodes a Person have committed.
The previous query it's not working as it should. It shows parents with more than two sons and also sons that have just one son.
What is wrong with the logic of the query?
SIZE((p)-[:FAMILY_REL]->()) counts all children of p, including ones who had committed no crimes.
This query should work better, as it only counts children who are criminals:
MATCH (:Crime)<-[:PARTY_TO]-(p:Person)-[:FAMILY_REL]->(s:Person)-[:PARTY_TO]->(:Crime)
WITH p, COLLECT(s) AS badKids
WHERE SIZE(badKids) = 2
RETURN p, badKids
I am a beginner with Neo4j and I think that I did not properly understand how WITH and WHERE work.
I have a graph and I would like to count the number of nodes that I obtain if I exclude all the nodes with a certain label and I exclude all the nodes that have a degree > 20.
I first tried to this in a simple way, writing multiple queries for removing the nodes like:
MATCH(n:label1) DETACH DELETE n
MATCH(n:label2) DETACH DELETE n
and then
MATCH (n)
WITH n, size((n)-[]-()) as degree
WHERE degree>20
DETACH DELETE n
Then I counted the number of the nodes that I have in the graph with
MATCH (n)
RETURN count(n)
and I obtained 892
I generate again the original graph from scratch and I tried to combine all the previous queries in a single one:
MATCH (n)
WHERE NOT n:label1
AND NOT n:label2
WITH n, size((n)-[]-()) as degree
WHERE degree>20
DETACH DELETE n
If I count the number of nodes I obtained 713.
Why is the result different?
Thanks in advance for the reply.
The following explanation is speculation, since you have not provided sample data. But it does conform to what you have presented.
In your first trial, you first deleted all label1 and label2 nodes (and all their relationships), and that apparently reduced the degree-ness of some of the remaining nodes to below 21. Therefore, when you deleted the >20 degree nodes, there were fewer such nodes (as compared to your second trial), and you ended up with 892 remaining nodes.
In your second trial, all the nodes without those 2 labels still had their connections to nodes with those 2 labels, and so you had more >20 degree nodes to delete. That is why you ended up with 713 remaining nodes.
Your combined query isn't doing the same thing as your previous queries. Specifically, you aren't deleting nodes with the labels label1 and label2, you're excluding them from your query, which means they won't be deleted (even if they have degree > 20).
The two delete operations are working on entirely different sets of nodes, so it won't make sense to bring across n in your WITH. Instead, use a WITH to reset your result cardinality to 1 (through usage of DISTINCT or an aggregation), then match on the other nodes you want to delete and take care of them.
MATCH (n)
WHERE n:label1
OR n:label2
DETACH DELETE n
WITH count(n) as deleted
MATCH (n)
WHERE size((n)-[]-()) > 20
DETACH DELETE n
I am trying to do a model for state changes of a batch. I capture the various changes and I have an Epoch time column to track these. I managed to get this done with the below code :
MATCH(n:Batch), (n2:Batch)
WHERE n.BatchId = n2.Batch
WITH n, n2 ORDER BY n2.Name
WITH n, COLLECT(n2) as others
WITH n, others, COALESCE(
HEAD(FILTER(x IN others where x.EpochTime > n.EpochTime)),
HEAD(others)
) as next
CREATE (n)-[:NEXT]->(next)
RETURN n, next;
It makes my graph circular because of the HEAD(others) and doesn't stop at the Node with the maximum Epoch time. If I remove the HEAD(others) then I am unable to figure out how to stop the creation of relationship for the last node. Not sure how to put conditions around the creation of relationship so I can stop creating relationships when the next node is null
This might do what you want:
MATCH(n:Batch)
WITH n ORDER BY n.EpochTime
WITH n.BatchId AS id, COLLECT(n) AS ns
CALL apoc.nodes.link(ns, 'NEXT')
RETURN id, ns;
It orders all the Batch nodes by EpochTime, and then collects all the Batch nodes with the same BatchId value. For each collection, it calls the apoc procedure apoc.nodes.link to link all its nodes together (in chronological order) with NEXT relationships. Finally, it returns each distinct BatchId and its ordered collection of Batch nodes.
I have a graph with about 800k nodes and I want to create random relationships among them, using Cypher.
Examples like the following didn't work because the cartesian product is too big:
match (u),(p)
with u,p
create (u)-[:LINKS]->(p);
For example I want 1 relationship for each node (800k), or 10 relationships for each node (8M).
In short, I need a query Cypher in order to UNIFORMLY create relationships between nodes.
Does someone know the query to create relationships in this way?
So you want every node to have exactly x relationships? Try this in batches until no more relationships are updated:
MATCH (u),(p) WHERE size((u)-[:LINKS]->(p)) < {x}
WITH u,p LIMIT 10000 WHERE rand() < 0.2 // LIMIT to 10000 then sample
CREATE (u)-[:LINKS]->(p)
This should work (assuming your neo4j server has enough memory):
MATCH (n)
WITH COLLECT(n) AS ns, COUNT(n) AS len
FOREACH (i IN RANGE(1, {numLinks}) |
FOREACH (x IN ns |
FOREACH(y IN [ns[TOINT(RAND()*len)]] |
CREATE (x)-[:LINK]->(y) )));
This query collects all nodes, and uses nested loops to do the following {numLinks} times: create a LINK relationship between every node and a randomly chosen node.
The innermost FOREACH is used as a workaround for the current Cypher limitation that you cannot put an operation that returns a node inside a node pattern. To be specific, this is illegal: CREATE (x)-[:LINK]->(ns[TOINT(RAND()*len)]).
I've this kind of data model in the db:
(a)<-[:has_parent]<-(b)-[:has_parent]-(c)<-[:has_parent]-(...)
every parent can have multiple children & this can go on to unknown number of levels.
I want to find these values for every node
the number of descendants it has
the depth [distance from the node] of every descendant
the creation time of every descendant
& I want to rank the returned nodes based on these values. Right now, with no optimization, the query runs very slow (especially when the number of descendants increases).
The Questions:
what can I do in the model to make the query performant (indexing, data structure, ...)
what can I do in the query
what can I do anywhere else?
edit:
the query starts from a specific node using START or MATCH
to clarify:
a. the query may start from any point in the hierarchy, not just the root node
b. every node under the starting node is returned ranked by the total number of descendants it has, the distance (from the returned node) of every descendant & timestamp of every descendant it has.
c. by descendant I mean everything under it, not just it's direct children
for example,
here's a sample graph:
http://console.neo4j.org/r/awk6m2
First you need to know how to find the root node. The following statement finds the nodes having no outboung parent relationship - be aware that statement is potentially expensive in a large graph.
MATCH (n)
WHERE NOT ((n)-[:has_parent]->())
RETURN n
Instead you should use an index to find that node:
MATCH (n:Node {name:'abc'})
Starting with our root node, we traverse inbound parent relationship with variable depth. On each node traversed we calculate the number of children - since this might be zero a OPTIONAL MATCH is used:
MATCH (root:Node) // line 1-3 to find root node, replace by index lookup
WHERE NOT ((root)-[:has_parent]->())
WITH root
MATCH p =(root)<-[:has_parent*]-() // variable path length match
WITH last(nodes(p)) AS currentNode, length(p) AS currentDepth
OPTIONAL MATCH (currentNode)<-[:has_parent]-(c) // tranverse children
RETURN currentNode, currentNode.created, currentDepth, count(c) AS countChildren