Efficiently assigning UUIDs to connected components in Neo4j - neo4j

I have partitioned my graph into ~400,000 connected components using the algo.unionFind function from the Neo4j Graph Algorithms library.
Each node n within the same connected component has the same n.partition value. However, now I want to assigned each connected component a UUID so that each node n in a connected component will have n.uuid populated with a component UUID. What is the most efficient way of doing this?
Currently I am getting a list of all n.partition values and then going through each partition and running a Cypher query to update all nodes of that partition to have a generated UUID. I'm using the Python wrapper py2neo and this process is quite slow.
Edit:
The Cypher queries I am currently using are:
MATCH (n)
RETURN DISTINCT n.partition AS partition
to get a list of partitions ids and then iteratively calling:
MATCH (n)
WHERE n.partition = <PARTITION_ID>
SET n.uuid = <GENERATED_UUID>
on each of the partition ids.
Edit 2:
I am able to get through ~180k/400k of the connected components using the following query:
CALL apoc.periodic.iterate(
"MATCH (n)
WITH n.partition as partition, COLLECT(n) as nodes
RETURN partition, nodes, apoc.create.uuid() as uuid",
"FOREACH (n in nodes | SET n.uuid = uuid)",
{batchSize:1000, parallel:true}
)
before getting a heap error: "neo4j.exceptions.ClientError: Failed to invoke procedure `apoc.periodic.iterate`: Caused by: java.lang.OutOfMemoryError: Java heap space"

The best way would be to install the APOC plug-in to Neo4j so that you can use the UUID function apoc.create.uuid() in Cypher. (so that it can be generated, and assigned, in the same transaction)
To create 1 uuid per partition, you will need to use WITH to store the uuid in a temporary variable. It will be run per row, so you need to do it once you have one partition
USING PERIODIC COMMIT 5000 // commit every 5k changes
MATCH (n)
WITH DISTINCT n.partition as p // will exclude null
WITH p, apoc.create.uuid() as uuid // create reusable uuid
// now just match and assign
MATCH (n)
WHERE n.partition = p
SET n.uuid = uuid
or as InverseFalcon suggested
MATCH (n)
WHERE exists(n.partition) // to filter out nulls
WITH n.partition as p, collect(n) as nodes // collect nodes so each row is 1 partition, and it's nodes
WITH p, nodes, apoc.create.uuid() as uuid // create reusable uuid
FOREACH (n in nodes | SET n.uuid = uuid) // assign uuid to each node in collection
The first query is more periodic commit friendly, since it doesn't need to load everything into memory to start doing assignments. Without the perodic commit statement though, it will eventually load everything into memory as it has to hold on to it for the transaction log. Once it hits a commit point, it can clear the transaction log to keep memory use down.
If your data set isn't too large though, the second query should be faster because by holding everything in memory after the first node scan, it doesn't need to run another node scan to find all the nodes. Periodic commit won't help here because if you blow the heap, it will almost certainly be during the initial scan/collect phase.

To do this you'll need to collect nodes by their partition value, which means you'll have a single row per distinct partition. Then you create the UUID (it will execute per row), then you can use FOREACH to apply to each node in the partition:
MATCH (n)
// WHERE exists(n.partition) // only if there are nodes in the graph without partitions
WITH n.partition as partition, collect(n) as nodes
WITH partition, nodes, randomUUID() as uuid
FOREACH (n in nodes | SET n.uuid = uuid)
Depending on the number of nodes in your graph, you may need to combine this with some batch processing, such as apoc.periodic.iterate(), to avoid heap issues.

Related

Finding all Leaf Nodes in Neo4j efficiently

I am trying to write a query in Cypher that returns all leaf nodes given a specific root node.
Right now I have been using:
MATCH (root:Node {name: 'Name'})<-[:REL *]-(leaf:Node)
WHERE NOT (leaf)<-[:REL]-()
RETURN leaf
The problem with this query is that as the database becomes larger, it becomes exponentially slower because every single possible leaf node that connects to my root is checked in the not clause. To omit the not clause, I can return the entire path like this:
MATCH p=(root:Node {name: 'Name'})<-[:REL *]-(leaf:Node)
RETURN p
The second query is a lot faster as the number of nodes/relationships in the graph increases, but I would prefer to return just the leaf nodes instead of the path.
Is there a way to run this query more efficiently on a larger data set?
have you tried this?
MATCH (root:Node {name: 'Name'})<-[:REL *]-(leaf:Node)
WITH leaf
WHERE NOT (leaf)<-[:REL]-()
RETURN leaf
On a similar case for me it gives the lowest number of db hits.

Visualize connected components in Neo4j

I can find the highest densely connected component in the graph using the code below:
CALL algo.unionFind.stream('', ':pnHours', {})
YIELD nodeId,setId
// groupBy setId, storing all node ids of the same set id into a list
MATCH (node) where id(node) = nodeId
WITH setId, collect(node) as nodes
// order by the size of nodes list descending
ORDER BY size(nodes) DESC
LIMIT 1 // limiting to 3
RETURN nodes;
But it does not help me visualize the topmost densely connected component (sub-graph) because the output graph it emits are disjoint nodes. Is it possible to visualize the densely connected component. If yes, then how
I tried this query but I am getting different the result.
I haven't used these algorithms and I don't know much about it, but I think you added an extra character (colon) in the query.
Can you check with pnHours instead of :pnHours.
I remove colon(:) from the query and I am getting the proper result (also I am able to get the relationships as well because Neo4j browser fetches it although it's not specified in the query).
If you still don't get check the following query:
CALL algo.unionFind.stream('', 'pnHours', {})
YIELD nodeId,setId
// groupBy setId, storing all node ids of the same set id into a list
MATCH (node) where id(node) = nodeId
WITH setId, collect(node) as nodes
// order by the size of nodes list descending
ORDER BY size(nodes) DESC
LIMIT 1 // limiting to 3
WITH nodes
UNWIND nodes AS node
MATCH (node)-[r:pnHours]-()
RETURN node,r;
If you want to visualize then in Neo4j browser then use:
CALL algo.unionFind.stream('', ':pnHours', {})
YIELD nodeId,setId
// groupBy setId, storing all node ids of the same set id into a list
MATCH p=(node)-->() where id(node) = nodeId
WITH setId, collect(p) as paths
// order by the size of nodes list descending
ORDER BY size(paths) DESC
LIMIT 1 // limiting to 3
// Maybe you need to unwind paths to be able to visualize in Neo4j browser
RETURN paths;
It is not the most optimized query but should do just fine on small datasets.
The following query should return all the single-step paths in the largest pnHours-connected component (i.e., the one having the most nodes). It only gets the paths for the largest component.
CALL algo.unionFind.stream(null, 'pnHours', {}) YIELD nodeId, setId
WITH setId, COLLECT(nodeId) as nodeIds
ORDER BY SIZE(nodeIds) DESC
LIMIT 1
UNWIND nodeIds AS nodeId
MATCH path = (n)-[:pnHours]->()
WHERE ID(n) = nodeId
RETURN path
The neo4j browser's Graph visualization of the results will show all the nodes in the component and their relationships.

Delete a connected graph with Cypher

I want to delete a connected graph related to a particular node in a Neo4j database using Cypher. The use case is to delete a "start" node and all the nodes where a path to the start node exists. To limit the transaction the query has to be iterative and must not disconnect the connected graph.
Until now I am using this query:
OPTIONAL MATCH (start {indexed_prop: $PARAM})--(toDelete)
OPTIONAL MATCH (toDelete)--(toBind)
WHERE NOT(id(start ) = id(toBind)) AND NOT((start)--(toBind))
WITH start, collect(toBind) AS TO_BIND, toDelete limit 10000
DETACH DELETE toDelete
WITH start, TO_BIND
UNWIND TO_BIND AS b
CREATE (start)-[:HasToDelete]->(b)
And call it until deleted node is equal to 0.
Is there a better query for this ?
You could try a mark and delete approach, which is similar to how you would detach and delete the entire connnected graph with a variable match, but instead of DETACH DELETE you can apply a :TO_DELETE label.
Something like this (making up a label to use for the start node, as otherwise it has to comb the entire db looking for a node with the indexed param):
MATCH (start:StartNodeLabel {indexed_prop: $PARAM})-[*]-(toDelete)
SET toDelete:TO_DELETE
If that blows up your heap, you can run it multiple times, with the added predicate WHERE NOT toDelete:TO_DELETE before the SET, and using a combination of LIMIT and/or a limit on the depth of the variable-length relationship.
When you're sure you've labeled every connected node, then it's just a matter of deleting every node in the TO_DELETE label, and you can run that iteratively, or use APOC procedure apoc.periodic.commit() to handle that in batches.

Inserting a Relation into Neo4j using MERGE or MATCH runs forever

I am experimenting with Neo4j using a simple dataset of Locations. A location can have a relation to another relation.
a:Location - [rel] - b:Location
I already have the locations in the database (roughly 700.000+ Location entries)
Now I wanted to add the relation data (170M Edges), but I wanted to experiment with the import logic with a smaller set first, so I basically picked 2 nodes that are in the set and tried to create a relationship as follows.
MERGE p =(a:Location {locationid: 3616})-[w:WikiLink]->(b:Location {locationid: 467501})
RETURN p;
and also tried the approach directly from the docu
MATCH (a:Person),(b:Person)
WHERE a.name = 'Node A' AND b.name = 'Node B'
CREATE (a)-[r:RELTYPE { name : a.name + '<->' + b.name }]->(b)
RETURN r
I tried using a directional merge, undirectional merge, etc. etc. I basically tried multiple variants of the above queries and the result is: They run forever, seeming to no complete even after 15 minutes. Which is very odd.
Indexes
ON :Location(locationid) ONLINE (for uniqueness constraint)
Constraints
ON (location:Location) ASSERT location.locationid IS UNIQUE
This is what I am currently using:
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///edgelist.csv' AS line WITH line
MATCH (a:Location {locationid: toInt(line.locationidone)}), (b:Location {locationid: toInt(line.locationidtwo)})
MERGE (a)-[w:WikiLink {weight: toFloat(line.edgeweight)}]-(b)
RETURN COUNT(w);
If you look at the terminal output below you can see Neo4j reports 258ms query execution time, the realtime is however somewhat above that. This query already takes a few seconds too much in my opinion (The machine this runs on has 48GB RAM, 16 Cores and is relatively new).
I am currently running this query with LIMIT 1000 (before it was LIMIT 1) but the script is already running for a few minutes. I wonder if I have to switch from MERGE to CREATE. The problem is, I cannot understand the callgraph that EXPLAIN gives me in order to determine the bottleneck.
time /usr/local/neo4j/bin/neo4j-shell -file import-relations.cql
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| p |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| [Node[758609]{title:"Tehran",locationid:3616,locationlabel:"NIL"},:WikiLink[9422418]{weight:1.2282325516616477E-7},Node[917147]{title:"Khorugh",locationid:467501,locationlabel:"city"}] |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row
Relationships created: 1
Properties set: 1
258 ms
real 0m1.417s
user 0m1.497s
sys 0m0.158s
If you haven't:
create constraint on loc:Location assert loc.locationid is unique;
Then find both nodes, and create the releationship.
MATCH (a:Location {locationid: 3616}),(b:Location {locationid: 467501})
MERGE p = (a)-[w:WikiLink]->(b)
RETURN p;
or if the locations don't exist yet:
MERGE (a:Location {locationid: 3616})
MERGE (b:Location {locationid: 467501})
MERGE p = (a)-[w:WikiLink]->(b)
RETURN p;
You should also use parameters if you do that from a program.
Have you indexed the Location nodes on locationid?
CREATE INDEX ON :Location(locationid)
I had a similar problem adding edges to a graph and indexing the nodes led to the linking running over 150x faster.
If the nodes aren't indexed neo4j will do a serial search for the two nodes to link together.
USING PERIODIC COMMIT <value>:
Specifies number of records(rows) to be commited in a transaction. Since you have high RAM, it is good to use value that is greater than 100000. This will reduce the number of transactions committed and might further reduce the overall time.

finding the farthest node using Neo4j (node without any incoming relation)

I have created a graph db in Neo4j and want to use it for generalization purposes.
There are about 500,000 nodes (20 distinct labels) and 2.5 million relations (50 distinct types) between them.
In an example path : a -> b -> c-> d -> e
I want to find out the node without any incoming relations (which is 'a').
And I should do this for all the nodes (finding the nodes at the beginning of all possible paths that have no incoming relations).
I have tried several Cypher codes without any success:
match (a:type_A)-[r:is_a]->(b:type_A)
with a,count (r) as count
where count = 0
set a.isFirst = 'true'
or
match (a:type_A), (b:type_A)
where not (a)<-[:is_a*..]-(b)
set a.isFirst = 'true'
Where is the problem?!
Also, I have to create this code in neo4jClient, too.
Your first query will only match paths where there is a relationship [r:is_a], so counting r can never be 0. Your second query will return any arbitrary pair of nodes labeled :typeA that aren't transitively related by [:is_a]. What you want is to filter on a path predicate. For the general case try
MATCH (a)
WHERE NOT ()-->a
This translates roughly "any node that does not have incoming relationships". You can specify the pattern with types, properties or labels as needed, for instance
MATCH (a:type_A)
WHERE NOT ()-[:is_a]->a
If you want to find all nodes that have no incoming relationships, you can find them using OPTIONAL MATCH:
START n=node(*)
OPTIONAL MATCH n<-[r]-()
WITH n,r
WHERE r IS NULL
RETURN n

Resources