Trying to batch merge nodes that are the same - neo4j

I have about 4.7 million "entity nodes." Many of these are duplicate entities. I want to merge the entities that are the same and keep the relationship(s) between those new combined entities and the things they are connected to in place. I wrote the below query to try and do this, but it does not seem to be working. Any assistance with this is greatly appreciated.
CALL apoc.periodic.iterate(
'MATCH (e:Entity)
WITH e.name AS name, e.entity_type AS type, collect(e) as nodes
CALL apoc.refactor.mergeNodes(nodes, {
properties: {
author_id:"combine",
author_name:"combine",
entity_hash:"combine",
entity_type:"combine",
forum_id:"combine",
name:"discard",
post_id:"combine",
thread_id:"combine"
}
}) YIELD node
RETURN count(node) AS new_node_count',
'',
{batchSize:100000}
)
The pinwheel keeps spinning but not reduction in nodes or anything, which tells me it's hung.

You don't use correctly the procedure apoc.periodic.iterate. This procedure takes 2 queries :
the first : for creating a population of element on which you will iterate
the second : for each element of the first query, what you want to do
So in your cae, the query should be :
CALL apoc.periodic.iterate(
'MATCH (e:Entity)
WITH e.name AS name, e.entity_type AS type, collect(e) as nodes
RETURN nodes',
'CALL apoc.refactor.mergeNodes(nodes, {
properties: {
author_id:"combine",
author_name:"combine",
entity_hash:"combine",
entity_type:"combine",
forum_id:"combine",
name:"discard",
post_id:"combine",
thread_id:"combine"
}
})',
{batchSize:500}
)
Moreover I have decrease the size of the batch to 500, because if you have a lot of identicals nodes, 500 is cool (or 1000 but not 100000 otherwise you will have some OOM).
To see the perf of this query, you can previously test the first query to see if it is fast.

Related

creating 200K relationships to a node is taking a lot of time in Neo4J 3.5?

I have one vertex like this
Vertex1
{
name:'hello',
id: '2',
key: '12345',
col1: 'value1',
col2: 'value2',
.......
}
Vertex2, Vertex3, ..... Vertex200K
{
name:'hello',
id: '1',
key: '12345',
col1: 'value1',
col2: 'value2',
.......
}
Cypher Query
MATCH (a:Dense1) where a.id <> "1"
WITH a
MATCH (b:Dense1) where b.id = "1"
WITH a,b
WHERE a.key = b.key
MERGE (a)-[:PARENT_OF]->(b)
The end result should be Vertex1 should have a degree of 200K, therefore, there should be 200K relationships. However, the above query takes a lot of time pretty much killing the throughput to 500/second. Any ideas on how to create relationships/edges quicker?
When I run the profile and the cypher query above it keeps running forever and doesn't return so I reduced the size from 200K to 20K and here is what the profile is showing me.
Given your memory constraints, and the high db hits associated with your MERGE of the relationships, the issue is likely that you're trying to MERGE 200k relationships in a single transaction. You should probably batch this by using apoc.periodic.iterate() from APOC Procedures:
CALL apoc.periodic.iterate("
MATCH (a:Dense1)
WHERE a.id <> '1'
MATCH (b:Dense1)
WHERE b.id = '1' AND a.key = b.key
RETURN a, b",
"MERGE (a)-[:PARENT_OF]->(b)",
{}) YIELD batches, total, errorMessages
RETURN batches, total, errorMessages
This should batch those merges 10k at a time.
Also, if you happen to know for a fact that those relationships don't yet exist, use CREATE instead of MERGE, it will be faster.
Create an Index on the properties you are using for matching.
Here id and key properties.
You can create an index with the following queries:
CREATE INDEX ON :Schema1(id);
CREATE INDEX ON :Schema1(key);
This is the first step to improve performance.
You can further improve with a few other tricks.
Can you try running
MATCH (b:Dense1) where b.id <> "1"
WITH b, b.key AS bKey
MATCH (a:Dense1) where a.id = "1" AND a.key = bKey
MERGE (a)-[:PARENT_OF]->(b)
after ensuring you have indexes on id and key ?
Also, do I get this right that id is NOT unique, and you have 1 node with id=2 and 200k with id = 1? If I got this wrong flip the condition to make the first line return single node, one you want to have all relations coming into, and second part matching all the remaining 200k nodes. Also, in the merge, put the low-density node as the first one (so here, b would get 200k relationships in) - if that's not right, reverse it to be (b) <-[:XXX]-(a).
It's been a while since I was dealing with large imports/merges, but I recall that extracting the variable explicitly (e.g. bKey) that can then be matched in index, and starting from single nodes (single, or just a few b's) moving onto higher (multiple a's) was working better than queries with where clauses similar to a.key = b.key.
Having said that, 200k relationships in one transaction, AND connected to single node, is a lot, since just matching on the index finds the nodes quickly, but still you need to validate all outgoing relationships to see if by chance they already link to the other node. So, by the time you create your last relationship, you need to iterate/check nearly 200k relationships.
One trick is running batches in a loop until nothing gets created, e.g.
MATCH (b:Dense1) where b.id = "1"
WITH b, b.key AS bKey
MATCH (a:Dense1) where a.id <> "1" AND a.key = bKey
AND NOT (a) -[:PARENT_OF]-> (b) WITH a,b LIMIT 10000
MERGE (a)-[:PARENT_OF]->(b)
This might show you probably that the further the batch, the longer it takes - makes sense logically, as more and more relationships out of b need to be checked the further you go.
Or, as shown in other responses, batch via APOC.
Last thing - is this supposed to be ongoing process or one-time setup / initialisation of the DB? There are more, dramatically faster options if it's for initial load only.

How do I set relationship data as properties on a node?

I've taken the leap from SQL to Neo4j. I have a few complicated relationships that I need to set as properties on nodes as the first step towards building a recommendation engine.
This Cypher query returns a list of categories and weights.
MATCH (m:Movie {name: "The Matrix"})<-[:TAKEN_FROM]-(i:Image)-[r:CLASSIFIED_AS]->(c:Category) RETURN c.name, avg(r.weight)
This returns
{ "fighting": 0.334, "looking moody": 0.250, "lying down": 0.237 }
How do I set these results as key value pairs on the parent node?
The desired outcome is this:
(m:Movie { "name": "The Matrix", "fighting": 0.334, "looking moody": 0.250, "lying down": 0.237 })
Also, I assume I should process my (m:Movie) nodes in batches so what is the best way of accomplishing this?
Not quite sure how you're getting that output, that return shouldn't be returning both of them as key value pairs. Instead I would expect something like: {"c.name":"fighting", "avg(r.weight)":0.334}, with separate records for each pair.
You may need APOC procedures for this, as you need a means to set the property key to the value of the category name. That's a bit tricky, but you can do this by creating a map from the collected pairs, then use SET with += to update the relevant properties:
MATCH (m:Movie {name: "The Matrix"})<-[:TAKEN_FROM]-(:Image)-[r:CLASSIFIED_AS]->(c:Category)
WITH m, c.name as name, avg(r.weight) as weight
WITH m, collect([name, weight]) as category
WITH m, apoc.map.fromPairs(category) as categories
SET m += categories
As far as batching goes, take a look at apoc.periodic.iterate(), it will allow you to iterate on the streamed results of the outer query and execute the inner query on batches of the stream:
CALL apoc.periodic.iterate(
"MATCH (m:Movie)
RETURN m",
"MATCH (m)<-[:TAKEN_FROM]-(:Image)-[r:CLASSIFIED_AS]->(c:Category)
WITH m, c.name as name, avg(r.weight) as weight
WITH m, collect([name, weight]) as category
WITH m, apoc.map.fromPairs(category) as categories
SET m += categories",
{iterateList:true, parallel:false}) YIELD total, batches, errorMessages
RETURN total, batches, errorMessages

Cypher no loops, no double paths

I am currently modelling a database with over 50.000 nodes and every node has 2 directed relationships. I try to get all nodes for one input node (the root node), which are connected to it with one relationship and all so-called children of these nodes and so on, until every node connected direct and indirect to this root node is reached.
String query =
"MATCH (m {title:{title},namespaceID:{namespaceID}})-[:categorieLinkTo*..]->(n) " +
"RETURN DISTINCT n.title AS Title, n.namespaceID " +
"ORDER BY n.title";
Result result = db.execute(query, params);
String infos = result.resultAsString();
I have read that the runtime is more likely in O(n^x), but I cannot find any command that excludes for example loops or multiple paths to one node, so the query takes simple over 2 hours and that is not acceptable for my use case.
For simple relationship expressions, Cypher excludes multiple relationships automatically by enforcing uniqueness:
While pattern matching, Neo4j makes sure to not include matches where the same graph relationship is found multiple times in a single pattern.
The documentation is not entirely clear on whether this works for variable length paths - so let's design a small experiment to confirm it:
CREATE
(n1:Node {name: "n1"}),
(n2:Node {name: "n2"}),
(n3:Node {name: "n3"}),
(n4:Node {name: "n4"}),
(n1)-[:REL]->(n2),
(n2)-[:REL]->(n3),
(n3)-[:REL]->(n2),
(n2)-[:REL]->(n4)
This results in the following graph:
Query with:
MATCH (n:Node {name:"n1"})-[:REL*..]->(m)
RETURN m
The result is:
╒══════════╕
│m │
╞══════════╡
│{name: n2}│
├──────────┤
│{name: n3}│
├──────────┤
│{name: n2}│
├──────────┤
│{name: n4}│
├──────────┤
│{name: n4}│
└──────────┘
As you can see n4 is included multiple times (as it can be accessed with avoiding the loop and going through the loop as well).
Check the execution with PROFILE:
So we should use DISTINCT to get rid of the duplicates:
MATCH (n:Node {name:"n1"})-[:REL*..]->(m)
RETURN DISTINCT m
The result is:
╒══════════╕
│m │
╞══════════╡
│{name: n2}│
├──────────┤
│{name: n3}│
├──────────┤
│{name: n4}│
└──────────┘
Again, check the execution with PROFILE:
There are certainly some things we can do to improve this query.
For one, you aren't using labels at all. And because you aren't using labels, your match on the input node can't take advantage of any schema indexes you may have in place, and it must do a scan of all 50k nodes accessing and comparing properties until it finds every one with the given title and namespace (it will not stop when it's found one, as it does not know if there are other nodes that satisfy the conditions). You can check the timing by matching on the start node alone.
To improve this, your nodes should be labeled, and your match on your start node should include the label, and your title and namespaceID properties should be indexed.
That alone should provide a noticeable improvement in query speed.
The next question is if the remaining bottleneck is more due to the sorting, or returning a massive result set?
You can check the cost for sorting alone by LIMITing the results you return.
You can use this for the end of the query, after the match.
WITH DISTINCT n
ORDER BY n.title
LIMIT 10
RETURN n.title AS Title, n.namespaceID
ORDER BY n.title
Also, when doing any performance tuning, you should PROFILE your queries (at least the ones that finish in a reasonable amount of time) and EXPLAIN the ones that are taking an absurd amount of time to examine the query plan.
public static HashSet<Node> breadthFirst(String name, int namespace, GraphDatabaseService db) {
// Hashmap for storing the cypher query and its parameters
Map<String, Object> params = new HashMap<>();
// Adding the title and namespaceID as parameters to the Hashmap
params.put("title", name);
params.put("namespaceID", namespace);
/*it is a simple BFS with these variables below
* basically a Queue (touched) as usual, a Set to store the nodes
* which have been used (finished), a return variable and 2 result
* variables for the queries
*/
Node startNode = null;
String query = "Match (n{title:{title},namespaceID:{namespaceID}})-[:categorieLinkTo]-> (m) RETURN m";
Queue<Node> touched = new LinkedList<Node>();
HashSet<Node>finished = new HashSet<Node>();
HashSet<Node> returnResult = new HashSet<Node>();
Result iniResult = null;
Result tempResult=null;
/*the part below get the direct nodes and puts them
* into the queue
*/
try (Transaction tx = db.beginTx()) {
iniResult =db.execute(query,params);
while(iniResult.hasNext()){
Map<String,Object> iniNode=iniResult.next();
startNode=(Node) iniNode.get("m");
touched.add(startNode);
finished.add(startNode);
}
tx.success();
}catch (QueryExecutionException e) {
logger.error("Fehler bei Ausführung der Anfrage", e);
}
/*and now we just execute the BFS (don't think i need more to
* say here.. we are all pros ;))
* as usual, marking every node we have visited
* and saving every visited node.
* the difficult part had been the casting from
* and to node and result, everything else is pretty much
* straightforward. I think the variables explain their self
* via their name....
*/
while(! (touched.isEmpty())){
try (Transaction tx = db.beginTx()) {
Node currNode=touched.poll();
returnResult.add(currNode);
tempResult=null;
Map<String, Object> paramsTemp = new HashMap<>();
paramsTemp.put("title",currNode.getProperty("title").toString());
paramsTemp.put("namespaceID", 14);
String tempQuery = "MATCH (n{title:{title},namespaceID:{namespaceID}})-[:categorieLinkTo] -> (m) RETURN m";
tempResult = db.execute(tempQuery,paramsTemp);
while(tempResult.hasNext()){
Map<String, Object> currResult= null;
currResult=tempResult.next();
Node tempCurrNode = (Node) currResult.get("m");
if (!finished.contains(tempCurrNode)){
touched.add(tempCurrNode);
finished.add(tempCurrNode);
}
}
tx.success();
}catch (QueryExecutionException f) {
logger.error("Fehler bei Ausführung der Anfrage", f);
}
}
return returnResult;
}
Since I was not able to find a fitting Cypher expression, I just wrote a pretty BFS myself - it works it seems.

Create multiple nodes and relationships in several Cypher statements

I want to create multiple neo4j nodes and relationships in one Cypher transaction. I'm using py2neo which allows issuing multiple Cypher statements in one transaction .
I thought I'd add a statement for each node and relationship I create:
tx.append('CREATE (n:Label { prop: val })')
tx.append('CREATE (m:Label { prop: val2 })')
Now I want to create a relationship between the two created nodes:
tx.append('CREATE (n)-[:REL]->(m)')
This doesn't work as expected. No relationship is created between the first two nodes, since there's no n or m in the context of the last statement (there is a new relationship between two new nodes - four nodes are created in total)
Is there a way around this? Or should I combine all the calls to CREATE (around 100,000 per logical transaction) in one statement?
It just hurts my brain thinking about such a statement, because I'll need to store everything on one big StringIO, and I lose the ability to use Cypher query parameters - I'll need to serialize dictionaries to text myself.
UPDATE:
The actual graph layout is more complicated than that. I have multiple relationship types, and each node is connected to at least two other nodes, while some nodes are connected to hundreds of nodes.
You don't need multiple queries. You can use a single CREATE to create each relationship and its related nodes:
tx.append('CREATE (:Label { prop: val })-[:REL]->(:Label { prop: val2 })')
Do something like this:
rels = [(1,2), (3,4), (5,6)]
query = """
CREATE (n:Label {prop: {val1} }),
(m:Label {prop: {val2} }),
(n)-[:REL]->(m)
"""
tx = graph.cypher.begin()
for val1, val2 in rels:
tx.append(query, val1=val1, val2=val2)
tx.commit()
And if your data is large enough consider doing this in batches of 5000 or so.

WHERE NOT() in cypher neo4j query

I am having trouble with a simple cypher query. The query is:
MATCH (u:user { google_id : 'example_user' })--(rm:room)--(a:area),
(c:category { name : 'culture:Yoruba' })--(o:object)
WHERE NOT (a-[:CONTAINS]->o)
RETURN DISTINCT o.id
The "WHERE NOT.." is being ignored and I am getting back the nodes with incoming :CONTAINS relationships from the area nodes. If I take out the "NOT" function, then I correctly only get back the nodes that have this a-->o relationship.
I think I have a weak understanding of NOT()
Trad,
The query is returning just what you asked it to. In your example at the link, there are three areas. None of the objects are contained by the first two areas, so all three nodes are returned. If you change the RETURN line to
RETURN a.area_number, o.id
you will see this.
I don't know about your larger problem context, but if you want to know about objects that aren't in any area, then the query
MATCH (o:object)
WHERE NOT (o)<-[:CONTAINS]-()
RETURN o.id
will accomplish the task.
Grace and peace,
Jim

Resources