cypher delete is taking forever - neo4j

I am trying to delete data from neo4j using the following query:
MATCH (c:Customer {customerID: '16af89a6-832b-4bef-b026-eafea3873d69'})
MATCH (c)<-[r:DEPT_OF]-(dept:Dept)-[*]-(n2) WITH r, dept, n2 LIMIT 10
DETACH DELETE r, dept, n2;
This statement is taking forever and not deleting anything when I inspect the dept node for example. Is there anything I am missing here?

You have a variable length path without specifying an upper bound in this line:
MATCH (c)<-[r:DEPT_OF]-(dept:Dept)-[*]-(n2) WITH r, dept, n2 LIMIT 10
This will result in a lot of traversals. Does your data model allow for specifying an upper bound on the number of hops to match n2. Also, you should specify a label or labels for n2.
Also, you don't need to include r in the DETACH DELETE statement. Any existing relationships of a node being deleted will also be deleted when using DETACH DELETE.
Edit
The pattern (dept:Dept)-[*]-(n2) indicates a bidirectional path of any length (with no upper bound). To specify an upper bound on the variable length path simply replace the (dept:Dept)-[*]-(n2) piece of the pattern with (dept:Dept)-[*1..3]-(n2). This will limit the length of the paths traversed to a maximum of three relationships between (dept:Dept) and (n2) (although this might not be appropriate for your data model). It would also be good to add labels and a relationship direction to the pattern (appropriate for your data model), something like:
MATCH (c)<-[r:DEPT_OF]-(dept:Dept)<-[:BELONGS_TO*1..2]-(n2:Product) WITH r, dept, n2 LIMIT 10

There are many different issues in your query. Here are the one I've identified.
The number of paths discoverable by a variable length path query (let's assume the lower bound is 0 or 1) is roughly an exponential function of the maximum path length. That is, if every relevant node has M relationships, and the maximum depth being searched (or, if there is no upper bound, the maximum possible depth) is N, then in the worst case the number of possible paths is (M ^ N). For example, if we plug in 5 and 10 for M and N, we get 9,765,625 possible paths (and the same number of nodes and relationships to be deleted). This is probably the main reason why your query takes a long time.
A second major concern would be total failure of the query due to an out-of-memory situation in the neo4j engine, due to the potentially huge amount of data that needs to be in memory. You have apparently not encountered this yet, but you might. You could try to minimize the number of found paths by only matching complete paths (that is, paths in which the last node has no other node to connect to). I don’t know your data model, so I can’t show you a Cypher clause to do that for your data. But if you do this, your query would have to be modified to use all the nodes in the found paths rather than just the path end nodes.
The second MATCH clause will only match dept nodes that have at least one relationship other than r, because the default lower bound for a variable-length path is a length of 1. Therefore, this query will not delete dept nodes that have no other relationships. You could solve this by specifying a lower bound of 0, as in: [*0..].
You have a LIMIT 10 on your WITH clause, so your query is only going to attempt to delete a few dept and n2 nodes. Also, since you are not necessarily deleting complete paths, you may end up with “disconnected subgraphs” that are no longer connected to anything else. So, you should remove the LIMIT clause, even though that would make your query take even longer.
It is theoretically possible (but I don't know your data model) for an n2 to be the same as c. If your data allows this to be possible, but you never want your query to delete c, you need to add a WHERE clause right after the relevant MATCH clause to prevent that (see below).
Since a MATCH clause filters out any matches where the same relationship is used twice, your second MATCH clause is actually doing extra work to ensure that none of the relationships in each variable length path matches r. Since your use case does not need this checking (after you fix item 5), you could avoid that unneeded check by splitting the second MATCH clause so that r is matched in its own clause.
Here is a sample fix for items 3, 4, 5, 6:
MATCH (c:Customer {customerID: '16af89a6-832b-4bef-b026-eafea3873d69'})
MATCH (c)<-[r:DEPT_OF]-(dept:Dept)
MATCH (dept)-[*0..]-(n2)
WHERE n2 <> c
DETACH DELETE dept, n2;
But, since the above does not solve items 1 or 2, your query could still take a very long time and/or fail. If you provide a more complete idea of your data model, we might be able to solve item 2. However, item 1 is the main issue, and may require rethinking your data model or possibly splitting the deletion into multiple queries.

Related

How to efficiently remove nodes that do not connect two different nodes of a targeted type?

I have a Neo4j database with relationships like : (:Person)-[:KNOWS]-(:Target)
I would like to remove all Persons that are not connected to at least two different Targets thanks to a Cypher query.
I tried to use a query that, for each node, get all connected nodes (with an arbitrary path length) and then count the number of Target in it. If there is less than two, I remove the node.
But the request seems to be extremely long and unsuccessful:
MATCH (n:Person)
OPTIONAL MATCH (n)-[*]-(t:Target)
WITH n, COUNT(t) AS nb_targets
WHERE NOT n:Target AND nb_targets < 2
RETURN n
The request does not even succeed due to its inefficiency...
NB : I have only a few Targets and a lot of Persons
Cypher is interested in returning all possible paths that match the pattern, so it won't do well with an unbounded variable-length query like this.
You can instead use the path expander procs from APOC Procedures, which are designed to be more efficient for these use cases. We can even limit the results per node to 2, since that's the minimum we would need to determine if a node needs to be kept or discarded.
If you needed a query just to return those that didn't have at least 2 targets, then this query should work:
MATCH (n:Person)
WHERE NOT n:TARGET
CALL apoc.path.subgraphNodes(n, {labelFilter:'>TARGET', limit:2, optional:true}) YIELD node
WITH n, count(node) AS nb_targets
WHERE nb_targets < 2
RETURN n
[UPDATED]
If this is a real-life scenario (say, identifying suspects who might be in cahoots with known bad guys), then you probably don't care about people who have a large "degree of separation" from the bad guys. Otherwise, you may end up with most of the population falling under suspicion. So, you will probably want to impose a reasonable upper bound on the depth of your search.
It just so happens that the time (and memory) required to search a variable length path goes up exponentially with the depth of the search, so to speed up the search you would want to impose a reasonable upper bound on the depth of the search as well.
Therefore, try using a reasonable upper bound (say, 6). Here is a query to identify people who are not themselves targets and who are connected (by up to a depth of 6) to fewer than 2 targets. To help speed up the search, I also limit the search to just using KNOWS relationships (assuming that is the only relationship type you care about). And note that I count DISTINCT targets, since multiple paths can contain the same target.
MATCH (n:Person) WHERE NOT n:Target
OPTIONAL MATCH (n)-[:KNOWS*..6]-(t:Target)
WITH n, COUNT(DISTINCT t) AS nb_targets
WHERE nb_targets < 2
RETURN n
To eliminate all the people who are not potential suspects (including the ones who are more than 6 hops from any target), then you can first identify all the suspects and delete the people who are neither targets nor suspects:
MATCH (n:Person) WHERE NOT n:Target
OPTIONAL MATCH (n)-[:KNOWS*..6]-(t:Target)
WITH n, COUNT(DISTINCT t) AS nb_targets
WHERE nb_targets >= 2
WITH COLLECT(n) AS suspects
MATCH (m:Person)
WHERE NOT m:Target AND NOT m IN suspects
DETACH DELETE m

Is neo4j suitable for searching for paths of specific length

I am a total newcommer in the world of graph databases. But let's put that on a side.
I have a task to find a cicular path of certain length (or of any other measure) from start point and back.
So for example, I need to find a path from one node and back which is 10 "nodes" long and at the same time has around 15 weights of some kind. This is just an example.
Is this somehow possible with neo4j, or is it even the right thing to use?
Hope I clarified it enough, and thank you for your answers.
Regards
Neo4j is a good choice for cycle detection.
If you need to find one path from n to n of length 10, you could try some query like this one:
MATCH p=(n:TestLabel {uuid: 1})-[rels:TEST_REL_TYPE*10]-(n)
RETURN p LIMIT 1
The match clause here is asking Cypher to find all paths from n to itself, of exactly 10 hops, using a specific relationship type. This is called variable length relationships in Neo4j. I'm using limit 1 to return only one path.
Resulting path can be visualized as a graph:
You can also specify a range of length, such as [*8..10] (from 8 to 10 hops away).
I'm not sure I understand what you mean with:
has around 15 weights of some kind
You can check relationships properties, such as weight, in variable length paths if you need to. Specific example in the doc here.
Maybe you will also be interested in shortestPath() and allShortestPaths() functions, for which you need to know the end node as well as the start one, and you can find paths between them, even specifying the length.
Since you did not provide a data model, I will just assume that your starting/ending nodes all have the Foo label, that the relevant relationships all have the BAR type, and that your circular path relationships all point in the same direction (which should be faster to process, in general). Also, I gather that you only want circular paths of a specific length (10). Finally, I am guessing that you prefer circular paths with lower total weight, and that you want to ignore paths whose total weight exceed a bounding value (15). This query accomplishes the above, returning the matching paths and their path weights, in ascending order:
MATCH p=(f:Foo)-[rels:BAR*10]->(f)
WITH p, REDUCE(s = 0, r IN rels | s + r.weight) AS pathWeight
WHERE pathWeight <= 15
RETURN p, pathWeight
ORDER BY pathWeight;

Neo4j and Cypher - How can I create/merge chained sequential node relationships (and even better time-series)?

To keep things simple, as part of the ETL on my time-series data, I added a sequence number property to each row corresponding to 0..370365 (370,366 nodes, 5,555,490 properties - not that big). I later added a second property and named it "outeseq" (original) and "ineseq" (second) to see if an outright equivalence to base the relationship on might speed things up a bit.
I can get both of the following queries to run properly on up to ~30k nodes (LIMIT 30000) but past that, its just an endless wait. My JVM has 16g max (if it can even use it on a windows box):
MATCH (a:BOOK),(b:BOOK)
WHERE a.outeseq=b.outeseq-1
MERGE (a)-[s:FORWARD_SEQ]->(b)
RETURN s;
or
MATCH (a:BOOK),(b:BOOK)
WHERE a.outeseq=b.ineseq
MERGE (a)-[s:FORWARD_SEQ]->(b)
RETURN s;
I also added these in hopes of speeding things up:
CREATE CONSTRAINT ON (a:BOOK)
ASSERT a.outeseq IS UNIQUE
CREATE CONSTRAINT ON (b:BOOK)
ASSERT b.ineseq IS UNIQUE
I can't get the relationships created for the entire data set! Help!
Alternatively, I can also get bits of the relationships built with parameters, but haven't figured out how to parameterize the sequence over all of the node-to-node sequential relationships, at least not in a semantically general enough way to do this.
I profiled the query, but did't see any reason for it to "blow-up".
Another question: I would like each relationship to have a property to represent the difference in the time-stamps of each node or delta-t. Is there a way to take the difference between the two values in two sequential nodes, and assign it to the relationship?....for all of the relationships at the same time?
The last Q, if you have the time - I'd really like to use the raw data and just chain the directed relationships from one nodes'stamp to the next nearest node with the minimum delta, but didn't run right at this for fear that it cause scanning of all the nodes in order to build each relationship.
Before anyone suggests that I look to KDB or other db's for time series, let me say I have a very specific reason to want to use a DAG representation.
It seems like this should be so easy...it probably is and I'm blind. Thanks!
Creating Relationships
Since your queries work on 30k nodes, I'd suggest to run them page by page over all the nodes. It seems feasible because outeseq and ineseq are unique and numeric so you can sort nodes by that properties and run query against one slice at time.
MATCH (a:BOOK),(b:BOOK)
WHERE a.outeseq = b.outeseq-1
WITH a, b ORDER BY a.outeseq SKIP {offset} LIMIT 30000
MERGE (a)-[s:FORWARD_SEQ]->(b)
RETURN s;
It will take about 13 times to run the query changing {offset} to cover all the data. It would be nice to write a script on any language which has a neo4j client.
Updating Relationship's Properties
You can assign timestamp delta to relationships using SET clause following the MATCH. Assuming that a timestamp is a long:
MATCH (a:BOOK)-[s:FORWARD_SEQ]->(b:BOOK)
SET s.delta = abs(b.timestamp - a.timestamp);
Chaining Nodes With Minimal Delta
When relationships have the delta property inside, the graph becomes a weighted graph. So we can apply this approach to calculate the shortest path using deltas. Then we just save the length of the shortest path (summ of deltas) into the relation between the first and the last node.
MATCH p=(a:BOOK)-[:FORWARD_SEQ*1..]->(b:BOOK)
WITH p AS shortestPath, a, b,
reduce(weight=0, r in relationships(p) : weight+r.delta) AS totalDelta
ORDER BY totalDelta ASC
LIMIT 1
MERGE (a)-[nearest:NEAREST {delta: totalDelta}]->(b)
RETURN nearest;
Disclaimer: queries above are not supposed to be totally working, they just hint possible approaches to the problem.

Cypher directionless query not returning all expected paths

I have a cypher query that starts from a machine node, and tries to find nodes related to it using any of the relationship types I've specified:
match p1=(n:machine)-[:REL1|:REL2|:REL3|:PERSONAL_PHONE|:MACHINE|:ADDRESS*]-(n2)
where n.machine="112943691278177215"
optional match p2=(n2)-[*]->()
return p1,p2
limit 300
The optional match clause is my attempt to traverse outwards in my model from each of the nodes found in p1. The below screenshot shows the part of the results I'm having issues with:
You can see from the starting machine node, it finds a personal_phone node via two app nodes related to the machine. For clarification, this part of the model is designed like so:
So it appeared to be working until I realized that certain paths were somehow being left out of the results. If I run a second query showing me all apps related to that particular personal_phone node, I get the following:
match p1=(n:personal_phone)<-[*]-(n2)
where n.personal_phone="(xxx) xxx-xxxx"
return p1
limit 100
The two apps I have segmented out, are the two apps shown in the earlier image.
So why doesn't my original query show the other 7 apps related to the personal_phone?
EDIT : Despite the overly broad optional match combined with the limit 300 statement, the returned results show only 52 nodes and 154 rels. This is because the paths following relationships with an outward direction are going to stop very quickly. I could have put a max 2 on it but was being lazy.
EDIT 2: The query I finally came up with to give me what I want is this:
match p1=(m:machine)<-[:MACHINE]-(a:app)
where m.machine="112943691278177215"
optional match p2=(a:app)-[:REL1|:REL2|:REL3|:PERSONAL_PHONE|:MACHINE|:ADDRESS*0..3]-(n)
where a<>n and a<>m and m<>n
optional match p3=(n)-[r*]->(n2)
where n2<>n
return distinct n, r, n2
This returns 74 nodes and 220 rels which seems to be the correct result (387 rows). So it seems like my incredibly inefficient query was the reason the graph was being truncated. Not only were the nodes being traversed many times, but the paths being returned contained duplicate information which consumed the limited rows available for return. I guess my new questions are:
When following multiple hops, should I always explicitly make sure the same nodes aren't traversed via where clauses?
If I was to return p3 instead, it returns 1941 rows to display 74 nodes and 220 rels. There seems to be a lot of duplication present. Is it typically better to use return distinct (like I have above) or is there a way to easily dedupe the nodes and relationships within a path?
So part of your issue here (updated questions) is that you're returning paths, and not individual nodes/relationships.
For example, if you do MATCH p=(n)-[*]-() and your data is A->B->C->D then the results you'll get will be A->B, A->B->C, A->B->C->D and so on. If on the other hand you did MATCH (n)-[r:*]-(m) and then worked with r and m, you could get the same data, but deal with the distinct things on the path rather than have to sort that out later.
It seems you want the nodes and relationships, but you're asking for the paths - so you're getting them. ALL of them. :)
When following multiple hops, should I always explicitly make sure the
same nodes aren't traversed via where clauses?
Well, the way you did it, yes -- but honestly I haven't ever run into that problem before. Part of the issue again is the overly-broad query you're running. Lacking any constraint, it ends up roping in the items you've already matched, which buys you this problem. Perhaps better would be to match some set of possible labels, to narrow your query down. By narrowing it down, you wouldn't have the same issue, for example something like:
MATCH (n)-[r:*]-(m)
WHERE 'foo' in labels(m) or 'bar' in labels(m)
RETURN n, r, m;
Note we're not doing path matching, and we're specifying some range of labels that could be m, without leaving it completely wild-west. I tend to formulate queries this way, so your question #2 never really arises. Presumably you have a reasonable data model that would act as your grounding for that.

How to find distinct nodes in a Neo4j/Cypher query

I'm trying to do some pattern matching in neo4j/cypher and I came across this issue:
There are two types of graphs I want to search for:
Star graphs: A graph with one center node and multiple outgoing relationships.
n-length line graphs: A line graph with length n where none of the nodes are repeats (I have some bidirectional edges and cycles in my graph)
So the main problem is that when I do something such as:
MATCH a-->b, a-->c, a-->d
MATCH a-->b-->c-->d
Cypher doesn't guarantee (when I tried it) that a, b, c, and d are all different nodes. For small graphs, this can easily be fixed with
WHERE not(a=b) AND not(a=c) AND ...
But I'm trying to have graphs of size 10+, so checking equality between all nodes isn't a viable option. Afaik, RETURN DISTINCT does not work as well since it doesn't check equality among variables, only across different rows. Is there any simple way I can specify the query to make the differently named nodes distinct?
Old question, but look to APOC Path Expander procedures for how to address these kinds of use cases, as you can change the traversal uniqueness behavior for expansion (the same way you can when using the traversal API...which these procedures use).
Cypher implicitly uses RELATIONSHIP_PATH uniqueness, meaning that per path returned, a relationship must be unique, it cannot be used multiple times in a single path.
While this is good for queries where you need all possible paths, it's not a good fit for queries where you want distinct nodes or a subgraph or to prevent repeating nodes in a path.
For an n-length path, let's say depth 6 with only outgoing relationships of any type, we can change the uniqueness to NODE_PATH, where a node must be unique per path, no repeats in a path:
MATCH (n)
WHERE id(n) = 12345
CALL apoc.path.expandConfig(n, {maxLevel:6, uniqueness:'NODE_PATH'}) YIELD path
RETURN path
If you want all reachable nodes up to a certain depth (or at any depth by omitting maxLevel), you can use NODE_GLOBAL uniqueness, or instead just use apoc.path.subgraphNodes():
MATCH (n)
WHERE id(n) = 12345
CALL apoc.path.subgraphNodes(n, {maxLevel:6}) YIELD node
RETURN node
NODE_GLOBAL uniqueness means that across all paths that a node must be unique, it will only be visited once, and there will only be one path to a node from a given start node. This keeps the number of paths that need to be evaluated down significantly, but because of this behavior not all relationships will be traversed, if they expand to a node already visited.
You will not get relationships back with this procedure (you can use apoc.path.spanningTree() for that, although as previously mentioned not all relationships will be included, as we will only capture a single path to each node, not all possible paths to nodes). If you want all nodes up to a max level and all possible relationships between those nodes, then use apoc.path.subgraphAll():
MATCH (n)
WHERE id(n) = 12345
CALL apoc.path.subgraphAll(n, {maxLevel:6}) YIELD nodes, relationships
RETURN nodes, relationships
Richer options exist for label and relationship filtering, or filtering (whitelist, blacklist, endnode, terminator node) based on lists of pre-matched nodes.
We also support repeating sequences of relationships or node labels.
If you need filtering by node or relationship properties during expansion, then this won't be a good option as that feature is yet supported.

Resources