I am a total newcommer in the world of graph databases. But let's put that on a side.
I have a task to find a cicular path of certain length (or of any other measure) from start point and back.
So for example, I need to find a path from one node and back which is 10 "nodes" long and at the same time has around 15 weights of some kind. This is just an example.
Is this somehow possible with neo4j, or is it even the right thing to use?
Hope I clarified it enough, and thank you for your answers.
Regards
Neo4j is a good choice for cycle detection.
If you need to find one path from n to n of length 10, you could try some query like this one:
MATCH p=(n:TestLabel {uuid: 1})-[rels:TEST_REL_TYPE*10]-(n)
RETURN p LIMIT 1
The match clause here is asking Cypher to find all paths from n to itself, of exactly 10 hops, using a specific relationship type. This is called variable length relationships in Neo4j. I'm using limit 1 to return only one path.
Resulting path can be visualized as a graph:
You can also specify a range of length, such as [*8..10] (from 8 to 10 hops away).
I'm not sure I understand what you mean with:
has around 15 weights of some kind
You can check relationships properties, such as weight, in variable length paths if you need to. Specific example in the doc here.
Maybe you will also be interested in shortestPath() and allShortestPaths() functions, for which you need to know the end node as well as the start one, and you can find paths between them, even specifying the length.
Since you did not provide a data model, I will just assume that your starting/ending nodes all have the Foo label, that the relevant relationships all have the BAR type, and that your circular path relationships all point in the same direction (which should be faster to process, in general). Also, I gather that you only want circular paths of a specific length (10). Finally, I am guessing that you prefer circular paths with lower total weight, and that you want to ignore paths whose total weight exceed a bounding value (15). This query accomplishes the above, returning the matching paths and their path weights, in ascending order:
MATCH p=(f:Foo)-[rels:BAR*10]->(f)
WITH p, REDUCE(s = 0, r IN rels | s + r.weight) AS pathWeight
WHERE pathWeight <= 15
RETURN p, pathWeight
ORDER BY pathWeight;
Related
I am currently starting to work with Neo4J and it's query language cypher.
I have a multple queries that follow the same pattern.
I am doing some comparison between a SQL-Database and Neo4J.
In my Neo4J Datababase I habe one type of label (person) and one type of relationship (FRIENDSHIP). The person has the propterties personID, name, email, phone.
Now I want to have the the friends n-th degree. I also want to filter out those persons that are also friends with a lower degree.
FOr example if I want to search for the friends 3 degree I want to filter out those that are also friends first and/or second degree.
Here my query type:
MATCH (me:person {personID:'1'})-[:FRIENDSHIP*3]-(friends:person)
WHERE NOT (me:person)-[:FRIENDSHIP]-(friends:person)
AND NOT (me:person)-[:FRIENDSHIP*2]-(friends:person)
RETURN COUNT(DISTINCT friends);
I found something similiar somewhere.
This query works.
My problem is that this pattern of query is much to slow if I search for a higher degree of friendship and/or if the number of persons becomes more.
So I would really appreciate it, if somemone could help me with optimize this.
If you just wanted to handle depths of 3, this should return the distinct nodes that are 3 degrees away but not also less than 3 degrees away:
MATCH (me:person {personID:'1'})-[:FRIENDSHIP]-(f1:person)-[:FRIENDSHIP]-(f2:person)-[:FRIENDSHIP]-(f3:person)
RETURN apoc.coll.subtract(COLLECT(f3), COLLECT(f1) + COLLECT(f2) + me) AS result;
The above query uses the APOC function apoc.coll.subtract to remove the unwanted nodes from the result. The function also makes sure the collection contains distinct elements.
The following query is more general, and should work for any given depth (by just replacing the number after *). For example, this query will work with a depth of 4:
MATCH p=(me:person {personID:'1'})-[:FRIENDSHIP*4]-(:person)
WITH NODES(p)[0..-1] AS priors, LAST(NODES(p)) AS candidate
UNWIND priors AS prior
RETURN apoc.coll.subtract(COLLECT(DISTINCT candidate), COLLECT(DISTINCT prior)) AS result;
The problem with Cypher's variable-length relationship matching is that it's looking for all possible paths to that depth. This can cause unnecessary performance issues when all you're interested in are the nodes at certain depths and not the paths to them.
APOC's path expander using 'NODE_GLOBAL' uniqueness is a more efficient means of matching to nodes at inclusive depths.
When using 'NODE_GLOBAL' uniqueness, nodes are only ever visited once during traversal. Because of this, when we set the path expander's minLevel and maxLevel to be the same, the result are nodes at that level that are not present at any lower level, which is exactly the result you're trying to get.
Try this query after installing APOC:
MATCH (me:person {personID:'1'})
CALL apoc.path.expandConfig(me, {uniqueness:'NODE_GLOBAL', minLevel:4, maxLevel:4}) YIELD path
// a single path for each node at depth 4 but not at any lower depth
RETURN COUNT(path)
Of course you'll want to parameterize your inputs (personID, level) when you get the chance.
To keep things simple, as part of the ETL on my time-series data, I added a sequence number property to each row corresponding to 0..370365 (370,366 nodes, 5,555,490 properties - not that big). I later added a second property and named it "outeseq" (original) and "ineseq" (second) to see if an outright equivalence to base the relationship on might speed things up a bit.
I can get both of the following queries to run properly on up to ~30k nodes (LIMIT 30000) but past that, its just an endless wait. My JVM has 16g max (if it can even use it on a windows box):
MATCH (a:BOOK),(b:BOOK)
WHERE a.outeseq=b.outeseq-1
MERGE (a)-[s:FORWARD_SEQ]->(b)
RETURN s;
or
MATCH (a:BOOK),(b:BOOK)
WHERE a.outeseq=b.ineseq
MERGE (a)-[s:FORWARD_SEQ]->(b)
RETURN s;
I also added these in hopes of speeding things up:
CREATE CONSTRAINT ON (a:BOOK)
ASSERT a.outeseq IS UNIQUE
CREATE CONSTRAINT ON (b:BOOK)
ASSERT b.ineseq IS UNIQUE
I can't get the relationships created for the entire data set! Help!
Alternatively, I can also get bits of the relationships built with parameters, but haven't figured out how to parameterize the sequence over all of the node-to-node sequential relationships, at least not in a semantically general enough way to do this.
I profiled the query, but did't see any reason for it to "blow-up".
Another question: I would like each relationship to have a property to represent the difference in the time-stamps of each node or delta-t. Is there a way to take the difference between the two values in two sequential nodes, and assign it to the relationship?....for all of the relationships at the same time?
The last Q, if you have the time - I'd really like to use the raw data and just chain the directed relationships from one nodes'stamp to the next nearest node with the minimum delta, but didn't run right at this for fear that it cause scanning of all the nodes in order to build each relationship.
Before anyone suggests that I look to KDB or other db's for time series, let me say I have a very specific reason to want to use a DAG representation.
It seems like this should be so easy...it probably is and I'm blind. Thanks!
Creating Relationships
Since your queries work on 30k nodes, I'd suggest to run them page by page over all the nodes. It seems feasible because outeseq and ineseq are unique and numeric so you can sort nodes by that properties and run query against one slice at time.
MATCH (a:BOOK),(b:BOOK)
WHERE a.outeseq = b.outeseq-1
WITH a, b ORDER BY a.outeseq SKIP {offset} LIMIT 30000
MERGE (a)-[s:FORWARD_SEQ]->(b)
RETURN s;
It will take about 13 times to run the query changing {offset} to cover all the data. It would be nice to write a script on any language which has a neo4j client.
Updating Relationship's Properties
You can assign timestamp delta to relationships using SET clause following the MATCH. Assuming that a timestamp is a long:
MATCH (a:BOOK)-[s:FORWARD_SEQ]->(b:BOOK)
SET s.delta = abs(b.timestamp - a.timestamp);
Chaining Nodes With Minimal Delta
When relationships have the delta property inside, the graph becomes a weighted graph. So we can apply this approach to calculate the shortest path using deltas. Then we just save the length of the shortest path (summ of deltas) into the relation between the first and the last node.
MATCH p=(a:BOOK)-[:FORWARD_SEQ*1..]->(b:BOOK)
WITH p AS shortestPath, a, b,
reduce(weight=0, r in relationships(p) : weight+r.delta) AS totalDelta
ORDER BY totalDelta ASC
LIMIT 1
MERGE (a)-[nearest:NEAREST {delta: totalDelta}]->(b)
RETURN nearest;
Disclaimer: queries above are not supposed to be totally working, they just hint possible approaches to the problem.
I am trying to delete data from neo4j using the following query:
MATCH (c:Customer {customerID: '16af89a6-832b-4bef-b026-eafea3873d69'})
MATCH (c)<-[r:DEPT_OF]-(dept:Dept)-[*]-(n2) WITH r, dept, n2 LIMIT 10
DETACH DELETE r, dept, n2;
This statement is taking forever and not deleting anything when I inspect the dept node for example. Is there anything I am missing here?
You have a variable length path without specifying an upper bound in this line:
MATCH (c)<-[r:DEPT_OF]-(dept:Dept)-[*]-(n2) WITH r, dept, n2 LIMIT 10
This will result in a lot of traversals. Does your data model allow for specifying an upper bound on the number of hops to match n2. Also, you should specify a label or labels for n2.
Also, you don't need to include r in the DETACH DELETE statement. Any existing relationships of a node being deleted will also be deleted when using DETACH DELETE.
Edit
The pattern (dept:Dept)-[*]-(n2) indicates a bidirectional path of any length (with no upper bound). To specify an upper bound on the variable length path simply replace the (dept:Dept)-[*]-(n2) piece of the pattern with (dept:Dept)-[*1..3]-(n2). This will limit the length of the paths traversed to a maximum of three relationships between (dept:Dept) and (n2) (although this might not be appropriate for your data model). It would also be good to add labels and a relationship direction to the pattern (appropriate for your data model), something like:
MATCH (c)<-[r:DEPT_OF]-(dept:Dept)<-[:BELONGS_TO*1..2]-(n2:Product) WITH r, dept, n2 LIMIT 10
There are many different issues in your query. Here are the one I've identified.
The number of paths discoverable by a variable length path query (let's assume the lower bound is 0 or 1) is roughly an exponential function of the maximum path length. That is, if every relevant node has M relationships, and the maximum depth being searched (or, if there is no upper bound, the maximum possible depth) is N, then in the worst case the number of possible paths is (M ^ N). For example, if we plug in 5 and 10 for M and N, we get 9,765,625 possible paths (and the same number of nodes and relationships to be deleted). This is probably the main reason why your query takes a long time.
A second major concern would be total failure of the query due to an out-of-memory situation in the neo4j engine, due to the potentially huge amount of data that needs to be in memory. You have apparently not encountered this yet, but you might. You could try to minimize the number of found paths by only matching complete paths (that is, paths in which the last node has no other node to connect to). I don’t know your data model, so I can’t show you a Cypher clause to do that for your data. But if you do this, your query would have to be modified to use all the nodes in the found paths rather than just the path end nodes.
The second MATCH clause will only match dept nodes that have at least one relationship other than r, because the default lower bound for a variable-length path is a length of 1. Therefore, this query will not delete dept nodes that have no other relationships. You could solve this by specifying a lower bound of 0, as in: [*0..].
You have a LIMIT 10 on your WITH clause, so your query is only going to attempt to delete a few dept and n2 nodes. Also, since you are not necessarily deleting complete paths, you may end up with “disconnected subgraphs” that are no longer connected to anything else. So, you should remove the LIMIT clause, even though that would make your query take even longer.
It is theoretically possible (but I don't know your data model) for an n2 to be the same as c. If your data allows this to be possible, but you never want your query to delete c, you need to add a WHERE clause right after the relevant MATCH clause to prevent that (see below).
Since a MATCH clause filters out any matches where the same relationship is used twice, your second MATCH clause is actually doing extra work to ensure that none of the relationships in each variable length path matches r. Since your use case does not need this checking (after you fix item 5), you could avoid that unneeded check by splitting the second MATCH clause so that r is matched in its own clause.
Here is a sample fix for items 3, 4, 5, 6:
MATCH (c:Customer {customerID: '16af89a6-832b-4bef-b026-eafea3873d69'})
MATCH (c)<-[r:DEPT_OF]-(dept:Dept)
MATCH (dept)-[*0..]-(n2)
WHERE n2 <> c
DETACH DELETE dept, n2;
But, since the above does not solve items 1 or 2, your query could still take a very long time and/or fail. If you provide a more complete idea of your data model, we might be able to solve item 2. However, item 1 is the main issue, and may require rethinking your data model or possibly splitting the deletion into multiple queries.
I am trying to find the amount of relationships that stem originally from a parent node and I am not sure the syntax to use in order to gain access to this returned integer. I am can be sure in my code that each child node can only have one relationship of a particular type so this allows me to capture a "true" depth reading
My attempt is this but I am hoping there is a cleaner way:
MATCH p=(n {id:'123'})-[r:Foo*]->(c)
RETURN length(p)
I am not sure this is the correct syntax because it returns an array of integers with the last index being the true tally length. I am hoping for something that just returns an int instead of this mentioned array.
I am very grateful for help that you may be able to offer.
As Nicole says, in general, finding the longest path between two nodes in a graph is not feasible in any reasonable time. If your graph is very small, it is possible that you will be able to find all paths, and select the one with the most edges but this won't scale to larger graphs.
However there is a trick that you can do in certain circumstances. If your graph contains no directed cycles, you can assign each edge a weight of -1, and then look for the shortest weighted path between the source and target nodes. Since the edge weights are negative a shortest weighted path must correspond to a path with a maximum number of edges between the desired nodes.
Unfortunately, Cypher doesn't yet support shortest weighted path algorithms, however the Neo4j database engine does. The docs give an example of how to do this. You will also need to implement your own algorithm, such as Bellman-Ford using the traversal API, because Dijkstra won't work with -ve edge weights.
However, please be aware that this trick won't work if your graph contains cycles - it must be a DAG.
Your query:
MATCH p=(n {id:'123'})-[r:Foo*]->(c)
RETURN length(p)
is returning the length of ALL possible paths from n to c. You probably are only interested in the shortest path? You can use the shortestPath function to only consider the shortest path from n to c:
MATCH p = shortestPath((n {id:'123'})-[r:Foo*]->(c))
RETURN length(p)
I have a Neo4j database (version 2.0.0) containing words and their etymological relationships with other words. I am currently able to create "word networks" by traversing these word origins, using a variable depth Cypher query.
For client-side performance reasons (these networks are visualized in JavaScript), and because the number of relationships varies significantly from one word to the next, I would like to be able to make the depth traversal conditional on the number of nodes. My query currently looks something like this:
start a=node(id)
match p=(a)-[r:ORIGIN_OF*1..5]-(b)
where not b-->()
return nodes(p)
Going to a depth of 5 usually yields very interesting results, but at times delivers far too many nodes for my client-side visualization to handle. I'd like to check against, for example, sum(length(nodes(p))) and decrement the depth if that result exceeds a particular maximum value. Or, of course, any other way of achieving this goal.
I have experimented with adding a WHERE clause to the path traversal, but this is specific to individual paths and does not allow me to sum() the total number of nodes.
Thanks in advance!
What you're looking to do isn't fairly straight forward in a single query. Assuming you are using labels and indexing on the word property, the following query should do what you want.
MATCH p=(a:Word { word: "Feet" })-[r:ORIGIN_OF*1..5]-(b)
WHERE NOT (b)-->()
WITH reduce(pathArr =[], word IN nodes(p)| pathArr + word.word) AS wordArr
MATCH (words:Word)
WHERE words.word IN wordArr
WITH DISTINCT words
MATCH (origin:Word { word: "Feet" })
MATCH p=shortestPath((words)-[*]-(origin))
WITH words, length(nodes(p)) AS distance
RETURN words
ORDER BY distance
LIMIT 100
I should mention that this most likely won't scale to huge datasets. It will most likely take a few seconds to complete if there are 1000+ paths extending from your origin word.
The query basically does a radial distance operation by collecting all distinct nodes from your paths into a word array. Then it measures the shortest path distance from each distinct word to the origin word and orders by the closest distance and imposes a maximum limit of results, for example 100.
Give it a try and see how it performs in your dataset. Make sure to index on the word property and to apply the Word label to your applicable word nodes.
what comes to my mind is an stupid optimalization of graph:
what you need to do is to ad an information into each node, which will show up how many connections it has for each depth from 1 to 5, ie:
start a=node(id)
match (a)-[r:ORIGIN_OF*1..1]-(b)
with count(*) as cnt
set a.reach1 = cnt
...
start a=node(id)
match (a)-[r:ORIGIN_OF*5..5]-(b)
where not b-->()
with count(*) as cnt
set a.reach5 = cnt
then, before each run of your question query above, check if the number of reachX < you_wished_results and run the query with [r:ORIGIN_OF*X..X]
this would have some consequences - either you would have to run this optimalisation each time after new items or updates happens to your db, or after each new node /updated node you must add the reachX param to the update