Is it possible to reduce/optimize this query for node degrees? - neo4j

Given the following Cypher query that returns afferent (inbound) and efferent (outbound) connections, and the sum as the node degree:
START n = node(*)
RETURN n.name, length((n)-->()) AS efferent,
length((n)<--()) AS afferent,
length((n)-->()) + length((n)<--()) AS degree
Is it possible to reduce the query so that the two length() functions are not repeated in the summation in the degree column?

You can resolve and bind the two length computations separately from and before returning by using WITH. Then you can sum those bound values while returning.
START n = node(*)
WITH n, length((n)-->()) AS efferent, length((n)<--()) AS afferent
RETURN n.name, efferent, afferent, efferent + afferent AS degree
You may want to use MATCH (n) instead of START n = node(*) if your Neo4j version is >2.0, but that's not what you're asking about so I'll assume you know what you are doing.
EDIT
In Neo4j 1.x START is how you began a query. From 2.x and on, while START is still around, MATCH is the preferred way. If you have Neo4j 2.x and don't know a particular reason why you should use START, then you should use MATCH. Here's a short explanation of why.
Your query is written to touch the entire graph. When that is the intention there is not a very big difference between START n = node(*) and MATCH (n). The execution plans do differ, but I'm not aware that the difference is very important.
If, however, you want to perform your computations only on part of the graph, and you add to your 'starting point pattern' to that effect, then there will be significant differences. If, for example, you want to perform your computation only on nodes with the :User label
START n = node(*)
WHERE n:User
will still pull up all nodes, and then apply a filter to discard those that don't have the label, whereas
MATCH (n)
WHERE n:User
will only pull up the nodes that have that label to begin with.
The general difference is this: WHERE is a dependent clause accompanying START, MATCH, OPTIONAL MATCH or WITH. When it accompanies START or WITH it does not work by modifying the operation but by filtering the results; when it accompanies MATCH and OPTIONAL MATCH it modifies (as often as it can) the operation and therefore doesn't have to filter the results. The difference is that between shouting "Everyone, if you are my child, don't go into the road" and "Kids, don't go into the road".
There are cases when WHERE is not pulled into the MATCH clause. One example is
MATCH n
WHERE n:Male OR n:Female
In this case all nodes are pulled up and then filtered, just as if we had used START instead of MATCH.
Sometimes it is easy to know which patterns in the WHERE clause are able to be pulled in to modify the MATCH. This is the case for patterns that you can move into the MATCH clause yourself, by simply rearranging the query. The first MATCH example above could also be expressed
MATCH (n:User)
There is no way, however, to do this for the WHERE clause in second MATCH example, WHERE n:Male OR n:Female.
That a WHERE pattern cannot be moved into the MATCH clause by reformulating the query is not a reliable indicator that the query planner is unable to make use of it in the match operation. Being a declarative language, you ultimately have to trust the query planner to wisely implement the instructions; trust, but verify.1,2
One other difference between START and MATCH pertains to indexing. If you use 'legacy indexing' then you need to use START to access these indices. The 'new' (about two years I believe) label indices have continuously been improved for features and efficiency and we are running out of reasons to use the old indices. I think the only reason left may be full-text indexing, for which a configured legacy lucene index is still necessary. In time this feature also will be added to the label indices. Possibly, at that point, the START clause will be removed from Cypher altogether–but that is just the author's speculation.

Related

Adding a property filter to cypher query explodes memory, why?

I'm trying to write a query that explores a DAG-type graph (a bill of materials) for all construction paths leading down to a specific part number (second MATCH), among all the parts associated with a given product (first MATCH). There is a strange behavior I don't understand:
This query runs in a reasonable time using Neo4j community edition (~2 s):
WITH '12345' as snid, 'ABCDE' as pid
MATCH (m:Product {full_sn:snid})-[:uses]->(p:Part)
WITH snid, pid, collect(p) AS mparts
MATCH path=(anc:Part)-[:has*]->(child:Part)
WHERE ALL(node IN nodes(path) WHERE node IN mparts)
WITH snid, path, relationships(path)[-1] AS rel,
nodes(path)[-2] AS parent, nodes(path)[-1] AS child
RETURN stuff I want
However, to get the query I want, I must add a filter on the child using the part number pid in the second MATCH statement:
MATCH path=(anc:Part)-[:has*]->(child:Part {pn:pid})
And when I try to run the new query, neo4j browser compains that there is not enough memory. (Neo.TransientError.General.OutOfMemoryError). When I run it with EXPLAIN, the db hits are exploding into the 10s of billions, as if I'm asking it for a massive cartestian product: but all I have done is added a restriction on the child, so this should be reducing the search space, shouldn't it?
I also tried adding an index on :Part(pn). Now the profile shown by EXPLAIN looks very efficient, but I still have the same memory error.
If anyone can help me understand why this change between the two queries is causing problems, I'd greatly appreciate it!
Best wishes,
Ben
MATCH path=(anc:Part)-[:has*]->(child:Part)
The * is exploding to every downstream child node.
That's appropriate if that is what's desired. If you make this an optional match and limit to the collect items, this should restrict the return results.
OPTIONAL MATCH path=(anc:Part)-[:has*]->(child:Part)
This is conceptionally (& crudely) similar to an inner join in SQL.

Optimizing Cypher Query

I am currently starting to work with Neo4J and it's query language cypher.
I have a multple queries that follow the same pattern.
I am doing some comparison between a SQL-Database and Neo4J.
In my Neo4J Datababase I habe one type of label (person) and one type of relationship (FRIENDSHIP). The person has the propterties personID, name, email, phone.
Now I want to have the the friends n-th degree. I also want to filter out those persons that are also friends with a lower degree.
FOr example if I want to search for the friends 3 degree I want to filter out those that are also friends first and/or second degree.
Here my query type:
MATCH (me:person {personID:'1'})-[:FRIENDSHIP*3]-(friends:person)
WHERE NOT (me:person)-[:FRIENDSHIP]-(friends:person)
AND NOT (me:person)-[:FRIENDSHIP*2]-(friends:person)
RETURN COUNT(DISTINCT friends);
I found something similiar somewhere.
This query works.
My problem is that this pattern of query is much to slow if I search for a higher degree of friendship and/or if the number of persons becomes more.
So I would really appreciate it, if somemone could help me with optimize this.
If you just wanted to handle depths of 3, this should return the distinct nodes that are 3 degrees away but not also less than 3 degrees away:
MATCH (me:person {personID:'1'})-[:FRIENDSHIP]-(f1:person)-[:FRIENDSHIP]-(f2:person)-[:FRIENDSHIP]-(f3:person)
RETURN apoc.coll.subtract(COLLECT(f3), COLLECT(f1) + COLLECT(f2) + me) AS result;
The above query uses the APOC function apoc.coll.subtract to remove the unwanted nodes from the result. The function also makes sure the collection contains distinct elements.
The following query is more general, and should work for any given depth (by just replacing the number after *). For example, this query will work with a depth of 4:
MATCH p=(me:person {personID:'1'})-[:FRIENDSHIP*4]-(:person)
WITH NODES(p)[0..-1] AS priors, LAST(NODES(p)) AS candidate
UNWIND priors AS prior
RETURN apoc.coll.subtract(COLLECT(DISTINCT candidate), COLLECT(DISTINCT prior)) AS result;
The problem with Cypher's variable-length relationship matching is that it's looking for all possible paths to that depth. This can cause unnecessary performance issues when all you're interested in are the nodes at certain depths and not the paths to them.
APOC's path expander using 'NODE_GLOBAL' uniqueness is a more efficient means of matching to nodes at inclusive depths.
When using 'NODE_GLOBAL' uniqueness, nodes are only ever visited once during traversal. Because of this, when we set the path expander's minLevel and maxLevel to be the same, the result are nodes at that level that are not present at any lower level, which is exactly the result you're trying to get.
Try this query after installing APOC:
MATCH (me:person {personID:'1'})
CALL apoc.path.expandConfig(me, {uniqueness:'NODE_GLOBAL', minLevel:4, maxLevel:4}) YIELD path
// a single path for each node at depth 4 but not at any lower depth
RETURN COUNT(path)
Of course you'll want to parameterize your inputs (personID, level) when you get the chance.

Neo4j and Cypher - How can I create/merge chained sequential node relationships (and even better time-series)?

To keep things simple, as part of the ETL on my time-series data, I added a sequence number property to each row corresponding to 0..370365 (370,366 nodes, 5,555,490 properties - not that big). I later added a second property and named it "outeseq" (original) and "ineseq" (second) to see if an outright equivalence to base the relationship on might speed things up a bit.
I can get both of the following queries to run properly on up to ~30k nodes (LIMIT 30000) but past that, its just an endless wait. My JVM has 16g max (if it can even use it on a windows box):
MATCH (a:BOOK),(b:BOOK)
WHERE a.outeseq=b.outeseq-1
MERGE (a)-[s:FORWARD_SEQ]->(b)
RETURN s;
or
MATCH (a:BOOK),(b:BOOK)
WHERE a.outeseq=b.ineseq
MERGE (a)-[s:FORWARD_SEQ]->(b)
RETURN s;
I also added these in hopes of speeding things up:
CREATE CONSTRAINT ON (a:BOOK)
ASSERT a.outeseq IS UNIQUE
CREATE CONSTRAINT ON (b:BOOK)
ASSERT b.ineseq IS UNIQUE
I can't get the relationships created for the entire data set! Help!
Alternatively, I can also get bits of the relationships built with parameters, but haven't figured out how to parameterize the sequence over all of the node-to-node sequential relationships, at least not in a semantically general enough way to do this.
I profiled the query, but did't see any reason for it to "blow-up".
Another question: I would like each relationship to have a property to represent the difference in the time-stamps of each node or delta-t. Is there a way to take the difference between the two values in two sequential nodes, and assign it to the relationship?....for all of the relationships at the same time?
The last Q, if you have the time - I'd really like to use the raw data and just chain the directed relationships from one nodes'stamp to the next nearest node with the minimum delta, but didn't run right at this for fear that it cause scanning of all the nodes in order to build each relationship.
Before anyone suggests that I look to KDB or other db's for time series, let me say I have a very specific reason to want to use a DAG representation.
It seems like this should be so easy...it probably is and I'm blind. Thanks!
Creating Relationships
Since your queries work on 30k nodes, I'd suggest to run them page by page over all the nodes. It seems feasible because outeseq and ineseq are unique and numeric so you can sort nodes by that properties and run query against one slice at time.
MATCH (a:BOOK),(b:BOOK)
WHERE a.outeseq = b.outeseq-1
WITH a, b ORDER BY a.outeseq SKIP {offset} LIMIT 30000
MERGE (a)-[s:FORWARD_SEQ]->(b)
RETURN s;
It will take about 13 times to run the query changing {offset} to cover all the data. It would be nice to write a script on any language which has a neo4j client.
Updating Relationship's Properties
You can assign timestamp delta to relationships using SET clause following the MATCH. Assuming that a timestamp is a long:
MATCH (a:BOOK)-[s:FORWARD_SEQ]->(b:BOOK)
SET s.delta = abs(b.timestamp - a.timestamp);
Chaining Nodes With Minimal Delta
When relationships have the delta property inside, the graph becomes a weighted graph. So we can apply this approach to calculate the shortest path using deltas. Then we just save the length of the shortest path (summ of deltas) into the relation between the first and the last node.
MATCH p=(a:BOOK)-[:FORWARD_SEQ*1..]->(b:BOOK)
WITH p AS shortestPath, a, b,
reduce(weight=0, r in relationships(p) : weight+r.delta) AS totalDelta
ORDER BY totalDelta ASC
LIMIT 1
MERGE (a)-[nearest:NEAREST {delta: totalDelta}]->(b)
RETURN nearest;
Disclaimer: queries above are not supposed to be totally working, they just hint possible approaches to the problem.

Is it the optimal way of expressing "go through all nodes" queries in Cypher?

I have a quite large social graph in which I execute global queries like this one:
match (n:User)-[r:LIKES]->(k:User)
where not (k:User)-[]->(n:User)
return count(r);
They take a lot of time and memory, so I am curious if they are expressed in optimal way. I have felling that when I execute such query Cypher is firstly matching everything that fits the expression (and that takes a lot of memory) and then starts to count things. I would rather like to go through every node, check the pattern and update the counter if necessary. This way such queries would not require a lot of memory. So how in fact such query is executed? If it is not optimal, is there a way to make it better (in Cypher)?
If you used the query just as you wrote it, you may not be getting what you think you are. Putting labels on node "variables" can cause them to be treated as fresh (partial) patterns instead of bound nodes. Is your query any faster if you use
MATCH (n:User)-[r:LIKES]->(k:User)
WHERE NOT (n)<--(k)
RETURN count(r)
Here's how this works (not considering internal optimizations, which I don't begin to understand).
For each User node, every outgoing LIKES relationship is followed. If the other end of the LIKES relationship is a User node, the two nodes and the relationship are bound to the names n, k, and r and passed to the WHERE clause. Every outgoing relationship on the bound k node is then tested to see if it connects to the bound n node. If no such relationship is found, the match is considered successful. The count() function in the RETURN clause counts the resulting collection of relationships that were passed from the match.
If you have a densely connected graph, and particularly if there are many other relationships between nodes other than LIKES relationship, this can be quite an extensive search.
As a further experiment, you might try changing the WHERE clause to read
WHERE NOT (k)-->(n)
and see if it makes any difference. I don't think it will, but I could be wrong.

No START clause VS. n = node(*)

I've read in Neo4J 2.0 docs that START clause is optional and
Cypher will try and infer start points from your query
I have experimentally found that
START user = node(*)
MATCH (user:User)-[r:KNOWS]-(user2:User)
RETURN user.username AS username, collect(user2.username) AS username2
gives the same results as
MATCH (user:User)-[r:KNOWS]-(user2:User)
RETURN user.username AS username, collect(user2.username) AS username2
for small data sets.
My question is: is it semantically the same? Will they always return same result set (I'm not talking about the order)? Even for large data sets? Does skipping START guarantee traversing all nodes? If they are semantically equal why would one ever use node(*)?
Your queries are not semantically the same, but they will always return the same result. The reason they will return the same result is that in your first query, having stated the 'universal node pattern' node(*) you immediately limit it with a further pattern in your MATCH clause. In your second query you state this more narrow pattern from the start, but since the two MATCH clauses are equivalent and the most narrow pattern declared in each query (and since the RETURN clauses are the same) the two queries return the same results.
The START clause used to be the way to state the initial pattern for a query and it was tied up with indexing. Using node(*) or relationship(*) was rarely recommended or useful, but the clause was used for index retrievals, as in START user=node:userIndex(name="Maciej Ziarko"). With 2.0 labels and label indexing was introduced and this is now the preferred way to bind nodes in a query.
Skipping START will not guarantee traversing all nodes (or perhaps more accurately: binding all nodes), but neither do you need a START clause to do so. Using MATCH user (without limiting what is bound to user with labels or relationships) you can still bind every node in your database. It is still rarely recommended or useful.

Resources