Neo4j Cypher Aggregate function changes in WITH clause - neo4j

I'm new to Neo4j, and having a problem with the average function.
I've got a test database of bank accounts (nodes) and payments between them (relationships).
I want to compute the average of the payments between each pair of accounts (ie between A&B, between A&C, between B&C, etc), and then find any payments that are $50 above the average.
My code looks like this:
MATCH (a)-[r:Payment]-(b)
WITH a, b, AVG(ToFloat(r.Amount)) AS Average, ToFloat(r.Amount) as Amount
WHERE Amount-Average>50
RETURN a, b, Amount-Average AS Difference
If I just leave a and Average in the WITH clause, it seems to compute the average correctly, but if I add in anything else (either r or the r.Amount clause), then the Average function output changes, and just returns the same value as "Amount" (So it would compute "Difference" as 0 for every relationship).
Could it be that the way I'm MATCHing the nodes and relationships doesn't correctly find the relationships between each pair of accounts and then average on them, which would then cause the error?
Thanks in advance!

This is a consequence of Cypher's implicit grouping when performing aggregations. The grouping key (the context over which the grouping happens) is implicit, formed by the non-aggregation variables present on the WITH or RETURN clause.
This is why, when you include r or r.amount, that the output changes, since you would be calculating the average with respect to the same relationship, or the same amount (average of a single value is that value).
Since you want to evaluate and filter all amounts between the nodes based upon the average, you should collect the amounts when you take the average, and then filter/transform the contents for your return.
Also, you'll want to include a bit of filtering for a and b to ensure you don't return mirrored results (same results for the same nodes except the nodes for a and b are swapped), so we'll use a restriction on the node ids to ensure order in a single direction only:
MATCH (a)-[r:Payment]-(b)
WHERE id(a) < id(b) // ensure we don't get mirrored results
WITH a, b, AVG(ToFloat(r.Amount)) AS Average, collect(ToFloat(r.Amount)) as Amounts
WITH a, b, [amt in Amounts WHERE amt-Average > 50 | amt - Average] as Differences
RETURN a, b, Differences
If you want individual results for each row, then you can UNWIND the Differences list before you return.

Related

Pivot Table type of query in Cypher (in one pass)

I am trying to perform the following query in one pass but I conclude that it is impossible and would furthermore lead to some form of "nested" structure which is never good news in terms of performance.
I may however be missing something here, so I thought I might ask.
The underlying data structure is a many-to-many relationship between two entities A<---0:*--->B
The end goal is to obtain how many times are objects of entity B assigned to objects of entity A within a specific time interval as a percentage of total assignments.
It is exactly this latter part of the question that causes the headache.
Entity A contains an item_date field
Entity B contains an item_category field.
The presentation of the results can be expanded to a table whose columns are the distinct item_date and rows are the different item_category normalised counts. I am just mentioning this for clarity, the query does not have to return the results in that exact form.
My Attempt:
with 12*30*24*3600 as window_length, "1980-1-1" as start_date,
"1985-12-31" as end_date
unwind range(apoc.date.parse(start_date,"s","yyyy-MM-dd"),apoc.date.parse(end_date,"s","yyyy-MM-dd"),window_length) as date_step
match (a:A)<-[r:RELATOB]-(b:B)
where apoc.date.parse(a.item_date,"s","yyyy-MM-dd")>=date_step and apoc.date.parse(a.item_date,"s","yyyy-MM-dd")<(date_step+window_length)
with window_length, date_step, count(r) as total_count unwind ["code_A", "code_B", "code_C"] as the_code [MATCH THE PATTERN AGAIN TO COUNT SPECIFIC `item_code` this time.
I am finding it difficult to express this in one pass because it requires the equivalent of two independent GROUP BY-like clauses right after the definition of the graph pattern. You can't express these two in parallel, so you have to unwind them. My worry is that this leads to two evaluations: One for the total count and one for the partial count. The bit I am trying to optimise is some way of re-writing the query so that it does not have to count nodes it has "captured" before but this is very difficult with the implied way the aggregate functions are being applied to a set.
Basically, any attribute that is not an aggregate function becomes the stratification variable. I have to say here that a plain simple double stratification ("Grab everything, produce one level of count by item_date produce another level of count by item_code) does not work for me because there is NO WAY to control the width of the window_length. This means that I cannot compare between two time periods with different rates of assignments of item_codes because the time periods are not equal :(
Please note that retrieving the counts of item_code and then normalising for the sum of those particular codes within a period of time (externally to cypher) would not lead to accurate percentages because the normalisation there would be with respect to that particular subset of item_code rather than the total.
Is there a way to perform a simultaneous count of r within a time period but then (somehow) re-use the already matched a,b subsets of nodes to now evaluate a partial count of those specific b's that (b:{item_code:the_code})-[r2:RELATOB]-(a) where a.item_date...?
If not, then I am going to move to the next fastest thing which is to perform two independent queries (one for the total count, one for the partials) and then do the division externally :/ .
The solution proposed by Tomaz Bratanic in the comment is (I think) along these lines:
with 1*30*24*3600 as window_length,
"1980-01-01" as start_date,
"1985-12-31" as end_date
unwind range(apoc.date.parse(start_date,"s","yyyy-MM-dd"),apoc.date.parse(end_date,"s","yyyy-MM-dd"),window_length) as date_step
unwind ["code_A","code_B","code_c"] as the_code
match (a:A)<-[r:RELATOB]-(b:B)
where apoc.date.parse(a.item_date,"s","yyyy-MM-dd")>=date_step and apoc.date.parse(a.item_category,"s","yyyy-MM-dd")<(date_step+window_length)
return the_code, date_step, tofloat(sum(case when b.item_category=code then 1 else 0 end)/count(r)) as perc_count order by date_step asc
This:
Is working
It does exactly what I was after (after some minor modifications)
It even adds filling in the missing values with zero because of that ELSE 0 which is effectively forcing a zero even when no count data exists.
But in realistic conditions it is at least 30 seconds slower (no it is not, please see edit) than what I am currently using which re-matches. (And no, it is not because of the extra data that is now returned as the missing data are filled in, this is raw query time).
I thought that it might be worth attaching the query plans here:
This is the plan of the applying the same pattern twice but fast way of doing it:
This is the plan of the performing the count in one pass but slow way of doing it:
I might see how does time scales with data in the input later on, maybe the two are scaling at different rates but at this point, the "one-pass" seems to be already slower than the "two-pass" and frankly, I cannot see how it could get any faster with more data. This is already a simple count of 12 months over 3 categories distributed amongst 18k items (approximately).
Hope this might help others too.
EDIT:
While I had done this originally, there was another modification that I did not include where the second unwind goes AFTER the match. This slashes the time by 20 seconds below the "double match" as the unwind affects the return rather than multiple executions of the same query which now becomes:
with 1*30*24*3600 as window_length,
"1980-01-01" as start_date,
"1985-12-31" as end_date
unwind range(apoc.date.parse(start_date,"s","yyyy-MM-dd"),apoc.date.parse(end_date,"s","yyyy-MM-dd"),window_length) as date_step
match (a:A)<-[r:RELATOB]-(b:B)
where apoc.date.parse(a.item_date,"s","yyyy-MM-dd")>=date_step and apoc.date.parse(a.item_category,"s","yyyy-MM-dd")<(date_step+window_length)
unwind ["code_A","code_B","code_c"] as the_code
return the_code, date_step, tofloat(sum(case when b.item_category=code then 1 else 0 end)/count(r)) as perc_count order by date_step asc
And here is the execution plan for it too:
Original double match approximately 55790ms, Doing it in one pass (both unwinds BEFORE the match) 82306ms, Doing it in one pass (second unwind after the match) 23461ms.

Cypher query to get subsets of different node labels, with relations

Let's assume this use case;
We have few nodes (labeled Big) and each having a simple integer ID property.
Each Big node has a relation with millions of (labeled Small) nodes.
such as :
(Small)-[:BELONGS_TO]->(Big)
How can I phrase a Cypher query to represent the following in natural language:
For each Big node in the range of ids between 4-7, get me 10 of Small nodes that belongs to it.
The supposed result would give 2 Big nodes, 20 Small nodes, and 20 Relations
The needed result would be represented by this graph:
2 Big nodes, each with a subset of 10 of Small nodes that belongs to them
What I've tried but failed (it only shows 1 big node (id=5) along with 10 of its related Small nodes, but doesn't show the second node (id=6):
MATCH (s:Small)-[:BELONGS_TO]->(b:Big)
Where 4<b.bigID<7
return b,s limit 10
I guess I need a more complex compound query.
Hope I could phrase my question in an understandable way!
As stdob-- says, you can't use limit here, at least not in this way, as it limits the entire result set.
While the aggregation solution will return you the right answer, you'll still pay the cost for the expansion to those millions of nodes. You need a solution that will lazily get the first ten for each.
Using APOC Procedures, you can use apoc.cypher.run() to effectively perform a subquery. The query will be run per-row, so if you limit the rows first, you can call this and use LIMIT within the subquery, and it will properly limit to 10 results per row, lazily expanding so you don't pay for an expansion to millions of nodes.
MATCH (b:Big)
WHERE 4 < b.bigID < 7
CALL apoc.cypher.run('
MATCH (s:Small)-[:BELONGS_TO]->(b)
RETURN s LIMIT 10',
{b:b}) YIELD value
RETURN b, value.s
Your query does not work because the limit applies to the entire previous flow.
You need to use aggregation function collect:
MATCH (s:Small)-[:BELONGS_TO]->(b:Big) Where 4<b.bigID<7
With b,
collect(distinct s)[..10] as smalls
return b,
smalls

Neo4j and Cypher - How can I create/merge chained sequential node relationships (and even better time-series)?

To keep things simple, as part of the ETL on my time-series data, I added a sequence number property to each row corresponding to 0..370365 (370,366 nodes, 5,555,490 properties - not that big). I later added a second property and named it "outeseq" (original) and "ineseq" (second) to see if an outright equivalence to base the relationship on might speed things up a bit.
I can get both of the following queries to run properly on up to ~30k nodes (LIMIT 30000) but past that, its just an endless wait. My JVM has 16g max (if it can even use it on a windows box):
MATCH (a:BOOK),(b:BOOK)
WHERE a.outeseq=b.outeseq-1
MERGE (a)-[s:FORWARD_SEQ]->(b)
RETURN s;
or
MATCH (a:BOOK),(b:BOOK)
WHERE a.outeseq=b.ineseq
MERGE (a)-[s:FORWARD_SEQ]->(b)
RETURN s;
I also added these in hopes of speeding things up:
CREATE CONSTRAINT ON (a:BOOK)
ASSERT a.outeseq IS UNIQUE
CREATE CONSTRAINT ON (b:BOOK)
ASSERT b.ineseq IS UNIQUE
I can't get the relationships created for the entire data set! Help!
Alternatively, I can also get bits of the relationships built with parameters, but haven't figured out how to parameterize the sequence over all of the node-to-node sequential relationships, at least not in a semantically general enough way to do this.
I profiled the query, but did't see any reason for it to "blow-up".
Another question: I would like each relationship to have a property to represent the difference in the time-stamps of each node or delta-t. Is there a way to take the difference between the two values in two sequential nodes, and assign it to the relationship?....for all of the relationships at the same time?
The last Q, if you have the time - I'd really like to use the raw data and just chain the directed relationships from one nodes'stamp to the next nearest node with the minimum delta, but didn't run right at this for fear that it cause scanning of all the nodes in order to build each relationship.
Before anyone suggests that I look to KDB or other db's for time series, let me say I have a very specific reason to want to use a DAG representation.
It seems like this should be so easy...it probably is and I'm blind. Thanks!
Creating Relationships
Since your queries work on 30k nodes, I'd suggest to run them page by page over all the nodes. It seems feasible because outeseq and ineseq are unique and numeric so you can sort nodes by that properties and run query against one slice at time.
MATCH (a:BOOK),(b:BOOK)
WHERE a.outeseq = b.outeseq-1
WITH a, b ORDER BY a.outeseq SKIP {offset} LIMIT 30000
MERGE (a)-[s:FORWARD_SEQ]->(b)
RETURN s;
It will take about 13 times to run the query changing {offset} to cover all the data. It would be nice to write a script on any language which has a neo4j client.
Updating Relationship's Properties
You can assign timestamp delta to relationships using SET clause following the MATCH. Assuming that a timestamp is a long:
MATCH (a:BOOK)-[s:FORWARD_SEQ]->(b:BOOK)
SET s.delta = abs(b.timestamp - a.timestamp);
Chaining Nodes With Minimal Delta
When relationships have the delta property inside, the graph becomes a weighted graph. So we can apply this approach to calculate the shortest path using deltas. Then we just save the length of the shortest path (summ of deltas) into the relation between the first and the last node.
MATCH p=(a:BOOK)-[:FORWARD_SEQ*1..]->(b:BOOK)
WITH p AS shortestPath, a, b,
reduce(weight=0, r in relationships(p) : weight+r.delta) AS totalDelta
ORDER BY totalDelta ASC
LIMIT 1
MERGE (a)-[nearest:NEAREST {delta: totalDelta}]->(b)
RETURN nearest;
Disclaimer: queries above are not supposed to be totally working, they just hint possible approaches to the problem.

Comparing or Diffing Two Graphs in neo4j

I have two disconnected graphs in a neo4j database. They are very similar networks but one is a version that is several months later of the same graph.
Is there a way that I can easily compare these two graphs to see any additions, deletes or editing that has been done to the network?
If you want a pure Cypher solution to compare the structure of two graphs you can try the following approach (based on Mark Needham's article for creating adjacency matrices from a graphs https://markhneedham.com/blog/2014/05/20/neo4j-2-0-creating-adjacency-matrices/)
The basic idea is to construct two adjacency matrices, one for each graph to be compared with a column and row for each node identifier (business identifier, not node id), and then to perform some algebra on the two matrices to find the differences.
The problem is that if the graphs don't contain the same node identifiers then the adjacency matrices will have different dimensions, making the actual comparison harder, so the trick is to produce two identically sized matrices and populate one with the adjacency matrix from the first graph and the second with the adjacency matrix from the second graph.
Consider these two graphs:
All the nodes in Graph 1 are labeled :G1 and all the nodes in Graph 2 are labeled :G2.
Step 1 is to find all the unique node identifiers, the 'name' property in this case, from both graphs:
match (g:G1)
with collect(g.name) as g1Names
match (g:G2)
with g1Names + collect(g.name) as collectedNames
unwind collectedNames as allNames
with collect(distinct allNames) as uniqueNames
uniqueNames now contains a list of all the unique identifiers in both graphs. (It is necessary to unwind the collected names and then collect them back up because the distinct operator doesn't work on a list - there is a lot more collecting and unwinding to come!)
Next, two new lists of unique identifiers are created to represent the two dimensions of of the adjacency matrix for the first graph.
unwind uniqueNames as dim1
unwind uniqueNames as dim2
Then an optional match is performed to create a Cartesian product of each node with every other node in G1, the first graph.
optional match p = (g1:G1 {name: dim1})-->(g2:G1 {name: dim2})
The matched paths will either exist or return null from the above match statement. These are now converted into a count of edges between nodes or a zero if there was no connection (the essence of the adjacency matrix). The matched paths are sorted to keep the order of rows and columns in the matrix correct when it is created. uniqueNames is passed through as it will be used to construct the adjacency matrix for the second graph.
with uniqueNames, dim1, dim2, case when p is null then 0 else count(p) end as edgeCount
order by dim1, dim2
Next, the edges are rolled up into a list of values for the second dimension
with uniqueNames, dim1 as g1DimNames, collect(edgeCount) as g1Matrix
order by g1DimNames
The whole operation above is repeated for the second graph to generate the second adjacency matrix.
with uniqueNames, g1DimNames, g1Matrix
unwind uniqueNames as dim1
unwind uniqueNames as dim2
optional match p = (g1:G2 {name: dim1})-->(g2:G2 {name: dim2})
with g1DimNames, g1Matrix, dim1, dim2, case when p is null then 0 else count(p) end as edges
order by dim1, dim2
with g1DimNames, g1Matrix, dim1 as g2DimNames, collect(edges) as g2Matrix
order by g1DimNames, g2DimNames
At this point g1DimNames and g1Matrix form a Cartesian product with the g2DimNames and g2Matrix. This product is factored by removing duplicate rows with the filter statement
with filter( x in collect([g1DimNames, g1Matrix, g2DimNames, g2Matrix]) where x[0] = x[2]) as factored
The final step is to determine the differences between the two matrices, which is just a matter of finding the rows which are different in the factored result above.
with filter(x in factored where x[1] <> x[3]) as diffs
unwind diffs as result
return result
We then end up with a result that shows what is different and how:
To interpret the results: The first two columns represent a subset of the first graph's adjacency matrix and the second two columns represent the corresponding row by row adjacency matrix for the second graph. The alpha characters represent the node names and the lists of digits represent the corresponding rows in the matrix for each original column, A to G in this case.
Looking at the "A" row, we can conclude node "A" owns nodes "B" and "C" in graph 1 and node "A" owns node "B" once and node "C" twice in graph 2.
For the "D" row, node "D" does not own any nodes in graph 1 and owns nodes "F" and "G" in graph 2.
There are at least a couple of caveats to this approach:
Creating Cartesian products is slow, in even small graphs. (I have
been comparing XML schemas with this technique and comparing two
graphs containing about 200 nodes each takes around 30 seconds,
compared with 14ms for the example above, on my fairly modestly
sized server).
Reading the result matrix is not easy when there are more than a
trivial amount of nodes as it is hard to keep track of which column
corresponds to which node. (To get round this, I have exported the
results to a csv and then inserted the node names (from uniqueNames)
into the top row of the spreadsheet.
I guess diffing is most easy done using a text based tool.
One approach I can think of is to export the two subgraphs to GraphML using https://github.com/jexp/neo4j-shell-tools and then apply the regular diff from unix.
Another one would be using dump in neo4j-shell and diff the results as above.
This largely depends on what you want the diff to be of and the constraints of the graphs themselves.
If nodes and relationships have an identifier property (not the internal Neo4j ID), then you could just pull down the nodes and relationships of each graph and track which are added, removed, or changed (diff the properties).
If relationships are not uniquely identified (by a property), but nodes are, their natural key is the start node, end node and type since duplicate relationships cannot exist.
If neither have managed identifiers, but properties are immutable, then those could be compared across nodes (could be costly), then subsequently the relationships in method.

Neo4j Cypher - Vary traversal depth conditional on number of nodes

I have a Neo4j database (version 2.0.0) containing words and their etymological relationships with other words. I am currently able to create "word networks" by traversing these word origins, using a variable depth Cypher query.
For client-side performance reasons (these networks are visualized in JavaScript), and because the number of relationships varies significantly from one word to the next, I would like to be able to make the depth traversal conditional on the number of nodes. My query currently looks something like this:
start a=node(id)
match p=(a)-[r:ORIGIN_OF*1..5]-(b)
where not b-->()
return nodes(p)
Going to a depth of 5 usually yields very interesting results, but at times delivers far too many nodes for my client-side visualization to handle. I'd like to check against, for example, sum(length(nodes(p))) and decrement the depth if that result exceeds a particular maximum value. Or, of course, any other way of achieving this goal.
I have experimented with adding a WHERE clause to the path traversal, but this is specific to individual paths and does not allow me to sum() the total number of nodes.
Thanks in advance!
What you're looking to do isn't fairly straight forward in a single query. Assuming you are using labels and indexing on the word property, the following query should do what you want.
MATCH p=(a:Word { word: "Feet" })-[r:ORIGIN_OF*1..5]-(b)
WHERE NOT (b)-->()
WITH reduce(pathArr =[], word IN nodes(p)| pathArr + word.word) AS wordArr
MATCH (words:Word)
WHERE words.word IN wordArr
WITH DISTINCT words
MATCH (origin:Word { word: "Feet" })
MATCH p=shortestPath((words)-[*]-(origin))
WITH words, length(nodes(p)) AS distance
RETURN words
ORDER BY distance
LIMIT 100
I should mention that this most likely won't scale to huge datasets. It will most likely take a few seconds to complete if there are 1000+ paths extending from your origin word.
The query basically does a radial distance operation by collecting all distinct nodes from your paths into a word array. Then it measures the shortest path distance from each distinct word to the origin word and orders by the closest distance and imposes a maximum limit of results, for example 100.
Give it a try and see how it performs in your dataset. Make sure to index on the word property and to apply the Word label to your applicable word nodes.
what comes to my mind is an stupid optimalization of graph:
what you need to do is to ad an information into each node, which will show up how many connections it has for each depth from 1 to 5, ie:
start a=node(id)
match (a)-[r:ORIGIN_OF*1..1]-(b)
with count(*) as cnt
set a.reach1 = cnt
...
start a=node(id)
match (a)-[r:ORIGIN_OF*5..5]-(b)
where not b-->()
with count(*) as cnt
set a.reach5 = cnt
then, before each run of your question query above, check if the number of reachX < you_wished_results and run the query with [r:ORIGIN_OF*X..X]
this would have some consequences - either you would have to run this optimalisation each time after new items or updates happens to your db, or after each new node /updated node you must add the reachX param to the update

Resources