Comparing or Diffing Two Graphs in neo4j - neo4j

I have two disconnected graphs in a neo4j database. They are very similar networks but one is a version that is several months later of the same graph.
Is there a way that I can easily compare these two graphs to see any additions, deletes or editing that has been done to the network?

If you want a pure Cypher solution to compare the structure of two graphs you can try the following approach (based on Mark Needham's article for creating adjacency matrices from a graphs https://markhneedham.com/blog/2014/05/20/neo4j-2-0-creating-adjacency-matrices/)
The basic idea is to construct two adjacency matrices, one for each graph to be compared with a column and row for each node identifier (business identifier, not node id), and then to perform some algebra on the two matrices to find the differences.
The problem is that if the graphs don't contain the same node identifiers then the adjacency matrices will have different dimensions, making the actual comparison harder, so the trick is to produce two identically sized matrices and populate one with the adjacency matrix from the first graph and the second with the adjacency matrix from the second graph.
Consider these two graphs:
All the nodes in Graph 1 are labeled :G1 and all the nodes in Graph 2 are labeled :G2.
Step 1 is to find all the unique node identifiers, the 'name' property in this case, from both graphs:
match (g:G1)
with collect(g.name) as g1Names
match (g:G2)
with g1Names + collect(g.name) as collectedNames
unwind collectedNames as allNames
with collect(distinct allNames) as uniqueNames
uniqueNames now contains a list of all the unique identifiers in both graphs. (It is necessary to unwind the collected names and then collect them back up because the distinct operator doesn't work on a list - there is a lot more collecting and unwinding to come!)
Next, two new lists of unique identifiers are created to represent the two dimensions of of the adjacency matrix for the first graph.
unwind uniqueNames as dim1
unwind uniqueNames as dim2
Then an optional match is performed to create a Cartesian product of each node with every other node in G1, the first graph.
optional match p = (g1:G1 {name: dim1})-->(g2:G1 {name: dim2})
The matched paths will either exist or return null from the above match statement. These are now converted into a count of edges between nodes or a zero if there was no connection (the essence of the adjacency matrix). The matched paths are sorted to keep the order of rows and columns in the matrix correct when it is created. uniqueNames is passed through as it will be used to construct the adjacency matrix for the second graph.
with uniqueNames, dim1, dim2, case when p is null then 0 else count(p) end as edgeCount
order by dim1, dim2
Next, the edges are rolled up into a list of values for the second dimension
with uniqueNames, dim1 as g1DimNames, collect(edgeCount) as g1Matrix
order by g1DimNames
The whole operation above is repeated for the second graph to generate the second adjacency matrix.
with uniqueNames, g1DimNames, g1Matrix
unwind uniqueNames as dim1
unwind uniqueNames as dim2
optional match p = (g1:G2 {name: dim1})-->(g2:G2 {name: dim2})
with g1DimNames, g1Matrix, dim1, dim2, case when p is null then 0 else count(p) end as edges
order by dim1, dim2
with g1DimNames, g1Matrix, dim1 as g2DimNames, collect(edges) as g2Matrix
order by g1DimNames, g2DimNames
At this point g1DimNames and g1Matrix form a Cartesian product with the g2DimNames and g2Matrix. This product is factored by removing duplicate rows with the filter statement
with filter( x in collect([g1DimNames, g1Matrix, g2DimNames, g2Matrix]) where x[0] = x[2]) as factored
The final step is to determine the differences between the two matrices, which is just a matter of finding the rows which are different in the factored result above.
with filter(x in factored where x[1] <> x[3]) as diffs
unwind diffs as result
return result
We then end up with a result that shows what is different and how:
To interpret the results: The first two columns represent a subset of the first graph's adjacency matrix and the second two columns represent the corresponding row by row adjacency matrix for the second graph. The alpha characters represent the node names and the lists of digits represent the corresponding rows in the matrix for each original column, A to G in this case.
Looking at the "A" row, we can conclude node "A" owns nodes "B" and "C" in graph 1 and node "A" owns node "B" once and node "C" twice in graph 2.
For the "D" row, node "D" does not own any nodes in graph 1 and owns nodes "F" and "G" in graph 2.
There are at least a couple of caveats to this approach:
Creating Cartesian products is slow, in even small graphs. (I have
been comparing XML schemas with this technique and comparing two
graphs containing about 200 nodes each takes around 30 seconds,
compared with 14ms for the example above, on my fairly modestly
sized server).
Reading the result matrix is not easy when there are more than a
trivial amount of nodes as it is hard to keep track of which column
corresponds to which node. (To get round this, I have exported the
results to a csv and then inserted the node names (from uniqueNames)
into the top row of the spreadsheet.

I guess diffing is most easy done using a text based tool.
One approach I can think of is to export the two subgraphs to GraphML using https://github.com/jexp/neo4j-shell-tools and then apply the regular diff from unix.
Another one would be using dump in neo4j-shell and diff the results as above.

This largely depends on what you want the diff to be of and the constraints of the graphs themselves.
If nodes and relationships have an identifier property (not the internal Neo4j ID), then you could just pull down the nodes and relationships of each graph and track which are added, removed, or changed (diff the properties).
If relationships are not uniquely identified (by a property), but nodes are, their natural key is the start node, end node and type since duplicate relationships cannot exist.
If neither have managed identifiers, but properties are immutable, then those could be compared across nodes (could be costly), then subsequently the relationships in method.

Related

Neo4j: Iterating from leaf to parent AND finding common children

I've migrated my relational database to neo4j and am studying whether I can implement some functionalities before I commit to the new system. I just read two neo4j books, but unfortunately they don't cover two key features I was hoping would be more self-evident. I'd be most grateful for some quick advice on whether these things will be easy to implement or whether I should stick to sql! Thx!
Features I need are:
1) I have run a script to assign :leaf label to all nodes that are leaves in my tree. In paths between a known node and its related leaf nodes, I aim to assign to every node a level property that reflects how many hops that node is from the known node (or leaf node - whatever I can get to work most easily).
I tried:
match path=(n:Leaf)-[:R*]->(:Parent {Parent_ID: $known_value})
with n, length(nodes(path)) as hops
set n.Level2=hops;
and
path=(n:Leaf)-[:R*]->(:Parent {Parent_ID: $known_value})
with n, path, length(nodes(path)) as hops
foreach (n IN relationships (path) |
set n.Level=hops);
The first assigns property with value of full length of path to only leaf nodes. The second assigns property with value of full length of path to all relationships in path.
Should I be using shortestpath instead, create a bogus property with value =1 for all nodes and iteratively add weight of that property?
2) I need to find the common children for a given parent node. For example, my children each [:like] lots of movies, and I would like to create [:like] relationships from myself to just the movies that my children all like in common (so if 1 of 1 likes a movie, then I like it too, but if only 2 of 3 like a movie, nothing happens).
I found a solution with three paths here:
Need only common nodes across multiple paths - Neo4j Cypher
But I need a solution that works for any number of paths (starting from 1).
3) Then I plan to start at my furthest leaf nodes, create relationships to children's movies, and move level by level toward my known node and repeat create relationships, so that the top-most grandparent likes only the movies that all children [of all children of all children...] like in common and if there's one that everybody agrees on, that's the movie the entire extended family will watch Saturday night.
Can this be done with neo4j and how hard a task is it for someone with rudimentary Cypher? This is mostly how I did it in my relational database / Should I be looking at implementing this totally differently in graph database?
Most grateful for any advice. Thanks!
1.
shortestPath() may help when your already matched start and end nodes are not the root and the leaf, in that it won't continue to look for additional paths once the first is found. If your already matched start and end nodes are the root and the leaf when the graph is a tree structure (acyclic), there's no real reason to use shortestPath().
Typically when setting something like the depth of a node in a tree, you would use length(path), so the root would be at depth 0, its children at depth 1.
Usually depth is calculated with respect to the root node and not leaf nodes (as an intermediate node may be the ancestor of multiple leaf nodes at differing distances). Taking the depth as the distance from the root makes the depths consistent.
Your approach with setting the property on relationships will be a problem, as the same relationship can be present in multiple paths for multiple leaf nodes at varying depths. Your query could overwrite the property on the same relationship over and over until the last write wins. It would be better to match down to all nodes (leave out :Leaf in the query), take the last relationship in the path, and set its depth:
MATCH path=(:Parent {Parent_ID: $known_value})<-[:R*]-()
WITH length(path) as length, last(relationships(path)) as rel
SET rel.Level = length
2.
So if all child nodes of a parent in the tree :like a movie then the parent should :like the movie. Something like this should work:
MATCH path=(:Parent {Parent_ID: $known_value})<-[:R*0..]-(n)
WITH n, size((n)<-[:R]-()) as childCount
MATCH (n)<-[:R]-()-[:like]->(m:Movie)
WITH n, childCount, m, count(m) as movieLikes
WHERE childCount = movieLikes
MERGE (n)-[:like]->(m)
The idea here is that for a movie, if the count of that movie node equals the count of the child nodes then all of the children liked the movie (provided that a node can only :like the same movie once).
This query can't be used to build up likes from the bottom up however, the like relationships (liking personally, as opposed to liking because all children liked it) would have to be present on all nodes first for this query to work.
3.
In order to do a bottom-up approach, you would need to force the query to execute in a particular order, and I believe the best way to do that is to first order the nodes to process in depth order, then use apoc.cypher.doIt(), a proc in APOC Procedures which lets you execute an entire Cypher query per row, to do the calculation.
This approach should work:
MATCH path=(:Parent {Parent_ID: $known_value})<-[:R*0..]-(n)
WHERE NOT n:Leaf // leaves should have :like relationships already created
WITH n, length(path) as depth, size((n)<-[:R]-()) as childCount
ORDER BY depth DESC
CALL apoc.cypher.doIt("
MATCH (n)<-[:R]-()-[:like]->(m:Movie)
WITH n, childCount, m, count(m) as movieLikes
WHERE childCount = movieLikes
MERGE (n)-[:like]->(m)
RETURN count(m) as relsCreated",
{n:n, childCount:childCount}) YIELD value
RETURN sum(value.relsCreated) as relsCreated
That said, I'm not sure this will do what you think it will do. Or rather, it will only work the way you think it will if the only :like relationships to movies are initially set on just the leaf nodes, and (prior to running this propagation query) no other intermediate node in the tree has any :like relationship to a movie.

Neo4j Cypher Aggregate function changes in WITH clause

I'm new to Neo4j, and having a problem with the average function.
I've got a test database of bank accounts (nodes) and payments between them (relationships).
I want to compute the average of the payments between each pair of accounts (ie between A&B, between A&C, between B&C, etc), and then find any payments that are $50 above the average.
My code looks like this:
MATCH (a)-[r:Payment]-(b)
WITH a, b, AVG(ToFloat(r.Amount)) AS Average, ToFloat(r.Amount) as Amount
WHERE Amount-Average>50
RETURN a, b, Amount-Average AS Difference
If I just leave a and Average in the WITH clause, it seems to compute the average correctly, but if I add in anything else (either r or the r.Amount clause), then the Average function output changes, and just returns the same value as "Amount" (So it would compute "Difference" as 0 for every relationship).
Could it be that the way I'm MATCHing the nodes and relationships doesn't correctly find the relationships between each pair of accounts and then average on them, which would then cause the error?
Thanks in advance!
This is a consequence of Cypher's implicit grouping when performing aggregations. The grouping key (the context over which the grouping happens) is implicit, formed by the non-aggregation variables present on the WITH or RETURN clause.
This is why, when you include r or r.amount, that the output changes, since you would be calculating the average with respect to the same relationship, or the same amount (average of a single value is that value).
Since you want to evaluate and filter all amounts between the nodes based upon the average, you should collect the amounts when you take the average, and then filter/transform the contents for your return.
Also, you'll want to include a bit of filtering for a and b to ensure you don't return mirrored results (same results for the same nodes except the nodes for a and b are swapped), so we'll use a restriction on the node ids to ensure order in a single direction only:
MATCH (a)-[r:Payment]-(b)
WHERE id(a) < id(b) // ensure we don't get mirrored results
WITH a, b, AVG(ToFloat(r.Amount)) AS Average, collect(ToFloat(r.Amount)) as Amounts
WITH a, b, [amt in Amounts WHERE amt-Average > 50 | amt - Average] as Differences
RETURN a, b, Differences
If you want individual results for each row, then you can UNWIND the Differences list before you return.

Neo4j and Cypher - How can I create/merge chained sequential node relationships (and even better time-series)?

To keep things simple, as part of the ETL on my time-series data, I added a sequence number property to each row corresponding to 0..370365 (370,366 nodes, 5,555,490 properties - not that big). I later added a second property and named it "outeseq" (original) and "ineseq" (second) to see if an outright equivalence to base the relationship on might speed things up a bit.
I can get both of the following queries to run properly on up to ~30k nodes (LIMIT 30000) but past that, its just an endless wait. My JVM has 16g max (if it can even use it on a windows box):
MATCH (a:BOOK),(b:BOOK)
WHERE a.outeseq=b.outeseq-1
MERGE (a)-[s:FORWARD_SEQ]->(b)
RETURN s;
or
MATCH (a:BOOK),(b:BOOK)
WHERE a.outeseq=b.ineseq
MERGE (a)-[s:FORWARD_SEQ]->(b)
RETURN s;
I also added these in hopes of speeding things up:
CREATE CONSTRAINT ON (a:BOOK)
ASSERT a.outeseq IS UNIQUE
CREATE CONSTRAINT ON (b:BOOK)
ASSERT b.ineseq IS UNIQUE
I can't get the relationships created for the entire data set! Help!
Alternatively, I can also get bits of the relationships built with parameters, but haven't figured out how to parameterize the sequence over all of the node-to-node sequential relationships, at least not in a semantically general enough way to do this.
I profiled the query, but did't see any reason for it to "blow-up".
Another question: I would like each relationship to have a property to represent the difference in the time-stamps of each node or delta-t. Is there a way to take the difference between the two values in two sequential nodes, and assign it to the relationship?....for all of the relationships at the same time?
The last Q, if you have the time - I'd really like to use the raw data and just chain the directed relationships from one nodes'stamp to the next nearest node with the minimum delta, but didn't run right at this for fear that it cause scanning of all the nodes in order to build each relationship.
Before anyone suggests that I look to KDB or other db's for time series, let me say I have a very specific reason to want to use a DAG representation.
It seems like this should be so easy...it probably is and I'm blind. Thanks!
Creating Relationships
Since your queries work on 30k nodes, I'd suggest to run them page by page over all the nodes. It seems feasible because outeseq and ineseq are unique and numeric so you can sort nodes by that properties and run query against one slice at time.
MATCH (a:BOOK),(b:BOOK)
WHERE a.outeseq = b.outeseq-1
WITH a, b ORDER BY a.outeseq SKIP {offset} LIMIT 30000
MERGE (a)-[s:FORWARD_SEQ]->(b)
RETURN s;
It will take about 13 times to run the query changing {offset} to cover all the data. It would be nice to write a script on any language which has a neo4j client.
Updating Relationship's Properties
You can assign timestamp delta to relationships using SET clause following the MATCH. Assuming that a timestamp is a long:
MATCH (a:BOOK)-[s:FORWARD_SEQ]->(b:BOOK)
SET s.delta = abs(b.timestamp - a.timestamp);
Chaining Nodes With Minimal Delta
When relationships have the delta property inside, the graph becomes a weighted graph. So we can apply this approach to calculate the shortest path using deltas. Then we just save the length of the shortest path (summ of deltas) into the relation between the first and the last node.
MATCH p=(a:BOOK)-[:FORWARD_SEQ*1..]->(b:BOOK)
WITH p AS shortestPath, a, b,
reduce(weight=0, r in relationships(p) : weight+r.delta) AS totalDelta
ORDER BY totalDelta ASC
LIMIT 1
MERGE (a)-[nearest:NEAREST {delta: totalDelta}]->(b)
RETURN nearest;
Disclaimer: queries above are not supposed to be totally working, they just hint possible approaches to the problem.

Limiting nodes per label

I have a graph with currently around the several thousand nodes with each node having between two to ten relationships. If we look at a single node and its connections, they would look like somewhat this:
The nodes with alphabetical characters are category nodes. All other nodes are content nodes that have an associated with relationship with these category nodes and their colour denotes which label(s) is/are attached to it. For simplicity, every node has a single label, and each node is only connected to a single other node:
Blue: Categories
Green: Scientific Publications
Orange: General Articles
Purple: Blog Posts
Now, the simplest thing I'm trying to do is getting a certain amount of related content nodes to a given node. The following returns all twenty related nodes:
START n = node(1)
MATCH (n)-->(category)<--(m)
RETURN m
However, I would like to filter this to 2 nodes per label per category (and afterwards play with ordering by nodes that have multiple categories overlapping with the starting node.
Currently I'm doing this by getting the results from the above query, and then manually looping through the results, but this feels like redundant work to me.
Is there a way to do this via Neo4j's Cipher Query language?
This answer extends #Stefan's original answer to return the result for all the categories, not just one of them.
START p = node(1)
MATCH (p)-->(category)<--(m)
WITH category, labels(m) as label, collect(m)[0..2] as nodes
UNWIND label as lbl
UNWIND nodes AS n
RETURN category, lbl, n
To facilitate manual verification of the results, you can also add this line to the end, to sort the results. (This sorting should probably not be in your final code, unless you really need sorted results and are willing expend the extra computing time):
ORDER BY id(category), lbl
Cypher has a labels function returning an array with all labels for a given node. Assuming you only have exactly one label per m node the following approach could work:
START n = node(1)
MATCH (n)-->(category)<--(m)
WITH labels(m)[0] as label, collect[m][0..2] as nodes
UNWIND nodes as n
RETURN n
The WITH statements builds up a seperate collection of all nodes sharing the same label. Using the subscript operator [0..2] the collection just keeps the first two elements. Unwind then converts the collection into separate rows for the result. From here on you can apply ordering.

Neo4j Cypher - Vary traversal depth conditional on number of nodes

I have a Neo4j database (version 2.0.0) containing words and their etymological relationships with other words. I am currently able to create "word networks" by traversing these word origins, using a variable depth Cypher query.
For client-side performance reasons (these networks are visualized in JavaScript), and because the number of relationships varies significantly from one word to the next, I would like to be able to make the depth traversal conditional on the number of nodes. My query currently looks something like this:
start a=node(id)
match p=(a)-[r:ORIGIN_OF*1..5]-(b)
where not b-->()
return nodes(p)
Going to a depth of 5 usually yields very interesting results, but at times delivers far too many nodes for my client-side visualization to handle. I'd like to check against, for example, sum(length(nodes(p))) and decrement the depth if that result exceeds a particular maximum value. Or, of course, any other way of achieving this goal.
I have experimented with adding a WHERE clause to the path traversal, but this is specific to individual paths and does not allow me to sum() the total number of nodes.
Thanks in advance!
What you're looking to do isn't fairly straight forward in a single query. Assuming you are using labels and indexing on the word property, the following query should do what you want.
MATCH p=(a:Word { word: "Feet" })-[r:ORIGIN_OF*1..5]-(b)
WHERE NOT (b)-->()
WITH reduce(pathArr =[], word IN nodes(p)| pathArr + word.word) AS wordArr
MATCH (words:Word)
WHERE words.word IN wordArr
WITH DISTINCT words
MATCH (origin:Word { word: "Feet" })
MATCH p=shortestPath((words)-[*]-(origin))
WITH words, length(nodes(p)) AS distance
RETURN words
ORDER BY distance
LIMIT 100
I should mention that this most likely won't scale to huge datasets. It will most likely take a few seconds to complete if there are 1000+ paths extending from your origin word.
The query basically does a radial distance operation by collecting all distinct nodes from your paths into a word array. Then it measures the shortest path distance from each distinct word to the origin word and orders by the closest distance and imposes a maximum limit of results, for example 100.
Give it a try and see how it performs in your dataset. Make sure to index on the word property and to apply the Word label to your applicable word nodes.
what comes to my mind is an stupid optimalization of graph:
what you need to do is to ad an information into each node, which will show up how many connections it has for each depth from 1 to 5, ie:
start a=node(id)
match (a)-[r:ORIGIN_OF*1..1]-(b)
with count(*) as cnt
set a.reach1 = cnt
...
start a=node(id)
match (a)-[r:ORIGIN_OF*5..5]-(b)
where not b-->()
with count(*) as cnt
set a.reach5 = cnt
then, before each run of your question query above, check if the number of reachX < you_wished_results and run the query with [r:ORIGIN_OF*X..X]
this would have some consequences - either you would have to run this optimalisation each time after new items or updates happens to your db, or after each new node /updated node you must add the reachX param to the update

Resources