How is BFS on an adjacency matrix list O(m+n)? - breadth-first-search

I'm trying to figure out how a BFS is O(m+n), where n is the number of vertices and m is the number of edges.
The algorithm is:
public void bfs()
{
//BFS uses Queue data structure
Queue q=new LinkedList();
q.add(this.rootNode);
printNode(this.rootNode);
rootNode.visited=true;
while(!q.isEmpty())
{
Node n=(Node)q.remove();
Node child=null;
while((child=getUnvisitedChildNode(n))!=null)
{
child.visited=true;
printNode(child);
q.add(child);
}
}
//Clear visited property of nodes
clearNodes();
}
//
In an adjacency list, we store the vertices in an array/hash table, and a linked list of the edges that each vertex forms with other vertices.
My main question is this: how do we implement get unvisited child node? It's clear that you mark nodes as visited, but when traversing, you go through all the linked lists, so you count every edge twice, so the complexity is O(2m+n) right? Does that just get rounded down to O(m+n)?
Also, can we employ a similar strategy for an adjacency matrix? If I'm given an matrix of size n x n, and I want to know if a specific element exists, can I do a BFS to figure that out?
Thanks.

The O-notation "reduces" multiplicative constants to 1, so O(2m+n) gets reduced to O(m+n).

Related

Cypher query for assigning property values in an arbitrary number of nodes

I am a Neo4J beginner, so, apologies in advance if my question is too trivial.
I am trying to create a Neo4J graph representing a set of consecutive steps in a game, as shown in this diagram.
You will see in the diagram that I start with zero points, and, at certain steps (but not in every step), additional points are accumulated.
I want to assign points to nodes that don't have points yet, according to the following principle: whenever a node does not have points, I want to assign to it a number of points equal to the points possessed by the closest previous node that has points assigned to it. In the sample diagram, step 2 would have 0 points (:Step {id: 2, points_so_far: 0}), and step 4 would have 1 point (:Step {id: 4, points_so_far: 1}). Note that there may be an arbitrary number of scoreless nodes between nodes that do have a score.
Any help in creating a respective Cypher query would be much appreciated!
Many thanks in advance!
Here is a way to do it :
match (s:Step) WHERE not exists(s.points_so_far)
match (prev:Step)<-[:HAS_PREVIOUS_STEP*]-(s) where exists(prev.points_so_far)
with s, head(collect(prev)) as prev
SET s.points_so_far = prev.points_so_far
How does it work ?
First, find all nodes that have no points_so_far
match (s:Step) WHERE not exists(s.points_so_far)
with that nodes, find all previous steps that have points_so_far
match (prev:Step)<-[:HAS_PREVIOUS_STEP*]-(s) where exists(prev.points_so_far)
get all the previous nodes with points, collect them in a list, and keep only the first one encountered
with s, head(collect(prev)) as prev
set the value of the node, with the value of the previous node
SET s.points_so_far = prev.points_so_far
Note:
This request uses variable path length (the * in <-[:HAS_PREVIOUS_STEP*]-) wich has some performance cost.

Limiting nodes per label

I have a graph with currently around the several thousand nodes with each node having between two to ten relationships. If we look at a single node and its connections, they would look like somewhat this:
The nodes with alphabetical characters are category nodes. All other nodes are content nodes that have an associated with relationship with these category nodes and their colour denotes which label(s) is/are attached to it. For simplicity, every node has a single label, and each node is only connected to a single other node:
Blue: Categories
Green: Scientific Publications
Orange: General Articles
Purple: Blog Posts
Now, the simplest thing I'm trying to do is getting a certain amount of related content nodes to a given node. The following returns all twenty related nodes:
START n = node(1)
MATCH (n)-->(category)<--(m)
RETURN m
However, I would like to filter this to 2 nodes per label per category (and afterwards play with ordering by nodes that have multiple categories overlapping with the starting node.
Currently I'm doing this by getting the results from the above query, and then manually looping through the results, but this feels like redundant work to me.
Is there a way to do this via Neo4j's Cipher Query language?
This answer extends #Stefan's original answer to return the result for all the categories, not just one of them.
START p = node(1)
MATCH (p)-->(category)<--(m)
WITH category, labels(m) as label, collect(m)[0..2] as nodes
UNWIND label as lbl
UNWIND nodes AS n
RETURN category, lbl, n
To facilitate manual verification of the results, you can also add this line to the end, to sort the results. (This sorting should probably not be in your final code, unless you really need sorted results and are willing expend the extra computing time):
ORDER BY id(category), lbl
Cypher has a labels function returning an array with all labels for a given node. Assuming you only have exactly one label per m node the following approach could work:
START n = node(1)
MATCH (n)-->(category)<--(m)
WITH labels(m)[0] as label, collect[m][0..2] as nodes
UNWIND nodes as n
RETURN n
The WITH statements builds up a seperate collection of all nodes sharing the same label. Using the subscript operator [0..2] the collection just keeps the first two elements. Unwind then converts the collection into separate rows for the result. From here on you can apply ordering.

Comparing or Diffing Two Graphs in neo4j

I have two disconnected graphs in a neo4j database. They are very similar networks but one is a version that is several months later of the same graph.
Is there a way that I can easily compare these two graphs to see any additions, deletes or editing that has been done to the network?
If you want a pure Cypher solution to compare the structure of two graphs you can try the following approach (based on Mark Needham's article for creating adjacency matrices from a graphs https://markhneedham.com/blog/2014/05/20/neo4j-2-0-creating-adjacency-matrices/)
The basic idea is to construct two adjacency matrices, one for each graph to be compared with a column and row for each node identifier (business identifier, not node id), and then to perform some algebra on the two matrices to find the differences.
The problem is that if the graphs don't contain the same node identifiers then the adjacency matrices will have different dimensions, making the actual comparison harder, so the trick is to produce two identically sized matrices and populate one with the adjacency matrix from the first graph and the second with the adjacency matrix from the second graph.
Consider these two graphs:
All the nodes in Graph 1 are labeled :G1 and all the nodes in Graph 2 are labeled :G2.
Step 1 is to find all the unique node identifiers, the 'name' property in this case, from both graphs:
match (g:G1)
with collect(g.name) as g1Names
match (g:G2)
with g1Names + collect(g.name) as collectedNames
unwind collectedNames as allNames
with collect(distinct allNames) as uniqueNames
uniqueNames now contains a list of all the unique identifiers in both graphs. (It is necessary to unwind the collected names and then collect them back up because the distinct operator doesn't work on a list - there is a lot more collecting and unwinding to come!)
Next, two new lists of unique identifiers are created to represent the two dimensions of of the adjacency matrix for the first graph.
unwind uniqueNames as dim1
unwind uniqueNames as dim2
Then an optional match is performed to create a Cartesian product of each node with every other node in G1, the first graph.
optional match p = (g1:G1 {name: dim1})-->(g2:G1 {name: dim2})
The matched paths will either exist or return null from the above match statement. These are now converted into a count of edges between nodes or a zero if there was no connection (the essence of the adjacency matrix). The matched paths are sorted to keep the order of rows and columns in the matrix correct when it is created. uniqueNames is passed through as it will be used to construct the adjacency matrix for the second graph.
with uniqueNames, dim1, dim2, case when p is null then 0 else count(p) end as edgeCount
order by dim1, dim2
Next, the edges are rolled up into a list of values for the second dimension
with uniqueNames, dim1 as g1DimNames, collect(edgeCount) as g1Matrix
order by g1DimNames
The whole operation above is repeated for the second graph to generate the second adjacency matrix.
with uniqueNames, g1DimNames, g1Matrix
unwind uniqueNames as dim1
unwind uniqueNames as dim2
optional match p = (g1:G2 {name: dim1})-->(g2:G2 {name: dim2})
with g1DimNames, g1Matrix, dim1, dim2, case when p is null then 0 else count(p) end as edges
order by dim1, dim2
with g1DimNames, g1Matrix, dim1 as g2DimNames, collect(edges) as g2Matrix
order by g1DimNames, g2DimNames
At this point g1DimNames and g1Matrix form a Cartesian product with the g2DimNames and g2Matrix. This product is factored by removing duplicate rows with the filter statement
with filter( x in collect([g1DimNames, g1Matrix, g2DimNames, g2Matrix]) where x[0] = x[2]) as factored
The final step is to determine the differences between the two matrices, which is just a matter of finding the rows which are different in the factored result above.
with filter(x in factored where x[1] <> x[3]) as diffs
unwind diffs as result
return result
We then end up with a result that shows what is different and how:
To interpret the results: The first two columns represent a subset of the first graph's adjacency matrix and the second two columns represent the corresponding row by row adjacency matrix for the second graph. The alpha characters represent the node names and the lists of digits represent the corresponding rows in the matrix for each original column, A to G in this case.
Looking at the "A" row, we can conclude node "A" owns nodes "B" and "C" in graph 1 and node "A" owns node "B" once and node "C" twice in graph 2.
For the "D" row, node "D" does not own any nodes in graph 1 and owns nodes "F" and "G" in graph 2.
There are at least a couple of caveats to this approach:
Creating Cartesian products is slow, in even small graphs. (I have
been comparing XML schemas with this technique and comparing two
graphs containing about 200 nodes each takes around 30 seconds,
compared with 14ms for the example above, on my fairly modestly
sized server).
Reading the result matrix is not easy when there are more than a
trivial amount of nodes as it is hard to keep track of which column
corresponds to which node. (To get round this, I have exported the
results to a csv and then inserted the node names (from uniqueNames)
into the top row of the spreadsheet.
I guess diffing is most easy done using a text based tool.
One approach I can think of is to export the two subgraphs to GraphML using https://github.com/jexp/neo4j-shell-tools and then apply the regular diff from unix.
Another one would be using dump in neo4j-shell and diff the results as above.
This largely depends on what you want the diff to be of and the constraints of the graphs themselves.
If nodes and relationships have an identifier property (not the internal Neo4j ID), then you could just pull down the nodes and relationships of each graph and track which are added, removed, or changed (diff the properties).
If relationships are not uniquely identified (by a property), but nodes are, their natural key is the start node, end node and type since duplicate relationships cannot exist.
If neither have managed identifiers, but properties are immutable, then those could be compared across nodes (could be costly), then subsequently the relationships in method.

Neo4j Cypher - Vary traversal depth conditional on number of nodes

I have a Neo4j database (version 2.0.0) containing words and their etymological relationships with other words. I am currently able to create "word networks" by traversing these word origins, using a variable depth Cypher query.
For client-side performance reasons (these networks are visualized in JavaScript), and because the number of relationships varies significantly from one word to the next, I would like to be able to make the depth traversal conditional on the number of nodes. My query currently looks something like this:
start a=node(id)
match p=(a)-[r:ORIGIN_OF*1..5]-(b)
where not b-->()
return nodes(p)
Going to a depth of 5 usually yields very interesting results, but at times delivers far too many nodes for my client-side visualization to handle. I'd like to check against, for example, sum(length(nodes(p))) and decrement the depth if that result exceeds a particular maximum value. Or, of course, any other way of achieving this goal.
I have experimented with adding a WHERE clause to the path traversal, but this is specific to individual paths and does not allow me to sum() the total number of nodes.
Thanks in advance!
What you're looking to do isn't fairly straight forward in a single query. Assuming you are using labels and indexing on the word property, the following query should do what you want.
MATCH p=(a:Word { word: "Feet" })-[r:ORIGIN_OF*1..5]-(b)
WHERE NOT (b)-->()
WITH reduce(pathArr =[], word IN nodes(p)| pathArr + word.word) AS wordArr
MATCH (words:Word)
WHERE words.word IN wordArr
WITH DISTINCT words
MATCH (origin:Word { word: "Feet" })
MATCH p=shortestPath((words)-[*]-(origin))
WITH words, length(nodes(p)) AS distance
RETURN words
ORDER BY distance
LIMIT 100
I should mention that this most likely won't scale to huge datasets. It will most likely take a few seconds to complete if there are 1000+ paths extending from your origin word.
The query basically does a radial distance operation by collecting all distinct nodes from your paths into a word array. Then it measures the shortest path distance from each distinct word to the origin word and orders by the closest distance and imposes a maximum limit of results, for example 100.
Give it a try and see how it performs in your dataset. Make sure to index on the word property and to apply the Word label to your applicable word nodes.
what comes to my mind is an stupid optimalization of graph:
what you need to do is to ad an information into each node, which will show up how many connections it has for each depth from 1 to 5, ie:
start a=node(id)
match (a)-[r:ORIGIN_OF*1..1]-(b)
with count(*) as cnt
set a.reach1 = cnt
...
start a=node(id)
match (a)-[r:ORIGIN_OF*5..5]-(b)
where not b-->()
with count(*) as cnt
set a.reach5 = cnt
then, before each run of your question query above, check if the number of reachX < you_wished_results and run the query with [r:ORIGIN_OF*X..X]
this would have some consequences - either you would have to run this optimalisation each time after new items or updates happens to your db, or after each new node /updated node you must add the reachX param to the update

Singly connected Graph?

A singly connected graph is a directed graph which has at most 1 path from u to v ∀ u,v.
I have thought of the following solution:
Run DFS from any vertex.
Now run DFS again but this time starting from the vertices in order of decreasing finish time. Run this DFS only for vertices which are not visited in some previous DFS. If we find a cross edge in the same component or a forward edge, then it is not Singly connected.
If all vertices are finished and no such cross of forward edges, then singly connected.
O(V+E)
Is this right? Or is there a better solution.
Update : atmost 1 simple path.
A graph is not singly connected if one of the two following conditions satisfies:
In the same component, when you do the DFS, you get a road from a vertex to another vertex that has already finished it's search (when it is marked BLACK)
When a node points to >=2 vertices from another component, if the 2 vertices have a connection then it is not singly connected. But this would require you to keep a depth-first forest.
A singly connected component is any directed graph belonging to the same entity.
It may not necessarily be a DAG and can contain a mixture of cycles.
Every node has atleast some link(in-coming or out-going) with atleast one node for every node in the same component.
All we need to do is to check whether such a link exists for the same component.
Singly Connected Component could be computed as follows:
Convert the graph into its undirected equivalent
Run DFS and set the common leader of each node
Run an iteration over all nodes.
If all the nodes have the same common leader, the undirected version of the graph is singly connected.
Else, it contains of multiple singly connected subgraphs represented by their corresponding leaders.
Is this right?
No, it's not right. Considering the following graph which is not singly connected. The first component comes from a dfs beginning with vertex b and the second component comes from a dfs beginning with vertex a.
The right one:
Do the DFS, the graph is singly connected if all of the three following conditions satisfies:
no foward edges
no cross edges in the same component
there is no more than 1 cross edges between any two of components

Resources