How to find distinct nodes in a Neo4j/Cypher query

How to find distinct nodes in a Neo4j/Cypher query - neo4j

I'm trying to do some pattern matching in neo4j/cypher and I came across this issue:
There are two types of graphs I want to search for:
Star graphs: A graph with one center node and multiple outgoing relationships.
n-length line graphs: A line graph with length n where none of the nodes are repeats (I have some bidirectional edges and cycles in my graph)
So the main problem is that when I do something such as:
MATCH a-->b, a-->c, a-->d
MATCH a-->b-->c-->d
Cypher doesn't guarantee (when I tried it) that a, b, c, and d are all different nodes. For small graphs, this can easily be fixed with
WHERE not(a=b) AND not(a=c) AND ...
But I'm trying to have graphs of size 10+, so checking equality between all nodes isn't a viable option. Afaik, RETURN DISTINCT does not work as well since it doesn't check equality among variables, only across different rows. Is there any simple way I can specify the query to make the differently named nodes distinct?

Old question, but look to APOC Path Expander procedures for how to address these kinds of use cases, as you can change the traversal uniqueness behavior for expansion (the same way you can when using the traversal API...which these procedures use).
Cypher implicitly uses RELATIONSHIP_PATH uniqueness, meaning that per path returned, a relationship must be unique, it cannot be used multiple times in a single path.
While this is good for queries where you need all possible paths, it's not a good fit for queries where you want distinct nodes or a subgraph or to prevent repeating nodes in a path.
For an n-length path, let's say depth 6 with only outgoing relationships of any type, we can change the uniqueness to NODE_PATH, where a node must be unique per path, no repeats in a path:
MATCH (n)
WHERE id(n) = 12345
CALL apoc.path.expandConfig(n, {maxLevel:6, uniqueness:'NODE_PATH'}) YIELD path
RETURN path
If you want all reachable nodes up to a certain depth (or at any depth by omitting maxLevel), you can use NODE_GLOBAL uniqueness, or instead just use apoc.path.subgraphNodes():
MATCH (n)
WHERE id(n) = 12345
CALL apoc.path.subgraphNodes(n, {maxLevel:6}) YIELD node
RETURN node
NODE_GLOBAL uniqueness means that across all paths that a node must be unique, it will only be visited once, and there will only be one path to a node from a given start node. This keeps the number of paths that need to be evaluated down significantly, but because of this behavior not all relationships will be traversed, if they expand to a node already visited.
You will not get relationships back with this procedure (you can use apoc.path.spanningTree() for that, although as previously mentioned not all relationships will be included, as we will only capture a single path to each node, not all possible paths to nodes). If you want all nodes up to a max level and all possible relationships between those nodes, then use apoc.path.subgraphAll():
MATCH (n)
WHERE id(n) = 12345
CALL apoc.path.subgraphAll(n, {maxLevel:6}) YIELD nodes, relationships
RETURN nodes, relationships
Richer options exist for label and relationship filtering, or filtering (whitelist, blacklist, endnode, terminator node) based on lists of pre-matched nodes.
We also support repeating sequences of relationships or node labels.
If you need filtering by node or relationship properties during expansion, then this won't be a good option as that feature is yet supported.

Related

Retrieve All Nodes That Can Be Reached By A Specific Node In A Directed Graph

Given a graph in Neo4j that is directed (but possible to have cycles), how can I retrieve all nodes that are reachable from a specific node with Cypher?
(Also: how long can I expect a query like this to take if my graph has 2 million nodes, and by extension 48 million nodes? A rough gauge will do eg. less than a minute, few minutes, an hour)

Cypher's uniqueness behavior is that relationships must be unique per path (each relationship can only be traversed once per path), but this isn't efficient for these kinds of use cases, where the goal is instead to find distinct nodes, so a node should only be visited once total (across all paths, not per path).
There are some path expander procedures in the APOC Procedures library that are directed at these use cases.
If you're trying to find all reachable nodes from a starting node, traversing relationships in either direction, you can use apoc.path.subgraphNodes() like so, using the movies graph as an example:
MATCH (n:Movie {title:"The Matrix"})
CALL apoc.path.subgraphNodes(n, {}) YIELD node
RETURN node
If you only wanted reachable nodes going a specific direction (let's say outgoing) then you can use a relationshipFilter to specify this. You can also add in the type too if that's important, but if we only wanted reachable via any outgoing relationship the query would look like:
MATCH (n:Movie {title:"The Matrix"})
CALL apoc.path.subgraphNodes(n, {relationshipFilter:'>'}) YIELD node
RETURN node
In either case these approaches should work better than with Cypher alone, especially in any moderately connected graph, as there will only ever be a single path considered for every reachable node (alternate paths to an already visited node will be pruned, cutting down on the possible paths to explore during traversal, which is efficient as we don't care about these alternate paths for this use case).

Have a look here, where an algorithm is used for community detection.
You can use something like
match (n:Movie {title:"The Matrix"})-[r*1..50]-(m) return distinct id(m)
but that is slow (tested on the Neo4j movie dataset with 60k nodes, above already runs more than 10 minutes. Probably memory usage will become an issue when you have a dataset consisting out of millions of nodes. Next to that, it also depends how your dataset is connected, e.g. nr of relationships.

Neo4j: Iterating from leaf to parent AND finding common children

I've migrated my relational database to neo4j and am studying whether I can implement some functionalities before I commit to the new system. I just read two neo4j books, but unfortunately they don't cover two key features I was hoping would be more self-evident. I'd be most grateful for some quick advice on whether these things will be easy to implement or whether I should stick to sql! Thx!
Features I need are:
1) I have run a script to assign :leaf label to all nodes that are leaves in my tree. In paths between a known node and its related leaf nodes, I aim to assign to every node a level property that reflects how many hops that node is from the known node (or leaf node - whatever I can get to work most easily).
I tried:
match path=(n:Leaf)-[:R*]->(:Parent {Parent_ID: $known_value})
with n, length(nodes(path)) as hops
set n.Level2=hops;
and
path=(n:Leaf)-[:R*]->(:Parent {Parent_ID: $known_value})
with n, path, length(nodes(path)) as hops
foreach (n IN relationships (path) |
set n.Level=hops);
The first assigns property with value of full length of path to only leaf nodes. The second assigns property with value of full length of path to all relationships in path.
Should I be using shortestpath instead, create a bogus property with value =1 for all nodes and iteratively add weight of that property?
2) I need to find the common children for a given parent node. For example, my children each [:like] lots of movies, and I would like to create [:like] relationships from myself to just the movies that my children all like in common (so if 1 of 1 likes a movie, then I like it too, but if only 2 of 3 like a movie, nothing happens).
I found a solution with three paths here:
Need only common nodes across multiple paths - Neo4j Cypher
But I need a solution that works for any number of paths (starting from 1).
3) Then I plan to start at my furthest leaf nodes, create relationships to children's movies, and move level by level toward my known node and repeat create relationships, so that the top-most grandparent likes only the movies that all children [of all children of all children...] like in common and if there's one that everybody agrees on, that's the movie the entire extended family will watch Saturday night.
Can this be done with neo4j and how hard a task is it for someone with rudimentary Cypher? This is mostly how I did it in my relational database / Should I be looking at implementing this totally differently in graph database?
Most grateful for any advice. Thanks!

1.
shortestPath() may help when your already matched start and end nodes are not the root and the leaf, in that it won't continue to look for additional paths once the first is found. If your already matched start and end nodes are the root and the leaf when the graph is a tree structure (acyclic), there's no real reason to use shortestPath().
Typically when setting something like the depth of a node in a tree, you would use length(path), so the root would be at depth 0, its children at depth 1.
Usually depth is calculated with respect to the root node and not leaf nodes (as an intermediate node may be the ancestor of multiple leaf nodes at differing distances). Taking the depth as the distance from the root makes the depths consistent.
Your approach with setting the property on relationships will be a problem, as the same relationship can be present in multiple paths for multiple leaf nodes at varying depths. Your query could overwrite the property on the same relationship over and over until the last write wins. It would be better to match down to all nodes (leave out :Leaf in the query), take the last relationship in the path, and set its depth:
MATCH path=(:Parent {Parent_ID: $known_value})<-[:R*]-()
WITH length(path) as length, last(relationships(path)) as rel
SET rel.Level = length
2.
So if all child nodes of a parent in the tree :like a movie then the parent should :like the movie. Something like this should work:
MATCH path=(:Parent {Parent_ID: $known_value})<-[:R*0..]-(n)
WITH n, size((n)<-[:R]-()) as childCount
MATCH (n)<-[:R]-()-[:like]->(m:Movie)
WITH n, childCount, m, count(m) as movieLikes
WHERE childCount = movieLikes
MERGE (n)-[:like]->(m)
The idea here is that for a movie, if the count of that movie node equals the count of the child nodes then all of the children liked the movie (provided that a node can only :like the same movie once).
This query can't be used to build up likes from the bottom up however, the like relationships (liking personally, as opposed to liking because all children liked it) would have to be present on all nodes first for this query to work.
3.
In order to do a bottom-up approach, you would need to force the query to execute in a particular order, and I believe the best way to do that is to first order the nodes to process in depth order, then use apoc.cypher.doIt(), a proc in APOC Procedures which lets you execute an entire Cypher query per row, to do the calculation.
This approach should work:
MATCH path=(:Parent {Parent_ID: $known_value})<-[:R*0..]-(n)
WHERE NOT n:Leaf // leaves should have :like relationships already created
WITH n, length(path) as depth, size((n)<-[:R]-()) as childCount
ORDER BY depth DESC
CALL apoc.cypher.doIt("
MATCH (n)<-[:R]-()-[:like]->(m:Movie)
WITH n, childCount, m, count(m) as movieLikes
WHERE childCount = movieLikes
MERGE (n)-[:like]->(m)
RETURN count(m) as relsCreated",
{n:n, childCount:childCount}) YIELD value
RETURN sum(value.relsCreated) as relsCreated
That said, I'm not sure this will do what you think it will do. Or rather, it will only work the way you think it will if the only :like relationships to movies are initially set on just the leaf nodes, and (prior to running this propagation query) no other intermediate node in the tree has any :like relationship to a movie.

How to get all connected nodes in neo4j

I want to get list of all connected nodes starting from node 0 as shown in the diagram

Based on your comment:
I want to get a list of all the connected nodes. For example in the
above case when I search for connected nodes for 0, it should return
nodes- 1,2,3
This query will do what you want:
MATCH ({id : 0})-[*]-(connected)
RETURN connected
The above query will return all nodes connected with a node with id=0 (I'm considering that the numbers inside the nodes are values of an id property) in any depth, both directions and considering any relationship type. Take a look in the section Relationships in depth of the docs.
While this will work fine for small graphs note that this is a very expensive operation. It will go through the entire graph starting from the start point ({id : 0}) considering any relationship type. This is really not a good idea for production environments.

If you wish to match the nodes that have a relationship to another node, you can use this:
MATCH (n) MATCH (n)-[r]-() RETURN n,r
It will return you all the nodes that have a relationship to another node or nodes, irrespective of the direction of the relationship.
If you wish to add a constraint you can do it this way:
MATCH (n:Label {id:"id"}) MATCH (n)-[r]-() RETURN n,r

For larger or more heavily interconnected graphs, APOC Procedures offers a more efficient means of traversal that returns all nodes in a subgraph.
As others have already mentioned, it's best to use labels on your nodes, and add either an index or a unique constraint on the label+property for fast lookup of your starting node.
Using a label of "Label", and a parameter of idParam, a query to get nodes of the subgraph with APOC would be:
MATCH (n:Label {id:$idParam})
CALL apoc.path.subgraphNodes(n, {minLevel:1}) YIELD node
RETURN node
Nodes will be distinct, and the starting node will not be returned with the rest.
EDIT
There's currently a restriction preventing usage of minLevel in subgraphNodes(), you can use either filter out the starting node yourself, or use apoc.path.expandConfig() using uniqueness:'NODE_GLOBAL' to get the same effect.

neo4j query to exclude nodes related to nodes with certain properties

I am trying to write a neo4j query where I only want to present nodes that are have no relation to nodes with a specific property. One way to think of it is where two separate graphs exist where one node has the property I want to exclude. I should get a result that only contains the graph of the set of nodes not connected to the node that has the property I want to exclude. This is what the graph looks like before my query
match (n) where not (n{property:'valueIWishToExclude'})--() return n
This is what the result of the query looks like
I only want to have the four connected nodes in my results. How can I set up a query that excludes the nodes that are not connected to the node with the property I wish to exclude?

In fact you need those nodes from which there is no path to the node that should be excluded. You can use the shortestPath function and ALL predicate:
match (ex) where n.property = 'valueIWishToExclude'
with collect(ex) as exn
match (n) where (not n.property = 'valueIWishToExclude') and
ALL(e in exn where not shortestPath( (n)-[*]-(e) ) is null)
return n

You are almost there, just add in the relationship in your query to only get the nodes that are related to each other
MATCH (n:label) -[:RELATED]->() where n.property<>'exclude'
RETURN n
That should return only the nodes connected to each other, as the other nodes do not have that relationship.
Let me know if that worked for you.

You may want to alter your wording a bit, what you're asking for in this question, and what you really want, are not the same thing.
In Neo4j (and most graph databases), the phrase "nodes that have no relation to..." means nodes that are not connected by a relationship to the node in question.
In that context, in your right graph (assuming the one node selected is the node marked as excluded), one node would fit the criteria and be returned as a possible result, the topmost node, since it doesn't have a relationship to the node you want to exclude; It is however two relationships removed from the excluded node.
You seem to be asking for something else, though. You seem to want nodes that are not in the same subgraph as the node to exclude. Or, alternately, nodes that have no path to the excluded node.
Make sure on future queries you're clear about what you're asking, or you'll get answers that have no relevance to what you really want.
One approach that will work is to first find all nodes within the subgraph of the excluded node, and then return all nodes that are not in those subgraph nodes.
You'll want to install APOC Procedures so you can make use of a fast means of obtaining nodes within the subgraph.
You'll also want to use labels in your graph, and maybe put an index on the property you're searching for as this will make your search fast. As it is now, your query must examine every node in your entire database to find nodes with the property in question, and that will become slower and slower as your graph grows.
Your query might look like this (using 'Label' as a stand-in for the node label):
MATCH (n:Label{propertyToExclude:'valueToExclude'})
CALL apoc.path.expandConfig(n, {bfs:true, uniqueness:"NODE_GLOBAL"}) YIELD path
WITH COLLECT(DISTINCT LAST(NODES(path))) as subgraph
MATCH (n)
WHERE NOT n in subgraph
RETURN n

How to filter results by node label in neo4j cypher?

I have a graph database that maps out connections between buildings and bus stations, where the graph contains other connecting pieces like roads and intersections (among many node types).
What I'm trying to figure out is how to filter a path down to only return specific node types. I have two related questions that I'm currently struggling with.
Question 1: How do I return the labels of nodes along a path?
It seems like a logical first step is to determine what type of nodes occur along the path.
I have tried the following:
MATCH p=(a:Building)-[:CONNECTED_TO*..5]-(b:Bus)
WITH nodes(p) AS nodes
RETURN DISTINCT labels(nodes);
However, I'm getting a type exception error that labels() expects data of type node and not Collection. I'd like to dynamically know what types of nodes are on my paths so that I can eventually filter my paths.
Question 2: How can I return a subset of the nodes in a path that match a label I identified in the first step?
Say I found that that between (a:Building) and (d1:Bus) and (d2:Bus) I can expect to find (:Intersection) nodes and (:Street) nodes.
This is a simplified model of my graph:
(a:Building)--(:Street)--(:Street)--(b1:Bus)
\(:Street)--(:Intersection)--(:Street)--(b2:Bus)
I've written a MATCH statement that would look for all possible paths between (:Building) and (:Bus) nodes. What would I need to do next to filter to selectively return the Street nodes?
MATCH p=(a:Building)-[r:CONNECTED_TO*]-(b:Bus)
// Insert logic to only return (:Street) nodes from p
Any guidance on this would be greatly appreciated!

To get the distinct labels along matching paths:
MATCH p=(a:Building)-[:CONNECTED_TO*..5]-(b:Bus)
WITH NODES(p) AS nodes
UNWIND nodes AS n
WITH LABELS(n) AS ls
UNWIND ls AS label
RETURN DISTINCT label;
To return the nodes that have the Street label.
MATCH p=(a:Building)-[r:CONNECTED_TO*]-(b:Bus)
WITH NODES(p) AS nodes
UNWIND nodes AS n
WITH n
WHERE 'Street' IN LABELS(n)
RETURN n;

Cybersam's answers are good, but their output is simply a column of labels...you lose the path information completely. So if there are multiple paths from a :Building to a :Bus, the first query will only output all labels in all nodes in all patterns, and you can't tell how many paths exist, and since you lose path information, you cannot tell what labels are in some paths but not others, or common between some paths.
Likewise, the second query loses path information, so if there are multiple paths using different streets to get from a :Building to a :Bus, cybersam's query will return all streets involved in all paths. It is possible for it to output all streets in your graph, which doesn't seem very useful.
You need queries that preserve path information.
For 1, finding the distinct labels on nodes on each path I would offer this query:
MATCH p=(:Building)-[:CONNECTED_TO*..5]-(:Bus)
WITH NODES(p) AS nodes
WITH REDUCE(myLabels = [], node in nodes | myLabels + labels(node)) as myLabels
RETURN DISTINCT myLabels
For 2, this query preserves path information:
MATCH p=(:Building)-[:CONNECTED_TO*..5]-(:Bus)
WITH NODES(p) AS nodes
WITH FILTER(node in nodes WHERE (node:Street)) as pathStreets
RETURN pathStreets
Note that these are both expensive operations, as they perform a cartesian product of all buildings and all busses, as in the queries in your description. I highly recommend narrowing down the buildings and busses you're matching upon, hopefully to very few or specific buildings at least.
I also encourage limiting how deep you're looking in your pattern. I get the idea that many, if not most, of your nodes in your graph are connected by :CONNECTED_TO relationships, and if we don't cap that to a reasonable amount, your query could be finding every single path through your entire graph, no matter how long or convoluted or nonsensical, and I don't think that's what you want.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart