Neo4J - how can I filter the unique nodes before passing them to the next process - neo4j

I'm pretty new to neo4j and I'm not exactly sure how I can achieve this.
Essentially I have 3 sets of nodes: Student, Pass, Mark
Student has a "ACQUIRED" relationship with Pass node.
And Student also has an "ACHIEVED" relationship with Mark.
What I want to do, is find all the marks belonging to students who have at least passed once.
This is what I have so far:
MATCH (m:Mark)<-[r:ACHIEVED]-(s:Student)-[a:ACQUIRED]->(p:Pass)
WHERE p.status = 'True'
RETURN m, r, s
The problem with this is that some student nodes have passed multiple times and so they have multiple relationships with the Pass nodes. This makes it so that the marks they achieved get returned multiple times.
For example if one Student node has relationship with 4 Mark nodes and has passed twice (i.e., has relationship with 2 Pass nodes), then the returned output would be 8 Marks instead of 4 - it gets duplicated.
Is there anyway of preventing against this behaviour? and just returning unique results?

Simply add DISTINCT on the RETURN command.
That is:
RETURN distinct m, r, s
Then will remove duplicates for m

Related

NEO4J - Matching a path where middle node might exist or not

I have the following graph:
I would look to get all contractors and subcontractors and clients, starting from David.
So I thought of a query likes this:
MATCH (a:contractor)-[*0..1]->(b)-[w:works_for]->(c:client) return a,b,c
This would return:
(0:contractor {name:"David"}) (0:contractor {name:"David"}) (56:client {name:"Sarah"})
(0:contractor {name:"David"}) (1:subcontractor {name:"John"}) (56:client {name:"Sarah"})
Which returns the desired result. The issue here is performance.
If the DB contains millions of records and I leave (b) without a label, the query will take forever. If I add a label to (b) such as (b:subcontractor) I won't hit millions of rows but I will only get results with subcontractors:
(0:contractor {name:"David"}) (1:subcontractor {name:"John"}) (56:client {name:"Sarah"})
Is there a more efficient way to do this?
link to graph example: https://console.neo4j.org/r/pry01l
There are some things to consider with your query.
The relationship type is not specified- is it the case that the only relationships from contractor nodes are works_for and hired? If not, you should constrain the relationship types being matched in your query. For example
MATCH (a:contractor)-[:works_for|:hired*0..1]->(b)-[w:works_for]->(c:client)
RETURN a,b,c
The fact that (b) is unlabelled does not mean that every node in the graph will be matched. It will be reached either as a result of traversing the works_for or hired relationships if specified, or any relationship from :contractor, or via the works_for relationship.
If you do want to label it, and you have a hierarchy of types, you can assign multiple labels to nodes and just use the most general one in your query. For example, you could have a label such as ExternalStaff as the generic label, and then further add Contractor or SubContractor to distinguish individual nodes. Then you can do something like
MATCH (a:contractor)-[:works_for|:hired*0..1]->(b:ExternalStaff)-[w:works_for]->(c:client)
RETURN a,b,c
Depends really on your use cases.

Neo4j Cypher - Returning nodes and their nested nodes of the same type

I want to be able to return a list of Item nodes with their list of nested Item nodes contained within a Box. Because the relationships between the Item nodes and their nested Item nodes may be different (E.g. WHEELS, WINDOWS, LIGHTS), I would like to write a query that skips over the relationships and returns any nested Item node and their Item children because an Item will either have at least one Item child or none (thus resulting in empty children list).
I want to be able to do this with just a Box identifier (E.g. boxID) being passed.
NOTE: I'm new to Neo4j and Cypher so please reply with a (fairly) detailed answer of how the query works. I want to be able to understand how it works. Thanks!
E.g.
MATCH (iA: Item)-[r]->(iB: Item)-[r]->(b: Box)
WHERE b.boxID = $boxID
RETURN COLLECT(iB.itemID AS ItemID, ib.name as ItemName, COLLECT(iA.itemID as ItemID, iA.name as ItemName, COLLECT(...) ) AS ItemChildren)
The COLLECT(..) part confuses me. How do I return an Item node and all of its Item children and all of that childs Item children, and so on until empty children? Is there a better way to MATCH all of the nodes?
That is very easy using a variable-length relationship pattern:
MATCH (b:Box)-[:CONTAINS]->(:ItemInstance)-[*]-(i:Item)
WHERE b.boxID = $boxID
RETURN COLLECT(DISTINCT i) AS ItemChildren
The DISTINCT option is needed because the variable-length relationship result can return the same item multiple times.
This query also acknowledges the relationship directionality shown in your diagram. The CONTAINS relationship pattern specifies the appropriate directionality, but the variable-length relationship (-[*]-) specifies no directionality since your data model does not use a consistent direction throughout the tree starting at an ItemInstance.
Caveat: unbounded variable-length relationships can take a very long time or even run out of memory, depending on how big your DB is and how many relationships each node has. This can be worked around by specifying a reasonable upper bound on the length.

Neo4J returns node twice when matched with other node

I want to get values (2 set of values) from two relationships that have one node in common and then return all sets.
I have tried this code, but for the first set that has only one result it duplicates it because of the second set that has two results.
MATCH (sti:SingleTaskInstance) <- [:CONTAINS] - (cti:CollaborativeTaskInstance {cti_id: "RD1CT"})
- [:CONTAINS] -> (cti2:CollaborativeTaskInstance) return sti, cti2
Here is the result
We see that sti is duplicated while it should only return one result.
I have also tried using collect (distinct sti) on the set I do not want to duplicate but it's still not working. Any suggestion is welcomed.
In Cypher, you will get rows of results depending on all possible paths that matched the pattern. In your case, two paths were found that matched the pattern, but both of them happen to have the same sti node, which is why you see it appear twice. This is by design. Results are not grouped implicitly, you need to do this yourself using aggregation functions.
If you want to collect cti nodes per distinct sti node, then you'll need to collect() like so:
MATCH (sti:SingleTaskInstance) <- [:CONTAINS] - (cti:CollaborativeTaskInstance {cti_id: "RD1CT"}) - [:CONTAINS] -> (cti2:CollaborativeTaskInstance)
RETURN sti, collect(DISTINCT cti2)
We are collecting the distinct cti2 nodes just in case a cti2 node is reachable by multiple cti nodes (otherwise it might appear multiple times). When you aggregate, the non-aggregation variables become distinct, so you'll get distinct sti nodes by virtue of the aggregation.

How to cluster nodes together in Neo4j

My graph is 1M nodes. The data model is intentionally simple. There are Entities and IDType nodes. A single Entity may have 1:many IDType nodes. And an IDType node may be connected to 1:many Entities. This forms the graph.
The goal is to find all clusters of IDType's and Entities that are connected together into what I call a cluster of nodes (subgraph I guess some call it). Imagine if we had 1M nodes. I would like to find "clusters" like this in the graph data, I'm trying to figure out how to do that. I've written the cypher query that I believe does it, but it's not clear to me if it's doing what is intended.
The question: how do I efficiently traverse my graph and cluster together nodes so that there is a single row or group of rows that I can return as a row-based result set to my python driver program to then operate over that cluster. While this doesn't need to be the exact structure of my result, this is a sense of what I'm looking for.
cluster|nodes
1|2,3,4,5,6,7
2|10,11,12,13
3|15,17,19,20,21,25,27,28,33
Where the "cluster" is some arbitrary clustering of the list of nodes (frankly if I have a single line that's just a collection of clusters or some other way of telling they are all related, then I'm golden). The "nodes" number represents a unique integer-based property that we tag to every Entity node.
The query is below. The concept is that an "Entity" node can have 1 or many "ID" nodes and I'm trying to get all "Entity" and "ID" that are related to each other through the relationship "HAS_ID".
Conceptually, if there is a relationship that exists in the data like this Entity1-->ID1<--Entity2-->ID2<--Entity3-->ID3<--Entity4-->ID4<--Entity5 then I want to "cluster" them together so that I can create a unique number that represents this group of nodes. With my example, there are 5 entities, but there could just as easily be 2 entities, or 50 entities, which are all related to one another, that's why I'm thinking the variable length path is what I need.
The below is my attempt to do this in the graph. But 1) is it correct? 2) is it efficient because it seems to runs indefinitely 3) how do i best "group" these together?
match
(n:Entity)-[e1:HAS_ID*]-(o)
where n.key <> o.key
return *
limit 10
;
I've also tried
match (n:Entity)-[e1:HAS_ID*]-(o)
where n.key <> o.key
with distinct n.key as key_1, o.key as key_2
return key_1, collect(key_2)
limit 100
;
This seems to do close to what I want, but I'm still not getting a single group for a given key, in other words, I can have 5 rows returned but they are all still related, which I'd rather have 1 row in that case... He's an example, you can see that key "49518" is on the first and second row, I'd rather have one row that grouped them all together.
49518 [49004, 49871, 49940, 50525, 49101, 49625, 50165, 50017, 49098, 50383]
49940 [49088, 49706, 50292, 50470, 49140, 49258, 49216, 49559, 50004, 50346, 49237, 49518, 49894, 49101, 49625, 50165, 50017, 49098, 50383]
Well, for one, your query doesn't match the relationship pattern you described.
Each of your arrows in your pattern is a [:HAS_ID] relationship, so if entities and IDs are always alternating between each relationship, then your current query would only match patterns like this:
(:Entity)-[:HAS_ID]->(:ID)<-[:HAS_ID]-(:Entity)-[:HAS_ID]->(:ID)<-[:HAS_ID]-(:Entity)
3 entities, 2 IDs, 4 relationships. That doesn't match your example pattern of 5 entities, 4 IDs, and 8 relationships. So at the very least, you'll want to alter your pattern to use *8.
As for efficiency...the thing you're trying to do seems rather inefficient, as it must attempt to find this pattern on every single :Entity node in your graph, trying every single :HAS_ID relationship it finds. If your entire graph is made of this same pattern of :Entity and :ID and :HAS_ID, then your query is going to be traversing your entire graph, not once but multiple times.
You are going to get duplicate results. Even if we assume that your entire graph is made up of isolated 5 entity / 4 ID / 8 relationship chains like a snake, as in your example (an entity either being at the end of the chain with one link to an ID, or somewhere in the middle with links to 2 IDs), then you'll be getting 2 matches for that same group of nodes, one matching from one end of the chain, the other matching the other end. And that's the simple case...I'm guessing your graph could be much more complex than this, allowing even more possibilities for many different patterns to match on the exact same group of nodes. A unique path using your pattern does not equate to a unique grouping of nodes.
At the very least, you'll probably want to match on a pattern and use RETURN DISTINCT NODES(p) to enforce unique sets of nodes, but I still think the matching may take quite a bit of time.

simple cypher query unreasonably slow - what am I doing wrong?

I'm trying to get all the relationships connected to a given node that also have a property called 'name'. this is my cypher:
MATCH (starting { number:'123' })<-[r]-() WHERE HAS(r.name) RETURN r
this is unimaginably slow! it takes neo4j ages to compute even if there are only few return values, and there are not so many relationships connected to the node (1 to 10 relationships at most).
am I doing something wrong here?
other cyphers works fine.
thanks!
The number of relationships on the one node might be less relevant if you have not told Neo enough about your graph structure.
Firstly use labels and secondly use indexes. The below will Use a Label YourLabel on the property number.
CREATE INDEX ON :YourLabel(number)
Then hit the index to start the query, and use a type on your relationship too.
MATCH (:YourLabel{number:'123'})<-[r:RELATIONSHIP_TYPE]-()
WHERE HAS (r.name)
RETURN r
Now instead of scanning through every node for the number property with a value of 123, it reads only a single Index.
To use the labels, create your nodes like this (will be added to index):
CREATE (s1:YourLabel{number:"1"})

Resources