Filtering based on relationships properties - neo4j

Given the following graph: http://console.neo4j.org/?id=qeuv73
I'd like to design a cypher query, which will return the following nodes for 'first user':
node 7 ("dep 2")
The conditions, are such that, in order for a node to be returned, user has to have all of it's dependencies. A user owns a node, when there is a relationship of HAS between the user and the node.
That is simple enough. The following query should do the trick:
MATCH (a:Dep)-[:REQUIRES]->(req:Dep)
WITH collect(req) AS requirements, a
MATCH (ub:Dep)<-[:HAS]-(:User)
WITH collect(ub) AS userDeps, requirements, a
WHERE ALL (req IN requirements WHERE req IN userDeps)
RETURN a
The problem start, when I want to introduce another condition, which is the value of user's relationship of HAS (data property) has to be either equal or greater than a particular dependency's data.
To put that into the example: "dep 2" fulfills both of the conditions, whereas "dep 1" does not, bcause it has a REQUIRES relationship to "dep 4" with data = 6, and user's HAS relationship to that node is equal to 5. The other dependency for "dep 1" is however fulfilled (as the levels are euqal).
Could anyone help?
UPDATE:
In other words, I want to iterate though every node Dep, check all of the REQUIRES relationships for each and every one of them and then return all the nodes that have resolved all the requirements for a particular user. Resolved means that a user has a HAS relationship with every requirement and that HAS.data >= REQUIRES.data

Just for future reference.
If you have a task that seems to be too complex to handle by one (or indeed combined) cypher queries you can extend Neo4J functionality.
You can write your own module and deploy it with Neo4J. One of the ways to do it is to use GraphAware Runtime.
https://github.com/graphaware/neo4j-framework/tree/master/runtime

Related

How to query Neo4j N levels deep with variable length relationship and filters on each level

I'm new(ish) to Neo4j and I'm attempting to build a tool that allows users on a UI to essentially specify a path of nodes they would like to query neo4j for. For each node in the path they can specify specific properties of the node and generally they don't care about the relationship types/properties. The relationships need to be variable in length because the typical use case for them is they have a start node and they want to know if it reaches some end node without caring about (all of) the intermediate nodes between the start and end.
Some restrictions the user has when building the path from the UI is that it can't have cycles, it can't have nodes who has more than one child with children and nodes can't have more than one incoming edge. This is only enforced from their perspective, not in the query itself.
The issue I'm having is being able to specify filtering on each level of the path without getting strange behavior.
I've tried a lot of variations of my Cypher query such as breaking up the path into multiple MATCH statements, tinkering with the relationships and anything else I could think of.
Here is a Gist of a sample Cypher dump
cypher-dump
This query gives me the path that I'm trying to get however it doesn't specify name or type on n_four.
MATCH path = (n_one)-[*0..]->(n_two)-[*0..]->(n_three)-[*0..]->(n_four)
WHERE n_one.type IN ["JCL_JOB"]
AND n_two.type IN ["JCL_PROC"]
AND n_three.name IN ["INPA", "OUTA", "PRGA"]
AND n_three.type IN ["RESOURCE_FILE", "COBOL_PROGRAM"]
RETURN path
This query is what I'd like to work however it leaves out the leafs at the third level which I am having trouble understanding.
MATCH path = (n_one)-[*0..]->(n_two)-[*0..]->(n_three)-[*0..]->(n_four)
WHERE n_one.type IN ["JCL_JOB"]
AND n_two.type IN ["JCL_PROC"]
AND n_three.name IN ["INPA", "OUTA", "PRGA"]
AND n_three.type IN ["RESOURCE_FILE", "COBOL_PROGRAM"]
AND n_four.name IN ["TAB1", "TAB2", "COPYA"]
AND n_four.type IN ["RESOURCE_TABLE", "COBOL_COPYBOOK"]
RETURN path
What I've noticed is that when I "... RETURN n_four" in my query it is including nodes that are at the third level as well.
This behavior is caused by your (probably inappropriate) use of [*0..] in your MATCH pattern.
FYI:
[*0..] matches 0 or more relationships. For instance, (a)-[*0..]->(b) would succeed even if a and b are the same node (and there is no relationship from that node back to itself).
The default lower bound is 1. So [*] is equivalent to [*..] and [*1..].
Your 2 queries use the same MATCH pattern, ending in ...->(n_three)-[*0..]->(n_four).
Your first query does not specify any WHERE tests for n_four, so the query is free to return paths in which n_three and n_four are the same node. This lack of specificity is why the query is able to return 2 extra nodes.
Your second query specifies WHERE tests for n_four that make it impossible for n_three and n_four to be the same node. The query is now more picky, and so those 2 extra nodes are no longer returned.
You should not use [*0..] unless you are sure you want to optionally match 0 relationships. It can also add unnecessary overhead. And, as you now know, it also makes the query a bit trickier to understand.

How to limit recursion depending on node and relationship properties for each connected node pair

Let's start from simple query that finds all coworkers recursively.
match (user:User {username: 'John'})
match (user)-[:WORKED_WITH *1..]-(coworkers:User)
return user, coworkers
Now, I have to modify it in order to recieve only those users, that are connected with first N relationships.
Every User label have value of N in the properties, and every relationship have date of creation in its properties.
I suppose, that it can be reasonable to create and maintain separate set of relationships that will satisfy this condition.
UPD: Limitations have to be applied only for those, who know each other directly.
Limitation have to be applied to each node in the path, e.g. first user have 3 relationships :WORKED_WITH (on the first level) and limitation 5, than everything OK we can continue to check connected users, if user have 6 relationships and limitation 5, only 5 of relationships have to be used to move on.
I understand that it can be slow query, but how to do that without hand written tools? One of improvements is to move all that limitations out of query execution into some preprocessing step and create additional type of relationships that will hold all of those limitations, it will require validations because they are not part of the state but projection of the state.
The following query should work (as long as you do not have a lot of data). It uses DISTINCT to remove duplicates.
MATCH (user:User {username: 'John'})-[:WORKED_WITH*]-(coworker:User)
WITH DISTINCT user, coworker
ORDER BY coworker.createDate
RETURN COLLECT(coworker)[0..user.N] AS coworkers;
Note: since variable-length paths have exponential complexity, you would usually want to specify a reasonable upper bound (e.g., [:WORKED_WITH*..5]) to avoid the query running too long or causing an out-of-memory error. Also, since the LIMIT operator does not accept a variable as its argument, this query uses COLLECT(coworker)[0..user.N] to get the N coworkers with the earliest createDate -- which is also a bit expensive.
Now, if (as you suggested) you had created a specific type of relationship (e.g., FIRST_WORKED_WITH) between each User and its N earliest "coworkers", that would allow you to use the following trivial and fast query:
MATCH (user:User {username: 'John'})-[:FIRST_WORKED_WITH]->(coworker:User)
RETURN coworker;

How to Model a relationship that adds a feature to a node?

This is a follow-up to this earlier question
How to model two nodes related through a third node in neo4j?
If the capabilities of a product are enhanced by a connects_to relationship with another product, how should that fact be captured?
Example: given
(shelf:Shelf {maxload:20}), if (node:L-bracket)-[connects-to]->(shelf), then shelf's maxload increases by 10. Now, if someone queries for a Shelf that supports maxload=30, I should be able to retrieve this combination of L-Bracket+Shelf as an option, in addition to the shelves that support maxload without L-bracket. This is one use-case.
The other is when the connects_to relationship adds an entirely new property to the Shelf node. The option I'm thinking of is adding a property to the relationship called 'provides feature' and then query those as well when returning nodes, to see if a product is been enhanced by any of its connections.
Part 1 :
I should be able to retrieve this combination of L-Bracket+Shelf as
an option, in addition to the shelves that support maxload without
L-bracket.
This use case is handled with OPTIONAL MATCH :
MATCH (shelf:Shelf {maxload:30})
OPTIONAL MATCH (shelf)<-[:CONNECTS_TO]-(bracket:L-Bracket)
RETURN shelf, collect(bracket) as brackets
This would return you a list of shelfs and a collection of brackets for each of them - empty collection if they don't have any brackets.
Part 2 :
the other is when the connects_to relationship adds an entirely new
property to the Shelf node. The option i'm thinking of is adding a
property to the relationship called 'provides feature' and then query
those as well when returning nodes, to see if a product is been
enhanced by any of its connections
You can simply use a PROVIDES_FEATURE relationship type, no need for a property on it. You can request for them in the same way as for part 1.
To be a bit more general, suppose everything that can be connected to a shelf (not just an L-Bracket) was represented by an Accessory node that has type and extraLoad properties, like this:
(:Accessory {type: 'L-Bracket', extraLoad: 10})
This would allow accessories of different types and with differing extra load capacities.
With this model, you could find all Shelf/Accessory combinations that can hold a load of at least 30 this way:
MATCH (shelf:Shelf)
OPTIONAL MATCH (shelf)<-[:CONNECTS_TO]-(x:Accessory)
WITH shelf, COLLECT(x) AS accessories, SUM(x.extraLoad) AS extra
WHERE shelf.maxLoad + extra >= 30
RETURN shelf, accessories;

Cypher query optimisation - Utilising known properties of nodes

Setup:
Neo4j and Cypher version 2.2.0.
I'm querying Neo4j as an in-memory instance in Eclipse created TestGraphDatabaseFactory().newImpermanentDatabase();.
I'm using this approach as it seems faster than the embedded version and I assume it has the same functionality.
My graph database is randomly generated programmatically with varying numbers of nodes.
Background:
I generate cypher queries automatically. These queries are used to try and identify a single 'target' node. I can limit the possible matches of the queries by using known 'node' properties. I only use a 'name' property in this case. If there is a known name for a node, I can use it to find the node id and use this in the start clause. As well as known names, I also know (for some nodes) if there are names known not to belong to a node. I specify this in the where clause.
The sorts of queries that I am running look like this...
START
nvari = node(5)
MATCH
(target:C5)-[:IN_LOCATION]->(nvara:LOCATION),
(nvara:LOCATION)-[:CONNECTED]->(nvarb:LOCATION),
(nvara:LOCATION)-[:CONNECTED]->(nvarc:LOCATION),
(nvard:LOCATION)-[:CONNECTED]->(nvarc:LOCATION),
(nvard:LOCATION)-[:CONNECTED]->(nvare:LOCATION),
(nvare:LOCATION)-[:CONNECTED]->(nvarf:LOCATION),
(nvarg:LOCATION)-[:CONNECTED]->(nvarf:LOCATION),
(nvarg:LOCATION)-[:CONNECTED]->(nvarh:LOCATION),
(nvari:C4)-[:IN_LOCATION]->(nvarg:LOCATION),
(nvarj:C2)-[:IN_LOCATION]->(nvarg:LOCATION),
(nvare:LOCATION)-[:CONNECTED]->(nvark:LOCATION),
(nvarm:C3)-[:IN_LOCATION]->(nvarg:LOCATION),
WHERE
NOT(nvarj.Name IN ['nf']) AND NOT(nvarm.Name IN ['nb','nj'])
RETURN DISTINCT target
Another way to think about this (if it helps), is that this is an isomorphism testing problem where we have some information about how nodes in a query and target graph correspond to each other based on restrictions on labels.
Question:
With regards to optimisation:
Would it help to include relation variables in the match clause? I took them out because the node variables are sufficient to distinguish between relationships but this might slow it down?
Should I restructure the match clause to have match/where couples including the where clauses from my previous example first? My expectation is that they can limit possible bindings early on. For example...
START
nvari = node(5)
MATCH
(nvarj:C2)-[:IN_LOCATION]->(nvarg:LOCATION)
WHERE NOT(nvarj.Name IN ['nf'])
MATCH
(nvarm:C3)-[:IN_LOCATION]->(nvarg:LOCATION)
WHERE NOT(nvarm.Name IN ['nb','nj'])
MATCH
(target:C5)-[:IN_LOCATION]->(nvara:LOCATION),
(nvara:LOCATION)-[:CONNECTED]->(nvarb:LOCATION),
(nvara:LOCATION)-[:CONNECTED]->(nvarc:LOCATION),
(nvard:LOCATION)-[:CONNECTED]->(nvarc:LOCATION),
(nvard:LOCATION)-[:CONNECTED]->(nvare:LOCATION),
(nvare:LOCATION)-[:CONNECTED]->(nvarf:LOCATION),
(nvarg:LOCATION)-[:CONNECTED]->(nvarf:LOCATION),
(nvarg:LOCATION)-[:CONNECTED]->(nvarh:LOCATION),
(nvare:LOCATION)-[:CONNECTED]->(nvark:LOCATION)
RETURN DISTINCT target
On the side:
(Less important but still an interest) If I make each relationship in a match clause an optional match except for relationships containing the target node, would cypher essentially be finding a maximum common sub-graph between the query and the graph data base with the constraint that the MCS contains the target node?
Thanks a lot in advance! I hope I have made my requirements clear but I appreciate that this is not a typical use-case for Neo4j.
I think querying with node properties is almost always preferable to using relationship properties (if you had a choice), as that opens up the possibility that indexing can help speed up the query.
As an aside, I would avoid using the IN operator if the collection of possible values only has a single element. For example, this snippet: NOT(nvarj.Name IN ['nf']), should be (nvarj.Name <> 'nf'). The current versions of Cypher might not use an index for the IN operator.
Restructuring a query to eliminate undesirable bindings earlier is exactly what you should be doing.
First of all, you would need to keep using MATCH for at least the first relationship in your query (which binds target), or else your result would contain a lot of null rows -- not very useful.
But, thinking clearly about this, if all the other relationships were placed in separate OPTIONAl MATCH clauses, you'd be essentially saying that you want a match even if none of the optional matches succeeded. Therefore, the logical equivalent would be:
MATCH (target:C5)-[:IN_LOCATION]->(nvara:LOCATION)
RETURN DISTINCT target
I don't think this is a useful result.

Cypher query to find a node based on a regexp on a property

I have a Neo4J database mapped with COMPANY as nodes and RELATED as edges. COMPANY has a CODE property. I want to query the database and get the first node that matches the regexp COMPANY.CODE =~ '12345678.*', i.e., a COMPANY whose first 8 letters of CODE is equal a given string literal.
After several attempts, the best I could come up with was the following query:
START p=node(*) where p.CODE =~ '12345678.*' RETURN p;
The result is the following exception:
org.neo4j.cypher.EntityNotFoundException:
The property 'CODE' does not exist on Node[0]
It looks like Node[0] is a special kind of node in the database, that obviously doesn't have my CODE property. So, my query is failing because I'm not choosing the appropriate type of node to query upon. But I couldn't figure out how to specify the type of node to query on.
What's the query that returns what I want?
I think I need an index on CODE to run this query, but I'd like to know whether there's a query that can do the job without using such an index.
Note: I'm using Neo4J version 1.9.2. Should I upgrade to 2.0?
You can avoid the exception by checking for the existence of the property,
START p=node(*)
WHERE HAS(p.CODE) AND p.CODE =~ '12345678.*'
RETURN p;
You don't need an index for the query to work, but an index may increase performance. If you don't want indices there are several other options. If you keep working with Neo4j 1.9.x you may group all nodes representing companies under one or more sorting nodes. When you query for company nodes you can then retrieve them from their sorting node and filter on their properties. You can partition your graph by grouping companies with a certain range of values for .code under one sorting node, and a different range under another; and you can extend this partitioning as needed if your graph grows.
If you upgrade to 2.0 (note: not released yet, so not suitable for production) you can benefit from labels. You can then assign a Company label to all those nodes and this label can maintain it's own index and uniqueness constraints.
The zero node, called 'reference node', will not remain in future versions of Neo4j.

Resources