neo4j cypher: stacking nodes from sequential query results - neo4j

Considering the existence of three types of nodes in a db, connected by the schema
(a:a)-[ra:madeWithB {ra.qty}]->(b:b)-[rb:madeWithC {rb.qty}]->(c:c)
with the user being able to have connection with each one of these types.
(user)-[:has {qty}]->(a:a)
(user)-[:has {qty}]->(b:b)
(user)-[:has {qty}]->(c:c)
What would be the best way to query the database to return a list of all the nodes the user :has, considering that when he :has an (a) then in the result the associated (b) and (c) should also be returned after having multiplied their qty field?
Real world example: a user buys three IKEA fully furnished rooms (nodes a). The db knows what furniture's in them (b nodes) and what parts are needed for those items (nails & stuff, c nodes). The user also buys some other random furniture (ie: some more b nodes, without being connected to any a but being connected to more c) and some extra spare nails and other parts (ie: some more c nodes, not connected to any b).
So - knowing the list of a and additional b and c - print the list of all b (that will be the sum of those contained in the three rooms + extra) and c (that will be the parts needed for all the furniture and extra), with its associated qty.
NOTE: consider arbitrary length queries not to be an option when matching nodes.

Related

Neo4j Cypher Aggregate function changes in WITH clause

I'm new to Neo4j, and having a problem with the average function.
I've got a test database of bank accounts (nodes) and payments between them (relationships).
I want to compute the average of the payments between each pair of accounts (ie between A&B, between A&C, between B&C, etc), and then find any payments that are $50 above the average.
My code looks like this:
MATCH (a)-[r:Payment]-(b)
WITH a, b, AVG(ToFloat(r.Amount)) AS Average, ToFloat(r.Amount) as Amount
WHERE Amount-Average>50
RETURN a, b, Amount-Average AS Difference
If I just leave a and Average in the WITH clause, it seems to compute the average correctly, but if I add in anything else (either r or the r.Amount clause), then the Average function output changes, and just returns the same value as "Amount" (So it would compute "Difference" as 0 for every relationship).
Could it be that the way I'm MATCHing the nodes and relationships doesn't correctly find the relationships between each pair of accounts and then average on them, which would then cause the error?
Thanks in advance!
This is a consequence of Cypher's implicit grouping when performing aggregations. The grouping key (the context over which the grouping happens) is implicit, formed by the non-aggregation variables present on the WITH or RETURN clause.
This is why, when you include r or r.amount, that the output changes, since you would be calculating the average with respect to the same relationship, or the same amount (average of a single value is that value).
Since you want to evaluate and filter all amounts between the nodes based upon the average, you should collect the amounts when you take the average, and then filter/transform the contents for your return.
Also, you'll want to include a bit of filtering for a and b to ensure you don't return mirrored results (same results for the same nodes except the nodes for a and b are swapped), so we'll use a restriction on the node ids to ensure order in a single direction only:
MATCH (a)-[r:Payment]-(b)
WHERE id(a) < id(b) // ensure we don't get mirrored results
WITH a, b, AVG(ToFloat(r.Amount)) AS Average, collect(ToFloat(r.Amount)) as Amounts
WITH a, b, [amt in Amounts WHERE amt-Average > 50 | amt - Average] as Differences
RETURN a, b, Differences
If you want individual results for each row, then you can UNWIND the Differences list before you return.

How to calculate custom degree based on the node label or other conditions?

I have a scenario where I need to calcula a custom degree between the first node (:employee) where it should only be incremented to another node when this node's label is :natural or :relative, but not when it is :legal.
Example:
The thing is I'm having trouble generating this custom degree property as I needed it.
So far I've tried playing with FOREACH and CASE but had no luck. The closest I got to getting some sort of calculated custom degree is this:
match p = (:employee)-[*5..5]-()
WITH distinct nodes(p) AS nodes
FOREACH(i IN RANGE(0, size(nodes)) |
FOREACH(node IN [nodes[i]] |
SET node.degree = i
))
return *
limit 1
But even this isn't right, as despite having 5 distinct nodes, I get SIZE(nodes) = 6, as the :legal node is accounted for twice for some reason.
Does anyone know how to achieve my goal within a single cypher query?
Also, if you know why the :legal node is account for twice, please let me know. I suspect it is because it has 2 :natural nodes related to it, but don't know the inner workings that make it appear twice.
More context:
:employee nodes are, well, employees of an organization
:relative nodes are relatives to an employee
:natural nodes are natural persons that may or may not be related to a :legal
:legal nodes are companies (legal persons) that may, or may not, be related to an :employee, :relative, :natural or another :legal on an IS_PARTNER relationship when, in real life, they are part of the board of directors or are shareholders of that company (:legal).
custom degree is what I aim to create and will define how close one node is to another given some conditions to this project (specified below).
All nodes have a total_contracts property that are the total amount of money received through contracts.
The objective is to find any employees with relationships to another node that has total_contracts > 0 and are up to custom degree <= 3, as employees may be receiving money from external sources, when they shouldn't.
As for why I need this custom degree ignoring the distance when it is a :legal node, is because we threat companies as the same distance as the natural person that is a partner.
On the illustrated example above, the employee has a son, DIEGO, that is a shareholder of a company (ALLURE) and has 2 other business partners (JOSE and ROSIEL). When I ask what's the degree of the son to the employee, I should get 1, as they are directly related; when I ask whats the degree of JOSE to the employee I should get 2, as JOSE is related to DIEGO through ALLURE and we shouldn't increment the custom degree when it is a company, only when its a person.
The trick with this type of graph is making sure we avoid paths that loop back to the same nodes (which is definitely going to happen quite a lot because you're using multiple relationships between nodes instead of just one...you may want to make sure this is necessary in your model).
The easiest way to do that is via APOC Procedures, as you can adjust the uniqueness of traversals so that nodes are unique in each path.
So for example, for a specific start node (let's say the :employee has empId:1 just for the sake of mocking up a lookup of the node, we'll calculate a degree for all nodes within 5 hops of the starting node. The idea here is that we'll take the length of the path (the number of hops) - the number of :legal nodes in the path (by filtering the nodes in the path for just :legal nodes, then getting the size of that filtered list).
MATCH (e:employee {empId:1})
CALL apoc.path.expandConfig(e, {minLevel:1, maxLevel:5, uniqueness:'NODE_PATH'}) YIELD path
WITH e, last(nodes(path)) as endNode,
length(path) - size([x in nodes(path) WHERE x:legal]) as customDegree
RETURN e, endNode, customDegree

Finding actors not connected to Kevin Bacon, efficiently

Using neo4j cypher, what query would efficiently find actors who are not connected to Kevin Bacon? We can say that 'not connected' means that an actor is not connected to Kevin Bacon by at least 10 hops for simplicity.
Here is what I have attempted:
MATCH (kb:Actor {name:'Kevin Bacon'})-[*1..10]-(h:Actor) with h
MATCH (a)-[:ACTS_IN]->(m)
WHERE a <> h
RETURN DISTINCT h.name
However, this query runs for 3 days. How can I do this more efficiently?
(A) Your first MATCH finds every actor that is connected within 10 hops to Kevin Bacon. The result of this clause is a number (M) of rows (and if an actor is connected in, say, 7 different ways to Kevin, then that actor is represented in 7 rows).
(B) Your second MATCH finds every actor that has acted in a movie. If this MATCH clause were standalone, then it would require N rows, where N is the number of ACTS_IN relationships (and if an actor acted in, say, 9 movies, then that actor would be represented in 9 rows). However, since the clause comes right after another MATCH clause, you get a cartesian product and the actual number of result rows is M*N.
So, your query requires a lot of storage and performs a (potentially large) number of redundant comparisons, and your results can contain duplicate names. To reduce the storage requirements and the number of actor comparisons (in your WHERE clause): you should cause the results of A and B to have distinct actors, and eliminate the cartesian product.
The following query should do that. It first collects a single list (in a single row) of every distinct actor that is connected within 10 hops to Kevin Bacon (as hs), and then finds all (distinct) actors not in that collection:
MATCH (kb:Actor {name:'Kevin Bacon'})-[*..10]-(h:Actor)
WITH COLLECT(DISTINCT h) AS hs
MATCH (a:Actor)
WHERE NOT a IN hs
RETURN a.name;
(This query also saves even more time by not bothering to test whether an actor has acted in a movie.)
The performance would still depend on how long it takes to perform the variable length path search in the first MATCH, however.

Neo4j - Traversing from one node to another which are indirectly connected by parent Node

I have a specific Case where i have two labels Person and Company.
Person has two nodes X and Y and Company has a Single Node.
Both persons have a relationship with Company HAS_EMPLOYEE.
I want to Find Relationship Between X and Y i.e. they work for the same Company.
How to do that in Neo4j? Given only Nodes X and Y?
This will depend on if you're looking for a specific connection (via a :Company node), or just looking for any connection at all.
Let's say :Person nodes have a name, and that person nodes X and Y have the names 'x' and 'y', so we can match to them. Let's also say that you have an index on :Person(name) so we can lookup the nodes quickly.
If the query we want is "do persons x and y share the same company", the query for this, returning the company in question, is:
match (x:Person{name:'x'})<-[:HAS_EMPLOYEE]-(comp:Company)-[:HAS_EMPLOYEE]->(y:Person{name:'y'})
return comp
But if we don't know how these persons are connected, or even if they're connected, then we'll likely want to run a shortestPath() match between the nodes, and see what connects them.
It helps to set an upper bounds for this match. For now let's use 8 hops max.
match path=shortestPath((x:Person{name:'x'})-[*..8]-(y:Person{name:'y'}))
return path

Efficient duplicate node finding in neo4j

A feature request for the next Neo4j version: Neo4j already supports indices that keep properties in a sorted order, allowing fast lookups. Eg. for a person's first name, one might have an index that looks like:
Alice
Bob
Carol
Dave
Emily
(....)
so one can look up "Dave" with binary search (O(log n)) instead of linear scanning (O(n)).
However, one can also use an index to efficiently find duplicates (nodes which have the same value for some property). Eg., if one wants a list of every group of "person" nodes sharing the same first name, what Neo4j 2.3 seems to do (via EXPLAIN in Cypher) is run a comparison of each node's first name against every other first name, which is O(N^2). Eg. this query:
EXPLAIN MATCH (a:person) WITH a MATCH (b:person) WHERE a.name = b.name RETURN a, b LIMIT 5
shows a CartesianProduct step followed by a Filter step. But with an index on first names, one can do a linear scan over a list like:
Alice
Alice
Alice
Bob
Carol
Carol
Dave
Emily
Frank
Frank
Frank
(....)
comparing item #1 to #2, #2 to #3, and so on, to build an ordered list of all the duplicates in O(n) time per scan. Neo4j doesn't seem to support that, but it would be very useful for my application, so I'd like to put in a request.
I have a couple of suggestions for what you might try, but if you find them insufficient (and nobody else has any better ideas), I would suggest submitting new feature ideas to the Neo4j GitHub issues list.
So I was wondering if maybe Neo4j considers properties special. If you have an index on a label/property (which you can create with CREATE INDEX ON :person(name)), then comparing a property with a string should be pretty efficient. I tried passing the name through as just a variable and it seems to have fewer DB hits in my small test DB:
MATCH (a:person)
WITH a, a.name AS name
MATCH (b:person)
WHERE name = b.name
RETURN a, b LIMIT 5
That seems to give me fewer DB hits when I PROFILE it.
Another way to go about it, since you're talking about the same set of objects, is to group the nodes by name and then pull out the pairs for each group. Like so:
MATCH (a:person)
WITH a.name AS name, collect(a) AS people
UNWIND people AS a
UNWIND people AS b
WITH name, a, b
WHERE a <> b
RETURN a, b LIMIT 50
Here we collect up an array for each unique name (we could also lower/upper if we wanted to be case-insensitive) and then UNWIND twice to get a cartesian product of the array. Since we're working on a group-by-group basis, this should be much faster than comparing every node to every other node.

Resources