How to make CREATE UNIQUE work with a subquery? - neo4j

I have a query like this:
MATCH left, right
WHERE (ID(right) IN [1, 2, 3] AND ID(left) IN [4, 5, 6])
WITH left, right
LIMIT 1
RETURN left, right
UNION MATCH left, right
WHERE (ID(right) IN [1, 2, 3] AND ID(left) IN [4, 5, 6])
WITH left, right
SKIP 4 LIMIT 1
RETURN left, right
UNION MATCH left, right
WHERE (ID(right) IN [1, 2, 3] AND ID(left) IN [4, 5, 6])
WITH left, right
SKIP 8 LIMIT 1
RETURN left, right
CREATE UNIQUE left-[rel:FRIEND]->right
RETURN rel;
In general, I'm just creating a dataset so that I can use it later in CREATE UNIQUE instruction.
Obviously, that doesn't work - query analyzer says that I can only use RETURN clause once.
My question is - how to compose a dataset in this case? I tried to assign an alias and use it in CREATE UNIQUE - can't get it to work either. What am I doing wrong? Is this scenario even possible?

I may misunderstand what you are after, but here's what occurs to me when I look at your query.
To begin with here's an adaptation of your query that uses SKIP and LIMIT without RETURN or UNION.
MATCH left, right
WHERE ID(left) IN [1,2,3] AND ID(right) IN [4,5,6]
WITH left, right
LIMIT 1
CREATE UNIQUE left-[rel:FRIEND]->right
WITH [rel] as rels //If you want to return the relationship later you can put it in a collection and bring it WITH
MATCH left, right
WHERE ID(left) IN [1,2,3] AND ID(right) IN [4,5,6]
WITH left, right, rels
SKIP 4 LIMIT 1
CREATE UNIQUE left-[rel:FRIEND]->right
WITH rels + [rel] as rels
MATCH left, right
WHERE ID(left) IN [1,2,3] AND ID(right) IN [4,5,6]
WITH left, right, rels
SKIP 8 LIMIT 1
CREATE UNIQUE left-[rel:FRIEND]->right
WITH rels + [rel] as rels
RETURN LENGTH(rels), rels // You can return the relationships here but SKIP/LIMIT does its job also if you don't return anything
But this query is a bit wild. It's really three queries, where two have been artificially squeezed in as sub queries of the first. It matches the same nodes anew in each sub query, and there really isn't anything gained by running the queries this way rather than separately (it's actually slower, because in each sub query you match also the nodes you know you will not use).
So my first suggestion is to use START instead of MATCH...WHERE when getting nodes by id. As it stands, the query binds every node in the database as "left", and then every node in the database as "right", and then it filters out all the nodes bound to "left" that don't fit the condition in the WHERE clause, and then the same for "right". Since this part of the query is repeated three times, all nodes in the database are bound a total of six times. That's expensive for creating three relationships. If you use START you can bind the nodes you want right away. This doesn't really answer your question, but it will be faster and the query will be cleaner. So, use START to get nodes by their internal id.
START left = node(1,2,3), right = node(4,5,6)
The second thing I think of is the difference between nodes and 'paths' or 'result items' when you match patterns. When you bind three nodes in "left" and three other nodes in "right", you don't have three result items, but nine. For each node bound in "left" you get three results, because there are three possible "right" to combine it with. If you wanted to relate every "left" to every "right", great. But I think what you are looking for are the result items (1),(4), (2),(5), (3),(6), and though it seems convenient to bind the three "left" nodes and the three "right" nodes in one query with collections of node ids, you then you have to do all that filtering to get rid of the 6 unwanted matches. The query gets complex and cumbersome, and its actually slower than running the queries separately. Another way to say this is to say that (1)-[:FRIEND]->(4) is a distinct pattern, not (relevantly) connected to the other patterns you are creating. It would be different if you wanted to create (1)-[:FRIEND]->(2)<-[:FRIEND]-(3), then you would want to handle those three nodes together. Maybe you're just exploring fringe uses of cypher, but I thought I should point it out. By the way, using SKIP and LIMIT in this way is a bit off key, they're not really intended for pattern matching and filtering. It's also unpredictable, unless you also use ORDER BY, since there is no guarantee that the results will be in a certain order. You don't know which result item it is that get's passed on. Anyway, in this case, I think it would be better to bind the nodes and create the relationship in three separate queries.
START left = node(1), right = node(4)
CREATE UNIQUE left-[rel:FRIEND]->right
RETURN rel
START left = node(2), right = node(5)
CREATE UNIQUE left-[rel:FRIEND]->right
RETURN rel
START left = node(3), right = node(6)
CREATE UNIQUE left-[rel:FRIEND]->right
RETURN rel
Since you already know that you want those three pairs, and not, say, (1),(4),(1),(5),(1),(6) it would make sense to query for just those pairs, and the easiest way is to query separately.
But thirdly, since the three queries are structurally identical, differing only in property value (if id is to be considered a property) you can simplify the query by generalizing or anonymizing that which distinguishes them, i.e. use parameters.
START left = node({leftId}), right = node({rightId})
CREATE UNIQUE left-[rel:FRIEND]->right
RETURN rel
parameters: {leftId:1, rightId:4}, {leftId:2, rightId:5}, {leftId:3, rightId:6}
Since the structure is identical, cypher can cache the execution plan. This makes for good performance, and the query is tidy, maintainable and can be easily extended if later you want to do the same operation on other pairs of nodes.

Related

Find shortest path with complex predicate (check simultaneously both node properties and relation property)

I've built a graph with 40 mln nodes and 40 mln relations with Neo4j.
Mostly I search for different shortest paths and queries are to be very fast. Right now it usually takes a few milliseconds per query.
For speed I encode all parameters in relations property val, so ordinary query looks like this:
MATCH (one:Obj{oid:'1'})
with one
MATCH (two:Obj{oid:'2'}), path=shortestPath((one) -[*0..50]-(two))
WHERE ALL (x IN RELATIONSHIPS(path) WHERE ((x.val > 50 and x.val<109) ))
return path
But one filter cannot be done this way, as it should evaluate (on each step) property of starting node, property of relation, property of ending node, for example:
Path: n1(==1)-r1(==2)-n2(==1)-r2(==5)-n3(==3)
On step1: properties of n1 and n2 equal 1 and relation's property equals 2, that's OK, going further
On step2: property of n2 equals 1, but property of n3 equals 3, so we stop. If it was 1, we would stop anyway, because relation r2 is not 2, but 5.
I've used RELATIONSHIPS and NODES predicates, but they seem to work separately.
Also, I guess this can be done with traversal API, but I'll have to rewrite a lot of my other code, so it is not desirable.
Am I missing some fast solution?
It looks like your basic query is running quite fast. If you want to filter at additional steps, you probably have to add additional optional match and with statements to accommodate the filters. Undesired elements should drop out.

neo4j - restrict query based on node's rank

I have a hierarchical structure of nodes, which all have a custom-assigned sorting property (numeric). Here's a simple Cypher query to recreate:
merge (p {my_id: 1})-[:HAS_CHILD]->(c1 { my_id: 11, sort: 100})
merge (p)-[:HAS_CHILD]->(c2 { my_id: 12, sort: 200 })
merge (p)-[:HAS_CHILD]->(c3 { my_id: 13, sort: 300 })
merge (c1)-[:HAS_CHILD]->(cc1 { my_id: 111 })
merge (c2)-[:HAS_CHILD]->(cc2 { my_id: 121 })
merge (c3)-[:HAS_CHILD]->(cc3 { my_id: 131 });
The problem I'm struggling with is that often I need to make decisions based on child node rank relative to some parent node, with regads to this sort identifier. So, for example, node c1 has rank 1 relative to node p (because it has the least sort property), c2 has rank 2, and c3 has rank 3 (the biggest sort).
The kind of decision I need to make based to this information: display children only of the first 2 cX nodes. Here's what I want to get:
cc1 and cc2 are present, but cc3 is not because c3 (its parent) is not the first or the second child of p. Here's a dumb query for that:
match (p {my_id: 1 })-->(c)
optional match (c)-->(cc) where c.sort <= 200
return p, c, cc
The problem is, these sort properties are custom-set and imported, so I have no way of knowing which value will be held for child number 2.
My current solution is to rank it during import, and since I'm using Oracle, that's quite simple -- I just need to use rank window function. But it seems awkward to me and I feel like there could be more elegant solution to that. I tried the next query and it works, but it looks weird and it's quite slow on bigger graphs:
match (p {my_id: 1 })-->(c)
optional match (c)-->(cc)
where size([ (p)-->(c1) where c1.sort < c.sort |c1]) < 2
return p, c, cc
Here's the plan for this query and the most expensive part is in fact the size expression:
The slowness you're seeing is likely because you're not performing an index lookup in your query, so it's performing an all nodes scan and accessing the my_id property of every node in your graph to find the one with id 1 (your p node).
You need to add labels on your nodes and use these labels in your queries (at least for your p node), and create an index (or in this case, probably a unique constraint) on the label for my_id so this lookup becomes fast.
You can confirm what's going on by doing a PROFILE of your query (if you can add the profile plan to your description, with all elements of the plan expanded that would help determine further optimizations).
As for your query, something like this should work (I'm using a :Node label as a standin for your actual label)
match (p:Node {my_id: 1 })-->(c)
with p, c
order by c.sort asc
with p, collect(c) as children // children are in order
unwind children[..2] as child // one row for each of the first 2 children
optional match (child)-->(cc) // only matched for the first 2 children
return p, children, collect(cc) as grandchildren
Note that this only returns nodes, not paths or relationships. The reason why you're getting the result graph in the graphical view is because, in the Browser Setting tab (the gear icon in the lower left menu) you have Connect result nodes checked at the bottom.

Pairs from a directed acyclic Neo4j graph

I have a DAG which for the most part is a tree... but there are a few cycles in it. I mention it in case it matters.
I have to translate the graph into pairs of relations. If:
A -> B
C
D -> 1
2 -> X
Y
Then I would produce ArB, ArC, arD, Dr1, Dr2, 2rX, 2rY, where r is some relationship information (in other words, the query cannot totally ignore it.)
Also, in my graph, node A has many cousins, so I need to 'anchor' my query to A.
My current attempt generates all possible pairs, so I get many unhelpful pairs such as ArY since A can eventually traverse to Y.
What is a query that starts (or ends) with A, that returns a list of pairs? I don't want to query Neo individually for each node - I want to get the list in one shot if possible.
The query would be great, doc pages that explain would be great. Any help is appreciated.
EDIT Here's what I have so far, using Frobber's post as inspiration:
1. MATCH p=(n {id:"some_id"})-[*]->(m)
2. WITH DISTINCT(NODES(p)) as zoot
3. MATCH (x)-[r]->(y)
4. WHERE x IN zoot AND y IN zoot
5. RETURN DISTINCT x, TYPE(r) as r, y
Where in line 1, I make a path that includes all the nodes under the one I care about.
In line 2, I start a new match that is intended to return my pairs
Line 3, I convert the path of nodes to a collection of nodes
Line 4, I accept only x and y nodes that were scooped up the first match. I am not sure why I have to include y in the condition, but it seems to matter.
Line 5, I return the results. I do not know why I need a distinct here. I thought the one on line 3 would do the trick.
So far, this is working for me. I have no insight into its performance in a large graph.
Here's an approach to try - this query is modeled off of the sample matrix data you can find online so you can play with it before adapting it to your schema.
MATCH p=(n:Crew)-[r:KNOWS*]-m
WHERE n.name='Neo'
WITH p, length(nodes(p)) AS nCount, length(relationships(p)) AS rCount
RETURN nodes(p)[nCount-2], relationships(p)[rCount-1], nodes(p)[nCount-1];
ORDER BY length(p) ASC;
A couple of notes about what's going on here:
Consider the "Neo" node (n.name="Neo") to be your "A" here. You're rooting this path traversal in some particular node you pick out.
We're matching paths, not nodes or edges.
We're going through all paths rooted at the A node, ordering by path length. This gets the near nodes before the distant nodes.
For each path we find, we're looking at the nodes and relationships in the path, and then returning the last pair. The second-to-last node (nodes(p)[nCount-2]) and the last relationship in the path (relationships(p)[rCount-1]).
This query basically returns the node, the relationship, and the connected node showing that you can get those items; from there you just customize the query to pull out whatever about those nodes/rels you might need pursuant to your schema.
The basic formula starts with matching p=(someNode {startingPoint: "A"})-[r:*]->(otherStuff); from there it's just processing paths as you go.

Neo4j cypher query a known path

I have a query that I'm not sure how to implement or if it's efficient to do in cypher. Anyway, here's what I'm trying to do.
I have basically this graph:
I want to get all the nodes/relationships from 1 to 3 (note: the empty node can be any number of nodes). I also want all the, if any, incoming edges from the last two nodes and only the last two nodes that are not in the original path. In this case the edges that are in red should also be added to result.
I already know the path that I want. So in this example I would have been given node ids 1, ..., 2, 3 and I think I know how to get the path of the first part.
MATCH (n)-->() WHERE n.nid IN ['1', '...', '2', '3'] RETURN n
I just can't figure out how to get the red edges for the last two nodes in the path. Also, I'm not given node ids 4 and 5. We can assume the edges connecting 1, ..., 2, 3 all have the same label and all the other edges have a different label.
I think I need to use merge but can't figure out how to do it yet.
Or if someone know's how to do this in gremlin, I'm all ears.
Does this work for you?
MATCH ({nid: '1'})-[:t*]->(n2 {nid: '2'})-[:t]->(n3 {nid: '3'})
OPTIONAL MATCH ()-[t42]->(n2)
WHERE (TYPE(t42) <> 't')
OPTIONAL MATCH ()-[t53]->(n3)
WHERE (TYPE(t53) <> 't')
RETURN COLLECT(t42) AS c42, COLLECT(t53) AS c53;
I give all the relationships on the left path (in your diagram) the type "t". (The term label is used for nodes, not relationships.). You said we can assume that the other relationships do not have that type, so this query takes advantage of that fact to filter out type "t" relationships from the result.
This query also makes the 4-2 and 5-3 relationships optional.

Any length relationship between two nodes that share the same label

I'm trying to find a query that will show me any length relationship that exists between two nodes that share the same index. Basically, if there is any overlap between for a specific label. My graph is Pretty simple and not particularly large:
(m:`Campaign`), (n:`Politician`), (o:`Assistant`), (p:`Staff`), (q:`Aid`), (s:`Contributor`)
(m)<-[:Campaigns_for]-(n)
(o)<-[:works_for]-(m)
(p)<-[:works_for]-(o)
(q)<-[:volunteers_for]-(p)
(m)<-[:contributes_to]-(s)
I want to find all the shared nodes and their relationships between Campaigns.
so far i have:
MATCH (n:`Campaign`)-[r*]-(m:`Campaign`)
RETURN n,count(r) as R,m
ORDER BY R DESC
but it's not returning everyhing I want, I want in addition to the counts, the labels of each relationship and the names of the nodes in between.
Assuming that "names of the nodes" means "return the name property of the node" (you could always substitute in "labels(n)" if you're after labels), then something like this might work, but you have some aggregation going on here so you may need to parse a bit:
MATCH p =(a:Campaign)-[r*]-(b:Campaign)
RETURN a, length(relationships(p)) AS count, b, extract(x IN relationships(p)| type(x)), extract(x IN nodes(p)| x.name)
ORDER BY count DESC
I'm also assuming that when you say "not returning everything you want", you mean that in addition to what's currently returned in your result set, you want just those other items you listed.
Keep in mind it might also be possible to have a cycle in your graph (not knowing too much about your particular graph), so, you may want to check your beginning and ending nodes.

Resources