I would like to show on the graph all connections (max.6) from the specific node (Company).
I simply used the below:
MATCH path=(c:Company)-[*1..6]-()
where c.property1=$company
RETURN path, c
but it takes lot of time to execute the query.
Any suggestion how to modify this query or how to speed up the process?
Best regards!
First of all, traversing 6 hops can be a lot especially in a well-connected graph.
However, I would suggest you use the following query to optimize your search:
MATCH path=shortestPath((c:Company)-[*1..6]-(e))
where c.property1=$company AND NOT c = e
RETURN path, c
Related
I have a neo4j graph that looks like this:
Nodes:
Blue Nodes: Account
Red Nodes: PhoneNumber
Green Nodes: Email
Graph design:
(:PhoneNumber) -[:PART_OF]->(:Account)
(:Email) -[:PART_OF]->(:Account)
The problem I am trying to solve is to
Find any path that exists between Account1 and Account2.
This is what I have tried so far with no success:
MATCH p=shortestPath((a1:Account {accId:'1234'})-[]-(a2:Account {accId:'5678'})) RETURN p;
MATCH p=shortestPath((a1:Account {accId:'1234'})-[:PART_OF]-(a2:Account {accId:'5678'})) RETURN p;
MATCH p=shortestPath((a1:Account {accId:'1234'})-[*]-(a2:Account {accId:'5678'})) RETURN p;
MATCH p=(a1:Account {accId:'1234'})<-[:PART_OF*1..100]-(n)-[:PART_OF]->(a2:Account {accId:'5678'}) RETURN p;
Same queries as above without the shortest path function call.
By looking at the graph I can see there is a path between these 2 nodes but none of my queries yield any result. I am sure this is a very simple query but being new to Cypher, I am having a hard time figuring out the right solution. Any help is appreciated.
Thanks.
All those queries are along the right lines, but need some tweaking to make work. In the longer term, though, to get a better system to easily search for connections between accounts, you'll probably want to refactor your graph.
Solution for Now: Making Your Query Work
The path between any two (n:Account) nodes in your graph is going to look something like this:
(a1:Account)<-[:PART_OF]-(:Email)-[:PART_OF]->(ai:Account)<-[:PART_OF]-(:PhoneNumber)-[:PART_OF]->(a2:Account)
Since you have only one type of relationship in your graph, the two nodes will thus be connected by an indeterminate number of patterns like the following:
<-[:PART_OF]-(:Email)-[:PART_OF]->
or
<-[:PART_OF]-(:PhoneNumber)-[:PART_OF]->
So, your two nodes will be connected through an indeterminate number of intermediate (:Account), (:Email), or (:PhoneNumber) nodes all connected by -[:PART_OF]- relationships of alternating direction. Unfortunately to my knowledge (and I'd love to be corrected here), using straight cypher you can't search for a repeated pattern like this in your current graph. So, you'll simply have to use an undirected search, to find nodes (a1:Account) and(a2:Account) connected through -[:PART_OF]- relationships. So, at first glance your query would look like this:
MATCH p=shortestPath((a1:Account { accId: {a1_id} })-[:PART_OF*]-(a2:Account { accId: {a2_id} }))
RETURN *
(notice here I've used cypher parameters rather than the integers you put in the original post)
That's very similar to your query #3, but, like you said - it doesn't work. I'm guessing what happens is that it doesn't return a result, or returns an out of memory exception? The problem is that since your graph has circular paths in it, and that query will match a path of any length, the matching algorithm will literally go around in circles until it runs out of memory. So, you want to set a limit, like you have in query #4, but without the directions (which is why that query doesn't work).
So, let's set a limit. Your limit of 100 relationships is a little on the large side, especially in a cyclical graph (i.e., one with circles), and could potentially match in the region of 2^100 paths.
As a (very arbitrary) rule of thumb, any query with a potential undirected and unlabelled path length of more than 5 or 6 may begin to cause problems unless you're very careful with your graph design. In your example, it looks like these two nodes are connected via a path length of 8. We also know that for any two nodes, the given minimum path length will be two (i.e., two -[:PART_OF]- relationships, one into and one out of a node labelled either :Email or :PhoneNumber), and that any two accounts, if linked, will be linked via an even number of relationships.
So, ideally we'd set out our relationship length between 2 and 10. However, cypher's shortestPath() function only supports paths with a minimum length of either 0 or 1, so I've set it between 1 and 10 in the example below (even though we know that in reality, the shortest path have a length of at least two).
MATCH p=shortestPath((a1:Account { accId: {a1_id} })-[:PART_OF*1..10]-(a2:Account { accId: {a2_id} }))
RETURN *
Hopefully, this will work with your use case, but remember, it may still be very memory intensive to run on a large graph.
Longer Term Solution: Refactor Graph and/or Use APOC
Depending on your use case, a better or longer term solution would be to refactor your graph to be more specific about relationships to speed up query times when you want to find accounts linked only by email or phone number - i.e. -[:ACCOUNT_HAS_EMAIL]- and -[:ACCOUNT_HAS_PHONE]-. You may then also want to use APOC's shortest path algorithms or path finder functions, which will most likely return a faster result than using cypher, and allow you to be more specific about relationship types as your graph expands to take in more data.
I have a GraphAware time tree and spatial r tree set up to reference a large set of nodes in my graph. I am trying to search these records by time and space.
Individually I can gather results from these queries in about 5 seconds:
WITH
({start:1300542000000,end:1350543000000}) as tr
CALL ga.timetree.events.range(tr) YIELD node as n
RETURN count(n);
> ~ 500000 results
WITH
({lon:120.0,lat:20.0}) as smin, ({lon:122.0,lat:21.0}) as smax
CALL spatial.bbox('spatial_records', smin, smax) YIELD node as n
RETURN count(n);
> ~ 30000 results
When I try to filter these results the performance drops drastically. Neo4j is already using up a large amount of memory in my system, so I am under the impression that the memory footprint of this command is too much on my system, and that the query will never finish. (I am using to the neo4j-shell to run these commands)
WITH
({start:1300542000000,end:1350543000000}) as tr,
({lon:120.0,lat:20.0}) as smin, ({lon:122.0,lat:21.0}) as smax
CALL ga.timetree.events.range(tr) YIELD node as n
CALL spatial.bbox('spatial_records', smin, smax) YIELD node as m
WITH COLLECT(n) as nn, COLLECT(m) as mm
RETURN FILTER(x in nn WHERE X in mm);
I am wondering what the best way to efficiently filter the results of these two statement calls is. I attempted to use the REDUCE clause, but couldn't quite figure out the syntax.
As a side question, given that this is the most common type of query that I will issue to my database, is this a good way to do things (as in using the time tree and r tree referencing the same set of nodes)? I haven't found any other tools in neo4j that support indexing both space and time in a single structure, so this is my current implementation.
The first procedure returns you 500k nodes, and collecting is a costly operation, so yeah this would be very memory heavy.
I would start from what returns you the less nodes, and then using cypher rather than a procedure, so here I would replace the call to the timetree procedure by a ranged query filter in Cypher.
Assuming you have an indexed timestamp property on your nodes :
CALL spatial.bbox('spatial_records', smin, smax) YIELD node as m
WITH m
WHERE m.timestamp > 1300542000000 and m.timestamp < 1350543000000
RETURN m
I wouldn't recommend to remove the timetree (otherwise I would be fired <- joke) . In some time query cases the timetree would outperform the queries on ranged query, especially when the resolution is high (millisecond) and you have a lot of very consecutive timestamps.
Otherwise you seem to have a very good use case, this would be nice if you could send more details on the neo4j slack or privately (christophe at graphaware dot com), this could help Neo4j and GraphAware to maybe support more stuff via procedures (like passing a collection of nodes and filter out those not being in the range or a smooth combination with spatial) in a better way, as long as it is generic enough.
In the meantime, as you are using open source products, you could easily create a procedure that combine two procedures for your specific use case.
I'm struggling for days to find a way for finding all paths (to a maximum length) between two nodes while controlling the path exploration by Neo4j by sorting the relationships that are going to be explored (by one of their properties).
So to be clear, lets say I want to find K best paths between two nodes until a maximum length M. The query will be like:
match (source{name:"source"}), (target{name:"target"}),
p = (source)-[*..M]->(target)
return p order by length(p) limit K;
So far so good. But lets say the relationships of the path have a property called "priority". What I want is to write a query that tells Neo4j on each step of path exploration which relationships should be explored first.
I know that can be possible when I use the java libraries and an embedded database (By implementing PathExpander interface and giving it as input to the GraphAlgoFactory.allSimplePaths() function in Java).
But now I'm trying to find a way doing this in a server mode database access using Bolt or REST api.
Is there any way to do this in the server mode? Or maybe using Java libraries functions while accessing the graph in server mode?
use labels and an index to find your two start-nodes
perhaps consider allShortestPaths to make it faster
try this:
match (source{name:"source"}), (target{name:"target"}),
p = (source)-[rels:*..20]->(target)
return p, reduce(prio=0, r IN rels | prio + r.priority) as priority
order by priority ASC, length(p)
limit 100;
I had a very similar problem. I was trying to find the shortest path from one node to all other nodes. I had written a query similar to the one in the answer above (https://stackoverflow.com/a/38030536/783836) and couldn't get it to perform in any reasonable time.
Asking Can Graph DBs perform well with unspecified end nodes? pointed me to the solution: the Single Shortest Path algorithm.
In Neo4j you need to install the Graph Data Science Library and make use of this function: gds.alpha.shortestPath.deltaStepping.stream
I have the following cypher query being called multiple times.
start n=node:MyIndex(Name="ABC")
return n
Then somewhere else in the code
start m=node:MyIndex(NAME="XYZ")
return m
My data base is hosted in Azure and so I am having latency/performance issues. In order to speed up the process, and to reduce multiple round trips, I thought about combining multiple Cypher queries into a single one.
Actually, I am getting 10+ nodes in lookup but for simplicity I have decided to show example with just two nodes below.
start n=node:MyIndex(Name="ABC"), m=node:MyIndex(NAME="XYZ")
return n, m
My goal is to get what I can in one round trip instead of 10+. It works successfully if the index lookup on All nodes succeeds. However, Cypher query returns zero rows even if one index lookup fails. I was hoping that I will get NULL equivalent in n or m on the missing node. However, no luck.
Please suggest what I am doing wrong and any workarounds to reduce the round trips. Many thanks!
You can use a parametrized query with lucene syntax, e.g.:
START n=node:MyIndex({query}) return n
and parametrize with
{'query':'Name:(ABC XYZ)'}
where list of names is a string with space separated names you are looking for.
I use query
"START a=node("+str(node1)+"),
b =node("+str(node2)+")
MATCH p=shortestPath(a-[:cooperate*..200]-b)
RETURN length(p)"
to see the path between a and b. I have many nodes, so when i run the query, sometimes it runs fast and sometimes run slowly.I use neo4j 1.9 community. Can anyone helps?
Query time is proportional to the amount of the graph searched. Your query allows for very deep searches, up to depth 200. If a. and b. are very close, you'll not search much of the graph, and the query will return very fast. If a. and b. are separated by 200 edges, you will search a very large swathe of graph (perhaps the whole graph?), which for a large graph will be much slower.
Is the graph changing between the two queries, is it possible these two nodes end up in different places in relation to eachother between the queries? For example if you generate some random data to populate the graph?