Get start and end nodes of specific path in a large graph - neo4j

I have a large graph (1,068,029 nodes and 2,602,897 relationships), and I work with it via the python API and make requests to the graph in my program flow.
I have the following queries -
First query
MATCH
(start_node)--(o:observed_data)--(i:indicator)--(m:malware)--(end_node:attack_pattern)
WHERE start_node.id in [id_list]
RETURN start_node.id, end_node.name
Second query
MATCH
(start_node)--(o1:observed_data)--(h:MD5)--(o2:observed_data)--(i:indicator)--(m:malware)--(end_node:attack_pattern)
WHERE start_node.id in [id_list]
RETURN start_node.id, end_node.name
When I trying to preform the first query with id_list of size 75,000 its passes OK and returns the wanted output, but when I trying to preform the second query - the graph gets stuck, even when I decreasing the id_list to 20,000.
The id_list is even larger than 75,000 but I split it into chunks in order to make the graph's response time faster, but if I will split it to too many chunks I will increase the number of requests to the graph, and increase the program run-time.
My question is - Is there a library's function of some sort (APOC or something like that) that performs the same action but in less time? Or maybe you have another solution that solves this problem without decreasing the id_list under 50,000?

The (start_node) in your MATCH patterns should specify a label (like (start_node:Foo)), to avoid having to scan every node in the DB. Also, you should create an index (or uniqueness constraint) for that start node.
You should make all the relationships in your MATCH patterns directional, if appropriate. That is, put an arrow on either end.
You should specify the relationship types in your patterns as well (like ()-[:BAR]->()), so that the query would not be forced to evaluate all relationship types.

Related

Why does this cypher query never finish

This is the query:
MATCH (t:Table)-[*]-(a:Attribute) RETURN t,a
Here is the complete graph:
Here is the query and what happens when I try to execute it:
The reason is that you are performing a variable-length relationship without an upper bound. Cypher will attempt to find every possible path in existence that can be made no matter how long the path, provided that the path begins with a :Table node and ends with an :Attribute node. While a relationship will only be traversed once per path, there's no restriction to using a different relationship to return to a previously traversed node and then using another as-of-yet-untraversed-relationship-in-the-path to leave it and continue traversing.
Even on a small graph, the number of possible paths explodes. You can see for yourself how the number of paths grows, and how the db will get slower as the number of possible paths to explore explodes.
MATCH (:Table)-[*..6]-(:Attribute)
RETURN count(*) as pathsFound
Now if that finishes quick, increase the upper bound and run it, and keep on doing it, and see how high you can go, and how high the paths found gets, before the db starts running into trouble.
I'll save you some time, though. I recreated your graph, and you hit the max possible paths when you have an upper bound of 23 hops, returning a count of 1371112 total distinct paths in your graph matching that pattern. The browser alone won't be able to cope with this many rows of data.
Here are two queries you can run to verify it (provided that this is your entire graph):
MATCH (:Table)-[*..23]-(:Attribute)
RETURN count(*) as totalPathsFound
and
MATCH path = (:Table)-[*..23]-(:Attribute)
RETURN length(path) as pathLength, count(*) as pathsFound
ORDER BY pathLength DESC
Note that expanding out and counting the number of possible paths isn't too strenuous, we can get that in a few seconds. But doing property access or additional computations that may multiplicatively increase the number of paths can be a problem, and streaming back this many rows of data, especially to a browser app, can be a problem.
More to the point, I don't think you really want to process over a million results anyway. What the query is actually doing is likely completely different than what you really want. So you may want to clarify what exactly you want the query to do, because the current approach isn't feasible.

cypher performance for multiple hops /

I'm running my cypher queryies on a very large social network (over 1B records). I'm trying to get all paths between two person with variable relationship lengths. I get a reasonable response time running a query for a single relationship length (between 0.5 -2 seconds) [the person ids are index].
MATCH paths=( (pr1:person)-[*0..1]-(pr2:person) )
WHERE pr1.id='123456'
RETURN paths
However when I run the query with multiple lengths (i.e. 2 or more) my response time goes up to several minutes. Assuming that each person has in average the same number of connection I should be running my queries for 2-3 minutes Max (but I get up to 5+ min).
MATCH paths=( (pr1:person)-[*0..2]-(pr2:person) )
pr1.id='123456'
RETURN paths
I tried to use the EXPLAIN did not show extreme values for the VarLengthExpand(All) .
Maybe the traversing is not using the index for the pr2.
Is there anyway to improve the performance of my query?
Since variable-length relationship searches have exponential complexity, your *0..2 query might be generating a very large number of paths, which can cause the neo4j server (or your client code, like the neo4j browser) to run a long time or even run out of memory.
This query might be able to finish and show you how many matching paths there are:
MATCH (pr1:person)-[*0..2]-(:person)
WHERE pr1.id='123456'
RETURN COUNT(*);
If the returned number is very large, then you should modify your query to reduce the size of the result. For example, you can adding a LIMIT clause after your original RETURN clause to limit the number of returned paths.
By the way, the EXPLAIN clause just estimates the query cost, and can be way off. The PROFILE clause performs the actual query, and gives you an accurate accounting of the DB hits (however, if your query never finishes running, then a PROFILE of it will also never finish).
Rather than using the explain, try the "profile" instead.

Cypher: Find any path between nodes

I have a neo4j graph that looks like this:
Nodes:
Blue Nodes: Account
Red Nodes: PhoneNumber
Green Nodes: Email
Graph design:
(:PhoneNumber) -[:PART_OF]->(:Account)
(:Email) -[:PART_OF]->(:Account)
The problem I am trying to solve is to
Find any path that exists between Account1 and Account2.
This is what I have tried so far with no success:
MATCH p=shortestPath((a1:Account {accId:'1234'})-[]-(a2:Account {accId:'5678'})) RETURN p;
MATCH p=shortestPath((a1:Account {accId:'1234'})-[:PART_OF]-(a2:Account {accId:'5678'})) RETURN p;
MATCH p=shortestPath((a1:Account {accId:'1234'})-[*]-(a2:Account {accId:'5678'})) RETURN p;
MATCH p=(a1:Account {accId:'1234'})<-[:PART_OF*1..100]-(n)-[:PART_OF]->(a2:Account {accId:'5678'}) RETURN p;
Same queries as above without the shortest path function call.
By looking at the graph I can see there is a path between these 2 nodes but none of my queries yield any result. I am sure this is a very simple query but being new to Cypher, I am having a hard time figuring out the right solution. Any help is appreciated.
Thanks.
All those queries are along the right lines, but need some tweaking to make work. In the longer term, though, to get a better system to easily search for connections between accounts, you'll probably want to refactor your graph.
Solution for Now: Making Your Query Work
The path between any two (n:Account) nodes in your graph is going to look something like this:
(a1:Account)<-[:PART_OF]-(:Email)-[:PART_OF]->(ai:Account)<-[:PART_OF]-(:PhoneNumber)-[:PART_OF]->(a2:Account)
Since you have only one type of relationship in your graph, the two nodes will thus be connected by an indeterminate number of patterns like the following:
<-[:PART_OF]-(:Email)-[:PART_OF]->
or
<-[:PART_OF]-(:PhoneNumber)-[:PART_OF]->
So, your two nodes will be connected through an indeterminate number of intermediate (:Account), (:Email), or (:PhoneNumber) nodes all connected by -[:PART_OF]- relationships of alternating direction. Unfortunately to my knowledge (and I'd love to be corrected here), using straight cypher you can't search for a repeated pattern like this in your current graph. So, you'll simply have to use an undirected search, to find nodes (a1:Account) and(a2:Account) connected through -[:PART_OF]- relationships. So, at first glance your query would look like this:
MATCH p=shortestPath((a1:Account { accId: {a1_id} })-[:PART_OF*]-(a2:Account { accId: {a2_id} }))
RETURN *
(notice here I've used cypher parameters rather than the integers you put in the original post)
That's very similar to your query #3, but, like you said - it doesn't work. I'm guessing what happens is that it doesn't return a result, or returns an out of memory exception? The problem is that since your graph has circular paths in it, and that query will match a path of any length, the matching algorithm will literally go around in circles until it runs out of memory. So, you want to set a limit, like you have in query #4, but without the directions (which is why that query doesn't work).
So, let's set a limit. Your limit of 100 relationships is a little on the large side, especially in a cyclical graph (i.e., one with circles), and could potentially match in the region of 2^100 paths.
As a (very arbitrary) rule of thumb, any query with a potential undirected and unlabelled path length of more than 5 or 6 may begin to cause problems unless you're very careful with your graph design. In your example, it looks like these two nodes are connected via a path length of 8. We also know that for any two nodes, the given minimum path length will be two (i.e., two -[:PART_OF]- relationships, one into and one out of a node labelled either :Email or :PhoneNumber), and that any two accounts, if linked, will be linked via an even number of relationships.
So, ideally we'd set out our relationship length between 2 and 10. However, cypher's shortestPath() function only supports paths with a minimum length of either 0 or 1, so I've set it between 1 and 10 in the example below (even though we know that in reality, the shortest path have a length of at least two).
MATCH p=shortestPath((a1:Account { accId: {a1_id} })-[:PART_OF*1..10]-(a2:Account { accId: {a2_id} }))
RETURN *
Hopefully, this will work with your use case, but remember, it may still be very memory intensive to run on a large graph.
Longer Term Solution: Refactor Graph and/or Use APOC
Depending on your use case, a better or longer term solution would be to refactor your graph to be more specific about relationships to speed up query times when you want to find accounts linked only by email or phone number - i.e. -[:ACCOUNT_HAS_EMAIL]- and -[:ACCOUNT_HAS_PHONE]-. You may then also want to use APOC's shortest path algorithms or path finder functions, which will most likely return a faster result than using cypher, and allow you to be more specific about relationship types as your graph expands to take in more data.

Getting a "slice" of a linked-list in neo4j

So, I have a structure that resembles a linked-list. Each node has a prev field for an id to the previous node, and I link them together using a chain relationship. There are some cases when a node is not part of this chain, ie, it's "prev" points to another node, but nothing points to it.. or only 1 node points to it.
I want to take a "slice" of this list, only including the nodes that are directly linked. ie, from the point of node A, back to node B, return all nodes in between.
This is what I have so far
match (fb {id: A}) - [:chain] -> (eb {id:B})
return fb
However, it returns no results... I think I need it to go recursive in some way, but I'm not sure how to indicate that. I've tried using :chain*, but this tends to process forever. I think I need a way to limit it..
How do I do this?
What about this?
MATCH (fb {id: A})-[:chain*1..10]->(eb {id:B})
RETURN fb
That should limit it to 10 levels. You can change that if you like, obviously, but it affects performance
EDIT: Was just reading this guide to performance tuning:
http://neo4j.com/developer/guide-performance-tuning/
One bit that caught my eye:
If you’re using queries that will have a relatively large working set
(ie. will be traversing long paths, looking at lots of properties, or
collecting large sets of results in order to do sorting, etc) then
you’ll need a larger working heap. If you have small queries that do
very limited traversals and return small amounts of data, you need
less. Assume 1-2GB to start and tune from there

Is it the optimal way of expressing "go through all nodes" queries in Cypher?

I have a quite large social graph in which I execute global queries like this one:
match (n:User)-[r:LIKES]->(k:User)
where not (k:User)-[]->(n:User)
return count(r);
They take a lot of time and memory, so I am curious if they are expressed in optimal way. I have felling that when I execute such query Cypher is firstly matching everything that fits the expression (and that takes a lot of memory) and then starts to count things. I would rather like to go through every node, check the pattern and update the counter if necessary. This way such queries would not require a lot of memory. So how in fact such query is executed? If it is not optimal, is there a way to make it better (in Cypher)?
If you used the query just as you wrote it, you may not be getting what you think you are. Putting labels on node "variables" can cause them to be treated as fresh (partial) patterns instead of bound nodes. Is your query any faster if you use
MATCH (n:User)-[r:LIKES]->(k:User)
WHERE NOT (n)<--(k)
RETURN count(r)
Here's how this works (not considering internal optimizations, which I don't begin to understand).
For each User node, every outgoing LIKES relationship is followed. If the other end of the LIKES relationship is a User node, the two nodes and the relationship are bound to the names n, k, and r and passed to the WHERE clause. Every outgoing relationship on the bound k node is then tested to see if it connects to the bound n node. If no such relationship is found, the match is considered successful. The count() function in the RETURN clause counts the resulting collection of relationships that were passed from the match.
If you have a densely connected graph, and particularly if there are many other relationships between nodes other than LIKES relationship, this can be quite an extensive search.
As a further experiment, you might try changing the WHERE clause to read
WHERE NOT (k)-->(n)
and see if it makes any difference. I don't think it will, but I could be wrong.

Resources