Cypher / Efficiency about relationship cardinality - neo4j

Using Neo4j 2.X and Cypher, I want to query all Users that I know directly or via a friend.
I would expect something like this:
MATCH (me:User("123"))-[:KNOWS*1..2]-(friend) //does not work of course
I think about the shortestPath function, but wouldn't it be too expensive?
Moreover, if I have this query:
MATCH (a)-[:SOME_REL]->(b)<-[:OWNS_BY]-(me:User("123")) // would load the whole in memory before filtering by knowledge !
WITH shortestPath((me)-[:KNOWS*..2]-(friend)) as path
WHERE path.length <= 2
OR
MATCH (a)-[:SOME_REL]->(b)<-[:OWNS_BY]-(me:User("123")) // would load the whole in memory before filtering by knowledge !
MATCH path = shortestPath((me)-[:KNOWS*..2]-(friend))
WHERE path.length <= 2
Wouldn't it be more (maybe too in the case of a huge graph?) expensive?
Indeed, this would be better, if it worked:
MATCH (a)-[:SOME_REL]->(b)<-[:OWNS_BY]-(me:User("123"))-[:KNOWS*1..2]-(friend)
loading in memory only appropriate path.
I could also use an alternative like this:
OPTIONAL MATCH (a)-[:SOME_REL]->(b)<-[:OWNS_BY]-(me:User("123"))-[:KNOWS]-(friend)
OPTIONAL MATCH (a)-[:SOME_REL]->(b)<-[:OWNS_BY]-(me:User("123"))-[:KNOWS]-()-[:KNOWS]-(friend)
but imagine if I wanted three degrees of separation (for knowledge)... the query would be very redundant.
Is there a good syntax that would lead to a very efficient query?
What should I use?

I'm not sure I completely understand, and I think that your first query would work?
MATCH (me:User{userId:123})-[:KNOWS*1..2]-(friend:User)
WHERE me <> friend
RETURN friend
It's hard to know what to write for the other queries as the OWNS_BY and SOME_REL components seem unrelated to the friend of a friend component, if you could relate the two halves of the query with a concrete example I can explain an optimal approach.
Some key pointers are that you should
Start your queries with what you think will match the minimum set of nodes (to constrain the work that has to be done).
Make sure all query components utilise labels and relationship types.
Create indexes on properties that you will be using in lookups.
An excellent resource for query optimisation is Wes Freeman's Pragmatic Optimisation.
The size of the graph does not need to make the queries more expensive as you will mostly be working on a subgraph which presumably have more fixed sized bounds. Of course if your queries need to span the entire graph then the size will become an issue for speed!

Related

Cypher: Find any path between nodes

I have a neo4j graph that looks like this:
Nodes:
Blue Nodes: Account
Red Nodes: PhoneNumber
Green Nodes: Email
Graph design:
(:PhoneNumber) -[:PART_OF]->(:Account)
(:Email) -[:PART_OF]->(:Account)
The problem I am trying to solve is to
Find any path that exists between Account1 and Account2.
This is what I have tried so far with no success:
MATCH p=shortestPath((a1:Account {accId:'1234'})-[]-(a2:Account {accId:'5678'})) RETURN p;
MATCH p=shortestPath((a1:Account {accId:'1234'})-[:PART_OF]-(a2:Account {accId:'5678'})) RETURN p;
MATCH p=shortestPath((a1:Account {accId:'1234'})-[*]-(a2:Account {accId:'5678'})) RETURN p;
MATCH p=(a1:Account {accId:'1234'})<-[:PART_OF*1..100]-(n)-[:PART_OF]->(a2:Account {accId:'5678'}) RETURN p;
Same queries as above without the shortest path function call.
By looking at the graph I can see there is a path between these 2 nodes but none of my queries yield any result. I am sure this is a very simple query but being new to Cypher, I am having a hard time figuring out the right solution. Any help is appreciated.
Thanks.
All those queries are along the right lines, but need some tweaking to make work. In the longer term, though, to get a better system to easily search for connections between accounts, you'll probably want to refactor your graph.
Solution for Now: Making Your Query Work
The path between any two (n:Account) nodes in your graph is going to look something like this:
(a1:Account)<-[:PART_OF]-(:Email)-[:PART_OF]->(ai:Account)<-[:PART_OF]-(:PhoneNumber)-[:PART_OF]->(a2:Account)
Since you have only one type of relationship in your graph, the two nodes will thus be connected by an indeterminate number of patterns like the following:
<-[:PART_OF]-(:Email)-[:PART_OF]->
or
<-[:PART_OF]-(:PhoneNumber)-[:PART_OF]->
So, your two nodes will be connected through an indeterminate number of intermediate (:Account), (:Email), or (:PhoneNumber) nodes all connected by -[:PART_OF]- relationships of alternating direction. Unfortunately to my knowledge (and I'd love to be corrected here), using straight cypher you can't search for a repeated pattern like this in your current graph. So, you'll simply have to use an undirected search, to find nodes (a1:Account) and(a2:Account) connected through -[:PART_OF]- relationships. So, at first glance your query would look like this:
MATCH p=shortestPath((a1:Account { accId: {a1_id} })-[:PART_OF*]-(a2:Account { accId: {a2_id} }))
RETURN *
(notice here I've used cypher parameters rather than the integers you put in the original post)
That's very similar to your query #3, but, like you said - it doesn't work. I'm guessing what happens is that it doesn't return a result, or returns an out of memory exception? The problem is that since your graph has circular paths in it, and that query will match a path of any length, the matching algorithm will literally go around in circles until it runs out of memory. So, you want to set a limit, like you have in query #4, but without the directions (which is why that query doesn't work).
So, let's set a limit. Your limit of 100 relationships is a little on the large side, especially in a cyclical graph (i.e., one with circles), and could potentially match in the region of 2^100 paths.
As a (very arbitrary) rule of thumb, any query with a potential undirected and unlabelled path length of more than 5 or 6 may begin to cause problems unless you're very careful with your graph design. In your example, it looks like these two nodes are connected via a path length of 8. We also know that for any two nodes, the given minimum path length will be two (i.e., two -[:PART_OF]- relationships, one into and one out of a node labelled either :Email or :PhoneNumber), and that any two accounts, if linked, will be linked via an even number of relationships.
So, ideally we'd set out our relationship length between 2 and 10. However, cypher's shortestPath() function only supports paths with a minimum length of either 0 or 1, so I've set it between 1 and 10 in the example below (even though we know that in reality, the shortest path have a length of at least two).
MATCH p=shortestPath((a1:Account { accId: {a1_id} })-[:PART_OF*1..10]-(a2:Account { accId: {a2_id} }))
RETURN *
Hopefully, this will work with your use case, but remember, it may still be very memory intensive to run on a large graph.
Longer Term Solution: Refactor Graph and/or Use APOC
Depending on your use case, a better or longer term solution would be to refactor your graph to be more specific about relationships to speed up query times when you want to find accounts linked only by email or phone number - i.e. -[:ACCOUNT_HAS_EMAIL]- and -[:ACCOUNT_HAS_PHONE]-. You may then also want to use APOC's shortest path algorithms or path finder functions, which will most likely return a faster result than using cypher, and allow you to be more specific about relationship types as your graph expands to take in more data.

Cypher: retrieve all attached nodes of more than one type?

Beginner Cypher question. I know how to get all the nodes of a particular type attached to a particular person in my database. Here I am retrieving all the friends of a particular person, within 10 hops:
MATCH (rebecca:Person {name:"Rebecca"})-[r*1..10]->(friends:Friend)
RETURN rebecca, friends
But how would I extend this to get nodes of two types: either the friends, or the neighbours, of Rebecca?
You can filter on the label of the friends identifier :
MATCH (rebecca:Person {name:"Rebecca"})-[r*1..10]->(other)
WHERE ALL( x IN ["Friend","Neighbour"] WHERE x IN labels(other) )
RETURN rebecca, other
NB: The answer from InverseFalcon is perfectly valid, here it is just another way to do this filter.
Note that this is not really ideal, FRIEND and NEIGHBOUR are semantically best described as relationships and you can see here that when
going away from the natural way of thinking as a graph (relationships matters!) you suffer from it in your queries.
There isn't an OR we can use on the label in the MATCH itself, so you may have to filter with a WHERE clause:
MATCH (rebecca:Person {name:"Rebecca"})-[r*1..10]->(friendOrNeighbor)
WHERE friendOrNeighbor:Friend or friendOrNeighbor:Neighbor
RETURN DISTINCT rebecca, friendOrNeighbor
Keep in mind variable-length relationship matches like this are meant to find all possible paths up to the given max limit, so this is actually doing extra work that you may not need, that may be slow if there are many relationships within that local graph.
You may want to consider apoc.path.expandConfig() from APOC Procedures. If you use 'NODE_GLOBAL' for uniqueness, and specify the upper bound with maxLevel: 10, it's a much more efficient means of getting the nodes you want faster.

What is a path in neo4j cypher v2.0 and higher?

I read in the neo4j 2.0 cypher-refcard
that
Paths are no longer collections, use nodes(path) or rels(path).
What is a path now? Why the change? What consequence for path MATCHing does the change have, for example?
A path is a path. #DaveBennett answers what they are from the JSON perspective. Inside of cypher they're a special kind of object, which you can access in various ways (e.g. through nodes and rels). This I find more clear and intuitive; if it was to be a collection, what would it be a collection of? Inevitably mixed types (e.g. node rel node rel). Better that it should be its own object type to discourage people from doing things like indexing into even numbered items making certain assumptions.
Expanding on the previous answer, this (I think) further makes sense because of the syntax cypher uses for path binding, i.e.
MATCH p=(a)-[r]-(b) RETURN p.
Clearly in this example p is something special. The syntax pretty clearly indicates that a has to be a node, and r is definitely a relationship. Paths just aren't either of those things.
From a programming language perspective, it's good for "collections" to be uniformly typed. E.g. a programmer can know how to deal with a Collection<String>, this means each item in the collection plays by the semantic rules of a String. Making a path a collection would then be problematic, because it can't be a collection of any one type. When iterating through a path/collection, what would you do with each item? The answer is it would depend on what the item is, which tends to make for messy code.
Again, better to have paths be their own thing. Want to iterate over all of the nodes in the path? That's what nodes(p) is for, which will give you a uniformly typed collection. Extra bonus that it makes your cypher code more readable.
In some ways I'm "back-explaining" what the neo4j devs did. I didn't make this design decision, and I wasn't involved in it, so I'm not giving you the neo4j official answer why. This is just my explanation for why the design decision was (IMHO) a very good idea. It follows design patterns you see everywhere else, with certain advantages.

Neo4j with all relations between all nodes

I'm parsing a cypher query to a .gexf (xml) file. Entering this query in the Neo4j admin gui returns all nodes with their interconnecting relationships (relations between all b-nodes)
START a=node(52681) MATCH(a)-[r]-(b) RETURN a,r,b
The neo4j webgui seems to make it's own queries since it draws up all the relationships between the b-nodes and not just between the a and b-nodes. The JSON response contains no data of which I can parse an xml file with the relationships between the b-nodes.
I've resolved this so far by doing a seperate query for each and every b-node:
MATCH (a)-[r]-(b) WHERE id(a)=52681 AND id(b)=12345
But that doesn't seem like very good design... I would like to get this done in one query only.
Also, I tend to overcomplicate things.
I don't think there's an easy/efficient way to do this.
Consider that the paths between each pair of nodes are likely variable in size, and therefore something like (a)-[r]-(b) will only get you the results you want if a and b are both one degree away.
If they are, however, all only one degree away (and assuming no self-loops, which would be easy enough to take care of anyway), something like
MATCH (a)-[r]-(b) RETURN a, r, b
...would likely do the trick, albeit in a horribly inefficient fashion. But if your paths between a and b are > 1 level deep, it obviously won't work.
In that case, something like this might work, but again be horrible:
MATCH (a)-[r:*]-(b) RETURN a, r, b
...but if the depth of your paths are anything more than a few levels, well...ouch.
When you start asking questions of the graph that span the entire graph and require working/traversing the entirety of it, the kinds of questions you're asking start to blow up a bit.
So, likely, the resolution you came up with is probably the only way to really tackle this.
That said, I'd love to know if anyone else has a different take on this.
HTH, if only a bit.

Create Unique Relationship is taking much amount of time

START names = node(*),
target=node:node_auto_index(target_name="TARGET_1")
MATCH names
WHERE NOT names-[:contains]->()
AND HAS (names.age)
AND (names.qualification =~ ".*(?i)B.TECH.*$"
OR names.qualification =~ ".*(?i)B.E.*$")
CREATE UNIQUE (names)-[r:contains{type:"declared"}]->(target)
RETURN names.name,names,names.qualification
Iam consisting of nearly 1,80,000 names nodes, i had iterated the above process to create unique relationships above 100 times by changing the target. its taking too much amount of time.How can i resolve it..
i build the query with java and iterated.iam using neo4j 2.0.0.5 and java 1.7 .
I edited your cypher query because I think I understand it, but I can barely read the rest of your question. If you edit it with white spaces and punctuation it might be easier to understand what you are trying to do. Until then, here are some thoughts about your query being slow.
You bind all the nodes in the graph, that's typically pretty slow.
You bind all the nodes in the graph twice. First you bind universally in your start clause: names=node(*), and then you bind universally in your match clause: MATCH names, and only then you limit your pattern. I don't quite know what the Cypher engine makes of this (possibly it gets a migraine and goes off to make a pot of coffee). It's unnecessary, you can at least drop the names=node(*) from your start clause. Or drop the match clause, I suppose that could work too, since you don't really do anything there, and you will still need a start clause for as long as you use legacy indexing.
You are using Neo4j 2.x, but you use legacy indexing instead of labels, at least in this query. Without knowing your data and model it's hard to know what the difference would be for performance, but it would certainly make it much easier to write (and read) your queries. So, that's a different kind of slow. It's likely that if you had labels and label indices, the query performance would improve.
So, first try removing one of the universal bindings of nodes, then use the 2.x schema tools to structure your data. You should be able to write queries like
MATCH target:Target
WHERE target.target_name="TARGET_1"
WITH target
MATCH names:Name
WHERE NOT names-[:contains]->()
AND HAS (names.age)
AND (names.qualification =~ ".*(?i)B.TECH.*$"
OR names.qualification =~ ".*(?i)B.E.*$")
CREATE UNIQUE (names)-[r:contains{type:"declared"}]->(target)
RETURN names.name,names,names.qualification
I have no idea if such a query would be fast on your data, however. If you put the "Name" label on all your nodes, then MATCH names:Name will still bind all nodes in the database, so it'll probably still be slow.
P.S. The relationships you create have a TYPE called contains, and you give them a property called type with value declared. Maybe you have a good reason, but that's potentially very confusing.
Edit:
Reading through your question and my answer again I no longer think that I understand even your cypher query. (Why are you returning both the bound nodes and properties of those nodes?) Please consider posting sample data on console.neo4j.org and explain in more detail what your model looks like and what you are trying to do. Let me know if my answer meets your question at all or I'll consider removing it.

Resources