What is a path in neo4j cypher v2.0 and higher? - neo4j

I read in the neo4j 2.0 cypher-refcard
that
Paths are no longer collections, use nodes(path) or rels(path).
What is a path now? Why the change? What consequence for path MATCHing does the change have, for example?

A path is a path. #DaveBennett answers what they are from the JSON perspective. Inside of cypher they're a special kind of object, which you can access in various ways (e.g. through nodes and rels). This I find more clear and intuitive; if it was to be a collection, what would it be a collection of? Inevitably mixed types (e.g. node rel node rel). Better that it should be its own object type to discourage people from doing things like indexing into even numbered items making certain assumptions.
Expanding on the previous answer, this (I think) further makes sense because of the syntax cypher uses for path binding, i.e.
MATCH p=(a)-[r]-(b) RETURN p.
Clearly in this example p is something special. The syntax pretty clearly indicates that a has to be a node, and r is definitely a relationship. Paths just aren't either of those things.
From a programming language perspective, it's good for "collections" to be uniformly typed. E.g. a programmer can know how to deal with a Collection<String>, this means each item in the collection plays by the semantic rules of a String. Making a path a collection would then be problematic, because it can't be a collection of any one type. When iterating through a path/collection, what would you do with each item? The answer is it would depend on what the item is, which tends to make for messy code.
Again, better to have paths be their own thing. Want to iterate over all of the nodes in the path? That's what nodes(p) is for, which will give you a uniformly typed collection. Extra bonus that it makes your cypher code more readable.
In some ways I'm "back-explaining" what the neo4j devs did. I didn't make this design decision, and I wasn't involved in it, so I'm not giving you the neo4j official answer why. This is just my explanation for why the design decision was (IMHO) a very good idea. It follows design patterns you see everywhere else, with certain advantages.

Related

Neo4j data modeling: correct way to specify a source for a statement?

I'm working on a scientific database that contains model statements such as:
"A possible cause of Fibromyalgia is Microglial hyperactivity, as supported by these 10 studies: [...] and contradicted by 1 study [...]."
I need to specify a source for statements in Neo4j and be able to do 2 ways operations, like:
Find all statements supported by a study
Find all studies supporting a statement
The most immediate idea I had is to use the DOI of studies as unique identifiers in the relationship property. The big con of this idea is that I have to scan all the relationships to find the list of all statements supported by a study.
So, since it is impossible to make a link between a study and a relationship, I had the idea to make 2 links, at each extremity of the relationship. The obvious con is that it does not give information about the relationship, like "support" or "contradict".
So, I came to the conclusion that I need a node for the hypothesis:
However, it overloads the graph and we are not anymore in the classical node -relationship-> node design that makes property graphs so easy to understand.
Using RDF, it is possible to add properties to relationships using subgraphs, however there we enter semantic graphs and quad stores, which is a more complex tool.
So I'm wondering if there is a "correct" design pattern for Neo4j to support this type of need that I may not have imagined instead?
Thanks
Based on your requirements, I think put support_study as property of edge will do the work:
Thus we could query the following as:
Find all statements supported by a study
MATCH ()-[e:has_cause{support_study: "doi_foo_bar"}]->()
RETURN e;
Find all studies supporting a statement
Given statement is “foo” is caused by “bar”
MATCH (v:disease{name: "foo"})-[e:has_cause]->(v1:sympton{name: "bar")
RETURN DISTINCT e.support_study;
While, this is mostly based on NebulaGraph, where:
It speaks cypher DQL(together with nGQL)
It supports properties in edge
It used 4-tuple(rather than a Key) to distingush an edge(src,dst,edge_type,rank), where rank is an unique design to enable multiple has_cause edge instance between one pair of disease-> sympton, you could put the hash of doi or other number as rank field(or omit, of cause, it will be 0)
It’s distributed and Open-Source(Apache 2.0)
Note:
In NebulaGraph, index should be created on has_cause(support_study) and disease(name), ref: https://www.siwei.io/en/nebula-index-explained/ and https://docs.nebula-graph.io/3.2.0/3.ngql-guide/14.native-index-statements/
But, I think it applies to neo4j, too :)

Cypher query: Is it possible to "hide" an existing path with a "virtual relationship"?

We are working on a project trying to map a structure like Java code connections with Noe4J 2.1.5. We have succeeded in connecting Applications-Jars-Classes-Methods and can for example get a Cypher answer resulting in:
App1-->Jar1-->Class1-->Method1-->Method2-->Method3<--Class22<--Jar2<--App1
Now we would like to be able to get the condensed answer to what Jars that are connected like this, "hiding" the existing path above?
Jar1--Jar2
Is it possible with Cypher to get this result without creating a new Relationship like
Jar1-[:PATH_EXISTS]-Jar2
We can't find anything related collapsing/hiding paths in the manual nor here on stack overflow
Regards
Christofer
There's basically two ways of going about this.
The first is to explicitly create the new relationship, but I won't talk about this that much because it seems you've thought of that and rejected it. That method is easy, but more disk intensive (depending on the size of your graph)
The second is simply to query for the path when needed, with a variable length path like this:
MATCH (jar1 {myid: "something"})-[*]->(jar2 {myid: "somethingelse"})
RETURN jar2;
This will get you what you need, but it requires that this distant path be recomputed every time it's needed. So, it's easy, but it's compute intensive.
Now, more broadly what it sounds like you want is something like a graph inference engine. In the OWL/RDF world, people will create ontologies that describe different types of entities, and the relationships between them. One of the consequences of these relationships is that they can be transitive and can have implications on them. A classic example is that a person is an entity, and things like motherOf and fatherOf are relationships between. So if you have a path of fatherOf relationships between nodes, i.e. (A)-[:fatherOf]->(B)-[:fatherOf]->(C), the inference engine will return the "fact" that (A) and (C) are related by family. This would be a consequence of your ontological definition. That "fact" wouldn't actually be in the RDF store, it would simply be entailed by the facts.
In your case, you'd do something like writing an ontology that specified that all of the individual relationships you have in your graph are a specialization of some relationship type (like "related to"). You'd then ask the reasoner if a "related to" relationship exists between Jar1 and Jar2, and the answer would be yes because of your ontological definitions.
OK, so the bad news is that neo4j isn't RDF and doesn't do this. Also, doing this sort of thing is way harder than I'm making it sound; correct ontology modeling is an art unto itself, not unlike logic programming from the prolog world of the 1970s. But basically, that kind of inference is what it sounds like you're looking for.
What I think you might be able to hope for in some future release of neo4j is something akin to a database "view", or better schema support. I.e. it ought to be possible to specify that whenever a certain relationship pattern holds, some other relationship ought also be present.

Cypher / Efficiency about relationship cardinality

Using Neo4j 2.X and Cypher, I want to query all Users that I know directly or via a friend.
I would expect something like this:
MATCH (me:User("123"))-[:KNOWS*1..2]-(friend) //does not work of course
I think about the shortestPath function, but wouldn't it be too expensive?
Moreover, if I have this query:
MATCH (a)-[:SOME_REL]->(b)<-[:OWNS_BY]-(me:User("123")) // would load the whole in memory before filtering by knowledge !
WITH shortestPath((me)-[:KNOWS*..2]-(friend)) as path
WHERE path.length <= 2
OR
MATCH (a)-[:SOME_REL]->(b)<-[:OWNS_BY]-(me:User("123")) // would load the whole in memory before filtering by knowledge !
MATCH path = shortestPath((me)-[:KNOWS*..2]-(friend))
WHERE path.length <= 2
Wouldn't it be more (maybe too in the case of a huge graph?) expensive?
Indeed, this would be better, if it worked:
MATCH (a)-[:SOME_REL]->(b)<-[:OWNS_BY]-(me:User("123"))-[:KNOWS*1..2]-(friend)
loading in memory only appropriate path.
I could also use an alternative like this:
OPTIONAL MATCH (a)-[:SOME_REL]->(b)<-[:OWNS_BY]-(me:User("123"))-[:KNOWS]-(friend)
OPTIONAL MATCH (a)-[:SOME_REL]->(b)<-[:OWNS_BY]-(me:User("123"))-[:KNOWS]-()-[:KNOWS]-(friend)
but imagine if I wanted three degrees of separation (for knowledge)... the query would be very redundant.
Is there a good syntax that would lead to a very efficient query?
What should I use?
I'm not sure I completely understand, and I think that your first query would work?
MATCH (me:User{userId:123})-[:KNOWS*1..2]-(friend:User)
WHERE me <> friend
RETURN friend
It's hard to know what to write for the other queries as the OWNS_BY and SOME_REL components seem unrelated to the friend of a friend component, if you could relate the two halves of the query with a concrete example I can explain an optimal approach.
Some key pointers are that you should
Start your queries with what you think will match the minimum set of nodes (to constrain the work that has to be done).
Make sure all query components utilise labels and relationship types.
Create indexes on properties that you will be using in lookups.
An excellent resource for query optimisation is Wes Freeman's Pragmatic Optimisation.
The size of the graph does not need to make the queries more expensive as you will mostly be working on a subgraph which presumably have more fixed sized bounds. Of course if your queries need to span the entire graph then the size will become an issue for speed!

Cypher match path based on previous relationship

I am trying to impose restrictions on my path match pattern.
I would like to match the next relationship based on the type of the previous used relationship.
Here is an example of a simplified Database:
(A)-1-(B)-2-(C)-1-(E)-2-(F)
| |
3----(D)----3
Query 1:
start n=node(A), m=node(F)
match p=n-[r*]-m
return p
should result both paths
(A)-1-(B)-2-(C)-1-(E)-2-(F)
(A)-1-(B)-3-(D)-3-(E)-2-(F)
However, when running the query starting from node (F):
start m=node(F),n=node(A)
match p=m-[r*]-n
return p
The result should be only:
(F)-2-(E)-1-(C)-2-(B)-1-(A)
Path
(F)-2-(E)-3-(D)-3-(B)-1-(A)
should not be valid, since it violates these constrains:
Coming from a -1- type relationship you can proceed to either a
-2- or -3- relationship.
Coming from a -2- or -3- type relationship you can only proceed to
a -1- relationship.
These paths are valid:
()-1-()-2-()
()-1-()-3-()
()-2-()-1-()
()-3-()-1-()
These path are not valid:
()-3-()-2-()
()-2-()-3-()
First, upvote for the very detailed, specific, and well laid out question.
Unfortunately, I don't think it's possible to do what you want to do with Cypher, I think you need the Traversal API to do this. Cypher is a declarative language; that is, you tell it what you want, and it goes and gets it for you. Using the traversal API is an imperative approach to query; that is, you tell neo4j exactly how to traverse the graph.
Your query here imposes constraints about the order in which relationships get traversed, and what makes a valid path. Nothing wrong with that, but I believe that imposing constraints on the order of traversal implicitly means you're telling cypher which way to traverse, and you just can't do that with cypher because it's declarative. Another common example of the declarative vs. imperative thing is breadth-first vs. depth-first search. If you're looking for certain nodes, you can't tell cypher to traverse breadth-first vs. depth-first; you just tell it which nodes you want, and it goes and gets them.
Now, paths can be treated like collections in cypher via the relationships() function. And you can use the filter and reduce functions to work with individual relationships. But your query is harder in that you need code that says something like "If the first relationship in a path is a 1, then the next must be a 2 or a 3". This is exactly the sort of thing that you can do with Evaluators in the Traversal API. Check the interface and you can see how you could write your own method that would implement exactly the logic you're talking about via Evaluator#evaluate(Path path).
As a general note, because declarative query (cypher) hides traversal details from you, IMHO it's always better to use declarative query for ease, if you can specify what you want declaratively. But there are cases where you have to control the order of traversal, and for that you need traversal. (I have had cases where I need all nodes connected to something else, via breadth-first search only, to a maximum depth of 3, along complex relationship criteria -- I couldn't use cypher for that).
To give you a way forward, check the link I provided on the traversal framework. Perhaps you can describe your query as a TraversalDescription, and then hand it off to neo4j to run.

Create Unique Relationship is taking much amount of time

START names = node(*),
target=node:node_auto_index(target_name="TARGET_1")
MATCH names
WHERE NOT names-[:contains]->()
AND HAS (names.age)
AND (names.qualification =~ ".*(?i)B.TECH.*$"
OR names.qualification =~ ".*(?i)B.E.*$")
CREATE UNIQUE (names)-[r:contains{type:"declared"}]->(target)
RETURN names.name,names,names.qualification
Iam consisting of nearly 1,80,000 names nodes, i had iterated the above process to create unique relationships above 100 times by changing the target. its taking too much amount of time.How can i resolve it..
i build the query with java and iterated.iam using neo4j 2.0.0.5 and java 1.7 .
I edited your cypher query because I think I understand it, but I can barely read the rest of your question. If you edit it with white spaces and punctuation it might be easier to understand what you are trying to do. Until then, here are some thoughts about your query being slow.
You bind all the nodes in the graph, that's typically pretty slow.
You bind all the nodes in the graph twice. First you bind universally in your start clause: names=node(*), and then you bind universally in your match clause: MATCH names, and only then you limit your pattern. I don't quite know what the Cypher engine makes of this (possibly it gets a migraine and goes off to make a pot of coffee). It's unnecessary, you can at least drop the names=node(*) from your start clause. Or drop the match clause, I suppose that could work too, since you don't really do anything there, and you will still need a start clause for as long as you use legacy indexing.
You are using Neo4j 2.x, but you use legacy indexing instead of labels, at least in this query. Without knowing your data and model it's hard to know what the difference would be for performance, but it would certainly make it much easier to write (and read) your queries. So, that's a different kind of slow. It's likely that if you had labels and label indices, the query performance would improve.
So, first try removing one of the universal bindings of nodes, then use the 2.x schema tools to structure your data. You should be able to write queries like
MATCH target:Target
WHERE target.target_name="TARGET_1"
WITH target
MATCH names:Name
WHERE NOT names-[:contains]->()
AND HAS (names.age)
AND (names.qualification =~ ".*(?i)B.TECH.*$"
OR names.qualification =~ ".*(?i)B.E.*$")
CREATE UNIQUE (names)-[r:contains{type:"declared"}]->(target)
RETURN names.name,names,names.qualification
I have no idea if such a query would be fast on your data, however. If you put the "Name" label on all your nodes, then MATCH names:Name will still bind all nodes in the database, so it'll probably still be slow.
P.S. The relationships you create have a TYPE called contains, and you give them a property called type with value declared. Maybe you have a good reason, but that's potentially very confusing.
Edit:
Reading through your question and my answer again I no longer think that I understand even your cypher query. (Why are you returning both the bound nodes and properties of those nodes?) Please consider posting sample data on console.neo4j.org and explain in more detail what your model looks like and what you are trying to do. Let me know if my answer meets your question at all or I'll consider removing it.

Resources