Optimizing a Cypher query to improve performance - neo4j

The query I've written returns accurate results based on some random testing I've done. However, the query execution takes really long (7699.43 s)
I need help optimising this query.
count(Person) -> 67895
count(has_POA) -> 355479
count(POADocument) -> 40
count(issued_by) -> 40
count(Company) -> 21
count(PostCode) -> 9845
count(Town) -> 1673
count(in_town) -> 9845
count(offers_services_in) -> 17107
All the entity nodes are indexed on Id's (not Neo4j IDs). The PostCode nodes are also indexed on PostCode.
MATCH pa= (p:Person)-[r:has_POA]->(d:POADocument)-[:issued_by]->(c:Company),
(pc:PostCode),(t:Town) WHERE r.recipient_postcode=pc.PostCode AND (pc)-
[:in_town]->(t) AND NOT (c)-[:offers_services_in]->(t) RETURN p as Person,r
as hasPOA,t as Town, d as POA,c as Company
Much thanks in advance!
-Nancy

I made some changes in your query:
MATCH (p:Person)-[r:has_POA {recipient_code : {code} }]->(d:POADocument)-[:issued_by]->(c:Company),
(pc:PostCode {PostCode : {PostCode} })-[:in_town]->(t:Town)
WHERE NOT (c)-[:offers_services_in]->(t)
RETURN p as Person, r as hasPOA, t as Town, d as POA, c as Company
Since you are not using the entire path, removed pa variable
Moved the pattern existence check ((pc)-[:in_town]->(t)) from WHERE to MATCH.
Using parameters instead of the equality check r.recipient_postcode = pc.PostCode in where. If you are running the query in Neo4j Browser, you can set the parameters running the command :params {code : 10}.

Here is a simplified version of your current query.
MATCH (p:Person)-[r:has_POA]->(d:POADocument)-[:issued_by]->(c:Company)
MATCH (t:Town)<-[:in_town]-(pc:PostCode{PostCode:r.recipient_postcode})
WHERE NOT (c)-[:offers_services_in]->(t)
RETURN p as Person,r as hasPOA,t as Town, d as POA,c as Company
Your big performance hits are going to be on the Cartesian product between all the match sets, and the raw amount of data you are asking for.
In this simplified version, I'm using one less match, and the second match uses a variable from the first match to avoid generating a Cartesian product. I would also recommend using LIMIT and SKIP to page your results to limit data transfer.
If you can adjust your model, I would recommend converting the has_POA relation to an issued_POA node so that you can take advantage of Neo4j's relation finding on the 2 postcodes related to that instance, and making the second match a gimme instead of an extra indexed search (after you adjust the query to match the new model, of course).

Related

Checking if subgraph fulfill condition as a whole

I find it hard to explain, so consider the following picture
I'm trying to select all products that fulfill the warehouse requirements
In this example I need to select all products that have a maximum size of 5 AND maximum weight of 10.
To simplify, I only have MAX (no MIN or EQ) constraints, so the operator can be hardcoded.
I've tried to group the requirement subgraph using COLLECT and using the ALL operator, but failed.
Query to create the graph
CREATE
// NODES
(warehouse:WAREHOUSE{name:'My Warehouse'}),
(smallProduct:PRODUCT{name:'Small Product'}),
(largeProduct:PRODUCT{name:'Large Product'}),
// RELATIONSHIPS
(size:CONSTRAINT{name:'Size'}),
(weight:CONSTRAINT{name:'Weight'}),
(warehouse)-[:LIMIT{value:5}]->(size),
(warehouse)-[:LIMIT{value:5}]->(weight),
(smallProduct)-[:AMOUNT{value:3}]->(size),
(smallProduct)-[:AMOUNT{value:2}]->(weight),
(largeProduct)-[:AMOUNT{value:10}]->(size),
(largeProduct)-[:AMOUNT{value:4}]->(weight)
UPDATE
The following query apparently solves the problem:
MATCH (warehouse:WAREHOUSE)
MATCH rel = ((warehouse)-[limit:LIMIT]->(constraint:CONSTRAINT)<-[amount:AMOUNT]-(product:PRODUCT))
WITH warehouse, product, collect(relationships(rel)) as paths
WHERE all( p in paths WHERE p[0].value > p[1].value )
return product
I am wondering if there is a better solution.

Nested unions in Cypher/Neo4j

I've this metagraph in Neo4j:
:Protein ---> :is_a ---------------> :Enzyme ---> :activated_by|inhibited_by ---> :Compound
\<-- :activated_by <---/
:Compound --> :consumed_by|:produced_by ---> :Transport
\<---- :catalyzed_by -------<---/
:Transport --> part_of ---> :Pathway
(yes, it's biology, yes, it's imported from BioPAX).
I want to use Cypher to return all pairs of (:Protein, :Compounds) that are linked by some graph path. While one simple way would be to follow each possible path between the two node types and issue one query per each of them, clearly a query using some pattern union would be more compact (maybe slower, I'm assessing the performance of different approaches).
But how to write such a query? I've read that Cypher has a UNION construct, but I haven't clear how to combine and nest them, e.g., regarding the subgraphs from enzymes to transports, I would like to be able to write something equivalent to this informal expression:
enz:Enzyme :activated|inhibited_by comp:Compund
join with: (
(comp :consumed_by|produced_by :Transport)
UNION (:Transport :catalyzed_by comp )
)
I've read there should be some way, but I didn't get much of it and I'm trying to understand if there is a relatively easy way (in SPARQL or SQL the above is rather simple).
Using Periodic Collect
In Cypher, you can break a query into steps with WITH, and you can join two lists by concatenating them together.
MATCH (e:Enzyme)-[:activated]->(compA:Compound), (e)-[:inhibited_by]->(compB:Compund)
WITH e, COLLECT(compA)+COLLECT(compB) as compList
UNWIND compList as comp
WITH DISTINCT e, comp // if a comp can appear in both lists
MATCH ... // Repeat above at each path step
Using Union
When using Union to combine different queries, think of it like a comma separated list of queries, but instead of commas, you use the word UNION instead (Also, every query in this list has to have the same return columns.
MATCH (e:Enzyme)-[r1:activated]->(comp:Compound)-[r2:consumed_by]->(trans:Transport)
RETURN e as protein, comp as compound, trans as transport
UNION
MATCH (e:Enzyme)-[r1:inhibited_by]->(comp:Compound)-[r2:produced_by]->(trans:Transport)
RETURN e as protein, comp as compound, trans as transport
// Just to show only return names have to match
UNION
WITH "rawr" as a
RETURN a as protein, 51 as compound, NULL as transport
This is good for combining the results of completely different queries, but since the queries you are combining are usually related, most of the time COLLECT will be more efficient, and gives you better control over the results.
Using OR
You can get the name of a relationship with the TYPE function and filter on that.
MATCH (e:Enzyme)-[r1]->(comp:Compound)-[r2]->(trans:Transport)
WHERE (TYPE(r1) = "activated" OR TYPE(r1) = "inhibited_by") AND (TYPE(r2) = "consumed_by" OR TYPE(r2) = "produced_by")
RETURN *
Note: For using OR on just they relation type, you can also use -[:A|:B]-> for OR.
Using Path Filtering
As of Neo4j 3.1.x, Cypher is fairly good at free path finding. (By that, I mean finding a valid path without searching all possible paths. The pattern for relation matching) The upper bound is not strictly necessary, but good for preventing/controlling runaway queries.
MATCH p=(e:Protein)-[r1*..10]->(c:Compound)
WHERE ALL(r in RELATIONSHIPS(p) WHERE TYPE(r) in ["activated","inhibited_by","produced_by","consumed_by"])
RETURN e as protein, c as compound
Other
If they relationship type doesn't actually matter, you can just use (e:Enzyme)-->(c:Compound) or to ignore direction (e:Enzyme)--(c:Compound)
If it is an option, I would recommend refactoring your schema to be more consistent (or add a relation type relevant to this matching criteria) so that you don't need to union results. (This will give you the best performance, since the Cypher planner will better know how to quickly find your results)
The query below will return all paths in which an Enzyme is activated or inhibited by a Compound that is consumed, produced, or catalyzed by a Transport.
The query's r2 relationship pattern is non-directional, since the directionality of catalyzed_by is opposite that of consumed_by and produced_by.
MATCH p=
(e:Enzyme)-[r1:activated_by|:inhibited_by]->
(comp:Compound)-[r2:consumed_by|:produced_by|:catalyzed_by]-
(trans:Transport)
RETURN p;
So, I haven't still found a satisfactory way to do what I initially asked, but I get quite close to it and I wish to report about.
First, my initial graph was a bit different than the one I actually have, so I'll base my next examples on the real one (sorry for the confusion):
I can work branches quite comfortably with paths in the WHERE clause (plus, in simpler cases, UNIONS):
// Branch 2

MATCH (prot:Protein), (enz:Enzyme), (tns:Transport) - [:part_of] -> (path:Path)

WHERE ( (enz) - [:ac_by|:in_by] -> (:Comp) - [:pd_by|:cs_by] -> (tns) // Branch 2.1

 OR (tns) - [:ca_by] -> (enz) ) //Branch 2.2 (pt1)
AND ( (prot) - [:is_a] -> (enz) OR (prot) <- [:ac_by] - (enz) ) // Branch 2.2 (pt2)

RETURN prot, path LIMIT 30

UNION

// Branch1
MATCH (prot:Protein) - [:pd_by|:cs_by] -> (:Reaction) - [:part_of] -> (path:Path)

RETURN prot, path LIMIT 30
(I'm also sorry for all those abbreviations, e.g., pd_by is produced_by, ac_by is activated_by and so on).
This query yields results in about 1 minute. Too long and that's clearly due to the way the query is interpreted, as one can see from its plan:
I really can't understand why there are those huge cartesian products. I've tried the WITH/UNWIND approach, but I've been unable to get the correct results (see my comments above, thanks #Tezra and #cybersam) and, even if I was, it's a very difficult syntax.

Neo4j query for complete paths

I have the following structure.
CREATE
(`0` :Sentence {`{text`:'This is a sentence'}}) ,
(`1` :Word {`{ text`:'This' }}) ,
(`2` :Word {`{text`:'is'}}) ,
(`3` :Sentence {`{'text'`:'Sam is a dog'}}) ,
(`0`)-[:`RELATED_TO`]->(`1`),
(`0`)-[:`RELATED_TO`]->(`2`),
(`3`)-[:`RELATED_TO`]->(`2`)
schema example
So my question is this. I have a bunch of sentences that I have decomposed into word objects. These word objects are all unique and therefore will point to different sentences. If I perform a search for one word it's very easy to figure out all of the sentences that word is related to. How can I structure a query to figure out the same information for two words instead of one.
I would like to submit two or more words and find a path that includes all words submitted picking up all sentences of interest.
I just remembered an alternate approach that may work better. Compare the PROFILE on this query with the profiles for the others, see if it works better for you.
WITH {myListOfWords} as wordList
WITH wordList, size(wordList) as wordCnt
MATCH (s)-[:RELATED_TO]->(w:Word)
WHERE w.text in wordList
WITH s, wordCnt, count(DISTINCT w) as cnt
WHERE wordCnt = cnt
RETURN s
Unfortunately it's not a very pretty approach, it basically comes down to collecting :Word nodes and using the ALL() predicate to ensure that the pattern you want holds true for all elements of the collection.
MATCH (w:Word)
WHERE w.text in {myListOfWords}
WITH collect(w) as words
MATCH (s:Sentence)
WHERE ALL(word in words WHERE (s)-[:RELATED_TO]->(word))
RETURN s
What makes this ugly is that the planner isn't intelligent enough right now to infer that when you say MATCH (s:Sentence) WHERE ALL(word in words ... that the initial matches for s ought to come from the match from the first w in your words collection, so it starts out from all :Sentence nodes first, which is a major performance hit.
So to get around this, we have to explicitly match from the first of the words collection, and then use WHERE ALL() for the remaining.
MATCH (w:Word)
WHERE w.text in {myListOfWords}
WITH w, size(()-[:RELATED_TO]->(w)) as rels
WITH w ORDER BY rels ASC
WITH collect(w) as words
WITH head(words) as head, tail(words) as words
MATCH (s)-[:RELATED_TO]->(head)
WHERE ALL(word in words WHERE (s)-[:RELATED_TO]->(word))
RETURN s
EDIT:
Added an optimization to order your w nodes by the degree of their incoming :RELATED_TO relationships (this is a degree lookup on very few nodes), as this will mean the initial match to your :Sentence nodes is the smallest possible starting set before you filter for relationships from the rest of the words.
As an alternative, you could consider using manual indexing (also called "legacy indexing") instead of using Word nodes and RELATED_TO relationships. Manual indexes support "fulltext" searches using lucene.
There are many apoc procedures that help you with this.
Here is an example that might work for you. In this example, I assume case-insensitive comparisons are OK, you retain the Sentence nodes (and their text properties), and you want to automatically add the text properties of all Sentence nodes to a manual index.
If you are using neo4j 3.2+, you have to add this setting to the neo4j.conf file to make some expensive apoc.index procedures (like apoc.index.addAllNodes) available:
dbms.security.procedures.unrestricted=apoc.*
Execute this Cypher code to initialize a manual index named "WordIndex" with the text text from all existing Sentence nodes, and to enable automatic indexing from that point onwards:
CALL apoc.index.addAllNodes('WordIndex', {Sentence: ['text']}, {autoUpdate: true})
YIELD label, property, nodeCount
RETURN *;
To find (case insensitively) the Sentence nodes containing all the words in a collection (passed via a $words parameter), you'd execute a Cypher statement like the one below. The WITH clause builds the lucene query string (e.g., "foo AND bar") for you. Caveat: since lucene's special boolean terms (like "AND" and "OR") are always in uppercase, you should make sure the words you pass in are lowercased (or modify the WITH clause below to use the TOLOWER()` function as needed).
WITH REDUCE(s = $words[0], x IN $words[1..] | s + ' AND ' + x) AS q
CALL apoc.index.search('WordIndex', q) YIELD node
RETURN node;

Neo4J - How to do post processing UNION (pagination)

I'm writing a cypher query to load data from my Neo4J DB, this is my data model
So basically what I want is a query to return a Journal with all of its properties and everything related to it, Ive tried doing the simple query but it is not performant at all and my ec2 instance where the DB is hosted runs out of memory quickly
MATCH p=(j:Journal)-[*0..]-(n) RETURN p
I managed to write a query using UNIONS
`MATCH p=(j:Journal)<-[:BELONGS_TO]-(at:ArticleType) RETURN p
UNION
MATCH p=(j:Journal)<-[:OWNS]-(jo:JournalOwner) RETURN p
UNION
MATCH p=(j:Journal)<-[:BELONGS_TO]-(s:Section) RETURN p
UNION

MATCH p=(j:Journal)-[:ACCEPTS]->(fc:FileCategory) RETURN p
UNION
MATCH p=(j:Journal)-[:CHARGED_BY]->(a:APC) RETURN p
UNION
MATCH p=(j:Journal)-[:ACCEPTS]->(sft:SupportedFileType) RETURN p
UNION
MATCH p=(j:Journal)<-[:BELONGS_TO|:CHILD_OF*..]-(c:Classification) RETURN p
SKIP 0 LIMIT 100`
The query works fine and its performance is not bad at all, the only problem I'm finding is in the limit, I've been googling around and I've seen that post-processing queries with UNIONS is not yet supported.
The referenced github issue is not yet resolved, so post processing of UNION is not yet possible github link
Logically the first thing I tried when I came across this issue was to put the pagination on each individual query, but this had some weird behaviour that didn't make much sense to myself.
So I tried to write the query without using UNIONS, I came up with this
`MATCH (j:Journal)
WITH j LIMIT 10
MATCH pa=(j)<-[:BELONGS_TO]-(a:ArticleType)
MATCH po=(j)<-[:OWNS]-(o:JournalOwner)
MATCH ps=(j)<-[:BELONGS_TO]-(s:Section)
MATCH pf=(j)-[:ACCEPTS]->(f:FileCategory)
MATCH pc=(j)-[:CHARGED_BY]->(apc:APC)
MATCH pt=(j)-[:ACCEPTS]->(sft:SupportedFileType)
MATCH pl=(j)<-[:BELONGS_TO|:CHILD_OF*..]-(c:Classification)
RETURN pa, po, ps, pf, pc, pt, pl`
This query however breaks my DB, I feel like I'm missing something essential for writing CQL queries...
I've also looked into COLLECT and UNWIND in this neo blog post but couldn't really make sense of it.
How can I paginate my query without removing the unions? Or is there any other way of writing the query so that pagination can be applied at the Journal level and the performance isn't affected?
--- EDIT ---
Here is the execution plan for my second query
You really don't need UNION for this, because when you approach this using UNION, you're getting all the related nodes for every :Journal node, and only AFTER you've made all those expansions from every :Journal node do you limit your result set. That is a ton of work that will only be excluded due to your LIMIT.
Your second query looks like the more correct approach, matching on :Journal nodes with a LIMIT, and only then matching on the related nodes to prepare the data for return.
You said that the second query breaks your DB. Can you run a PROFILE on the query (or an EXPLAIN, if the query never finishes execution), expand all elements of the plan, and add it to your description?
Also, if you leave out the final MATCH to :Classification, does the query behave correctly?
It would also help to know if you really need the paths returned, or if it's enough to just return the connected nodes.
EDIT
If you want each :Journal and all its connected data on a single row, you need to either be using COLLECT() after each match, or using pattern comprehension so the result is already in a collection.
This will also cut down on unnecessary queries. Your initial match (after the limit) generated 31k rows, so all subsequent matches executed 31k times. If you collect() or use pattern comprehension, you'll keep the cardinality down to your initial 10, and prevent redundant matches.
Something like this, if you only want collected paths returned:
MATCH (j:Journal)
WITH j LIMIT 10
WITH j,
[pa=(j)<-[:BELONGS_TO]-(a:ArticleType) | pa] as pa,
[po=(j)<-[:OWNS]-(o:JournalOwner) | po] as po,
[ps=(j)<-[:BELONGS_TO]-(s:Section) | ps] as ps,
[pf=(j)-[:ACCEPTS]->(f:FileCategory) | pf] as pf,
[pc=(j)-[:CHARGED_BY]->(apc:APC) | pc] as pc,
[pt=(j)-[:ACCEPTS]->(sft:SupportedFileType) | pt] as pt,
[pl=(j)<-[:BELONGS_TO|:CHILD_OF*..]-(c:Classification) | pl] as pl
RETURN pa, po, ps, pf, pc, pt, pl

neo4j cypher: "stacking" nodes from query result

Considering the existence of three types of nodes in a db, connected by the schema
(a)-[ra {qty}]->(b)-[rb {qty}]->(c)
with the user being able to have some of each in their wishlist or whatever.
What would be the best way to query the database to return a list of all the nodes the user has on their wishlist, considering that when he has an (a) then in the result the associated (b) and (c) should also be returned after having multiplied some of their fields (say b.price and c.price) for the respective ra.qty and rb.qty?
NOTE: you can find the same problem without the variable length over here
Assuming you have users connected to the things they want like so:
(user:User)-[:WANTS]->(part:Part)
And that parts, like you describe, have dependencies on other parts in specific quantities:
CREATE
(a:Part) -[:CONTAINS {qty:2}]->(b:Part),
(a:Part) -[:CONTAINS {qty:3}]->(c:Part),
(b:Part) -[:CONTAINS {qty:2}]->(c:Part)
Then you can find all parts, and how many of each, you need like so:
MATCH
(user:User {name:"Steven"})-[:WANTS]->(part),
chain=(part)-[:CONTAINS*1..4]->(subcomponent:Part)
RETURN subcomponent, sum( reduce( total=1, r IN relationships(chain) | total * r.rty) )
The 1..4 term says to look between 1-4 sub-components down the tree. You can obv. set that to whatever you like, including "1..", infinite depth.
The second term there is a bit complex. It helps to try the query without the sum to see what it does. Without that, the reduce will do the multiplying of parts that you want for each "chain" of dependencies. Adding the sum will then aggregate the result by subcomponent (inferred from your RETURN clause) and sum up the total count for that subcomponent.
Figuring out the price is then an excercise of multiplying the aggregate quantities of each part. I'll leave that as an exercise for the reader ;)
You can try this out by running the queries in the online console at http://console.neo4j.org/

Resources