Neo4J - How to do post processing UNION (pagination) - neo4j

I'm writing a cypher query to load data from my Neo4J DB, this is my data model
So basically what I want is a query to return a Journal with all of its properties and everything related to it, Ive tried doing the simple query but it is not performant at all and my ec2 instance where the DB is hosted runs out of memory quickly
MATCH p=(j:Journal)-[*0..]-(n) RETURN p
I managed to write a query using UNIONS
`MATCH p=(j:Journal)<-[:BELONGS_TO]-(at:ArticleType) RETURN p
UNION
MATCH p=(j:Journal)<-[:OWNS]-(jo:JournalOwner) RETURN p
UNION
MATCH p=(j:Journal)<-[:BELONGS_TO]-(s:Section) RETURN p
UNION

MATCH p=(j:Journal)-[:ACCEPTS]->(fc:FileCategory) RETURN p
UNION
MATCH p=(j:Journal)-[:CHARGED_BY]->(a:APC) RETURN p
UNION
MATCH p=(j:Journal)-[:ACCEPTS]->(sft:SupportedFileType) RETURN p
UNION
MATCH p=(j:Journal)<-[:BELONGS_TO|:CHILD_OF*..]-(c:Classification) RETURN p
SKIP 0 LIMIT 100`
The query works fine and its performance is not bad at all, the only problem I'm finding is in the limit, I've been googling around and I've seen that post-processing queries with UNIONS is not yet supported.
The referenced github issue is not yet resolved, so post processing of UNION is not yet possible github link
Logically the first thing I tried when I came across this issue was to put the pagination on each individual query, but this had some weird behaviour that didn't make much sense to myself.
So I tried to write the query without using UNIONS, I came up with this
`MATCH (j:Journal)
WITH j LIMIT 10
MATCH pa=(j)<-[:BELONGS_TO]-(a:ArticleType)
MATCH po=(j)<-[:OWNS]-(o:JournalOwner)
MATCH ps=(j)<-[:BELONGS_TO]-(s:Section)
MATCH pf=(j)-[:ACCEPTS]->(f:FileCategory)
MATCH pc=(j)-[:CHARGED_BY]->(apc:APC)
MATCH pt=(j)-[:ACCEPTS]->(sft:SupportedFileType)
MATCH pl=(j)<-[:BELONGS_TO|:CHILD_OF*..]-(c:Classification)
RETURN pa, po, ps, pf, pc, pt, pl`
This query however breaks my DB, I feel like I'm missing something essential for writing CQL queries...
I've also looked into COLLECT and UNWIND in this neo blog post but couldn't really make sense of it.
How can I paginate my query without removing the unions? Or is there any other way of writing the query so that pagination can be applied at the Journal level and the performance isn't affected?
--- EDIT ---
Here is the execution plan for my second query

You really don't need UNION for this, because when you approach this using UNION, you're getting all the related nodes for every :Journal node, and only AFTER you've made all those expansions from every :Journal node do you limit your result set. That is a ton of work that will only be excluded due to your LIMIT.
Your second query looks like the more correct approach, matching on :Journal nodes with a LIMIT, and only then matching on the related nodes to prepare the data for return.
You said that the second query breaks your DB. Can you run a PROFILE on the query (or an EXPLAIN, if the query never finishes execution), expand all elements of the plan, and add it to your description?
Also, if you leave out the final MATCH to :Classification, does the query behave correctly?
It would also help to know if you really need the paths returned, or if it's enough to just return the connected nodes.
EDIT
If you want each :Journal and all its connected data on a single row, you need to either be using COLLECT() after each match, or using pattern comprehension so the result is already in a collection.
This will also cut down on unnecessary queries. Your initial match (after the limit) generated 31k rows, so all subsequent matches executed 31k times. If you collect() or use pattern comprehension, you'll keep the cardinality down to your initial 10, and prevent redundant matches.
Something like this, if you only want collected paths returned:
MATCH (j:Journal)
WITH j LIMIT 10
WITH j,
[pa=(j)<-[:BELONGS_TO]-(a:ArticleType) | pa] as pa,
[po=(j)<-[:OWNS]-(o:JournalOwner) | po] as po,
[ps=(j)<-[:BELONGS_TO]-(s:Section) | ps] as ps,
[pf=(j)-[:ACCEPTS]->(f:FileCategory) | pf] as pf,
[pc=(j)-[:CHARGED_BY]->(apc:APC) | pc] as pc,
[pt=(j)-[:ACCEPTS]->(sft:SupportedFileType) | pt] as pt,
[pl=(j)<-[:BELONGS_TO|:CHILD_OF*..]-(c:Classification) | pl] as pl
RETURN pa, po, ps, pf, pc, pt, pl

Related

simple match query taking ages

I have a simple query
MATCH (n:TYPE {id:123})<-[:CONNECTION*]<-(m:TYPE) RETURN m
and when executing the query "manually" (i.e. using the browser interface to follow edges) I only get a single node as a result as there are no further connections. Checking this with the query
MATCH (n:TYPE {id:123})<-[:CONNECTION]<-(m:TYPE)<-[n:CONNECTION]-(o:TYPE) RETURN m,o
shows no results and
MATCH (n:TYPE {id:123})<-[:CONNECTION]<-(m:TYPE) RETURN m
shows a single node so I have made no mistake doing the query manually.
However, the issue is that the first question takes ages to finish and I do not understand why.
Consequently: What is the reason such trivial query takes so long even though the maximum result would be one?
Bonus: How to fix this issue?
As Tezra mentioned, the variable-length pattern match isn't in the same category as the other two queries you listed because there's no restrictions given on any of the nodes in between n and m, they can be of any type. Given that your query is taking a long time, you likely have a fairly dense graph of :CONNECTION relationships between nodes of different types.
If you want to make sure all nodes in your path are of the same label, you need to add that yourself:
MATCH path = (n:TYPE {id:123})<-[:CONNECTION*]-(m:TYPE)
WHERE all(node in nodes(path) WHERE node:TYPE)
RETURN m
Alternately you can use APOC Procedures, which has a fairly efficient means of finding connected nodes (and restricting nodes in the path by label):
MATCH (n:TYPE {id:123})
CALL apoc.path.subgraphNodes(n, {labelFilter:'TYPE', relationshipFilter:'<CONNECTION'}) YIELD node
RETURN node
SKIP 1 // to avoid returning `n`
MATCH (n:TYPE {id:123})<-[:CONNECTION]<-(m:TYPE)<-[n:CONNECTION]-(o:TYPE) RETURN m,o Is not a fair test of MATCH (n:TYPE {id:123})<-[:CONNECTION*]<-(m:TYPE) RETURN m because it excludes the possibility of MATCH (n:TYPE {id:123})<-[:CONNECTION]<-(m:ANYTHING_ELSE)<-[n:CONNECTION]-(o:TYPE) RETURN m,o.
For your main query, you should be returning DISTINCT results MATCH (n:TYPE {id:123})<-[:CONNECTION*]<-(m:TYPE) RETURN DISTINCT m.
This is for 2 main reasons.
Without distinct, each node needs to be returned the number of times for each possible path to it.
Because of the previous point, that is a lot of extra work for no additional meaningful information.
If you use RETURN DISTINCT, it gives the cypher planner the choice to do a pruning search instead of an exhaustive search.
You can also limit the depth of the exhaustive search using ..# so that it doesn't kill your query if you run against a much older version of Neo4j where the Cypher Planner hasn't learned pruning search yet. Example use MATCH (n:TYPE {id:123})<-[:CONNECTION*..10]<-(m:TYPE) RETURN m

why is neo4j so slow on this cypher query?

I have a fairly deep tree that consists of an initial "transaction" node (call that the 0th layer of the tree), from which there are 50 edges to the next nodes (call it the 1st later of the tree), and then from each of those around 35 on average to the second layer, and so on...
The initial node is a :txnEvent and all the rest are :mEvent
mEvent nodes have 4 properties, one of them called channel_name
Now, I would like to retrieve all paths that go down to the 4th layer such that those paths contain a node with channel_name==A and also channel_name==B
This query:
match (n: txnEvent)-[r:TO*1..4]->(m:mEvent) return COUNT(*);
Is telling me there are only 1,667,444 paths to consider.
However, the following query:
MATCH p = (n:txnEvent)-[:TO*1..4]->(m:mEvent)
WHERE ANY(k IN nodes(p) WHERE k.channel_name='A')
AND ANY(k IN nodes(p) WHERE k.channel_name='B')
RETURN
EXTRACT (n in nodes(p) | n.channel_name),
EXTRACT (n in nodes(p) | n.step),
EXTRACT (n in nodes(p) | n.event_type),
EXTRACT (n in nodes(p) | n.event_device),
EXTRACT (r in relationships(p) | r.weight )
Takes almost 1 minute to execute (neo4j's UI on port 7474)
For completness, neo4j is telling me:
"Started streaming 125517 records after 2 ms and completed after 50789 ms, displaying first 1000 rows."
So I'm wondering whether there's something obvious I'm missing. All of the properties that nodes have are indexed by the way. Is the query slow, or is it fast and the streaming is slow?
UDATE:
This query, that doesn't stream data back:
MATCH p = (n:txnEvent)-[:TO*1..4]->(m:mEvent)
WHERE ANY(k IN nodes(p) WHERE k.channel_name='A')
AND ANY(k IN nodes(p) WHERE k.channel_name='B')
RETURN
COUNT(*)
Takes 35s, so even though it's faster, presumably because no data is returned, I feel it's still quite slow.
UPDATE 2:
Ideally this data should go into a jupyter notebook with a python kernel.
Thanks for the PROFILE plan.
Keep in mind that the query you're asking for is a difficult one to process. Since you want paths where at least one node in the path has one property and at least one other node in the path has another property, there is no way to prune paths during expansion. Instead, every possible path has to be determined, and then every node in each of those 1.6 million paths has to be accessed to check for the property (and that has to be done twice for each path, for both properties). Thus the ~10 million db hits for the filter operation.
You could try expanding your heap and pagecache sizes (if you have the RAM to spare), but I don't see any easy ways to tune this query.
As for your question about the query time vs streaming, the problem is the query itself. The message you saw means that the first result was found extremely quickly so the first result was ready in the stream almost immediately. Results are added to the stream as they're found, but the volume of paths needing to be matched and filtered with no ability to prune paths during expansion means it took a very long time for the query to complete.

Multiple Match with Cypher is super slow

I started with Cypher yesterday and went into a problem. My hierarchy structure looks like:
ReleaseCycle->Line->ProductSet
So "ReleaseCycle" are the fewest elements and in the hierarchy tree at the very top. The children from "ReleaseCycle" are "Line" and the children from "Line" are "ProductSet" which includes the most Data. All of them are related to "releaseTask" and i want to return all the releaseTask.
My cypher call at the moment looks like:
MATCH (a1:Line)<-[:contains]-(b1:ReleaseCycle{id:'xyz'})<-[c1:releaseTask]-(d1:Task)
MATCH (a2:ProductSet)<-[:contains]-(b2:Line{id: a1.id})<-[c2:releaseTask]-(d2:Task)
MATCH (b3:ProductSet{id: a2.id})<-[c3:releaseTask]-(d3:Task)
RETURN COLLECT(DISTINCT {a:"cycle", task: d1, contributor: c1.contributor}),
COLLECT(DISTINCT {a:"Line", task: d2, contributor: c2.contributor}),
COLLECT(DISTINCT {a:"ps", task: d3, contributor: c3.contributor})
The problem is that it takes 5 seconds for the respond and it seems like there are entries missing. My result is only 800 but it should be 3000... Maybe because of the "Distinct" and Collect() but if i don't use them I get a timeout because the query takes too long.
Thanks for your help
If you PROFILE the query, you'll see that the number of records/rows being generated as the query executes is likely spiking. Cypher operations execute per record/row, so you're performing much more work than is needed.
The key with aggregation is to collect as early as possible (where it makes sense), and this will in turn reduce cardinality (the records/rows in the query) which means fewer operations needing execution and fewer redundant operations on the same nodes/data.
In your case, it makes sense to do this in steps, and to use pattern comprehension as shorthand to match a pattern and collect results.
You can also reuse variables throughout your query instead of having to rematch to the nodes you already have (though WITH allows you to redefine what variables are in scope).
Try this one out, and if it gives good results, PROFILE it to see the difference:
MATCH (release:ReleaseCycle{id:'xyz'})
WITH release, [(release)<-[rel:releaseTask]-(task:Task) |
{a:"cycle", task:task, contributor:rel.contributor}] as cycleTasks
MATCH (line:Line)<-[:contains]-(release)
WITH cycleTasks, line, [(line)<-[rel:releaseTask]-(task:Task) |
{a:"Line", task:task, contributor:rel.contributor}] as lineTasks
MATCH (prod:ProductSet)<-[:contains]-(line)
WITH cycleTasks, lineTasks, [(prod)<-[rel:releaseTask]-(task:Task) |
{a:"ps", task:task, contributor:rel.contributor}] as psTasks
RETURN cycleTasks, lineTasks, psTasks
Also, make sure you have indexes on the label/property combinations that you plan to use when looking up your starting node/nodes.

Nested unions in Cypher/Neo4j

I've this metagraph in Neo4j:
:Protein ---> :is_a ---------------> :Enzyme ---> :activated_by|inhibited_by ---> :Compound
\<-- :activated_by <---/
:Compound --> :consumed_by|:produced_by ---> :Transport
\<---- :catalyzed_by -------<---/
:Transport --> part_of ---> :Pathway
(yes, it's biology, yes, it's imported from BioPAX).
I want to use Cypher to return all pairs of (:Protein, :Compounds) that are linked by some graph path. While one simple way would be to follow each possible path between the two node types and issue one query per each of them, clearly a query using some pattern union would be more compact (maybe slower, I'm assessing the performance of different approaches).
But how to write such a query? I've read that Cypher has a UNION construct, but I haven't clear how to combine and nest them, e.g., regarding the subgraphs from enzymes to transports, I would like to be able to write something equivalent to this informal expression:
enz:Enzyme :activated|inhibited_by comp:Compund
join with: (
(comp :consumed_by|produced_by :Transport)
UNION (:Transport :catalyzed_by comp )
)
I've read there should be some way, but I didn't get much of it and I'm trying to understand if there is a relatively easy way (in SPARQL or SQL the above is rather simple).
Using Periodic Collect
In Cypher, you can break a query into steps with WITH, and you can join two lists by concatenating them together.
MATCH (e:Enzyme)-[:activated]->(compA:Compound), (e)-[:inhibited_by]->(compB:Compund)
WITH e, COLLECT(compA)+COLLECT(compB) as compList
UNWIND compList as comp
WITH DISTINCT e, comp // if a comp can appear in both lists
MATCH ... // Repeat above at each path step
Using Union
When using Union to combine different queries, think of it like a comma separated list of queries, but instead of commas, you use the word UNION instead (Also, every query in this list has to have the same return columns.
MATCH (e:Enzyme)-[r1:activated]->(comp:Compound)-[r2:consumed_by]->(trans:Transport)
RETURN e as protein, comp as compound, trans as transport
UNION
MATCH (e:Enzyme)-[r1:inhibited_by]->(comp:Compound)-[r2:produced_by]->(trans:Transport)
RETURN e as protein, comp as compound, trans as transport
// Just to show only return names have to match
UNION
WITH "rawr" as a
RETURN a as protein, 51 as compound, NULL as transport
This is good for combining the results of completely different queries, but since the queries you are combining are usually related, most of the time COLLECT will be more efficient, and gives you better control over the results.
Using OR
You can get the name of a relationship with the TYPE function and filter on that.
MATCH (e:Enzyme)-[r1]->(comp:Compound)-[r2]->(trans:Transport)
WHERE (TYPE(r1) = "activated" OR TYPE(r1) = "inhibited_by") AND (TYPE(r2) = "consumed_by" OR TYPE(r2) = "produced_by")
RETURN *
Note: For using OR on just they relation type, you can also use -[:A|:B]-> for OR.
Using Path Filtering
As of Neo4j 3.1.x, Cypher is fairly good at free path finding. (By that, I mean finding a valid path without searching all possible paths. The pattern for relation matching) The upper bound is not strictly necessary, but good for preventing/controlling runaway queries.
MATCH p=(e:Protein)-[r1*..10]->(c:Compound)
WHERE ALL(r in RELATIONSHIPS(p) WHERE TYPE(r) in ["activated","inhibited_by","produced_by","consumed_by"])
RETURN e as protein, c as compound
Other
If they relationship type doesn't actually matter, you can just use (e:Enzyme)-->(c:Compound) or to ignore direction (e:Enzyme)--(c:Compound)
If it is an option, I would recommend refactoring your schema to be more consistent (or add a relation type relevant to this matching criteria) so that you don't need to union results. (This will give you the best performance, since the Cypher planner will better know how to quickly find your results)
The query below will return all paths in which an Enzyme is activated or inhibited by a Compound that is consumed, produced, or catalyzed by a Transport.
The query's r2 relationship pattern is non-directional, since the directionality of catalyzed_by is opposite that of consumed_by and produced_by.
MATCH p=
(e:Enzyme)-[r1:activated_by|:inhibited_by]->
(comp:Compound)-[r2:consumed_by|:produced_by|:catalyzed_by]-
(trans:Transport)
RETURN p;
So, I haven't still found a satisfactory way to do what I initially asked, but I get quite close to it and I wish to report about.
First, my initial graph was a bit different than the one I actually have, so I'll base my next examples on the real one (sorry for the confusion):
I can work branches quite comfortably with paths in the WHERE clause (plus, in simpler cases, UNIONS):
// Branch 2

MATCH (prot:Protein), (enz:Enzyme), (tns:Transport) - [:part_of] -> (path:Path)

WHERE ( (enz) - [:ac_by|:in_by] -> (:Comp) - [:pd_by|:cs_by] -> (tns) // Branch 2.1

 OR (tns) - [:ca_by] -> (enz) ) //Branch 2.2 (pt1)
AND ( (prot) - [:is_a] -> (enz) OR (prot) <- [:ac_by] - (enz) ) // Branch 2.2 (pt2)

RETURN prot, path LIMIT 30

UNION

// Branch1
MATCH (prot:Protein) - [:pd_by|:cs_by] -> (:Reaction) - [:part_of] -> (path:Path)

RETURN prot, path LIMIT 30
(I'm also sorry for all those abbreviations, e.g., pd_by is produced_by, ac_by is activated_by and so on).
This query yields results in about 1 minute. Too long and that's clearly due to the way the query is interpreted, as one can see from its plan:
I really can't understand why there are those huge cartesian products. I've tried the WITH/UNWIND approach, but I've been unable to get the correct results (see my comments above, thanks #Tezra and #cybersam) and, even if I was, it's a very difficult syntax.

Querying multiple indexes not working if one condition fails in Neo4j

I am trying to search for a key word on all the indexes. I have in my graph database.
Below is the query:
start n=node:Users(Name="Hello"),
m=node:Location(LocationName="Hello")
return n,m
I am getting the nodes and if keyword "Hello" is present in both the indexes (Users and Location), and I do not get any results if keyword Hello is not present in any one of index.
Could you please let me know how to modify this cypher query so that I get results if "Hello" is present in any of the index keys (Name or LocationName).
In 2.0 you can use UNION and have two separate queries like so:
start n=node:Users(Name="Hello")
return n
UNION
start n=node:Location(LocationName="Hello")
return n;
The problem with the way you have the query written is the way it calculates a cartesian product of pairs between n and m, so if n or m aren't found, no results are found. If one n is found, and two ms are found, then you get 2 results (with a repeating n). Similar to how the FROM clause works in SQL. If you have an empty table called empty, and you do select * from x, empty; then you'll get 0 results, unless you do an outer join of some sort.
Unfortunately, it's somewhat difficult to do this in 1.9. I've tried many iterations of things like WITH collect(n) as n, etc., but it boils down to the cartesian product thing at some point, no matter what.

Resources