Nested unions in Cypher/Neo4j - neo4j

I've this metagraph in Neo4j:
:Protein ---> :is_a ---------------> :Enzyme ---> :activated_by|inhibited_by ---> :Compound
\<-- :activated_by <---/
:Compound --> :consumed_by|:produced_by ---> :Transport
\<---- :catalyzed_by -------<---/
:Transport --> part_of ---> :Pathway
(yes, it's biology, yes, it's imported from BioPAX).
I want to use Cypher to return all pairs of (:Protein, :Compounds) that are linked by some graph path. While one simple way would be to follow each possible path between the two node types and issue one query per each of them, clearly a query using some pattern union would be more compact (maybe slower, I'm assessing the performance of different approaches).
But how to write such a query? I've read that Cypher has a UNION construct, but I haven't clear how to combine and nest them, e.g., regarding the subgraphs from enzymes to transports, I would like to be able to write something equivalent to this informal expression:
enz:Enzyme :activated|inhibited_by comp:Compund
join with: (
(comp :consumed_by|produced_by :Transport)
UNION (:Transport :catalyzed_by comp )
)
I've read there should be some way, but I didn't get much of it and I'm trying to understand if there is a relatively easy way (in SPARQL or SQL the above is rather simple).

Using Periodic Collect
In Cypher, you can break a query into steps with WITH, and you can join two lists by concatenating them together.
MATCH (e:Enzyme)-[:activated]->(compA:Compound), (e)-[:inhibited_by]->(compB:Compund)
WITH e, COLLECT(compA)+COLLECT(compB) as compList
UNWIND compList as comp
WITH DISTINCT e, comp // if a comp can appear in both lists
MATCH ... // Repeat above at each path step
Using Union
When using Union to combine different queries, think of it like a comma separated list of queries, but instead of commas, you use the word UNION instead (Also, every query in this list has to have the same return columns.
MATCH (e:Enzyme)-[r1:activated]->(comp:Compound)-[r2:consumed_by]->(trans:Transport)
RETURN e as protein, comp as compound, trans as transport
UNION
MATCH (e:Enzyme)-[r1:inhibited_by]->(comp:Compound)-[r2:produced_by]->(trans:Transport)
RETURN e as protein, comp as compound, trans as transport
// Just to show only return names have to match
UNION
WITH "rawr" as a
RETURN a as protein, 51 as compound, NULL as transport
This is good for combining the results of completely different queries, but since the queries you are combining are usually related, most of the time COLLECT will be more efficient, and gives you better control over the results.
Using OR
You can get the name of a relationship with the TYPE function and filter on that.
MATCH (e:Enzyme)-[r1]->(comp:Compound)-[r2]->(trans:Transport)
WHERE (TYPE(r1) = "activated" OR TYPE(r1) = "inhibited_by") AND (TYPE(r2) = "consumed_by" OR TYPE(r2) = "produced_by")
RETURN *
Note: For using OR on just they relation type, you can also use -[:A|:B]-> for OR.
Using Path Filtering
As of Neo4j 3.1.x, Cypher is fairly good at free path finding. (By that, I mean finding a valid path without searching all possible paths. The pattern for relation matching) The upper bound is not strictly necessary, but good for preventing/controlling runaway queries.
MATCH p=(e:Protein)-[r1*..10]->(c:Compound)
WHERE ALL(r in RELATIONSHIPS(p) WHERE TYPE(r) in ["activated","inhibited_by","produced_by","consumed_by"])
RETURN e as protein, c as compound
Other
If they relationship type doesn't actually matter, you can just use (e:Enzyme)-->(c:Compound) or to ignore direction (e:Enzyme)--(c:Compound)
If it is an option, I would recommend refactoring your schema to be more consistent (or add a relation type relevant to this matching criteria) so that you don't need to union results. (This will give you the best performance, since the Cypher planner will better know how to quickly find your results)

The query below will return all paths in which an Enzyme is activated or inhibited by a Compound that is consumed, produced, or catalyzed by a Transport.
The query's r2 relationship pattern is non-directional, since the directionality of catalyzed_by is opposite that of consumed_by and produced_by.
MATCH p=
(e:Enzyme)-[r1:activated_by|:inhibited_by]->
(comp:Compound)-[r2:consumed_by|:produced_by|:catalyzed_by]-
(trans:Transport)
RETURN p;

So, I haven't still found a satisfactory way to do what I initially asked, but I get quite close to it and I wish to report about.
First, my initial graph was a bit different than the one I actually have, so I'll base my next examples on the real one (sorry for the confusion):
I can work branches quite comfortably with paths in the WHERE clause (plus, in simpler cases, UNIONS):
// Branch 2

MATCH (prot:Protein), (enz:Enzyme), (tns:Transport) - [:part_of] -> (path:Path)

WHERE ( (enz) - [:ac_by|:in_by] -> (:Comp) - [:pd_by|:cs_by] -> (tns) // Branch 2.1

 OR (tns) - [:ca_by] -> (enz) ) //Branch 2.2 (pt1)
AND ( (prot) - [:is_a] -> (enz) OR (prot) <- [:ac_by] - (enz) ) // Branch 2.2 (pt2)

RETURN prot, path LIMIT 30

UNION

// Branch1
MATCH (prot:Protein) - [:pd_by|:cs_by] -> (:Reaction) - [:part_of] -> (path:Path)

RETURN prot, path LIMIT 30
(I'm also sorry for all those abbreviations, e.g., pd_by is produced_by, ac_by is activated_by and so on).
This query yields results in about 1 minute. Too long and that's clearly due to the way the query is interpreted, as one can see from its plan:
I really can't understand why there are those huge cartesian products. I've tried the WITH/UNWIND approach, but I've been unable to get the correct results (see my comments above, thanks #Tezra and #cybersam) and, even if I was, it's a very difficult syntax.

Related

Neo4j - Variable length greater 11 runs forever and query never returns

I'm lost and tried everything I can think of. Maybe you can help me.
I'm trying to find all dependencies for a given software package. In this special case I'm working with the Node.js / JavaScript ecosystem and scraped the whole npm registry. My data model is simple, I've got packages and a package can have multiple versions.
In my database I have 113.339.030 dependency relationships and 19.753.269 versions.
My whole code works fine until I found a package that has so many dependencies (direct and transitive) that all my queries break down. It's called react-scripts. Here you can see the package information.
https://registry.npmjs.org/react-scripts
One visualizer never finishes
https://npm.anvaka.com/#/view/2d/react-scripts
and another one creates a dependency graph so big it's hard to analyze.
https://npmgraph.js.org/?q=react-scripts
At first I tried PostgreSQL with recursive common table expression.
with recursive cte as (
select
child_id
from
dependencies
where
dependencies.parent_id = 16674850
union
select
dependencies.child_id
from
cte
left join dependencies on
cte.child_id = dependencies.parent_id
where
cte.child_id is not null
)
select * from cte;
That returns 1.726 elements which seems to be OK. https://deps.dev/npm/react-scripts/4.0.3/dependencies returns 1.445 dependencies.
However I'd like to get the path to the nodes and that doesn't work well with PostgreSQL and UNION. You'd have to use UNION ALL but the query will be much more complicated and slower. That's why I thought I'd give Neo4j a chance.
My nodes have the properties
version_id: integer
name: string
version: string
I'm starting with what I thought would be a simple query but it's already failing.
Start with version that has version_id 16674850 and give me all its dependencies.
MATCH p = (a:Version {version_id: 16674850})-[:DEPENDS_ON*..11]->(b:Version)
return DISTINCT b;
I have an index on version_id.
CREATE INDEX FOR (version:Version) ON (version.version_id)
That works until I set the depth to variable length to or greater 12.
Then the query runs forever. Here is the query plan.
Neo4j runs inside Docker. I've increased some memory settings.
- NEO4J_dbms_memory_heap_initial__size=2G
- NEO4J_dbms_memory_heap_max__size=2G
- NEO4J_dbms_memory_pagecache_size=1G
Any ideas? I'm really lost right now and don't want to give up on my "software dependency analysis graph".
I spent the last 6 weeks on this problem.
Thank you very much!
Edit 28/09/2021
I uploaded a sample data set. Here are the links
https://s3.amazonaws.com/blog.spolytics.com/versions.csv (737.1 MB)
https://s3.amazonaws.com/blog.spolytics.com/dependencies.csv (1.7 GB)
Here is the script to import the data.
neo4j-admin import \
--database=deps \
--skip-bad-relationships \
--id-type=INTEGER \
--nodes=Version=import/versions.csv \
--relationships=DEPENDS_ON=import/dependencies.csv
That might help to do some experiments on your side and to reproduce my problem.
The trouble here is that Cypher is interested in finding all possible path that match a pattern. That can make it problematic for cases when you just want distinct reachable nodes, where you really don't care about expanding to every distinct path, but just finding nodes and ignoring any alternate paths leading to nodes already visited.
Also, the planner is making a bad choice with that cartesian product plan, that can make the problem worse.
I'd recommend using APOC Procedures for this, as there are procs that are optimized to expanding to distinct nodes and ignoring paths to those already visited. apoc.path.subgraphNodes() is the procedure.
Here's an example of use:
MATCH (a:Version {version_id: 16674850})
CALL apoc.path.subgraphNodes(a, {relationshipFilter:'DEPENDS_ON>', labelFilter:'>Version', maxLevel:11}) YIELD node as b
RETURN b
The arrow in the relationship filter indicates direction, and since it's pointing right it refers to traversing outgoing relationships. If we were interested in traversing incoming relationships instead, we would have the arrow at the start of the relationship name, pointing to the left.
For the label filter, the prefixed > means the label is an end label, meaning that we are only interested in returning the nodes of that given label.
You can remove the maxLevel config property if you want it to be an unbounded expansion.
More options and details here:
https://neo4j.com/labs/apoc/4.1/graph-querying/expand-subgraph-nodes/
I don’t have a large dataset like yours, but I think you could bring the number of paths down by filtering the b nodes. Does this work , as a start?
MATCH p = (a:Version {version_id: 16674850})-[:DEPENDS_ON*..11]->(b:Version)
WHERE NOT EXISTS ((b)-[:DEPENDS_ON]->())
UNWIND nodes(p) AS node
return COUNT(DISTINCT node)
To check if you can return longer paths, you could do
MATCH p = (a:Version {version_id: 16674850})-[:DEPENDS_ON*..12]->(b:Version)
WHERE NOT EXISTS ((b)-[:DEPENDS_ON]->())
RETURN count(p)
Now if that works, I would do :
MATCH p = (a:Version {version_id: 16674850})-[:DEPENDS_ON*..12]->(b:Version)
WHERE NOT EXISTS ((b)-[:DEPENDS_ON]->())
RETURN p LIMIT 10
to see whether the paths are correct.
Sometimes UNWIND is causing n issue. To get the set of unique nodes, you could also try APOC
MATCH p = (a:Version {version_id: 16674850})-[:DEPENDS_ON*..12]->(b:Version)
WHERE NOT EXISTS ((b)-[:DEPENDS_ON]->())
RETURN apoc.coll.toSet(
apoc.coll.flatten(
COLLECT(nodes(p))
)
) AS unique nodes

Optimizing a Cypher query to improve performance

The query I've written returns accurate results based on some random testing I've done. However, the query execution takes really long (7699.43 s)
I need help optimising this query.
count(Person) -> 67895
count(has_POA) -> 355479
count(POADocument) -> 40
count(issued_by) -> 40
count(Company) -> 21
count(PostCode) -> 9845
count(Town) -> 1673
count(in_town) -> 9845
count(offers_services_in) -> 17107
All the entity nodes are indexed on Id's (not Neo4j IDs). The PostCode nodes are also indexed on PostCode.
MATCH pa= (p:Person)-[r:has_POA]->(d:POADocument)-[:issued_by]->(c:Company),
(pc:PostCode),(t:Town) WHERE r.recipient_postcode=pc.PostCode AND (pc)-
[:in_town]->(t) AND NOT (c)-[:offers_services_in]->(t) RETURN p as Person,r
as hasPOA,t as Town, d as POA,c as Company
Much thanks in advance!
-Nancy
I made some changes in your query:
MATCH (p:Person)-[r:has_POA {recipient_code : {code} }]->(d:POADocument)-[:issued_by]->(c:Company),
(pc:PostCode {PostCode : {PostCode} })-[:in_town]->(t:Town)
WHERE NOT (c)-[:offers_services_in]->(t)
RETURN p as Person, r as hasPOA, t as Town, d as POA, c as Company
Since you are not using the entire path, removed pa variable
Moved the pattern existence check ((pc)-[:in_town]->(t)) from WHERE to MATCH.
Using parameters instead of the equality check r.recipient_postcode = pc.PostCode in where. If you are running the query in Neo4j Browser, you can set the parameters running the command :params {code : 10}.
Here is a simplified version of your current query.
MATCH (p:Person)-[r:has_POA]->(d:POADocument)-[:issued_by]->(c:Company)
MATCH (t:Town)<-[:in_town]-(pc:PostCode{PostCode:r.recipient_postcode})
WHERE NOT (c)-[:offers_services_in]->(t)
RETURN p as Person,r as hasPOA,t as Town, d as POA,c as Company
Your big performance hits are going to be on the Cartesian product between all the match sets, and the raw amount of data you are asking for.
In this simplified version, I'm using one less match, and the second match uses a variable from the first match to avoid generating a Cartesian product. I would also recommend using LIMIT and SKIP to page your results to limit data transfer.
If you can adjust your model, I would recommend converting the has_POA relation to an issued_POA node so that you can take advantage of Neo4j's relation finding on the 2 postcodes related to that instance, and making the second match a gimme instead of an extra indexed search (after you adjust the query to match the new model, of course).

Neo4j query for complete paths

I have the following structure.
CREATE
(`0` :Sentence {`{text`:'This is a sentence'}}) ,
(`1` :Word {`{ text`:'This' }}) ,
(`2` :Word {`{text`:'is'}}) ,
(`3` :Sentence {`{'text'`:'Sam is a dog'}}) ,
(`0`)-[:`RELATED_TO`]->(`1`),
(`0`)-[:`RELATED_TO`]->(`2`),
(`3`)-[:`RELATED_TO`]->(`2`)
schema example
So my question is this. I have a bunch of sentences that I have decomposed into word objects. These word objects are all unique and therefore will point to different sentences. If I perform a search for one word it's very easy to figure out all of the sentences that word is related to. How can I structure a query to figure out the same information for two words instead of one.
I would like to submit two or more words and find a path that includes all words submitted picking up all sentences of interest.
I just remembered an alternate approach that may work better. Compare the PROFILE on this query with the profiles for the others, see if it works better for you.
WITH {myListOfWords} as wordList
WITH wordList, size(wordList) as wordCnt
MATCH (s)-[:RELATED_TO]->(w:Word)
WHERE w.text in wordList
WITH s, wordCnt, count(DISTINCT w) as cnt
WHERE wordCnt = cnt
RETURN s
Unfortunately it's not a very pretty approach, it basically comes down to collecting :Word nodes and using the ALL() predicate to ensure that the pattern you want holds true for all elements of the collection.
MATCH (w:Word)
WHERE w.text in {myListOfWords}
WITH collect(w) as words
MATCH (s:Sentence)
WHERE ALL(word in words WHERE (s)-[:RELATED_TO]->(word))
RETURN s
What makes this ugly is that the planner isn't intelligent enough right now to infer that when you say MATCH (s:Sentence) WHERE ALL(word in words ... that the initial matches for s ought to come from the match from the first w in your words collection, so it starts out from all :Sentence nodes first, which is a major performance hit.
So to get around this, we have to explicitly match from the first of the words collection, and then use WHERE ALL() for the remaining.
MATCH (w:Word)
WHERE w.text in {myListOfWords}
WITH w, size(()-[:RELATED_TO]->(w)) as rels
WITH w ORDER BY rels ASC
WITH collect(w) as words
WITH head(words) as head, tail(words) as words
MATCH (s)-[:RELATED_TO]->(head)
WHERE ALL(word in words WHERE (s)-[:RELATED_TO]->(word))
RETURN s
EDIT:
Added an optimization to order your w nodes by the degree of their incoming :RELATED_TO relationships (this is a degree lookup on very few nodes), as this will mean the initial match to your :Sentence nodes is the smallest possible starting set before you filter for relationships from the rest of the words.
As an alternative, you could consider using manual indexing (also called "legacy indexing") instead of using Word nodes and RELATED_TO relationships. Manual indexes support "fulltext" searches using lucene.
There are many apoc procedures that help you with this.
Here is an example that might work for you. In this example, I assume case-insensitive comparisons are OK, you retain the Sentence nodes (and their text properties), and you want to automatically add the text properties of all Sentence nodes to a manual index.
If you are using neo4j 3.2+, you have to add this setting to the neo4j.conf file to make some expensive apoc.index procedures (like apoc.index.addAllNodes) available:
dbms.security.procedures.unrestricted=apoc.*
Execute this Cypher code to initialize a manual index named "WordIndex" with the text text from all existing Sentence nodes, and to enable automatic indexing from that point onwards:
CALL apoc.index.addAllNodes('WordIndex', {Sentence: ['text']}, {autoUpdate: true})
YIELD label, property, nodeCount
RETURN *;
To find (case insensitively) the Sentence nodes containing all the words in a collection (passed via a $words parameter), you'd execute a Cypher statement like the one below. The WITH clause builds the lucene query string (e.g., "foo AND bar") for you. Caveat: since lucene's special boolean terms (like "AND" and "OR") are always in uppercase, you should make sure the words you pass in are lowercased (or modify the WITH clause below to use the TOLOWER()` function as needed).
WITH REDUCE(s = $words[0], x IN $words[1..] | s + ' AND ' + x) AS q
CALL apoc.index.search('WordIndex', q) YIELD node
RETURN node;

Neo4J - How to do post processing UNION (pagination)

I'm writing a cypher query to load data from my Neo4J DB, this is my data model
So basically what I want is a query to return a Journal with all of its properties and everything related to it, Ive tried doing the simple query but it is not performant at all and my ec2 instance where the DB is hosted runs out of memory quickly
MATCH p=(j:Journal)-[*0..]-(n) RETURN p
I managed to write a query using UNIONS
`MATCH p=(j:Journal)<-[:BELONGS_TO]-(at:ArticleType) RETURN p
UNION
MATCH p=(j:Journal)<-[:OWNS]-(jo:JournalOwner) RETURN p
UNION
MATCH p=(j:Journal)<-[:BELONGS_TO]-(s:Section) RETURN p
UNION

MATCH p=(j:Journal)-[:ACCEPTS]->(fc:FileCategory) RETURN p
UNION
MATCH p=(j:Journal)-[:CHARGED_BY]->(a:APC) RETURN p
UNION
MATCH p=(j:Journal)-[:ACCEPTS]->(sft:SupportedFileType) RETURN p
UNION
MATCH p=(j:Journal)<-[:BELONGS_TO|:CHILD_OF*..]-(c:Classification) RETURN p
SKIP 0 LIMIT 100`
The query works fine and its performance is not bad at all, the only problem I'm finding is in the limit, I've been googling around and I've seen that post-processing queries with UNIONS is not yet supported.
The referenced github issue is not yet resolved, so post processing of UNION is not yet possible github link
Logically the first thing I tried when I came across this issue was to put the pagination on each individual query, but this had some weird behaviour that didn't make much sense to myself.
So I tried to write the query without using UNIONS, I came up with this
`MATCH (j:Journal)
WITH j LIMIT 10
MATCH pa=(j)<-[:BELONGS_TO]-(a:ArticleType)
MATCH po=(j)<-[:OWNS]-(o:JournalOwner)
MATCH ps=(j)<-[:BELONGS_TO]-(s:Section)
MATCH pf=(j)-[:ACCEPTS]->(f:FileCategory)
MATCH pc=(j)-[:CHARGED_BY]->(apc:APC)
MATCH pt=(j)-[:ACCEPTS]->(sft:SupportedFileType)
MATCH pl=(j)<-[:BELONGS_TO|:CHILD_OF*..]-(c:Classification)
RETURN pa, po, ps, pf, pc, pt, pl`
This query however breaks my DB, I feel like I'm missing something essential for writing CQL queries...
I've also looked into COLLECT and UNWIND in this neo blog post but couldn't really make sense of it.
How can I paginate my query without removing the unions? Or is there any other way of writing the query so that pagination can be applied at the Journal level and the performance isn't affected?
--- EDIT ---
Here is the execution plan for my second query
You really don't need UNION for this, because when you approach this using UNION, you're getting all the related nodes for every :Journal node, and only AFTER you've made all those expansions from every :Journal node do you limit your result set. That is a ton of work that will only be excluded due to your LIMIT.
Your second query looks like the more correct approach, matching on :Journal nodes with a LIMIT, and only then matching on the related nodes to prepare the data for return.
You said that the second query breaks your DB. Can you run a PROFILE on the query (or an EXPLAIN, if the query never finishes execution), expand all elements of the plan, and add it to your description?
Also, if you leave out the final MATCH to :Classification, does the query behave correctly?
It would also help to know if you really need the paths returned, or if it's enough to just return the connected nodes.
EDIT
If you want each :Journal and all its connected data on a single row, you need to either be using COLLECT() after each match, or using pattern comprehension so the result is already in a collection.
This will also cut down on unnecessary queries. Your initial match (after the limit) generated 31k rows, so all subsequent matches executed 31k times. If you collect() or use pattern comprehension, you'll keep the cardinality down to your initial 10, and prevent redundant matches.
Something like this, if you only want collected paths returned:
MATCH (j:Journal)
WITH j LIMIT 10
WITH j,
[pa=(j)<-[:BELONGS_TO]-(a:ArticleType) | pa] as pa,
[po=(j)<-[:OWNS]-(o:JournalOwner) | po] as po,
[ps=(j)<-[:BELONGS_TO]-(s:Section) | ps] as ps,
[pf=(j)-[:ACCEPTS]->(f:FileCategory) | pf] as pf,
[pc=(j)-[:CHARGED_BY]->(apc:APC) | pc] as pc,
[pt=(j)-[:ACCEPTS]->(sft:SupportedFileType) | pt] as pt,
[pl=(j)<-[:BELONGS_TO|:CHILD_OF*..]-(c:Classification) | pl] as pl
RETURN pa, po, ps, pf, pc, pt, pl

Cypher Query not returning nonexistent relationships

I have a graph database where there are user and interest nodes which are connected by IS_INTERESTED relationship. I want to find interests which are not selected by a user. I wrote this query and it is not working
OPTIONAL MATCH (u:User{userId : 1})-[r:IS_INTERESTED] -(i:Interest)
WHERE r is NULL
Return i.name as interest
According to answers to similar questions on SO (like this one), the above query is supposed to work.However,in this case it returns null. But when running the following query it works as expected:
MATCH (u:User{userId : 1}), (i:Interest)
WHERE NOT (u) -[:IS_INTERESTED] -(i)
return i.name as interest
The reason I don't want to run the above query is because Neo4j gives a warning:
This query builds a cartesian product between disconnected patterns.
If a part of a query contains multiple disconnected patterns, this
will build a cartesian product between all those parts. This may
produce a large amount of data and slow down query processing. While
occasionally intended, it may often be possible to reformulate the
query that avoids the use of this cross product, perhaps by adding a
relationship between the different parts or by using OPTIONAL MATCH
(identifier is: (i))
What am I doing wrong in the first query where I use OPTIONAL MATCH to find nonexistent relationships?
1) MATCH is looking for the pattern as a whole, and if can not find it in its entirety - does not return anything.
2) I think that this query will be effective:
// Take all user interests
MATCH (u:User{userId: 1})-[r:IS_INTERESTED]-(i:Interest)
WITH collect(i) as interests
// Check what interests are not included
MATCH (ni:Interest) WHERE NOT ni IN interests
RETURN ni.name
When your OPTIONAL MATCH query does not find a match, then both r AND i must be NULL. After all, since there is no relationship, there is no way get the nodes that it points to.
A WHERE directly after the OPTIONAL MATCH is pulled into the evaluation.
If you want to post-filter you have to use a WITH in between.
MATCH (u:User{userId : 1})
OPTIONAL MATCH (u)-[r:IS_INTERESTED] -(i:Interest)
WITH r,i
WHERE r is NULL
Return i.name as interest

Resources