I have a JSON document with history based entity counts and relationship counts. I want to use this lookup data for entity and relationships in Neo4j. Lookup data has around 3000 rows. For the entity counts I want to display the counts for two entities based on UUID. For relationships, I want to order by two relationship counts (related entities and related mutual entities).
For entities, I have started with the following:
// get JSON doc
with value.aggregations.ent.terms.buckets as data
unwind data as lookup1
unwind data as lookup2
MATCH (e1:Entity)-[r1:RELATED_TO]-(e2)
WHERE e1.uuid = '$entityId'
AND e1.uuid = lookup1.key
AND e2.uuid = lookup2.key
RETURN e1.uuid, lookup1.doc_count, r1.uuid, e2.uuid, lookup2.doc_count
ORDER BY lookup2.doc_count DESC // just to demonstrate
LIMIT 50
I'm noticing that query is taking about 10 seconds. What am I doing wrong and how can I correct it?
Attaching explain plan:
Your query is very inefficient. You stated that data has 3,000 rows (let's call that number D).
So, your first UNWIND creates an intermediate result of D rows.
Your second UNWIND creates an intermediate result of D**2 (i.e., 9 million) rows.
If your MATCH (e1:Entity)-[r1:RELATED_TO]-(e2) clause finds N results, that generates an intermediate result of up to N*(D**2) rows.
Since your MATCH clause specifies a non-directional relationship pattern, it finds the same pair of nodes twice (in reverse order). So, N is actually twice as large as it needs to be.
Here is an improved version of your query, which should be much faster (with N/2 intermediate rows):
WITH apoc.map.groupBy(value.aggregations.ent.terms.buckets, 'key') as lookup
MATCH (e1:Entity)-[r1:RELATED_TO]->(e2)
WHERE e1.uuid = $entityId AND lookup[e1.uuid] IS NOT NULL AND lookup[e2.uuid] IS NOT NULL
RETURN e1.uuid, lookup[e1.uuid].doc_count AS count1, r1.uuid, e2.uuid, lookup[e2.uuid].doc_count AS count2
ORDER BY count2 DESC
LIMIT 50
The trick here is that the query uses apoc.map.groupBy to convert your buckets (a list of maps) into a single unified lookup map that uses the bucket key values as its property names. This allows the rest of the query to literally "look up" each uuid's data in the unified map.
Related
I have a graph database in Neo4j with drugs and drug-drug interactions, among other entities. In this regard, ()-[:IS_PARTICIPANT_IN]->() connects a drug to an interaction. I need to obtain those pairs of drugs a and b which are not involved in any other :IS_PARTICIPANT_IN relationship other than the one between them, i.e. (a)-[:IS_PARTICIPANT_IN]->(ddi:DrugDrugInteraction)<-[:IS_PARTICIPANT_IN]-(b), without any other IS_PARTICIPANT_IN relationships involving neither a nor b.
For that purpose, I have tried the following Cypher query. However, it ends up reaching heap size (raised to 8 GB), as collect operations consume too much memory.
MATCH (drug1:Drug)-[r1:IS_PARTICIPANT_IN]->(ddi:DrugDrugInteraction)
MATCH (drug2:Drug)-[r2:IS_PARTICIPANT_IN]->(ddi)
WHERE drug1 <> drug2
OPTIONAL MATCH (drug2)-[r3:IS_PARTICIPANT_IN]->(furtherDDI:DrugDrugInteraction)
WHERE furtherDDI <> ddi
WITH drug1, drug2, ddi, COLLECT(ddi) AS ddis, furtherDDI, COLLECT(furtherDDI) AS additionalDDIs
WITH drug1, drug2, ddi, COUNT(ddis) AS n1, COUNT(additionalDDIs) AS n2
WHERE n1 = 1 AND n2 = 0
RETURN drug1.name, drug2.name, ddi.name ORDER BY drug1;
How can I improve my code so as to get the desired results without exceeding the heap size limit?
This should work:
MATCH (d:Drug)
WHERE SIZE((d)-[:IS_PARTICIPANT_IN]->()) = 1
MATCH (d)-[:IS_PARTICIPANT_IN]->(ddi)
RETURN ddi.name AS ddiName, COLLECT(d.name) AS drugNames
ORDER BY drugNames[0]
The WHERE clause uses a very efficient degreeness check to filter for Drug nodes that have only a single outgoing IS_PARTICIPANT_IN relationship. This check is efficient because it does not have to actually get any DrugDrugInteraction nodes.
After the degreeness check, the query performs a second MATCH to actually get the associated DrugDrugInteraction node. (I assume that the IS_PARTICIPANT_IN relationship only points at DrugDrugInteraction nodes, and have therefore omitted the label from the search pattern, for efficiency).
The RETURN clause uses the aggregating function COLLECT to collect the Drug names for each ddi name. (I assume that ddi nodes have unique names.)
By the way, this query will also work if there are any number of Drugs (not just 2) that participate in the same DrugDrugInteraction, and no other ones. Also, if a matched DrugDrugInteraction happens to have a related Drug that participates in other interactions, this query will not include that Drug in the result (since this query only pays attention to d nodes that passed the initial degreeness check).
I am making a named entity graph in Noe4j 3.2.0. I have ARTICLE and ENTITY as node types. And the relation/edge between them is CONTAINS; which represents the number of times the entity has occurred in that article (As shown in attached picture Simple graph for articles and entities ). So if an article has one entity for 5 times, there will be 5 edges between that article and particular entity.
There are roughly 18 million articles and 40 thousand unique entities. The whole data is around 20GB(including indices on ids) and is loaded on a machine with 32 GB RAM.
I am using this graph to suggest/recommend the other entities. But my queries are taking too much time.
Use Case1: Find all entities present in the articles which have an entity from list ["A", "B"] and also having an entity "X" and an entity "Y" and an entity "Z" in the order of articles count.
Here is the cypher query I am running.
MATCH(e:Entity)-[:CONTAINS]-(a:Article)
WHERE e.EID in ["A","B"]
WITH a
MATCH (:Entity {EID:"X"})-[:CONTAINS]-(a)
WITH a
MATCH (:Entity {EID:"Y"})-[:CONTAINS]-(a)
WITH a
MATCH (:Entity {EID:"Z"})-[:CONTAINS]-(a)
WITH a
MATCH (a)-[:CONTAINS]-(e2:Entity)
RETURN e2.EID as EID, e2.Text as Text, e2.Type as Type ,count(distinct(a)) as articleCount
ORDER BY articleCount desc
Query Profile is here: Query Profile
This query gives me all first level entity neighbours of articles having X,Y,Z and at least one of A,B entities (I had to change the IDs in the query for content sensitivity).
I was just wondering if there is a better/fast way of doing it?
Another observation is if I keep adding filters (more match clauses like X,Y,Z) the performance is deteriorated; despite the fact that result set is getting smaller and smaller.
You have a uniqueness constraint on :Entity(EID), so at least that optimization is already in place.
The following Cypher query is simpler, and generates a simpler execution plan. Hopefully, it also reduces the number of DB hits.
MATCH (e:Entity)-[:CONTAINS]-(a)
WHERE e.EID in ['A','B'] AND ALL(x IN ['X','Y','Z'] WHERE (:Entity {EID: x})-[:CONTAINS]-(a))
WITH a
MATCH (a)-[:CONTAINS]-(e2:Entity)
RETURN e2.EID as EID, e2.Text as Text, e2.Type as Type, COUNT(DISTINCT a) as articleCount
ORDER BY articleCount DESC;
There are two node types, Account and Transfer. A Transfer signifies movement of funds between Account nodes. Transfer nodes may have any number of input and output nodes. For example, three Accounts could each send $40 ($120 combined) to sixteen other Accounts in any way they please and it would work.
The Transfer object, as is, does not have the sum of the funds sent or received - those are only stored in the relationships themselves. I'd like to calculate this in the cypher query and return it as part of the the returned Transfer object, not separately. (Similar to a SQL JOIN)
I'm rather new to Neo4j + Cypher; So far, the query I've got is this:
MATCH (tf:Transfer {id:'some_id'})
MATCH (tf)<-[in:IN_TO]-(in_account:Account)
MATCH (tf)-[out:OUT_TO]->(out_account:Account)
RETURN tf,in_account,in,out_account,out, sum(in.value) as sum_in, sum(out.value) as sum_out
If I managed this database, I'd just precalculate the sums and store it in the Transfer properties - but that's not an option at this time.
tl;dr: I'd like to store sum_in and sum_out in the returned tf object.
Tore Eschliman's answer is very insightful, especially on the properties of aggregated aliases.
I came up with a more hackish solution, that could work in this case.
Example data set:
CREATE
(a1:Account),
(a2:Account),
(a3:Account),
(tf:Transfer),
(a1)-[:IN_TO {value: 110}]->(tf),
(a2)-[:IN_TO {value: 230}]->(tf),
(tf)-[:OUT_TO {value: 450}]->(a3)
Query:
MATCH (in_account:Account)-[in:IN_TO]->(tf:Transfer)
WITH tf, SUM(in.value) AS sum_in
SET tf.sum_in = sum_in
RETURN tf
UNION
MATCH (tf:Transfer)-[out:OUT_TO]->(out_account:Account)
WITH tf, SUM(out.value) AS sum_out
SET tf.sum_out = sum_out
RETURN tf
Results:
╒═══════════════════════════╕
│tf │
╞═══════════════════════════╡
│{sum_in: 340, sum_out: 450}│
└───────────────────────────┘
Note that UNION performs a set union (as opposed to UNION ALL, which performs a multiset/bag union), hence we will not have duplicates in the results.
Update: as Tore Eschliman pointed out in the comments, this solution will modify the database. As a workaround, you can collect the results and abort the transaction afterwards.
When you use an aggregation like SUM, you have to leave the aggregated aliases out of your result row, or you'll end up with single-row sums. This should help you get something closer to what you want, including a workaround for your dynamic property assignment:
CREATE (temp)
WITH temp
MATCH (tf:Transfer {id:'some_id'})
MATCH (tf)<-[in:IN_TO]-(in_account:Account)
MATCH (tf)-[out:OUT_TO]->(out_account:Account)
SET temp += PROPERTIES(tf)
WITH temp, SUM(in.value) AS sum_in, SUM(out.value) AS sum_out, COLLECT(in_account) AS in_accounts, COLLECT(out_account) AS out_accounts
SET temp.sum_in = sum_in
SET temp.sum_out = sum_out
WITH temp, PROPERTIES(temp) AS props, in_accounts, out_accounts
DELETE temp
RETURN props, in_accounts, out_accounts
You are creating a dummy node to hold properties, because that's the only way to assign dynamic properties to existing Maps or Map-alikes, but the node won't ever be committed to the graph. This query should return a Map of the :Transfer's properties, with the in- and out-sums included, plus lists of the in- and out-accounts in case you need to do any additional work on them.
I have a "Gene" Label/node type with properties "value" and "geneName"
I have a separate Label/node type called Pathway with property "
I want to go through all the different geneName's and find the average of all the Gene's value with that Gene name. I need all those Gene's displayed as different rows. Bearing in mind I have a a lot of geneName's so I can't name them all in the query. I need to do this inside a certain Pathway.
MATCH (sample)-[:Measures]->(gene)-[:Part_Of]->(pathway)
WHERE pathway.pathwayName = 'Pyrimidine metabolism'
WITH sample, gene, Collect (distinct gene.geneName) AS temp
I have been trying to figure this out all day now and all I can manage to do is retrieve all the rows of geneNames. I'm lost from there.
RETURN extract(n IN temp | RETURN avg(gene.value))
Mabye?
This query should return the average gene value for each distinct gene name:
MATCH (sample)-[:Measures]->(gene)-[:Part_Of]->(pathway:Pathway)
WHERE pathway.pathwayName = 'Pyrimidine metabolism'
RETURN sample, gene.geneName AS name, AVG(gene.value) AS avg;
When you use an aggregation function (like AVG), it automatically uses distinct values for the non-aggregating values in the same WITH or RETURN clause (i.e., sample and gene.geneName in the above query).
For efficiency, I have also added the label to the pathway nodes so that neo4j can start off by scanning just Pathway nodes instead of all nodes. In addition, you should consider creating an index on :Pathway(pathwayName), so that the search for the Pathway is as fast as possible.
Currently I have a unique index on node with label "d:ReferenceEntity". It's taking approximately 11 seconds for this query to run, returning 7 rows. Granted T1 has about 400,000 relationships.
I'm not sure why this would take too long, considering we can build a Map of all connected Nodes to T1, thus giving constant time.
Am I missing some other index features that Neo4j can provide? Also my entire dataset is in memory, so it shouldn't have anything with going to disk.
match(n:ReferenceEntity {entityId : "T1" })-[r:HAS_REL]-(d:ReferenceEntity) WHERE d.entityId in ["T2", "T3", "T4"] return n
:schema
Indexes
ON :ReferenceEntity(entityId) ONLINE (for uniqueness constraint)
Constraints
ON (referenceentity:ReferenceEntity) ASSERT referenceentity.entityId IS UNIQUE
Explain Plan:
You had used EXPLAIN instead of PROFILE to get that query plan, so it shows misleading estimated row counts. If you had used PROFILE, then the Expand(All) operation actually would have had about 400,000 rows, since that operation would actually iterate through every relationship. That is why your query takes so long.
You can try this query, which tells Cypher use the index on d as well as n. (On my machine, I had to use the USING INDEX clause twice to get the desired results.) It definitely pays to use PROFILE to tune Cypher code.
MATCH (n:ReferenceEntity { entityId : "T1" })
USING INDEX n:ReferenceEntity(entityId)
MATCH n-[r:HAS_REL]-(d:ReferenceEntity)
USING INDEX d:ReferenceEntity(entityId)
WHERE d.entityId IN ["T2", "T3", "T4"]
RETURN n, d;
Here is the Profile Plan (In my DB, I had 2 relationships that satisfy the WHERE test):