I am making a named entity graph in Noe4j 3.2.0. I have ARTICLE and ENTITY as node types. And the relation/edge between them is CONTAINS; which represents the number of times the entity has occurred in that article (As shown in attached picture Simple graph for articles and entities ). So if an article has one entity for 5 times, there will be 5 edges between that article and particular entity.
There are roughly 18 million articles and 40 thousand unique entities. The whole data is around 20GB(including indices on ids) and is loaded on a machine with 32 GB RAM.
I am using this graph to suggest/recommend the other entities. But my queries are taking too much time.
Use Case1: Find all entities present in the articles which have an entity from list ["A", "B"] and also having an entity "X" and an entity "Y" and an entity "Z" in the order of articles count.
Here is the cypher query I am running.
MATCH(e:Entity)-[:CONTAINS]-(a:Article)
WHERE e.EID in ["A","B"]
WITH a
MATCH (:Entity {EID:"X"})-[:CONTAINS]-(a)
WITH a
MATCH (:Entity {EID:"Y"})-[:CONTAINS]-(a)
WITH a
MATCH (:Entity {EID:"Z"})-[:CONTAINS]-(a)
WITH a
MATCH (a)-[:CONTAINS]-(e2:Entity)
RETURN e2.EID as EID, e2.Text as Text, e2.Type as Type ,count(distinct(a)) as articleCount
ORDER BY articleCount desc
Query Profile is here: Query Profile
This query gives me all first level entity neighbours of articles having X,Y,Z and at least one of A,B entities (I had to change the IDs in the query for content sensitivity).
I was just wondering if there is a better/fast way of doing it?
Another observation is if I keep adding filters (more match clauses like X,Y,Z) the performance is deteriorated; despite the fact that result set is getting smaller and smaller.
You have a uniqueness constraint on :Entity(EID), so at least that optimization is already in place.
The following Cypher query is simpler, and generates a simpler execution plan. Hopefully, it also reduces the number of DB hits.
MATCH (e:Entity)-[:CONTAINS]-(a)
WHERE e.EID in ['A','B'] AND ALL(x IN ['X','Y','Z'] WHERE (:Entity {EID: x})-[:CONTAINS]-(a))
WITH a
MATCH (a)-[:CONTAINS]-(e2:Entity)
RETURN e2.EID as EID, e2.Text as Text, e2.Type as Type, COUNT(DISTINCT a) as articleCount
ORDER BY articleCount DESC;
Related
I have a JSON document with history based entity counts and relationship counts. I want to use this lookup data for entity and relationships in Neo4j. Lookup data has around 3000 rows. For the entity counts I want to display the counts for two entities based on UUID. For relationships, I want to order by two relationship counts (related entities and related mutual entities).
For entities, I have started with the following:
// get JSON doc
with value.aggregations.ent.terms.buckets as data
unwind data as lookup1
unwind data as lookup2
MATCH (e1:Entity)-[r1:RELATED_TO]-(e2)
WHERE e1.uuid = '$entityId'
AND e1.uuid = lookup1.key
AND e2.uuid = lookup2.key
RETURN e1.uuid, lookup1.doc_count, r1.uuid, e2.uuid, lookup2.doc_count
ORDER BY lookup2.doc_count DESC // just to demonstrate
LIMIT 50
I'm noticing that query is taking about 10 seconds. What am I doing wrong and how can I correct it?
Attaching explain plan:
Your query is very inefficient. You stated that data has 3,000 rows (let's call that number D).
So, your first UNWIND creates an intermediate result of D rows.
Your second UNWIND creates an intermediate result of D**2 (i.e., 9 million) rows.
If your MATCH (e1:Entity)-[r1:RELATED_TO]-(e2) clause finds N results, that generates an intermediate result of up to N*(D**2) rows.
Since your MATCH clause specifies a non-directional relationship pattern, it finds the same pair of nodes twice (in reverse order). So, N is actually twice as large as it needs to be.
Here is an improved version of your query, which should be much faster (with N/2 intermediate rows):
WITH apoc.map.groupBy(value.aggregations.ent.terms.buckets, 'key') as lookup
MATCH (e1:Entity)-[r1:RELATED_TO]->(e2)
WHERE e1.uuid = $entityId AND lookup[e1.uuid] IS NOT NULL AND lookup[e2.uuid] IS NOT NULL
RETURN e1.uuid, lookup[e1.uuid].doc_count AS count1, r1.uuid, e2.uuid, lookup[e2.uuid].doc_count AS count2
ORDER BY count2 DESC
LIMIT 50
The trick here is that the query uses apoc.map.groupBy to convert your buckets (a list of maps) into a single unified lookup map that uses the bucket key values as its property names. This allows the rest of the query to literally "look up" each uuid's data in the unified map.
Please check my Cypher below, I am getting result with the query below() with low records but as records increases it take a long time about 1601152 ms:
i found suggestion to add USING INDEX and and I apply the USING INDEX in query.
PROFILE MATCH (m:Movie)-[:IN_APP]->(a:App {app_id: '1'})<-[:USER_IN]-(p:Person)-[:WATCHED]->(ma:Movie)-[:HAS_TAG]->(t:Tag)<-[:HAS_TAG]-(mb:Movie)-[:IN_APP]->(a)
USING INDEX a:App(app_id) WHERE p.person_id= '1'
AND NOT (p:Person)-[:WATCHED]-(mb)
RETURN DISTINCT(mb.movie_id) , mb.title, mb.imdb_rating, mb.runtime, mb.award, mb.watch_count, COLLECT(DISTINCT(t.tag_id)) as Tag, count(DISTINCT(t.tag_id)) as matched_tags
ORDER BY matched_tags DESC SKIP 0 LIMIT 50
Can you help me out what can I do?
I am trying to find 100 movies for recommendation on basis of tags, as 100 movies which I do not watch and match with tags of Movies I watched.
The following query may work better for you [assuming you have indexes on both :App(app_id) and :Person(person_id)]. By the way, I presumed that in your query the identifier ma should have been m (or vice versa).
MATCH (m:Movie)-[:IN_APP]->(a:App {app_id: '1'})<-[:USER_IN]-(p:Person {person_id: '1'})-[:WATCHED]->(m)
WITH a, p, COLLECT(m) AS movies
UNWIND movies AS movie
MATCH (movie)-[:HAS_TAG]->(t)<-[:HAS_TAG]-(mb:Movie)-[:IN_APP]->(a)
WHERE NOT mb IN movies
WITH DISTINCT mb, t
RETURN mb.movie_id, mb.title, mb.imdb_rating, mb.runtime, mb.award, mb.watch_count, COLLECT(t.tag_id) as Tag, COUNT(t.tag_id) as matched_tags
ORDER BY matched_tags DESC SKIP 0 LIMIT 50;
If you PROFILE this query, you should see that it performs NodeIndexSeek operations (instead of the much slower NodeByLabelScan) to quickly execute the first MATCH. The query also collects all the movies watched by the specified person and uses that collection later to speed up the WHERE clause (which no longer needs hit the DB). In addition, the query removed some labels from some of the node patterns (where doing so seemed likely to be unambiguous) to speed up processing further.
I am sorry for the stupid question. I have two types of nodes in neo4j database, namely Recipes and Meal_Type. I am running a cypher query in neo4j that results all relationship between the two types of nodes. The query is not that special, it is the default query that returns relationship with a limit of 200 nodes.
MATCH ()-[r]->() RETURN r LIMIT 200
It is running fine. But I need, at least, all Meal_Types nodes in result regardless the rest of result. Right now it is returning 3 (sometimes 4,5 on re-running query) out of 11 Meal_Types.
I think you should fetch all of the Meal_Type nodes first and then with that result fetch a set of Recipe nodes that correspond to it.
Here is an example of what I am talking about. Fetch all of the different meal types, unless of course you have some specific ones you are interested in. Then with those meal types return a sampling of the corresponding set of recipes (200 ~= 19 * 11).
// match meal types
MATCH (mt:Meal_Type)
WITH mt
// find a sampling of the the corresponding recipes.
MATCH (mt)<-[OF_TYPE]-(r:Recipe)
RETURN mt, collect(r)[0..18] AS recipe_sample
Really? I answered that yesterday with your previous question, it's just a variation.
This should do the trick, sorting the relationships by node label:
MATCH (n)-[r]-()
RETURN r
ORDER BY head(labels(n))
Currently I have a unique index on node with label "d:ReferenceEntity". It's taking approximately 11 seconds for this query to run, returning 7 rows. Granted T1 has about 400,000 relationships.
I'm not sure why this would take too long, considering we can build a Map of all connected Nodes to T1, thus giving constant time.
Am I missing some other index features that Neo4j can provide? Also my entire dataset is in memory, so it shouldn't have anything with going to disk.
match(n:ReferenceEntity {entityId : "T1" })-[r:HAS_REL]-(d:ReferenceEntity) WHERE d.entityId in ["T2", "T3", "T4"] return n
:schema
Indexes
ON :ReferenceEntity(entityId) ONLINE (for uniqueness constraint)
Constraints
ON (referenceentity:ReferenceEntity) ASSERT referenceentity.entityId IS UNIQUE
Explain Plan:
You had used EXPLAIN instead of PROFILE to get that query plan, so it shows misleading estimated row counts. If you had used PROFILE, then the Expand(All) operation actually would have had about 400,000 rows, since that operation would actually iterate through every relationship. That is why your query takes so long.
You can try this query, which tells Cypher use the index on d as well as n. (On my machine, I had to use the USING INDEX clause twice to get the desired results.) It definitely pays to use PROFILE to tune Cypher code.
MATCH (n:ReferenceEntity { entityId : "T1" })
USING INDEX n:ReferenceEntity(entityId)
MATCH n-[r:HAS_REL]-(d:ReferenceEntity)
USING INDEX d:ReferenceEntity(entityId)
WHERE d.entityId IN ["T2", "T3", "T4"]
RETURN n, d;
Here is the Profile Plan (In my DB, I had 2 relationships that satisfy the WHERE test):
Usually I am building relationships between nodes while loading from CSV files. Here is a statement written cypher I used this time to build relationships between nodes. The Language nodes are 39K and the Description nodes are 2M.
MATCH (d:Description),(l:Language)
> WHERE d.description_language = l.language_name
> CREATE (d)-[r:HAS_LANGUAGE]->(l);
After a long, run the error I got is:
Self-suppression not permitted
I have created indexes on for the properties to be used in the relationship.
Indexes
...
ON :Description(woka_id) ONLINE
ON :Description(description_language) ONLINE
ON :Language(language_id) ONLINE (for uniqueness constraint)
ON :Language(language_name) ONLINE (for uniqueness constraint)
...
What I am doing wrong here causing such a long time to complete the relationships creation (more than 10 hours)?
You are dealing with a very large cartesian product at the filter step:
WHERE d.description_language = l.language_name
You could try to MATCH the Descriptions, group them by their description_language and CREATE the relationships from there:
MATCH (d:Description)
WITH d.description_language AS dl, collect(d) as all_d_for_lang
MATCH (l:Language {language_name: dl})
UNWIND all_d_for_lang AS d
CREATE (l)-[:HAS_LANGUAGE]->(d)
If you look at the PROFILE of this query you'll see there are less DB hits (limit the number of descriptions in the first MATCH for testing).
In general, I think the best way would be to use your CSV files to generate relationships when you create the nodes, i.e. do this application side, not on the database.
Since you are creating relationships from every Description node and there are 2M of them I would just grab the description that are not yet matched and do them in smaller batches.
Something like...
match (d:Description)
where not ( d-[:HAS_LANGUAGE]->() )
with d
limit 200000
match (l:Language {language_name: d.description_language} )
create d-[:HAS_LANGUAGE]->l