How can I optimise my neo4j cypher query? - neo4j

Please check my Cypher below, I am getting result with the query below() with low records but as records increases it take a long time about 1601152 ms:
i found suggestion to add USING INDEX and and I apply the USING INDEX in query.
PROFILE MATCH (m:Movie)-[:IN_APP]->(a:App {app_id: '1'})<-[:USER_IN]-(p:Person)-[:WATCHED]->(ma:Movie)-[:HAS_TAG]->(t:Tag)<-[:HAS_TAG]-(mb:Movie)-[:IN_APP]->(a)
USING INDEX a:App(app_id) WHERE p.person_id= '1'
AND NOT (p:Person)-[:WATCHED]-(mb)
RETURN DISTINCT(mb.movie_id) , mb.title, mb.imdb_rating, mb.runtime, mb.award, mb.watch_count, COLLECT(DISTINCT(t.tag_id)) as Tag, count(DISTINCT(t.tag_id)) as matched_tags
ORDER BY matched_tags DESC SKIP 0 LIMIT 50
Can you help me out what can I do?
I am trying to find 100 movies for recommendation on basis of tags, as 100 movies which I do not watch and match with tags of Movies I watched.

The following query may work better for you [assuming you have indexes on both :App(app_id) and :Person(person_id)]. By the way, I presumed that in your query the identifier ma should have been m (or vice versa).
MATCH (m:Movie)-[:IN_APP]->(a:App {app_id: '1'})<-[:USER_IN]-(p:Person {person_id: '1'})-[:WATCHED]->(m)
WITH a, p, COLLECT(m) AS movies
UNWIND movies AS movie
MATCH (movie)-[:HAS_TAG]->(t)<-[:HAS_TAG]-(mb:Movie)-[:IN_APP]->(a)
WHERE NOT mb IN movies
WITH DISTINCT mb, t
RETURN mb.movie_id, mb.title, mb.imdb_rating, mb.runtime, mb.award, mb.watch_count, COLLECT(t.tag_id) as Tag, COUNT(t.tag_id) as matched_tags
ORDER BY matched_tags DESC SKIP 0 LIMIT 50;
If you PROFILE this query, you should see that it performs NodeIndexSeek operations (instead of the much slower NodeByLabelScan) to quickly execute the first MATCH. The query also collects all the movies watched by the specified person and uses that collection later to speed up the WHERE clause (which no longer needs hit the DB). In addition, the query removed some labels from some of the node patterns (where doing so seemed likely to be unambiguous) to speed up processing further.

Related

find all items that wasn't bought by a person and count the times it was bought

I have a graph that looks like this.
I want to find all the items bought by the people, who bought the same items as Gremlin using cypher.
Basically I want to imitate the query in the gremlin examples that looks like this
g.V().has("name","gremlin")
.out("bought").aggregate("stash")
.in("bought").out("bought")
.where(not(within("stash")))
.groupCount()
.order(local).by(values,desc)
I was trying to do it like this
MATCH (n)-[:BOUGHT]->(g_item)<-[:BOUGHT]-(r),
(r)-[:BOUGHT]->(n_item)
WHERE
n.name = 'Gremlin'
AND NOT (n)-[:BOUGHT]->(n_item)
RETURN n_item.id, count(*) as frequency
ORDER by frequency DESC
but it seems it doesn't count frequencies properly - they seem to be twice as big.
4 - 4
5 - 2
3 - 2
While 3 and 5 was bought only once and 4 was bought 2 times.
What's the problem?
Cypher is interested in paths, and your MATCH finds the following:
2 paths to item 3 both through Rexter (via items 2 and 1)
2 paths to item 5 through Pipes (via items 1 and 2)
4 paths to item 4 via Rexter and Pipes (via items 1 and 2 for each person)
Basically the items are being counted multiple times because there are multiple paths to that same item per individual person via different common items with Gremlin.
To get accurate counts, you either need to match to distinct r users, and only then match out to items the r users bought (as long as they aren't in the collection of items bought by Gremlin), OR you need to do the entire match, but before doing the counts, get distinct items with respect to each person so each item per person only occurs once...then get the count per item (counts across all persons).
Here's a query that uses the second approach
MATCH (n:Person)-[:BOUGHT]->(g_item)
WHERE n.name = 'Gremlin'
WITH n, collect(g_item) as excluded
UNWIND excluded as g_item // now you have excluded list to use later
MATCH (g_item)<-[:BOUGHT]-(r)-[:BOUGHT]->(n_item)
WHERE r <> n AND NOT n_item in excluded
WITH DISTINCT r, n_item
WITH n_item, count(*) as frequency
RETURN n_item.id, frequency
ORDER by frequency DESC
You should be using labels in your graph, and you should use them in your query in order to leverage indexes and quickly find a starting point in the graph. In your case, an index on :Person(name), and usage of the :Person label in the query, should make this quick even as more nodes and more :Persons are added to the graph.
EDIT
If you're just looking for conciseness of the query, and don't have a large enough graph where performance will be an issue, then you can use your original query but add one extra line to get distinct rows of r and n_item before you count the item. This ensures that you only count an item per person once when you get the count.
Note that forgoes optimizations for handling excluded items (it will do a pattern match per item rather than aggregating the collection of bought items and doing a collection membership check), and it aggregates on items while doing property access rather than doing property access only after aggregating by the node.
MATCH (n:Person)-[:BOUGHT*2]-(r)-[:BOUGHT]->(n_item)
WHERE n.name = 'Gremlin'
WITH DISTINCT n, r, n_item
WHERE NOT (n)-[:BOUGHT]->(n_item)
RETURN n_item.id, count(*) as frequency
ORDER by frequency DESC
I am adding a quick shortcut in your match, using :BOUGHT*2 to indicate two :BOUGHT hops to r, since we don't really care about the item in-between.

can we use datetime filter in MATCH clause instead of WHERE clause on a node

I have some sample tweets stored as neo4j. Below query finds top hashtags from specific country. It is taking a lot of time because the time filter for status type nodes is in where clause and is slowing the response. Is it possible to move this filter to MATCH clause so that status nodes are filtered before relationships are found?
match (c:country{countryCode:"PK"})-[*0..4]->(s:status)-[*0..1]->(h:hashtag) where (s.createdAt >= datetime('2017-06-01T00:00:00') AND s.createdAt
>= datetime('2017-06-01T23:59:59')) return h.name,count(h.name) as hCount order by hCount desc limit 100
thanks
As mentioned in my comment, whether a predicate for a property is in the MATCH clause or the WHERE clause shouldn't matter, as this is just syntactical sugar and is interpreted the same way by the query planner.
You can use PROFILE or EXPLAIN to see the query plan to see what it's doing. PROFILE will give you more information but will have to actually execute the query. You can attempt to use planner hints to force the planner to plan the match differently which may yield a better approach.
You will want to ensure you have an index on :status(createdAt).
You can also try altering your match a little, and moving the portion connecting to the country in question into your WHERE clause instead. Also it's a good idea to get the count based upon the hashtag node itself (assuming there's only one :hashtag node for a given name) so you can order and limit before you do property access:
MATCH (s:status)-[*0..1]->(h:hashtag)
WHERE (s.createdAt >= datetime('2017-06-01T00:00:00') AND s.createdAt
>= datetime('2017-06-01T23:59:59'))
AND (:country{countryCode:"PK"})-[*0..4]->(s)
WITH h, count(h) as hCount
ORDER BY hCount DESC
LIMIT 100
RETURN h.name, hCount

Neo4J Cypher Query to find common linked nodes

I am making a named entity graph in Noe4j 3.2.0. I have ARTICLE and ENTITY as node types. And the relation/edge between them is CONTAINS; which represents the number of times the entity has occurred in that article (As shown in attached picture Simple graph for articles and entities ). So if an article has one entity for 5 times, there will be 5 edges between that article and particular entity.
There are roughly 18 million articles and 40 thousand unique entities. The whole data is around 20GB(including indices on ids) and is loaded on a machine with 32 GB RAM.
I am using this graph to suggest/recommend the other entities. But my queries are taking too much time.
Use Case1: Find all entities present in the articles which have an entity from list ["A", "B"] and also having an entity "X" and an entity "Y" and an entity "Z" in the order of articles count.
Here is the cypher query I am running.
MATCH(e:Entity)-[:CONTAINS]-(a:Article)
WHERE e.EID in ["A","B"]
WITH a
MATCH (:Entity {EID:"X"})-[:CONTAINS]-(a)
WITH a
MATCH (:Entity {EID:"Y"})-[:CONTAINS]-(a)
WITH a
MATCH (:Entity {EID:"Z"})-[:CONTAINS]-(a)
WITH a
MATCH (a)-[:CONTAINS]-(e2:Entity)
RETURN e2.EID as EID, e2.Text as Text, e2.Type as Type ,count(distinct(a)) as articleCount
ORDER BY articleCount desc
Query Profile is here: Query Profile
This query gives me all first level entity neighbours of articles having X,Y,Z and at least one of A,B entities (I had to change the IDs in the query for content sensitivity).
I was just wondering if there is a better/fast way of doing it?
Another observation is if I keep adding filters (more match clauses like X,Y,Z) the performance is deteriorated; despite the fact that result set is getting smaller and smaller.
You have a uniqueness constraint on :Entity(EID), so at least that optimization is already in place.
The following Cypher query is simpler, and generates a simpler execution plan. Hopefully, it also reduces the number of DB hits.
MATCH (e:Entity)-[:CONTAINS]-(a)
WHERE e.EID in ['A','B'] AND ALL(x IN ['X','Y','Z'] WHERE (:Entity {EID: x})-[:CONTAINS]-(a))
WITH a
MATCH (a)-[:CONTAINS]-(e2:Entity)
RETURN e2.EID as EID, e2.Text as Text, e2.Type as Type, COUNT(DISTINCT a) as articleCount
ORDER BY articleCount DESC;

Neo4j Cypher MATCH and LIMIT, does limit work on match or on return only

Let's say i issue this query
MATCH(n:node)
RETURN n LIMIT 5
Does cypher take into account the LIMIT clause when doing the matching?
Tests on my data set would point towards the answer being no.
The above query takes 7366ms in the neo shell.
versus
MATCH (a:Address) where id(a) IN [1589346,1589347,1589348,1589349,1589350]
RETURN a
which takes 81ms
I have also picked random ids so that they are not in the cache, with the same result for time.
[UPDATED]
With your initial query, the LIMIT clause should be limiting the amount of work done by the MATCH clause. I verified this using PROFILE in neo4j versions 2.3, 3.0, and 3.1.1.
For example, in version 3.1.1, I created 20 nodes labeled node. Your query produced the following plan:
Notice that the number of DB hits by the NodeByLabelScan operation was only 6 (it is always 1 more than the LIMIT value in my tests), not 20 or 21.
This, however, does not mean that the LIMIT clause will always limit the amount of work done by the preceding MATCH clause. And, even if it did, the actual amount of work done by the MATCH clause can be much greater than implied by the LIMIT value.
For example, suppose the query were as follows (and also suppose there is no index for :node(timestamp)):
MATCH (n:node)
WHERE n.timestamp > 1234567890
RETURN n LIMIT 5
If the total number of matching nodes in the DB was < 5, then the MATCH clause would have to scan all the node nodes.
If the total number of matching nodes in the DB was >=5, then the MATCH clause would have to keep scanning until it found 5 matching node nodes. In the worst case, the 5th matching node may not be found until all the node nodes have been scanned.

Optimizing Cypher Query - Neo4j

I have the following query
MATCH (User1 )-[:VIEWED]->(page)<-[:VIEWED]- (User2 )
RETURN User1.userId,User2.userId, count(page) as cnt
Its a relatively simple query to find co-page view counts between users.
Its just too slow, and I have to terminate it after some time.
Details
User consists of about 150k Nodes
Page consists of about 180k Nodes
User -VIEWS-> Page has about 380k Relationships
User has 7 attributes, and Page has about 5 attributes.
Both User and Page are indexed on UserId and PageId respectively.
Heap Size is 512mb (tried to run on 1g too)
What would be some of the ways to optimize this query as I think the count of the nodes and relationships are not a lot.
Use Labels
Always use Node labels in your patterns.
MATCH (u1:User)-[:VIEWED]->(p:Page)<-[:VIEWED]-(u2:User)
RETURN u1.userId, u2.userId, count(p) AS cnt;
Don't match on duplicate pairs of users
This query will be executed for all pairs of users (that share a viewed page) twice. Each user will be mapped to User1 and then each user will also be mapped to User2. To limit this:
MATCH (u1:User)-[:VIEWED]->(p:Page)<-[:VIEWED]-(u2:User)
WHERE id(u1) > id(u2)
RETURN u1.userId, u2.userId, count(p) AS cnt;
Query for a specific user
If you can bind either side of the pattern the query will be much faster. Do you need to execute this query for all pairs of users? Would it make sense to execute it relative to a single user only? For example:
MATCH (u1:User {name: "Bob"})-[:VIEWED]->(p:Page)<-[:VIEWED]-(u2:User)
WHERE NOT u1=u2
RETURN u1.userId, u2.userId, count(p) AS cnt;
As you are trying different queries you can prepend EXPLAIN or PROFILE to the Cypher query to see the execution plan and number of data hits. More info here.

Resources