I have the following query
MATCH (User1 )-[:VIEWED]->(page)<-[:VIEWED]- (User2 )
RETURN User1.userId,User2.userId, count(page) as cnt
Its a relatively simple query to find co-page view counts between users.
Its just too slow, and I have to terminate it after some time.
Details
User consists of about 150k Nodes
Page consists of about 180k Nodes
User -VIEWS-> Page has about 380k Relationships
User has 7 attributes, and Page has about 5 attributes.
Both User and Page are indexed on UserId and PageId respectively.
Heap Size is 512mb (tried to run on 1g too)
What would be some of the ways to optimize this query as I think the count of the nodes and relationships are not a lot.
Use Labels
Always use Node labels in your patterns.
MATCH (u1:User)-[:VIEWED]->(p:Page)<-[:VIEWED]-(u2:User)
RETURN u1.userId, u2.userId, count(p) AS cnt;
Don't match on duplicate pairs of users
This query will be executed for all pairs of users (that share a viewed page) twice. Each user will be mapped to User1 and then each user will also be mapped to User2. To limit this:
MATCH (u1:User)-[:VIEWED]->(p:Page)<-[:VIEWED]-(u2:User)
WHERE id(u1) > id(u2)
RETURN u1.userId, u2.userId, count(p) AS cnt;
Query for a specific user
If you can bind either side of the pattern the query will be much faster. Do you need to execute this query for all pairs of users? Would it make sense to execute it relative to a single user only? For example:
MATCH (u1:User {name: "Bob"})-[:VIEWED]->(p:Page)<-[:VIEWED]-(u2:User)
WHERE NOT u1=u2
RETURN u1.userId, u2.userId, count(p) AS cnt;
As you are trying different queries you can prepend EXPLAIN or PROFILE to the Cypher query to see the execution plan and number of data hits. More info here.
Related
I've recently started learning Cypher. I have a database containing four users and films. Users can have can have [:WATCHED] / [:WATCHLISTED] / [:FAVORITED] relationships with films.
I want to get the films which all four users have watched. Here's a working query I've written:
match (u1)-[:WATCHED]->(f)<-[:WATCHED]-(u2),
(u3)-[:WATCHED]->(f)<-[:WATCHED]-(u4)
return u1, u2, u3, u4, f
I wanted to know if there was a more efficient way to do this. Or any another way, which I can't of. I'm asking this out of curiosity.
You can do this for example :
MATCH (f:Film)
WHERE size((f)<-[:WATCHED]-()) = 4
RETURN f, [(f)<-[:WATCHED]-(u:User) | u] as watchers
Here I assume that there is only one relationship of type WATCHED between a user and a movie, even if the user has watched the movie many times.
To avoid having to hardcode a User node count, this query efficiently gets the count using the DB's internal statistics:
MATCH (u:User)
WITH COUNT(u) AS userCount
MATCH (f:Film)
WHERE SIZE((f)<-[:WATCHED]-()) = userCount
RETURN f;
This query does not return the users that watched the film, since that is literally all the users in the DB, and with a sufficiently large number of them your query can run out of memory -- or it can take a very long time for a client (like the neo4j Browser) to receive and process the results. I think the main point of a query like this is to find the films, not the users. If you really want to get all the users, a separate query will do: MATCH (u:Users) RETURN u.
You can use all:
https://neo4j.com/docs/developer-manual/current/cypher/functions/predicate/
This checks if a predicate is true for all elements.
I have some questions regarding Neo4j's Query profiling.
Consider below simple Cypher query:
PROFILE
MATCH (n:Consumer {mobileNumber: "yyyyyyyyy"}),
(m:Consumer {mobileNumber: "xxxxxxxxxxx"})
WITH n,m
MATCH (n)-[r:HAS_CONTACT]->(m)
RETURN n,m,r;
and output is:
So according to Neo4j's Documentation:
3.7.2.2. Expand Into
When both the start and end node have already been found, expand-into
is used to find all connecting relationships between the two nodes.
Query.
MATCH (p:Person { name: 'me' })-[:FRIENDS_WITH]->(fof)-->(p) RETURN
> fof
So here in the above query (in my case), first of all, it should find both the StartNode & the EndNode before finding any relationships. But unfortunately, it's just finding the StartNode, and then going to expand all connected :HAS_CONTACT relationships, which results in not using "Expand Into" operator. Why does this work this way? There is only one :HAS_CONTACT relationship between the two nodes. There is a Unique Index constraint on :Consumer{mobileNumber}. Why does the above query expand all 7 relationships?
Another question is about the Filter operator: why does it requires 12 db hits although all nodes/ relationships are already retrieved? Why does this operation require 12 db calls for just 6 rows?
Edited
This is the complete Graph I am querying:
Also I have tested different versions of same above query, but the same Query Profile result is returned:
1
PROFILE
MATCH (n:Consumer{mobileNumber: "yyyyyyyyy"})
MATCH (m:Consumer{mobileNumber: "xxxxxxxxxxx"})
WITH n,m
MATCH (n)-[r:HAS_CONTACT]->(m)
RETURN n,m,r;
2
PROFILE
MATCH (n:Consumer{mobileNumber: "yyyyyyyyy"}), (m:Consumer{mobileNumber: "xxxxxxxxxxx"})
WITH n,m
MATCH (n)-[r:HAS_CONTACT]->(m)
RETURN n,m,r;
3
PROFILE
MATCH (n:Consumer{mobileNumber: "yyyyyyyyy"})
WITH n
MATCH (n)-[r:HAS_CONTACT]->(m:Consumer{mobileNumber: "xxxxxxxxxxx"})
RETURN n,m,r;
The query you are executing and the example provided in the Neo4j documentation for Expand Into are not the same. The example query starts and ends at the same node.
If you want the planner to find both nodes first and see if there is a relationship then you could use shortestPath with a length of 1 to minimize the DB hits.
PROFILE
MATCH (n:Consumer {mobileNumber: "yyyyyyyyy"}),
(m:Consumer {mobileNumber: "xxxxxxxxxxx"})
WITH n,m
MATCH Path=shortestPath((n)-[r:HAS_CONTACT*1]->(m))
RETURN n,m,r;
Why does this do this?
It appears that this behaviour relates to how the query planner performs a database search in response to your cypher query. Cypher provides an interface to search and perform operations in the graph (alternatives include the Java API, etc.), queries are handled by the query planner and then turned into graph operations by neo4j's internals. It make sense that the query planner will find what is likely to be the most efficient way to search the graph (hence why we love neo), and so just because a cypher query is written one way, it won't necessarily search the graph in the way we imagine it will in our head.
The documentation on this seemed a little sparse (or, rather I couldn't find it properly), any links or further explanations would be much appreciated.
Examining your query, I think you're trying to say this:
"Find two nodes each with a :Consumer label, n and m, with contact numbers x and y respectively, using the mobileNumber index. If you find them, try and find a -[:HAS_CONTACT]-> relationship from n to m. If you find the relationship, return both nodes and the relationship, else return nothing."
Running this query in this way requires a cartesian product to be created (i.e., a little table of all combinations of n and m - in this case only one row - but for other queries potentially many more), and then relationships to be searched for between each of these rows.
Rather than doing that, since a MATCH clause must be met in order to continue with the query, neo knows that the two nodes n and m must be connected via the -[:HAS_CONTACT]-> relationship if the query is to return anything. Thus, the most efficient way to run the query (and avoid the cartesian product) is as below, which is what your query can be simplified to.
"Find a node n with the :Consumer label, and value x for the index mobileNumber, which is connected via a -[:HAS_CONTACT]-> relationshop to a node m with the :Consumer label, and value y for its proprerty mobileNumber. Return both nodes and the relationship, else return nothing."
So, rather than perform two index searches, a cartesian product and a set of expand into operations, neo performs only one index search, an expand all, and a filter.
You can see the result of this simplification by the query planner through the presence of AUTOSTRING parameters in your query profile.
How to Change Query to Implement Search as Desired
If you want to change the query so that it must use an expand into relationship, make the requirement for the relationship optional, or use explicitly iterative execution. Both these queries below will produce the initially expected query profiles.
Optional example:
PROFILE
MATCH (n:Consumer{mobileNumber: "xxx"})
MATCH (m:Consumer{mobileNumber: "yyy"})
WITH n,m
OPTIONAL MATCH (n)-[r:HAS_CONTACT]->(m)
RETURN n,m,r;
Iterative example:
PROFILE
MATCH (n1:Consumer{mobileNumber: "xxx"})
MATCH (m:Consumer{mobileNumber: "yyy"})
UNWIND COLLECT(n1) AS n
MATCH (n)-[r:HAS_CONTACT]->(m)
RETURN n,m,r;
I am doing some x operation in two different ways. But in the second method, the count of Match query is inappropriate which is no way acceptable. Please suggest where I am missing.
First Way:
profile
WITH [1234] AS sellers_list,
[12345] AS buyers_list
MATCH (buyer:Person) WHERE buyer.person_guid IN buyers_list
MATCH (seller:Person) WHERE seller.person_guid IN sellers_list
RETURN count(buyer),size(buyers_list),count(seller),size(sellers_list)
This results:
Second Way
profile
WITH [1234] AS sellers_list,
[12345] AS buyers_list
MATCH (seller_member:Person)-[r:TEAM_MEMBER]-(seller_teammate:Person)
WHERE seller_member.person_guid IN sellers_list
WITH FILTER(x IN COLLECT(seller_teammate.person_guid) WHERE NOT(x in sellers_list)) AS sellerteam, sellers_list, buyers_list
MATCH (seller_member:Person)-[r:EMPLOYED_BY]->(b:Organization)
MATCH (b)<-[s:EMPLOYED_BY]-(org_member:Person)
WHERE seller_member.person_guid=sellers[0]
WITH FILTER(x IN COLLECT(org_member.person_guid) WHERE NOT(x IN sellerteam)) AS org_members,sellers_list,sellerteam,buyers_list
WITH sellers+sellerteam+org_members AS all_org_members,sellers_list,sellerteam,org_members,buyers_list
MATCH (buyer:Person) WHERE buyer.person_guid IN buyers_list
MATCH (seller:Person) WHERE seller.person_guid IN all_org_members
RETURN count(buyer),size(buyers_list),count(seller),size(sellers_list)
This Results in:
In the second method, I did not alter the buyers_list anywhere, I was just count the seller team members and seller organization members that's it. But the count of buyers is changing. Why?
Profiling the above query shows this:
Looking at this image, the no of buyers is just 1, but why the count is returning 45k.
And, why the 90k db hits for 45k nodes? Any specific reason and how can I reduce the db hits here.
A key thing to remember is that queries in Neo4j build up rows and columns. When you perform a match between disconnected patterns, you tend to get a cartesian product against your current rows (and you can see that in your query plan). That said, a cartesian product isn't necessarily a mistake or bad. There's really no way to match to all those sellers from your list of guids without a cartesian product, and it is just a cartesian product against your single row.
If you returned all values immediately after your seller match, you would see that each row has a different seller, but all the other fields (including the buyer) are the same.
You'll want to get a count of distinct values, count(distinct buyer), which should give you your expected buyer count of 1.
As for the 90k hits, a NodeUniqueIndexSeek requires 2 db hits per lookup, and you performed a lookup on 45k values, so the math works out.
EDIT
If you are still suspicious, you can try out a large unique lookup in isolation (or as much isolation as you can while having to lookup 45k guids first).
MATCH (p:Person)
WITH p LIMIT 45000
WITH COLLECT(p.person_guid) as guids
// you can always take the above subquery, returning 1, to see the timing of just collecting guids
MATCH (p:Person)
WHERE p.person_guid in guids
RETURN COUNT(p) as count
Please check my Cypher below, I am getting result with the query below() with low records but as records increases it take a long time about 1601152 ms:
i found suggestion to add USING INDEX and and I apply the USING INDEX in query.
PROFILE MATCH (m:Movie)-[:IN_APP]->(a:App {app_id: '1'})<-[:USER_IN]-(p:Person)-[:WATCHED]->(ma:Movie)-[:HAS_TAG]->(t:Tag)<-[:HAS_TAG]-(mb:Movie)-[:IN_APP]->(a)
USING INDEX a:App(app_id) WHERE p.person_id= '1'
AND NOT (p:Person)-[:WATCHED]-(mb)
RETURN DISTINCT(mb.movie_id) , mb.title, mb.imdb_rating, mb.runtime, mb.award, mb.watch_count, COLLECT(DISTINCT(t.tag_id)) as Tag, count(DISTINCT(t.tag_id)) as matched_tags
ORDER BY matched_tags DESC SKIP 0 LIMIT 50
Can you help me out what can I do?
I am trying to find 100 movies for recommendation on basis of tags, as 100 movies which I do not watch and match with tags of Movies I watched.
The following query may work better for you [assuming you have indexes on both :App(app_id) and :Person(person_id)]. By the way, I presumed that in your query the identifier ma should have been m (or vice versa).
MATCH (m:Movie)-[:IN_APP]->(a:App {app_id: '1'})<-[:USER_IN]-(p:Person {person_id: '1'})-[:WATCHED]->(m)
WITH a, p, COLLECT(m) AS movies
UNWIND movies AS movie
MATCH (movie)-[:HAS_TAG]->(t)<-[:HAS_TAG]-(mb:Movie)-[:IN_APP]->(a)
WHERE NOT mb IN movies
WITH DISTINCT mb, t
RETURN mb.movie_id, mb.title, mb.imdb_rating, mb.runtime, mb.award, mb.watch_count, COLLECT(t.tag_id) as Tag, COUNT(t.tag_id) as matched_tags
ORDER BY matched_tags DESC SKIP 0 LIMIT 50;
If you PROFILE this query, you should see that it performs NodeIndexSeek operations (instead of the much slower NodeByLabelScan) to quickly execute the first MATCH. The query also collects all the movies watched by the specified person and uses that collection later to speed up the WHERE clause (which no longer needs hit the DB). In addition, the query removed some labels from some of the node patterns (where doing so seemed likely to be unambiguous) to speed up processing further.
I have a Neo4j graph (ver. 2.2.2) with large number of relationships. For examaple: 1 node "Group", 300000 nodes "Data", 300000 relationships from "Group" to all existing nodes "Data". I need to check if there is a relationship between set of Data nodes and specific Group node (for example for 200 nodes). But Cypher query I used is very slow. I tried many modifications of this cypher but with no result.
Cypher to create graph:
FOREACH (r IN range(1,300000) | CREATE (:Data {id:r}));
CREATE (:Group);
MATCH (g:Group),(d:Data) create (g)-[:READ]->(d);
Query 1: COST. 600003 total db hits in 730 ms.
Acceptable but I asked only for 1 node.
PROFILE MATCH (d:Data)<-[:READ]-(g:Group) WHERE id(d) IN [10000] AND id(g)=300000 RETURN id(d);
Query 2: COST. 600003 total db hits in 25793 ms.
Not acceptable.
You need to replace "..." with real numbers of nodes from 10000 to 10199
PROFILE MATCH (d:Data)<-[:READ]-(g:Group) WHERE id(d) IN [10000,10001,10002 " ..." ,10198,10199] AND id(g)=300000 RETURN id(d);
Query 3: COST. 1000 total db hits in 309 ms.
This is only one solution I found to make query acceptable. I returned all ids of nodes "Group" and manualy filter result in my code to return only relationships to node with id 300000
You need to replace "..." with real numbers of nodes from 10000 to 10199
PROFILE MATCH (d:Data)<-[:READ]-(g:Group) WHERE id(d) IN [10000,10001,10002 " ..." ,10198,10199] RETURN id(d), id(g);
Question 1: Total DB hits in query 1 is surprising but I accept that physical model of neoj defines how this query is executed - it needs to look into every existing relation from node "Group". I accept that. But why is so big difference in execution time between query 1 and query 2 if number of db hits is the same (and exucution plan is the same)? I'm only returning id of node, not large set of properties.
Question 2: Is a query 3 the only one solution to optimize this query?
Apparently there is an issue with Cypher in 2.2.x with the seekById.
You can prefix your query with PLANNER RULE in order to make use of the previous Cypher planner, but you'll have to split your pattern in two for making it really fast, tested e.g. :
PLANNER RULE
MATCH (d:Data) WHERE id(d) IN [30]
MATCH (g:Group) WHERE id(g) = 300992
MATCH (d)<-[:READ]-(g)
RETURN id(d)