Neo4j - Inappropriate Match counts - neo4j

I am doing some x operation in two different ways. But in the second method, the count of Match query is inappropriate which is no way acceptable. Please suggest where I am missing.
First Way:
profile
WITH [1234] AS sellers_list,
[12345] AS buyers_list
MATCH (buyer:Person) WHERE buyer.person_guid IN buyers_list
MATCH (seller:Person) WHERE seller.person_guid IN sellers_list
RETURN count(buyer),size(buyers_list),count(seller),size(sellers_list)
This results:
Second Way
profile
WITH [1234] AS sellers_list,
[12345] AS buyers_list
MATCH (seller_member:Person)-[r:TEAM_MEMBER]-(seller_teammate:Person)
WHERE seller_member.person_guid IN sellers_list
WITH FILTER(x IN COLLECT(seller_teammate.person_guid) WHERE NOT(x in sellers_list)) AS sellerteam, sellers_list, buyers_list
MATCH (seller_member:Person)-[r:EMPLOYED_BY]->(b:Organization)
MATCH (b)<-[s:EMPLOYED_BY]-(org_member:Person)
WHERE seller_member.person_guid=sellers[0]
WITH FILTER(x IN COLLECT(org_member.person_guid) WHERE NOT(x IN sellerteam)) AS org_members,sellers_list,sellerteam,buyers_list
WITH sellers+sellerteam+org_members AS all_org_members,sellers_list,sellerteam,org_members,buyers_list
MATCH (buyer:Person) WHERE buyer.person_guid IN buyers_list
MATCH (seller:Person) WHERE seller.person_guid IN all_org_members
RETURN count(buyer),size(buyers_list),count(seller),size(sellers_list)
This Results in:
In the second method, I did not alter the buyers_list anywhere, I was just count the seller team members and seller organization members that's it. But the count of buyers is changing. Why?
Profiling the above query shows this:
Looking at this image, the no of buyers is just 1, but why the count is returning 45k.
And, why the 90k db hits for 45k nodes? Any specific reason and how can I reduce the db hits here.

A key thing to remember is that queries in Neo4j build up rows and columns. When you perform a match between disconnected patterns, you tend to get a cartesian product against your current rows (and you can see that in your query plan). That said, a cartesian product isn't necessarily a mistake or bad. There's really no way to match to all those sellers from your list of guids without a cartesian product, and it is just a cartesian product against your single row.
If you returned all values immediately after your seller match, you would see that each row has a different seller, but all the other fields (including the buyer) are the same.
You'll want to get a count of distinct values, count(distinct buyer), which should give you your expected buyer count of 1.
As for the 90k hits, a NodeUniqueIndexSeek requires 2 db hits per lookup, and you performed a lookup on 45k values, so the math works out.
EDIT
If you are still suspicious, you can try out a large unique lookup in isolation (or as much isolation as you can while having to lookup 45k guids first).
MATCH (p:Person)
WITH p LIMIT 45000
WITH COLLECT(p.person_guid) as guids
// you can always take the above subquery, returning 1, to see the timing of just collecting guids
MATCH (p:Person)
WHERE p.person_guid in guids
RETURN COUNT(p) as count

Related

"Query Optimization" : Neo4j Query builds a cartesian product between disconnected patterns -

I’m supposed to have graph of multiple nodes(more than 2) with their relationships at 1st degree, second degree, third degree.
For that right now I am using this query
WITH ["1258311979208519680","3294971891","1176078684270333952",”117607868427845”] as ids
MATCH (n1:Target),(n2:Target) WHERE n1.id in ids and n2.id in ids and n1.id<>n2.id and n1.uid=103 and n2.uid=103
MATCH p = ((n1)-[*..3]-(n2)) RETURN p limit 30
In which 4 nodes Id’s are mention in WITH[ ] and next [*..3] it is used to draw 3rd degree graph between the selected nodes.
WHAT the ABOVE QUERY DOING
After running the above query it will return the mutual nodes in case of second degree [*..2] if any of the 2 selected nodes have mutual relation it’ll return.
WHAT I WANT
*1) First of all I want to optimize the query, as it is taking so much time and this query causing the Cartesian product which slow down the query process.
2) As in this above query if any 2 nodes have mutual relationship it will return the data, I WANT, the query will return mutual nodes attached with all selected nodes. Means if we have some nodes in return, these nodes must have relation to all selected target nodes.
Any suggestions to modify the query, to optimize the query.
If you are looking for to avoid the cartesian product issue with the given query
WITH ["1258311979208519680","3294971891","1176078684270333952",”117607868427845”] as ids
MATCH (n1:Target),(n2:Target) WHERE n1.id in ids and n2.id in ids and n1.id<>n2.id and n1.uid=103 and n2.uid=103
MATCH p = ((n1)-[*..3]-(n2)) RETURN p limit 30
I suggest to use this one below
MATCH (node1:Target) WHERE node1.id IN ["1258311979208519680","3294971891","1176078684270333952"]
MATCH (node2:Target) WHERE node2.id IN ["1258311979208519680","3294971891","1176078684270333952"]
and node1.id <> node2.id
MATCH p=(node1)-[*..2]-(node2)
RETURN p
It will remove the cartesian product issue.
Try this..

Cypher - Neo4j Query Profiling

I have some questions regarding Neo4j's Query profiling.
Consider below simple Cypher query:
PROFILE
MATCH (n:Consumer {mobileNumber: "yyyyyyyyy"}),
(m:Consumer {mobileNumber: "xxxxxxxxxxx"})
WITH n,m
MATCH (n)-[r:HAS_CONTACT]->(m)
RETURN n,m,r;
and output is:
So according to Neo4j's Documentation:
3.7.2.2. Expand Into
When both the start and end node have already been found, expand-into
is used to find all connecting relationships between the two nodes.
Query.
MATCH (p:Person { name: 'me' })-[:FRIENDS_WITH]->(fof)-->(p) RETURN
> fof
So here in the above query (in my case), first of all, it should find both the StartNode & the EndNode before finding any relationships. But unfortunately, it's just finding the StartNode, and then going to expand all connected :HAS_CONTACT relationships, which results in not using "Expand Into" operator. Why does this work this way? There is only one :HAS_CONTACT relationship between the two nodes. There is a Unique Index constraint on :Consumer{mobileNumber}. Why does the above query expand all 7 relationships?
Another question is about the Filter operator: why does it requires 12 db hits although all nodes/ relationships are already retrieved? Why does this operation require 12 db calls for just 6 rows?
Edited
This is the complete Graph I am querying:
Also I have tested different versions of same above query, but the same Query Profile result is returned:
1
PROFILE
MATCH (n:Consumer{mobileNumber: "yyyyyyyyy"})
MATCH (m:Consumer{mobileNumber: "xxxxxxxxxxx"})
WITH n,m
MATCH (n)-[r:HAS_CONTACT]->(m)
RETURN n,m,r;
2
PROFILE
MATCH (n:Consumer{mobileNumber: "yyyyyyyyy"}), (m:Consumer{mobileNumber: "xxxxxxxxxxx"})
WITH n,m
MATCH (n)-[r:HAS_CONTACT]->(m)
RETURN n,m,r;
3
PROFILE
MATCH (n:Consumer{mobileNumber: "yyyyyyyyy"})
WITH n
MATCH (n)-[r:HAS_CONTACT]->(m:Consumer{mobileNumber: "xxxxxxxxxxx"})
RETURN n,m,r;
The query you are executing and the example provided in the Neo4j documentation for Expand Into are not the same. The example query starts and ends at the same node.
If you want the planner to find both nodes first and see if there is a relationship then you could use shortestPath with a length of 1 to minimize the DB hits.
PROFILE
MATCH (n:Consumer {mobileNumber: "yyyyyyyyy"}),
(m:Consumer {mobileNumber: "xxxxxxxxxxx"})
WITH n,m
MATCH Path=shortestPath((n)-[r:HAS_CONTACT*1]->(m))
RETURN n,m,r;
Why does this do this?
It appears that this behaviour relates to how the query planner performs a database search in response to your cypher query. Cypher provides an interface to search and perform operations in the graph (alternatives include the Java API, etc.), queries are handled by the query planner and then turned into graph operations by neo4j's internals. It make sense that the query planner will find what is likely to be the most efficient way to search the graph (hence why we love neo), and so just because a cypher query is written one way, it won't necessarily search the graph in the way we imagine it will in our head.
The documentation on this seemed a little sparse (or, rather I couldn't find it properly), any links or further explanations would be much appreciated.
Examining your query, I think you're trying to say this:
"Find two nodes each with a :Consumer label, n and m, with contact numbers x and y respectively, using the mobileNumber index. If you find them, try and find a -[:HAS_CONTACT]-> relationship from n to m. If you find the relationship, return both nodes and the relationship, else return nothing."
Running this query in this way requires a cartesian product to be created (i.e., a little table of all combinations of n and m - in this case only one row - but for other queries potentially many more), and then relationships to be searched for between each of these rows.
Rather than doing that, since a MATCH clause must be met in order to continue with the query, neo knows that the two nodes n and m must be connected via the -[:HAS_CONTACT]-> relationship if the query is to return anything. Thus, the most efficient way to run the query (and avoid the cartesian product) is as below, which is what your query can be simplified to.
"Find a node n with the :Consumer label, and value x for the index mobileNumber, which is connected via a -[:HAS_CONTACT]-> relationshop to a node m with the :Consumer label, and value y for its proprerty mobileNumber. Return both nodes and the relationship, else return nothing."
So, rather than perform two index searches, a cartesian product and a set of expand into operations, neo performs only one index search, an expand all, and a filter.
You can see the result of this simplification by the query planner through the presence of AUTOSTRING parameters in your query profile.
How to Change Query to Implement Search as Desired
If you want to change the query so that it must use an expand into relationship, make the requirement for the relationship optional, or use explicitly iterative execution. Both these queries below will produce the initially expected query profiles.
Optional example:
PROFILE
MATCH (n:Consumer{mobileNumber: "xxx"})
MATCH (m:Consumer{mobileNumber: "yyy"})
WITH n,m
OPTIONAL MATCH (n)-[r:HAS_CONTACT]->(m)
RETURN n,m,r;
Iterative example:
PROFILE
MATCH (n1:Consumer{mobileNumber: "xxx"})
MATCH (m:Consumer{mobileNumber: "yyy"})
UNWIND COLLECT(n1) AS n
MATCH (n)-[r:HAS_CONTACT]->(m)
RETURN n,m,r;

How to get total count of users exclude location="Hyderabad" and include deviceBrand= "lenova"?

I am using neo4j version 3.0.3. I have executed the below query. It is giving the results as the count of users who have the HAS_VISITED_LOCATION relation, but I want the total count of users who don't have the HAS_VISITED_LOCATION relation also.
MATCH (c:Consumer)-[:HAS_VISITED_LOCATION]-(l:Location)
WHERE NOT l.AreaName="hyderabad"
MATCH(c)-[:HAS_DEVICE_BRAND]-(d:DeviceBrand{BrandName:"lenovo"})
RETURN count(c)
So you're asking for the count of all consumers who have the lenovo device brand and who have not visited hyderabad.
This query should do that:
MATCH (l:Location {AreaName:'hyderabad'})
MATCH (c:Consumer)-[:HAS_DEVICE_BRAND]->(:DeviceBrand{BrandName:"lenovo"})
WHERE NOT (c)-[:HAS_VISITED_LOCATION]->(l)
RETURN COUNT(DISTINCT c)
EDIT - New (but related) question on how to get consumers who have not visited hyderabad and who don't have the lenovo brand.
This new question is trickier in that it's matching on the absence of relationships.
The straight forward approach is to simply match on consumers where the consumer has not visited hyderabad and doesn't have the lenevo device brand:
MATCH (c:Consumer)
WHERE NOT (c)-[:HAS_VISITED_LOCATION]->(l:Location {AreaName:'hyderabad'})
AND NOT (c)-[:HAS_DEVICE_BRAND]->(:DeviceBrand{BrandName:"lenovo"})
RETURN COUNT(c) as count
While this is correct, it may not be the most efficient query.
If we look at the logical representation of what you want, we might see an alternate approach:
NOT (visited hyderabad) AND NOT (has lenevo)
If we take the negation of your requirement:
NOT (NOT (visited hyderabad) AND NOT (has lenevo)) = (visited
hyderabad) OR (has lenevo)
So an alternate query can be to find the count of the negation of what you want (the count of consumers who have visited hyderabad OR who have lenovo), and subtract it from the total consumer count to get the actual count you want.
You can try this query and see if it performs better than the straightforward approach:
// first get the total count of consumers, should be very fast
MATCH (c:Consumer)
WITH COUNT(c) as totalCount
MATCH (lenovo:DeviceBrand{BrandName:'lenevo'}), (hyderabad:Location{AreaName:'hyderabad'})
// union lenevo and hyderabad into one column through collecting and combining and unwinding
// (this is a workaround since Cypher can't do post-union processing)
WITH totalCount, COLLECT(lenevo) + COLLECT(hyderabad) as excludeNodes
UNWIND excludeNodes as excludeNode
// get all consumers attached to these nodes
MATCH (excludeNode)<-[:HAS_DEVICE_BRAND|:HAS_VISITED_LOCATION]-(c:Consumer)
WITH totalCount, COUNT(DISTINCT c) as excludeCount
RETURN totalCount - excludeCount as count

Neo4j/Cypher matching first n nodes in the traversal branch

I have graph: (:Sector)<-[:BELONGS_TO]-(:Company)-[:PRODUCE]->(:Product).
I'm looking for the query below.
Start with (:Sector). Then match first 50 companies in that sector and for each company match first 10 products.
First limit is simple. But what about limiting products.
Is it possible with cypher?
UPDATE
As #cybersam suggested below query will return valid results
MATCH (s:Sector)<-[:BELONGS_TO]-(c:Company)
WITH c
LIMIT 50
MATCH (c)-[:PRODUCE]->(p:Product)
WITH c, (COLLECT(p))[0..10] AS products
RETURN c, products
However this solution doesn't scale as it still traverses all products per company. Slice applied after each company products collected. As number of products grows query performance will degrade.
Each returned row of this query will contain: a sector, one of its companies (at most 50 per sector), and a collection of up to 10 products for that company:
MATCH (s:Sector)<-[:BELONGS_TO]-(c:Company)
WITH s, (COLLECT(c))[0..50] AS companies
UNWIND companies AS company
MATCH (company)-[:PRODUCE]->(p:Product)
WITH s, company, (COLLECT(p))[0..10] AS products;
Updating with some solutions using APOC Procedures.
This Neo4j knowledge base article on limiting results per row describes a few different ways to do this.
One way is to use apoc.cypher.run() to execute a limited subquery per row. Applied to the query in question, this would work:
MATCH (s:Sector)<-[:BELONGS_TO]-(c:Company)
WITH c
LIMIT 50
CALL apoc.cypher.run('MATCH (c)-[:PRODUCE]->(p:Product) WITH p LIMIT 10 RETURN collect(p) as products', {c:c}) YIELD value
RETURN c, value.products AS products
The other alternative mentioned is using APOC path expander procedures, providing the label on a termination filter and a limit:
MATCH (s:Sector)<-[:BELONGS_TO]-(c:Company)
WITH c
LIMIT 50
CALL apoc.path.subgraphNodes(c, {maxLevel:1, relationshipFilter:'PRODUCE>', labelFilter:'/Product', limit:10}) YIELD node
RETURN c, collect(node) AS products

Optimizing Cypher Query - Neo4j

I have the following query
MATCH (User1 )-[:VIEWED]->(page)<-[:VIEWED]- (User2 )
RETURN User1.userId,User2.userId, count(page) as cnt
Its a relatively simple query to find co-page view counts between users.
Its just too slow, and I have to terminate it after some time.
Details
User consists of about 150k Nodes
Page consists of about 180k Nodes
User -VIEWS-> Page has about 380k Relationships
User has 7 attributes, and Page has about 5 attributes.
Both User and Page are indexed on UserId and PageId respectively.
Heap Size is 512mb (tried to run on 1g too)
What would be some of the ways to optimize this query as I think the count of the nodes and relationships are not a lot.
Use Labels
Always use Node labels in your patterns.
MATCH (u1:User)-[:VIEWED]->(p:Page)<-[:VIEWED]-(u2:User)
RETURN u1.userId, u2.userId, count(p) AS cnt;
Don't match on duplicate pairs of users
This query will be executed for all pairs of users (that share a viewed page) twice. Each user will be mapped to User1 and then each user will also be mapped to User2. To limit this:
MATCH (u1:User)-[:VIEWED]->(p:Page)<-[:VIEWED]-(u2:User)
WHERE id(u1) > id(u2)
RETURN u1.userId, u2.userId, count(p) AS cnt;
Query for a specific user
If you can bind either side of the pattern the query will be much faster. Do you need to execute this query for all pairs of users? Would it make sense to execute it relative to a single user only? For example:
MATCH (u1:User {name: "Bob"})-[:VIEWED]->(p:Page)<-[:VIEWED]-(u2:User)
WHERE NOT u1=u2
RETURN u1.userId, u2.userId, count(p) AS cnt;
As you are trying different queries you can prepend EXPLAIN or PROFILE to the Cypher query to see the execution plan and number of data hits. More info here.

Resources