Neo4j- incorrect count in multiple match query - neo4j

When I am trying to execute this query
match(u:User)-[ro:OWNS]->(p:PushDevice) where p.type='gcm'
match(com:Comment)
return count(com) as total_comments,count(ro) as device
this is returning the same number in both total_comments and device which is the number of total comment.

I feel like your query should work, though I'm more confident that this will work:
MATCH (u:User)-[ro:OWNS]->(p:PushDevice) WHERE p.type='gcm'
WITH count(ro) AS device
MATCH (com:Comment)
RETURN count(com) as total_comments, device

Your query is generating a row for every combination of your MATCH results. If you just returned the ro and com values, this would be more clear. See this console for an example. That console has 2 comments and a single OWNS relationship, but the result shows 2 rows (both rows have the same OWNS relationship). So, your query is essentially counting the number of rows -- not what you expected.
Here is an example of a query that would work as you you expected:
MATCH (u:User)-[ro:OWNS]->(p:PushDevice {type:'gcm'})
WITH COUNT(ro) AS device
MATCH (com:Comment)
RETURN count(com) AS total_comments, device;
[EDITED]
This would also work logically, but is less performant (as it takes a cartesian product and then filters out duplicates):
MATCH (u:User)-[ro:OWNS]->(p:PushDevice { type: 'gcm' })
MATCH (com:Comment)
RETURN COUNT(DISTINCT com), COUNT(DISTINCT ro);
Observation
The power of neo4j comes from its efficient handling of relationships. So, the most efficient queries tend to be for connected subgraphs (where all nodes are connected by relationships).
Since your query is not for a single connected subgraph, getting the answer you want is naturally going to be a bit more convoluted and can be inefficient.
If you determine that the suggested queries are too slow, you can try making 2 separate queries instead. That may also make make your code easier to understand.

Related

Neo4j Cypher execution plan when query has WHERE and WITH clause

I have a Neo4j graph database that stores the Staffing Relations and Nodes. I have to write a cypher that will find the home and office address
of a resource (or employee) along with their empId and name.
This is needed so that Staffing Solution can staff resources according to their home location as well as near to their office.
MATCH (employee:Employee) <-[:ADDRESS_TO_EMPLOYEE]- (homeAddress:HomeAddress)
WHERE employee.id = '70'
WITH employee, homeAddress
MATCH (employee)-[:EMPLOYEE_TO_OFFICEADDRESS]->(officeAddress:OfficeAddress)
RETURN employee.empId, employee.name,
homeAddress.street, homeAddress.area, homeAddress.city,
officeAddress.street, officeAddress.area, officeAddress.city
This cypher returns the desired results.
However, if I move the WHERE condition in the last, just before the RETURN clause.
MATCH (employee:Employee) <-[:ADDRESS_TO_EMPLOYEE]- (homeAddress:HomeAddress)
WITH employee, homeAddress
MATCH (employee)-[:EMPLOYEE_TO_OFFICEADDRESS]->(officeAddress:OfficeAddress)
WHERE employee.id = '70'
RETURN employee.empId, employee.name,
homeAddress.street, homeAddress.area, homeAddress.city,
officeAddress.street, officeAddress.area, officeAddress.city
It again gives me the same result.
So which one is more optimized as the query execution plan is same in both the cases?. I mean same number of DB hits and returned Records.
Now, if I remove the WITH clause,
MATCH (employee:Employee) <-[:ADDRESS_TO_EMPLOYEE]-
(homeAddress:HomeAddress),
MATCH (employee)-[:EMPLOYEE_TO_OFFICEADDRESS]->(officeAddress:OfficeAddress)
WHERE employee.id = '70'
RETURN employee.empId, employee.name,
homeAddress.street, homeAddress.area, homeAddress.city,
officeAddress.street, officeAddress.area, officeAddress.city
Then again the results is same, execution plan is also same.
Do I really need WITH in this case?
Any help would be greatly appreciated.
First, you can use Profile and Explain to get the performance of your query. Though, as long as you get the results you want in the time you want, the cypher doesn't matter too much, as the behavior will change depending on the Cypher Planner (version) running in the db. So as long as the cypher passes unit and load tests, the rest doesn't matter (assuming reasonably accurate tests).
Second, In general, less is more. Imagine you had to read your own cypher, and look up the info yourself on paper printouts. Isn't MATCH (officeAddress:OfficeAddress)<-[:EMPLOYEE_TO_OFFICEADDRESS]-(employee:Employee {id:'70'})<-[:ADDRESS_TO_EMPLOYEE]-(homeAddress:HomeAddress) so much easier to tell what exactly you are looking for? The easier it is for the Cypher planner to read what you want, the more likely the Cypher planner will plan the most efficient lookup strategy. Also, keeping your WHERE clause close to the relevant match also helps the planner. So try to keep your cyphers as simple as possible, while still being accurate for what you want.
In your Cypher, the only part that really matters is the WITH. WITH creates a logical break in the cypher, and a scope change for variables, As you aren't doing anything with the with, it's better to drop it. The only side effect it can produce in this case, is tricking the Cypher to do more work than necessary for the first match, to filter it down later. If an Employee is expected to have more than 1 home address, than WITH employee, COLLECT(homeAddress) as homeAdress will reduce that match to 1 row per employee, making the next match cheaper, but since I'm sure both sides of the match should only yield 1 result, it doesn't matter what the planner does first. (In general, you use with to aggregate results down to less rows, to make the rest of the cypher cheaper. Which shouldn't apply in this context)
You should always put a WHERE clause as early as possible in a query. That will filter out data that the rest of the query will not have to deal with, avoiding possible unneeded work.
You should avoid writing a WITH clause that is just passing forward all the defined variables (and is not required syntactically), since it is essentially a no-op. It wastes (a little bit of) time for the planner to process, and makes the Cypher code a bit harder to understand.
This simpler version of your query should produce the same query plan:
MATCH (officeAddress:OfficeAddress)<-[:EMPLOYEE_TO_OFFICEADDRESS]-(employee:Employee)<-[:ADDRESS_TO_EMPLOYEE]-(homeAddress:HomeAddress)
WHERE employee.id = '70'
RETURN
employee.empId, employee.name,
homeAddress.street, homeAddress.area, homeAddress.city,
officeAddress.street, officeAddress.area, officeAddress.city
And the following version (using the map projection syntax) is even simpler (with a similar query plan).
MATCH (officeAddress:OfficeAddress)<-[:EMPLOYEE_TO_OFFICEADDRESS]-(employee:Employee)<-[:ADDRESS_TO_EMPLOYEE]-(homeAddress:HomeAddress)
WHERE employee.id = '70'
RETURN
employee{.empId, .name},
homeAddress{.street, .area, .city},
officeAddress{.street, .area, .city}
The results of the above query have a different structure, though:
╒═══════════════════════════╤══════════════════════════════════════╤══════════════════════════════════════╕
│"employee" │"homeAddress" │"officeAddress" │
╞═══════════════════════════╪══════════════════════════════════════╪══════════════════════════════════════╡
│{"name":"sam","empId":"70"}│{"area":1,"city":"foo","street":"123"}│{"area":2,"city":"bar","street":"345"}│
└───────────────────────────┴──────────────────────────────────────┴──────────────────────────────────────┘

Neo4j more specific query slower than more generic one

I'm trying to count all values collected in one subtree of my graph. I thought that the more descriptive path from the root node I provide, the faster the query will run. Unfortunately this isn't true in my case and I can't figure out why.
Original, slow query:
MATCH (s:Sandbox {name: "sandbox"})<--(root)-[:has_metric]->(n:Metric)-[:most_recent|:prev*0..]->(v:Value) return count(v)
PROFILE returns 38397 total db hits in 2203 ms.
However without matching top-level node, labeled Sandbox, query is 10 times faster:
MATCH (root)-[:has_metric]->(n:Metric)-[:most_recent|:prev*0..]->(v:Value) return count(v)
PROFILE returns 38478 total db hits in 159 ms
To make this clear, in this case the result is the same as I have just one Sandbox.
What is wrong in my first query? How should I model/query the hierarchy like that? I can save sandbox name as property in Metric node, but it seems uglier for me, however executes faster.
Because the 2 queries are not identical.
(For reader visual difference)
MATCH (s:Sandbox {name: "sandbox"})<--(root)-[:has_metric]->(n:Metric)-[:most_recent|:prev*0..]->(v:Value) return count(v)
MATCH (root)-[:has_metric]->(n:Metric)-[:most_recent|:prev*0..]->(v:Value) return count(v)
So in the second query, Neo4j doesn't care about (root). You never use root, and root is already implied by [:has_metric], so Neo4j can just skip to finding ()-[:has_metric]->(n:Metric)-[:most_recent|prev]. In the first query, now we also have to find these Sandbox nodes! And on top of that, root has to be connected to that too! So Neo4j has to do extra work to prove that that is true. The extra column can also add more rows to the results being processed, which may add more validation checks on the rest of the query.
So long story short, the first query is slower because it is doing more validation work. So, the first query will be a subset of the latter.

Is it the optimal way of expressing "go through all nodes" queries in Cypher?

I have a quite large social graph in which I execute global queries like this one:
match (n:User)-[r:LIKES]->(k:User)
where not (k:User)-[]->(n:User)
return count(r);
They take a lot of time and memory, so I am curious if they are expressed in optimal way. I have felling that when I execute such query Cypher is firstly matching everything that fits the expression (and that takes a lot of memory) and then starts to count things. I would rather like to go through every node, check the pattern and update the counter if necessary. This way such queries would not require a lot of memory. So how in fact such query is executed? If it is not optimal, is there a way to make it better (in Cypher)?
If you used the query just as you wrote it, you may not be getting what you think you are. Putting labels on node "variables" can cause them to be treated as fresh (partial) patterns instead of bound nodes. Is your query any faster if you use
MATCH (n:User)-[r:LIKES]->(k:User)
WHERE NOT (n)<--(k)
RETURN count(r)
Here's how this works (not considering internal optimizations, which I don't begin to understand).
For each User node, every outgoing LIKES relationship is followed. If the other end of the LIKES relationship is a User node, the two nodes and the relationship are bound to the names n, k, and r and passed to the WHERE clause. Every outgoing relationship on the bound k node is then tested to see if it connects to the bound n node. If no such relationship is found, the match is considered successful. The count() function in the RETURN clause counts the resulting collection of relationships that were passed from the match.
If you have a densely connected graph, and particularly if there are many other relationships between nodes other than LIKES relationship, this can be quite an extensive search.
As a further experiment, you might try changing the WHERE clause to read
WHERE NOT (k)-->(n)
and see if it makes any difference. I don't think it will, but I could be wrong.

Neo4j why picking a single node and a single edge take so long?

I am trying to test the speed of Neo4j, thus I created an empty database and then populate it with 10,000 users.
Now I run the following query
MATCH (n) RETURN id(n) LIMIT 1;
Surprisingly, it takes 1069 ms!
Then I run the following query (note: I haven't created any edges)
MATCH ()-[r]-() RETURN id(r) LIMIT 1;
which takes 1153ms!
Then I run
MATCH (n) RETURN id(n) SKIP 9900 LIMIT 100
which takes 10427ms.
Is it normal? I think those operations, at least the last one, is quite frequent in an app. I am using a Macbook Air with 1.7GHz Core i5
What version of Neo4j are you using?
How do you measure? The Neo4j browser measures multiple roundtrips for additional data.
Also is that the first or a subsequent query?
None of those queries should be that slow. Perhaps you can share your Neo4j configuration?
that one should be really fast
this one goes over all the nodes (or even over the cross product) in your graph and tries to find a relationship at the end it doesn't find any
that one should also be really fat.
Regarding your comment, if you know the first node, your search will be anchored and you don't have to scan all rels in the database.
MATCH (:User {name:"Han"})-[:FRIEND]->(friend)
RETURN friend

efficiency of where clause in cypher vs match

I'm trying to find 10 posts that were not LIKED by user "mike" using cypher. Will putting a where clause with a NOT relationship be efficient than matching with an optional relationship then checking if that relationship is null in the where clause? Specifically I want to make sure it won't do the equivalent of a full table scan and make sure that this is a scalable query.
Here's what I'm using
START user=node:node_auto_index(uname:"mike"),
posts=node:node_auto_index("postId:*")
WHERE not (user-[:LIKES]->posts)
RETURN posts SKIP 20 LIMIT 10;
Or can I do something where I filter on a MATCH optional relationship
START user=node:node_auto_index(uname="mike"),
posts=node:node_auto_index("postId:*")
MATCH user-[r?:LIKES]->posts
WHERE r IS NULL
RETURN posts SKIP 100 LIMIT 10;
Some quick tests on the console seem to show faster performance in the 2nd approach. Am I right to assume the 2nd query is faster? And, if so why?
i think in the first query the engine runs through all postID nodes and manually checks the condition of not (user-[:LIKES]->posts) for each post ID
whereas in the second example (assuming you use at least v1.9.02) the engine picks up only the post nodes, which actually aren't connected to the user. this is just optimalization where the engine does not go through all postIDs nodes.
if possible, always use the MATCH clause in your queries instead of WHERE, and try to omit the asterix in the declaration START n=node:index('name:*')

Resources