Getting incorrect values with COUNT() in OPTIONAL MATCH query - neo4j

Ok so here is my data model:
(User)-[:ASSIGNED_TO]-(Account)
(Location)-[:BELONGS_TO]->(Account)
(User)-[:ASSIGNED_TO]-(Location)
In my database there is 1 account, 1 location and 16 users. Each user is :ASSIGNED_TO the account and also :ASSIGNED_TO the location. The location :BELONGS_TO the account.
I'm trying to select a specific account by id and also return the number of users and locations for that account. Here is my query:
MATCH (account:Account)
WHERE account.id = '123456'
WITH account
OPTIONAL MATCH (location:Location)-[:BELONGS_TO]->(account)
OPTIONAL MATCH (user:User)-[:ASSIGNED_TO]->(account)
RETURN account, count(location) as locationCount, count(user) as userCount
The result is the account, a userCount = 16 (correct) and a locationCount = 16 (incorrect; should be 1). If I add distinct to the location count, count(distinct location), I get the correct result (1) and if I remove the OPTIONAL MATCH for the users, I also get a location count of 1. I know it has something to do with the users having a relationship to the account and the location but I'm just trying to understand why the query without distinct doesn't work. Also, is there a better way to write this?

It is indeed a bit tricky. This is the query rewritten to show the pattern you are looking for :
MATCH (account:Account)
WHERE account.id = '123456'
MATCH (location:Location)-[:BELONGS_TO]->(account)<-[:ASSIGNED_TO]-(user:User)
RETURN account, count (location), count (user)
There's one account in the middle, but you don't know what the numbers are on each side. The resultset will contain all matches for the pattern (happens to be 16, but there could have been more locations and users assigned to multiple locations). So actually neither count is correct (you just get lucky for the users).
MATCH (account:Account)
WHERE account.id = '123456'
MATCH (location:Location)-[:BELONGS_TO]->(account)<-[:ASSIGNED_TO]-(user:User)
RETURN account, count (DISTINCT location), count (DISTINCT user)
DISTINCT solves the problem. Aggregated by account (there is only one so no real aggregation happens) there are 16 locations in the resultset. DISTINCT makes sure you only count the unique ones. And the same DOES apply for the users too !
Take a look at this query to see the difference :
MATCH (account:Account)
WHERE account.id = '123456'
MATCH (location:Location)-[:BELONGS_TO]->(account)
RETURN account.id as id, "location count" as type, count(location) as ct
UNION
MATCH (account:Account)
WHERE account.id = '123456'
MATCH (account)<-[:ASSIGNED_TO]-(user:User)
RETURN account.id as id, "user count" as type, count(user) as ct
Hope this helps.
Regards,
Tom

If you view your result in row instead of graph, you can see that there are actually 16 rows of data. Each row will contain location, count(location) is actually returning the number of lines that has location.
I prefer using distinct for removing duplicates. We have service in production and we are using distinct in a similar scenario.

I was facing a similar problem and it helped to think of it from an RDBMS perspective.
Consider a table of Users like so (I'll use 4 for my example):
Users
-----
u1
u2
u3
u4
And consider 2 tables of Locations and Accounts each (with one record each, as in your case):
Locations
---------
loc1
Accounts
--------
acc1
Now, when Neo4j evaluates a query like MATCH (location:Location)-[:BELONGS_TO]->(account)<-[:ASSIGNED_TO]-(user:User), it starts looking for User nodes, and Location nodes, and follows relationships inwards to Account nodes, where it then performs a join. So, to break that query into intermediate queries, it would look like: MATCH (location:Location)-[:BELONGS_TO]->(account) and MATCH (account)<-[:ASSIGNED_TO]-(user:User). Evaluating those 2 queries would give us something like the following tables:
Location-Account
----------------
loc1 | acc1
Account-User
------------
u1 | acc1
u2 | acc1
u3 | acc1
u4 | acc1
Finally, Neo4j performs a join on the intermediate results, to return something like the following combined table:
User-Account-Location
---------------------
u1 | acc1 | loc1
u2 | acc1 | loc1
u3 | acc1 | loc1
u4 | acc1 | loc1
count(location) as per this table would be 4, while count(DISTINCT(location)) would be 1 :)

Related

Why these two Cypher queries return different result?

I'm trying to learn Cypher and I have the data of a trust network, I wanted to query people who trust "15 most trusted people", so I wrote this query, QUERY1:
QUERY1:
MATCH (u1:USER)-[:TRUST]->(u2:USER)
with u2.Id as id, COUNT(u2) AS score
order by score desc
limit 15
match p=(w1:USER)-[:TRUST]->(w2:USER {Id: id})
return w1.Id as user1, w2.Id as user2
after that I wanted to change the last 2 lines of query to this:
QUERY2:
MATCH (u1:USER)-[:TRUST]->(u2:USER)
with u2.Id as id, COUNT(u2) AS score
order by score desc
limit 15
match p=(w1:USER)-[:TRUST]->(w2:USER {Id: id})-[:TRUST]->(w3:USER)
return w1.Id as user1,w2.Id as user2, w3.Id as user3
and after analyzing the result, I've guess that something is wrong!
so I hard coded id to specific value, for example 575, then count(p) is equal to 1937520, BUT if I run the last line of query with hardcoded Id, as a separate query:
QUERY3:
MATCH r=(u1:USER)-[:TRUST]->(u2:USER {Id: "575"})-[:TRUST]->(u3:USER)
return count(r)
the count(r) is equal to 129168!
I checked that the User "575" trust 207 people and is trusted by 624 people, so QUERY3 result seems correct: 207*624=129168. and my question is why?!
I can't understand what is wrong with the QUERY2, and the second question is does it mean that QUERY1 result is wrong too?
EDIT1:
thanks for answers, but I still had problem with this, so I checked another scenario and I've got the following result:
If I write a query like this:
QUERY4:
MATCH (n) WITH n limit 15 return "1"
I'll get 15 "1"s printed in the output, so it means the last part of QUERY2 executes 15 times, no matter if I hard code the Id or not, like it's in a for loop. so the problem here was that I thought the WHIT X LIMIT N doSomeThing would execute like a foreach(x : X) loop, if I use x, and would not, if I don't use x. stupid assumption...
This query might do what you intended.
MATCH (:USER)-[r:TRUST]->(u2:USER)
WITH u2, COUNT(r) AS score
ORDER BY score DESC
LIMIT 15
MATCH (w1:USER)-[:TRUST]->(u2)-[:TRUST]->(w3:USER)
RETURN w1.Id AS user1, u2.Id AS user2, w3.Id AS user3;
It first finds the 15 most-trusted users, then finds all the 2-level trust paths that those users are in the middle of, and finally returns the ids of the users in those paths.
Also, the second MATCH reuses the u2 nodes already found by the first MATCH, to speed up the processing of the second MATCH.
In QUERY3, you are matching u2 to a single user (user 575). QUERY 3 is correct.
However, in QUERY2, that WITH (line 3) matches 15 different u1-u2 combinations. The MATCH (line 1) returns a "row" for each u1 and u2 that..well, matches that pattern. Then you are returning just the first 15 results, which I guess are 15 different u1 for u2=user{Id:575} That´s what give 1937520 results, which is exactly 15 * 129168.
The problem in the with appears because you are not aggregating (not getting just 1 row for each u2). You 'return' (using WITH) one id variable for each u2 user, so count(u2) will always be 1. Maybe you wanted to write u1.Id or count(u1) ? Anyway, WITHing u2.Id or u1.Id will return 15 results because of the LIMIT 15 (line 4). LIMIT 1 would do the trick, but we can also do this:
MATCH (u1:User)-[:TRUST]-(u2:User)
WITH DISTINCT(u2.Id) AS id
LIMIT 15
And then the rest of the QUERY2 (or QUERY1, for that matter). I eliminated the score variable, but if it´s meant to be count(u1), it can be readded with no problem.
I'll just break down Query 2 and the rest should make sense.
QUERY2:
MATCH (u1:USER)-[:TRUST]->(u2:USER)
with u2.Id as id, COUNT(u2) AS score
order by score desc
limit 15
match p=(w1:USER)-[:TRUST]->(w2:USER {Id: id})-[:TRUST]->(w3:USER)
return w1.Id as user1,w2.Id as user2, w3.Id as user3
Starting with
MATCH (u1:USER)-[:TRUST]->(u2:USER)
with u2.Id as id, COUNT(u2) AS score
order by score desc
limit 15
You are basically creating a list of all u1 trusts u2; And COUNT(u2) = # of u2 matched. So assuming u1 trusts u2 has 100 matches, COUNT(u2) would put '100' in that column for each row. (and then you order on what is now a constant, which does nothing, and limit 15, so you now have an arbitrary list of 15 u1 trusts u2.
So that just leaves
match p=(w1:USER)-[:TRUST]->(w2:USER {Id: id})-[:TRUST]->(w3:USER)
So that is match each path p where a user w1 trusts user w2 (with for-each id from first part) who trusts a user w3.
So, fixing the first part, to get 'top 15 trusted users you need to count the number of incoming trusts
MATCH (u1:USER)-[trusts:TRUST]->(u2:USER)
with u2, COUNT(trusts) AS score
order by score desc
limit 15
So now you have 15 most trusted users and you can verify this with return u2.id, score. To get people who trust these people you would than just need to ask like...
MATCH (u3:USER)-[:TRUST]->(u2)
and u3 will then be all users who trust someone from top 15 trusted people (u2).
As an additional note, if you are using the neo4j web browser, try pre-pending the PROFILE keyword to your cypher for some insight into what the cypher query actually does.
Edit 1:
Now to explain what query 4 does MATCH (n) WITH n limit 15 return "1". As I am sure you guessed, MATCH (n) WITH n limit 15 matches all nodes but limits results to first 15. On the RETURN part, you are saying "For each row, return the constant '1'.", Which give you 15 distinct rows internally, but the returned rows are not distinct. This is what the DISTINCT keyword is for. Using RETURN DISTINCT "1" says "For each row, return the constant '1', but filter the result set to only have distinct rows." aka, no 2 columns will have the same value. The non-distinct result is useful if you know there will be some duplicate rows, but you want to see them anyways (maybe for a weight reference, or knowing that they are from 2 separate fields).
As I mentioned in the EDIT1, the problem here was that I thought the WHIT X LIMIT N doSomeThing would execute like a foreach(x : X) loop, if I use x, and would not, if I don't use x. stupid assumption...

Return Neo4J Combined Relationships When Searching Across Several Relationship Types

I would like to query for various things and returned a combined set of relationships. In the example below, I want to return all people named Joe living on Main St. I want to return both the has_address and has_state relationships.
MATCH (p:Person),
(p)-[r:has_address]-(a:Address),
(a)-[r1:has_state]-(s:State)
WHERE p.name =~ ".*Joe.*" AND a.street = ".*Main St.*"
RETURN r, r1;
But when I run this query in the Neo4J browser and look under the "Text" view, it seems to put r and r1 as columns in a table (something like this):
│r │r1 │
╞═══╪═══|
│{} │{} │
rather than as desired with each relationship on a different row, like:
Joe Smith | has_address | 1 Main Street
1 Main Street | has_state | NY
Joe Richards | has_address | 22 Main Street
I want to download this as a CSV file for filtering elsewhere. How do I re-write the query in Neo4J to get the desired result?
You may want to look at the Cypher cheat sheet, specifically the Relationship Functions.
That said, you have variables on all the nodes you need. You can output all the data you need on each row.
MATCH (p:Person),
(p)-[r:has_address]-(a:Address),
(a)-[r1:has_state]-(s:State)
WHERE p.name =~ ".*Joe.*" AND a.street = ".*Main St.*"
RETURN p.name AS name, a.street AS address, s.name AS state
That should be enough.
What you seem to be asking for above is a way to union r and r1, but in such a way that they alternate in-order, one row being r and the next being its corresponding r1. This is a rather atypical kind of query, and as such there isn't a lot of support for easily making this kind of output.
If you don't mind rows being out of order, it's easy to do, but your start and end nodes for each relationship are no longer the same type of thing.
MATCH (p:Person),
(p)-[r:has_address]-(a:Address),
(a)-[r1:has_state]-(s:State)
WHERE p.name =~ ".*Joe.*" AND a.street = ".*Main St.*"
WITH COLLECT(r) + COLLECT(r1) as rels
UNWIND rels AS rel
RETURN startNode(rel) AS start, type(rel) AS type, endNode(rel) as end

Neo4j similar paths

i want to do a query that will take all users (without a pre-condition like user ids) , and to find the common similar paths. (for example top 10 users flows)
For example:
User u1 has events: a,b,c,d
User u2 has events: b,d,e
Each event is a node with property event-type
the result should look like:
[a,b,e] - 100 users
[a,c,f] -80 users
[b,d,t]- 50 users
.......
the data the generated the 1st aggregated row in the result can be for example:
user 1: a,b,c,e
user 2: a,b,e,f
.........
user 100: a,c,t,b,g,e
i wonder if this link can help:
http://neo4j.com/docs/stable/rest-api-graph-algos.html#rest-api-execute-a-dijkstra-algorithm-with-equal-weights-on-relationships
Here is a Cypher query that returns all the Event nodes that user 1 and user 2 have in common (in a single row):
MATCH (u1:User {id: 1}) -[:HAS]-> (e:Event) <-[:HAS]- (u2:User {id: 2})
RETURN u1, u2, COLLECT(e);
[Added by MichaelHunger; modified by cybersam] For your additional question try:
// Specify the user ids of interest. This would normally be a query parameter.
WITH [1,2,3] as ids
MATCH (u1:User) -[:HAS]-> (e:Event)
// Only match events for users with one of the specified ids.
WHERE u1.id IN ids
// Count # of distinct user ids per event, and count # of input ids
WITH e, size(collect(distinct u1.id)) as n_users, size(ids) AS n_ids
// Only match when the 2 counts are the same
WHERE n_users = n_ids
RETURN e;

server requirements and optimizations for my data model ( 1.5 billion relationships )

My data model is fairly simple:
(n:User)-[:WANTS]->(c:Card)<-[:HAS]-(o:User)
Whenever a user updates a card in his wants list, I create outgoing :FOLLOWS connections to users who also have that card in their haves list. At the same time, I also create incoming :FOLLOWS connections from users who want cards in the user's have list like so:
// update my total wants
MATCH (u:User)-[w:WANTS]-()
WHERE u.id = 1
WITH u, SUM(w.qty) AS wqty
SET u.wqty = wqty
RETURN wqty;
// delete all my incoming and outgoing follows
MATCH (u1:User {id: 1})-[f:FOLLOWS]-() DELETE f;
// outgoing follows
MATCH (u1:User)-[w:WANTS]->(c:Card)<-[h:HAS]-(u2:User)
WHERE u1.id = 1 AND u1.id <> u2.id
WITH u1, u2, (CASE WHEN h.qty > w.qty THEN w.qty ELSE h.qty END) AS haves
WITH u1.id AS id1, u2.id AS id2, SUM(haves) as weight
MATCH (uf:User), (ut:User)
WHERE uf.id = id1 AND ut.id = id2
MERGE (uf)-[f:FOLLOWS {weight: weight}]->(ut)
ON MATCH SET f.weight = weight;
// incoming follows
MATCH (u1:User)-[h:HAS]->(c:Card)<-[w:WANTS]-(u2:User)
WHERE u1.id = 1 AND u1.id <> u2.id
WITH u1, u2, (CASE WHEN h.qty > w.qty THEN w.qty ELSE h.qty END) AS haves
WITH u1.id AS id1, u2.id AS id2, SUM(haves) as weight
MATCH (uf:User), (ut:User)
WHERE uf.id = id1 AND ut.id = id2
MERGE (uf)<-[f:FOLLOWS {weight: weight}]-(ut)
ON MATCH SET f.weight = weight;
I decided to include this hard-coded :FOLLOWS relationship every time a user updates something in his inventory because I tried querying the trade potential based on the cards they had and the query was very expensive. This way the users will be able to check trade potentials by doing the following query:
MATCH (u1:User {id: 1})-[f1:FOLLOWS]->(u2:User)-[f2:FOLLOWS]->u1
RETURN u2.id, f1.weight AS num_cards_i_need, f2.weight AS num_cards_they_need
This works very fast for my test database who only has the incoming/outgoing follow relationships calculated for 1 user.
Now on to the problem. I have a small amount of nodes: 50k users and 14k cards. However, each user on average follows 30k other users making roughly 1.5 billion relationships. The data store for this is expected to be around 20-30GB after I'm done loading it to neo4j.
My question is, do I need to be able to load the whole database in memory in order to achieve fast reads as well as fast and frequent write of the follows relationships? Let's say I didn't have the resources to rent some large memory instance Amazon and I'm limited to conventional server hardware, what optimizations do I need to do so that I can read and write the :FOLLOWS very fast?
I obviously have memory for the nodestore and I also have some memory for relationshipstores for users->cards relationship, but not for users->users relationship. Can I choose which ones get loaded to memory so in effect they are "warm"?

Cypher: group by cheapest price

Im currently experimenting a bit with cypher. I have a simple setup of components beeing connected to a merchant by a realtionship "sells" having a property "price"
(merchant-[:sells{price:10}]->component)
I made a cypher query which calculates the lowest price, if you buy products from the same merchant.
MATCH sup-[s:sells]->component
WITH SUM(s.price) AS total, sup
RETURN sup, total
ORDER BY total ASC
Now while this is working, I have an issue finding the cheapest price(s), in case 2 or more suppliers are tied. Id like to get something like
_________________________
| price | supplier |
-------------------------
| 60 | conrad |
| | amazon |
-------------------------
You can view my setup here:
http://console.neo4j.org/?id=wpz165
EDIT:
Ok, i found a way although it isnt pretty.
MATCH sup-[s:sells]->component
WITH SUM(s.price) AS minprice, sup
ORDER BY minprice
LIMIT 1
MATCH sup2-[s2:sells]->component2
WITH SUM(s2.price) AS total2, sup2, minprice
WHERE total2 = minprice
RETURN minprice, sup2
How does this work? Well the first part finds the lowest price(by ordering and only returning the first row). The second part runs the whole query again, and filters out items which dont have the lowest price...so the whole query is run two times.
any better ideas???
For my aesthetic this is less ugly though it does require three WITH clauses.
Sum price by supplier for all components
Find Minimum price
Return all suppliers with minimum price
MATCH sup-[s:sells]->component
WITH sup, SUM(s.price) AS price_sum
MATCH sup, price_sum
WITH MIN(price_sum) AS price_min
MATCH sup2-[s2:sells]->component2
WITH sup2, SUM(s2.price) AS price_sum2, price_min
WHERE price_sum2 = price_min
RETURN sup2, price_sum2

Resources