Neo4j Cypher return multiple counts from a single query - neo4j

For analytical purpose, I need to return multiple counts from a single query.
For example, I have a User entity. User has active property true/false.
Is it possible with Cypher to write a single query which will return a total number of all users, and also 2 additional count for active and inactive users? If so, please show how.

Here is the counts of active and inactive users. It is similar to SQL wherein it uses the sum() function and conditional clause "case when".
MATCH (n:Person)
RETURN count(n) as user_counts,
sum(case when n.active then 1 end) as active,
sum(case when not n.active then 1 end) as inactive,
sum(case when n.active is NULL then 1 end) as no_info
Sample result using Persons nodes in movie database
╒═════════════╤════════╤══════════╤═════════╕
│"user_counts"│"active"│"inactive"│"no_info"│
╞═════════════╪════════╪══════════╪═════════╡
│133 │121 │7 │5 │
└─────────────┴────────┴──────────┴─────────┘

We can simply use:
Match(p:Persons)`
RETURN count(p) as total_user,
sum(case when not p.active then 1 end) as inactive_users,
sum(case when p.active then 1 end) as active_users,
sum(case when p.active is NULL then 1 end) as remaining_users

Related

Most common tuples in Neo4j-DB

I have nodes of orders and products in my database connected by a contains- relationship:
(:order)-[:contains]->(:product)
I'm wondering if it is possible to find the most common n-tuples of products that occur in the same order.
I'm afraid this is not possible as I have over 1500 products, making the number of possible combinations between these tremendously high even for small n ,e.g. 1500^4 ≈ 5*10^12.
I have written the following test query for n=3:
MATCH (o:order)-[r:contains]->(p:product)
WITH count(r) as NrProds, o
MATCH (o)-[r:contains]->(p:product)
WHERE NrProds > 3
WITH o
MATCH (o)-[r1:contains]->(p1:product),(o)-[r2:contains]->(p2:product),(o)-[r3:contains]->(p3:product)
WITH o,p1,p2,p3,count(r1) as r1,count(r2) as r2,count(r3) as r3
WITH o,p1,p2,p3,
CASE WHEN r1<r2 AND r1<r3 THEN r1
WHEN r2<r1 AND r2<r3 THEN r2
WHEN r3<r1 AND r3<r2 THEN r3
END as result
WITH result,o,p1,p2,p3
RETURN count(result) as NrPurchs, o.Id,p1.name,p2.name,p3.name ORDER BY NrPurchs DESC
First I make sure to not consider any orders of product count less than 3 as those make up a huge part of all orders, then I match over contains-relationships in these orders.
My computer does not finish the query which is not surprising given the large joins being created.
Is there a way of finding the tuples that does not involve querying over so many possibilites, such that my computer can finish the calculations?
As you pointed out the number of combinations for product is too high so it is best to start generating the tuple from the order side to generate only existing tuples.
You have one unnecessary MATCH before your filter, following should be enough to filter out the orders with :
MATCH (o:order)-[r:contains]->(p:product)
WITH count(r) as NrProds, o
WHERE NrProds >= 3 // note >= sign for orders with 3 and more
WITH o
...
or if your order uses contains relationship only for products:
MATCH (o:order)
WITH o,size((o)-[:contains]->()) as NrProds
WHERE NrProds >= 3
WITH o
...
To avoid duplicates filter out permutations of same products by sorting them by id, name etc.. (this where clause would only work for unique names/ids. If you have duplicates you need <= in there)
...
MATCH (o)-[r1:contains]->(p1:product),(o)-[r2:contains]->(p2:product),(o)-[r3:contains]->(p3:product)
WHERE p1.name < p2.name AND p2.name < p3.name
...
then return each tuple with number of orders:
RETURN p1,p2,p3,COUNT(o) as c
(if you have multiple contains relationships between single order and product you should use COUNT(DISTINCT o))
Finally return only N tuples:
ORDER BY c DESC LIMIT {n}
Whole query:
MATCH (o:order)
WITH o,size((o)-[:contains]->()) as NrProds
WHERE NrProds >= 3
WITH o
MATCH
(o)-[r1:contains]->(p1:product),
(o)-[r2:contains]->(p2:product),
(o)-[r3:contains]->(p3:product)
WHERE p1.name < p2.name AND p2.name < p3.name
RETURN p1,p2,p3,COUNT(o) as c
ORDER BY c DESC LIMIT {n}

Why these two Cypher queries return different result?

I'm trying to learn Cypher and I have the data of a trust network, I wanted to query people who trust "15 most trusted people", so I wrote this query, QUERY1:
QUERY1:
MATCH (u1:USER)-[:TRUST]->(u2:USER)
with u2.Id as id, COUNT(u2) AS score
order by score desc
limit 15
match p=(w1:USER)-[:TRUST]->(w2:USER {Id: id})
return w1.Id as user1, w2.Id as user2
after that I wanted to change the last 2 lines of query to this:
QUERY2:
MATCH (u1:USER)-[:TRUST]->(u2:USER)
with u2.Id as id, COUNT(u2) AS score
order by score desc
limit 15
match p=(w1:USER)-[:TRUST]->(w2:USER {Id: id})-[:TRUST]->(w3:USER)
return w1.Id as user1,w2.Id as user2, w3.Id as user3
and after analyzing the result, I've guess that something is wrong!
so I hard coded id to specific value, for example 575, then count(p) is equal to 1937520, BUT if I run the last line of query with hardcoded Id, as a separate query:
QUERY3:
MATCH r=(u1:USER)-[:TRUST]->(u2:USER {Id: "575"})-[:TRUST]->(u3:USER)
return count(r)
the count(r) is equal to 129168!
I checked that the User "575" trust 207 people and is trusted by 624 people, so QUERY3 result seems correct: 207*624=129168. and my question is why?!
I can't understand what is wrong with the QUERY2, and the second question is does it mean that QUERY1 result is wrong too?
EDIT1:
thanks for answers, but I still had problem with this, so I checked another scenario and I've got the following result:
If I write a query like this:
QUERY4:
MATCH (n) WITH n limit 15 return "1"
I'll get 15 "1"s printed in the output, so it means the last part of QUERY2 executes 15 times, no matter if I hard code the Id or not, like it's in a for loop. so the problem here was that I thought the WHIT X LIMIT N doSomeThing would execute like a foreach(x : X) loop, if I use x, and would not, if I don't use x. stupid assumption...
This query might do what you intended.
MATCH (:USER)-[r:TRUST]->(u2:USER)
WITH u2, COUNT(r) AS score
ORDER BY score DESC
LIMIT 15
MATCH (w1:USER)-[:TRUST]->(u2)-[:TRUST]->(w3:USER)
RETURN w1.Id AS user1, u2.Id AS user2, w3.Id AS user3;
It first finds the 15 most-trusted users, then finds all the 2-level trust paths that those users are in the middle of, and finally returns the ids of the users in those paths.
Also, the second MATCH reuses the u2 nodes already found by the first MATCH, to speed up the processing of the second MATCH.
In QUERY3, you are matching u2 to a single user (user 575). QUERY 3 is correct.
However, in QUERY2, that WITH (line 3) matches 15 different u1-u2 combinations. The MATCH (line 1) returns a "row" for each u1 and u2 that..well, matches that pattern. Then you are returning just the first 15 results, which I guess are 15 different u1 for u2=user{Id:575} That´s what give 1937520 results, which is exactly 15 * 129168.
The problem in the with appears because you are not aggregating (not getting just 1 row for each u2). You 'return' (using WITH) one id variable for each u2 user, so count(u2) will always be 1. Maybe you wanted to write u1.Id or count(u1) ? Anyway, WITHing u2.Id or u1.Id will return 15 results because of the LIMIT 15 (line 4). LIMIT 1 would do the trick, but we can also do this:
MATCH (u1:User)-[:TRUST]-(u2:User)
WITH DISTINCT(u2.Id) AS id
LIMIT 15
And then the rest of the QUERY2 (or QUERY1, for that matter). I eliminated the score variable, but if it´s meant to be count(u1), it can be readded with no problem.
I'll just break down Query 2 and the rest should make sense.
QUERY2:
MATCH (u1:USER)-[:TRUST]->(u2:USER)
with u2.Id as id, COUNT(u2) AS score
order by score desc
limit 15
match p=(w1:USER)-[:TRUST]->(w2:USER {Id: id})-[:TRUST]->(w3:USER)
return w1.Id as user1,w2.Id as user2, w3.Id as user3
Starting with
MATCH (u1:USER)-[:TRUST]->(u2:USER)
with u2.Id as id, COUNT(u2) AS score
order by score desc
limit 15
You are basically creating a list of all u1 trusts u2; And COUNT(u2) = # of u2 matched. So assuming u1 trusts u2 has 100 matches, COUNT(u2) would put '100' in that column for each row. (and then you order on what is now a constant, which does nothing, and limit 15, so you now have an arbitrary list of 15 u1 trusts u2.
So that just leaves
match p=(w1:USER)-[:TRUST]->(w2:USER {Id: id})-[:TRUST]->(w3:USER)
So that is match each path p where a user w1 trusts user w2 (with for-each id from first part) who trusts a user w3.
So, fixing the first part, to get 'top 15 trusted users you need to count the number of incoming trusts
MATCH (u1:USER)-[trusts:TRUST]->(u2:USER)
with u2, COUNT(trusts) AS score
order by score desc
limit 15
So now you have 15 most trusted users and you can verify this with return u2.id, score. To get people who trust these people you would than just need to ask like...
MATCH (u3:USER)-[:TRUST]->(u2)
and u3 will then be all users who trust someone from top 15 trusted people (u2).
As an additional note, if you are using the neo4j web browser, try pre-pending the PROFILE keyword to your cypher for some insight into what the cypher query actually does.
Edit 1:
Now to explain what query 4 does MATCH (n) WITH n limit 15 return "1". As I am sure you guessed, MATCH (n) WITH n limit 15 matches all nodes but limits results to first 15. On the RETURN part, you are saying "For each row, return the constant '1'.", Which give you 15 distinct rows internally, but the returned rows are not distinct. This is what the DISTINCT keyword is for. Using RETURN DISTINCT "1" says "For each row, return the constant '1', but filter the result set to only have distinct rows." aka, no 2 columns will have the same value. The non-distinct result is useful if you know there will be some duplicate rows, but you want to see them anyways (maybe for a weight reference, or knowing that they are from 2 separate fields).
As I mentioned in the EDIT1, the problem here was that I thought the WHIT X LIMIT N doSomeThing would execute like a foreach(x : X) loop, if I use x, and would not, if I don't use x. stupid assumption...

NEO4J Query to exclude some nodes

With a data model that looks similar to this:
(u:User)-[:POSTED]->(p:Post {created_at: 123)-[:ABOUT]->(t:Topic {name: "Blue")
What is the best way to find distinct count of users who posted/created a post with {created_at: 123} AND also don't have a post with {created_at: 124} about topic "Blue".
Closest I can get is to collect ids and then exclude them but that doesn't scale when you have a lot of nodes (millions).
[EDITED]
I also need the created_at times to be specifiable as ranges.
This query allows you to specify created_at ranges. In this example, the desirable range is [123..130], and the undesirable "Blue" range is [131..140]. In your actual query, the range endpoints should be specified by parameters.
MATCH (user:User)-[:POSTED]->(p1:Post)
WHERE 123 <= p1.created_at <= 130
WITH user
OPTIONAL MATCH (user)-[:POSTED]->(p2:Post)-[:ABOUT]->(:Topic{name:"Blue"})
WHERE 131 <= p2.created_at <= 140
WITH user, p2
WHERE p2 IS NULL
RETURN COUNT(DISTINCT user) AS userCount;
The OPTIONAL MATCH clause is there to match the undesirable "Blue" paths, and the WHERE p2 IS NULL clause will filter out user nodes that have any such paths.
Assuming you have an index on Post.created_at and Topic.name (for speed), this should work:
MATCH (user:User)-[:POSTED|CREATED]->(:Post{created_at:123})
WHERE NOT EXISTS ( (user)-[:POSTED|CREATED]->(:Post{created_at:124})-[:ABOUT]->(:Topic{name:"Blue"}) )
RETURN count(DISTINCT user) as userCount
It's worth profiling that query, and if it isn't using the created_at and name indexes, supply that to the query with USING INDEX after your MATCH.

Neo4j/Cypher - Multiple WHERE clauses return no results

I have the following Queries:
This one returns the expected results:
MATCH (u:User)-[pw_rel:PAYED_WITH]->(ext_acc:ExtAccount)
WHERE pw_rel.processed_at > 0
WITH u, COUNT(DISTINCT ext_acc) as accs
WHERE accs > 2
RETURN u
While this one doesn't return anything:
MATCH (u:User)-[pw_rel:PAYED_WITH]->(ext_acc:ExtAccount)
WITH u, pw_rel, ext_acc, COUNT(DISTINCT ext_acc) as accs
WHERE pw_rel.processed_at > 0 AND accs > 2
RETURN u
Why is that? What am I missing?
This has to do with aggregation, and what it means in the context of other columns. That is, the other non-aggregation columns present act as the grouping key.
In your first query, your only grouping key is u. This means your output is the count of all ext accounts for each user.
In your second query, your grouping key includes u, pw_rel, ext_acc. So it returns the count of all ext accounts for each user / pw-rel / ext_acc combination (or row). That's not very helpful, as you're likely getting back a count of one on each row, so there are no rows that will pass your WHERE clause of the count being > 2. Even if you removed the ext_acc column in your WITH, the pw-rel column would still restrict you to counts of 1 (since there's only one ext_acc per user/pw_rel).
It should be easy to see that the other columns modify what the aggregate functions are grouping on, and the columns you'll get back.
count(ext_acc) - The total count of all ext accounts
u, count(ext_acc) - Users, and the count of ext accounts per user
u, pw_rel, count(ext_acc) - Users, payed-with relations, and the count of all external accounts per User and payed-with relation from that user
u, pw_rel, ext_acc, count(ext_acc) - Users, payed-with relations, ext accounts, and the count of all external accounts per User and payed-with relation from that user to a specific ext account.

top 10 users for each game in neo4j based of games they like

I am trying to get top 10 users who have liked particular game with game_id from 1 to 25.
These games have a relationship with user called rating with property rating_val=1 to 10 .
How to get 25 rows with group of all users who have rating_val from 1 to 10 desc order for each game category.
Basically :
25 game categories with id 1 to 25
games_like is a relationship with with rating_val from 1-10
users are nodes with id,name
This query is not working:
MATCH (u:user { user_id:"1" })
MATCH (o:user)
WHERE o <> u
OPTIONAL MATCH (u)-[r:games_like]->(d)<-[rw:games_like]-(o)
RETURN
toInt(r.rating_val)+toInt(rw.rating_val) as sum ,
collect(DISTINCT (r.rating_val)) AS user1,
collect(DISTINCT (rw.rating_val)) AS user2,
d
ORDER BY sum DESC
I'm a bit confused because your description seems to be saying you just want a join between users and games, but your query is matching two users against one game. Also, do your game nodes have a label?
From your description, this is the query I would write:
MATCH (game)<-[rel:games_like]-(user:user)
WITH game, user ORDER BY rel.rating_val DESC
RETURN game, collect(user)[0..10]
That returns you the top users for each game. If you want to limit it you could do:
MATCH (game)<-[rel:games_like]-(user:user)
WHERE
1 <= game.game_id AND game.game_id <= 25 AND
1 <= rel.rating_val AND rel.rating_val <= 10
WITH game, user ORDER BY rel.rating_val DESC
RETURN game, collect(user)[0..10]

Resources