Neo4j query performance with keyword distinct - neo4j

I have a naive question regarding the use of keyword DISTINCT. So basically I have a graph (User-[Likes]->Item) with millions nodes. I want to find distinct users that like a certain item. The following two queries have significant performance difference, and I'm confused. I do create index of :Item(id) and :User(id).
Query 1:
profile match (a:Item {id:'001'})<-[:LIKES]-(u:User)
return count(distinct u);
Query 2:
profile match (a:Item {id:'001'})<-[:LIKES]-(u:User)
return distinct u;
The first query returns result in seconds, but the second query keeps running for over 5 minutes and I lost patient and stop the query. I thought the second query would be faster than the first query since there is no count aggregation operation, so I don't understand the performance difference.

Your first query is only returning the count of distinct values which is an easy job for neo4j.
Whereas your second query is returning all the nodes which are distinct, if your database has too many distinct values it would take a long time. If you just want to have a glimpse at a few distinct values you may add a limit to your query.
Eg:
profile match (a:Item {id:'001'})<-[:LIKES]-(u:User)
return distinct u
limit 5;
It returns (random) five users who like the Item('001').

Related

Neo4j - most efficient way to check if nodes with a given label exist

I need to check if any node with the given label exists in my application. What's the most efficient approach to do so (in Java)? I was expecting
Transaction'getAllLabelsInUse()
to do the job, but it seems to also return truewhen any index or constraint exists for the given label.
My current workaround is running a query like this:
match (n:`label`) return n._id limit 1
assuming it would be a bit faster than
match (n:Crew) with n limit 1 return count(*)
The counts store can quickly service simple queries, such as getting the counts of all nodes of a label, so match (n:Crew) return count(n) will be very fast.
Take a look at our knowledge base article on getting fast counts from the counts store for other alternatives that leverage the counts store.

Cypher query to get subsets of different node labels, with relations

Let's assume this use case;
We have few nodes (labeled Big) and each having a simple integer ID property.
Each Big node has a relation with millions of (labeled Small) nodes.
such as :
(Small)-[:BELONGS_TO]->(Big)
How can I phrase a Cypher query to represent the following in natural language:
For each Big node in the range of ids between 4-7, get me 10 of Small nodes that belongs to it.
The supposed result would give 2 Big nodes, 20 Small nodes, and 20 Relations
The needed result would be represented by this graph:
2 Big nodes, each with a subset of 10 of Small nodes that belongs to them
What I've tried but failed (it only shows 1 big node (id=5) along with 10 of its related Small nodes, but doesn't show the second node (id=6):
MATCH (s:Small)-[:BELONGS_TO]->(b:Big)
Where 4<b.bigID<7
return b,s limit 10
I guess I need a more complex compound query.
Hope I could phrase my question in an understandable way!
As stdob-- says, you can't use limit here, at least not in this way, as it limits the entire result set.
While the aggregation solution will return you the right answer, you'll still pay the cost for the expansion to those millions of nodes. You need a solution that will lazily get the first ten for each.
Using APOC Procedures, you can use apoc.cypher.run() to effectively perform a subquery. The query will be run per-row, so if you limit the rows first, you can call this and use LIMIT within the subquery, and it will properly limit to 10 results per row, lazily expanding so you don't pay for an expansion to millions of nodes.
MATCH (b:Big)
WHERE 4 < b.bigID < 7
CALL apoc.cypher.run('
MATCH (s:Small)-[:BELONGS_TO]->(b)
RETURN s LIMIT 10',
{b:b}) YIELD value
RETURN b, value.s
Your query does not work because the limit applies to the entire previous flow.
You need to use aggregation function collect:
MATCH (s:Small)-[:BELONGS_TO]->(b:Big) Where 4<b.bigID<7
With b,
collect(distinct s)[..10] as smalls
return b,
smalls

Neo4j- incorrect count in multiple match query

When I am trying to execute this query
match(u:User)-[ro:OWNS]->(p:PushDevice) where p.type='gcm'
match(com:Comment)
return count(com) as total_comments,count(ro) as device
this is returning the same number in both total_comments and device which is the number of total comment.
I feel like your query should work, though I'm more confident that this will work:
MATCH (u:User)-[ro:OWNS]->(p:PushDevice) WHERE p.type='gcm'
WITH count(ro) AS device
MATCH (com:Comment)
RETURN count(com) as total_comments, device
Your query is generating a row for every combination of your MATCH results. If you just returned the ro and com values, this would be more clear. See this console for an example. That console has 2 comments and a single OWNS relationship, but the result shows 2 rows (both rows have the same OWNS relationship). So, your query is essentially counting the number of rows -- not what you expected.
Here is an example of a query that would work as you you expected:
MATCH (u:User)-[ro:OWNS]->(p:PushDevice {type:'gcm'})
WITH COUNT(ro) AS device
MATCH (com:Comment)
RETURN count(com) AS total_comments, device;
[EDITED]
This would also work logically, but is less performant (as it takes a cartesian product and then filters out duplicates):
MATCH (u:User)-[ro:OWNS]->(p:PushDevice { type: 'gcm' })
MATCH (com:Comment)
RETURN COUNT(DISTINCT com), COUNT(DISTINCT ro);
Observation
The power of neo4j comes from its efficient handling of relationships. So, the most efficient queries tend to be for connected subgraphs (where all nodes are connected by relationships).
Since your query is not for a single connected subgraph, getting the answer you want is naturally going to be a bit more convoluted and can be inefficient.
If you determine that the suggested queries are too slow, you can try making 2 separate queries instead. That may also make make your code easier to understand.

Explanation of WITH and COLLECT in following duplicate finding query

I was recently researching, how to find duplicate nodes by a property and found the following results which provided a very effective solution:
neo4j find all nodes with matching properties
An effective way to lookup duplicate nodes in Neo4j 1.8?
Since I'm using Neo4j v2.2.3 Community, I used the following style:
match (n:Label) with n.prop as prop, collect(n) as nodelist, count(*) as count where count > 1 return prop, nodelist, count
I'm having trouble understanding how this works. I've spent my career using relational databases and just don't get the grouping mechanism, which is obviously there since I got a list of nodes and their respective count.
Can someone please explain how this works or provide a reference to an explanation?
Here's the relevant documentation on aggregation in cypher.
The important bit is this:
Aggregate functions take multiple input values and calculate an
aggregated value from them. Examples are avg that calculates the
average of multiple numeric values, or min that finds the smallest
numeric value in a set of values.
Aggregation can be done over all the matching subgraphs, or it can be
further divided by introducing key values. These are non-aggregate
expressions, that are used to group the values going into the
aggregate functions.
So, if the return statement looks something like this:
RETURN n, count(*)
We have two return expressions: n, and count(). The first, n, is no
aggregate function, and so it will be the grouping key. The latter,
count() is an aggregate expression. So the matching subgraphs will be
divided into different buckets, depending on the grouping key. The
aggregate function will then run on these buckets, calculating the
aggregate values.
If you want to use aggregations to sort your result set, the
aggregation must be included in the RETURN to be used in your ORDER
BY.
Your query is reasonably straightforward:
match (n:Label)
with n.prop as prop,
collect(n) as nodelist,
count(*) as count
where count > 1
return prop, nodelist, count
You're just doing your aggregation in your with block instead of in your return block. In this case, collect() isn't an aggregate function (I know, the terminology seems a bit strange) but count(*) is. So your query gets divided into different buckets depending on the grouping key; your query is a little bit odd here in that prop would appear to be your grouping key. Not sure if you intend that, it depends on the semantics of your query.

What with clause do? Neo4j

I don't understand what WITH clause do in Neo4j. I read the The Neo4j Manual v2.2.2 but it is not quite clear about WITH clause. There are not many examples. For example I have the following graph where the blue nodes are football teams and the yellow ones are their stadiums.
I want to find stadiums where two or more teams play. I found that query and it works.
match (n:Team) -[r1:PLAYS]->(a:Stadium)
with a, count(*) as foaf
where foaf > 1
return a
count(*) says us the numbers of matching rows. But I don't understand what WITH clause do.
WITH allows you to pass on data from one part of the query to the next. Whatever you list in WITH will be available in the next query part.
You can use aggregation, SKIP, LIMIT, ORDER BY with WITH much like in RETURN.
The only difference is that your expressions have to get an alias with AS alias to be able to access them in later query parts.
That means you can chain query parts where one computes some data and the next query part can use that computed data. In your case it is what GROUP BY and HAVING would be in SQL but WITH is much more powerful than that.
here is another example
match (n:Team) -[r1:PLAYS]->(a:Stadium)
with distinct a
order by a.name limit 10
match (a)-[:IN_CITY]->(c:City)
return c.name

Resources