Explanation of WITH and COLLECT in following duplicate finding query - neo4j

I was recently researching, how to find duplicate nodes by a property and found the following results which provided a very effective solution:
neo4j find all nodes with matching properties
An effective way to lookup duplicate nodes in Neo4j 1.8?
Since I'm using Neo4j v2.2.3 Community, I used the following style:
match (n:Label) with n.prop as prop, collect(n) as nodelist, count(*) as count where count > 1 return prop, nodelist, count
I'm having trouble understanding how this works. I've spent my career using relational databases and just don't get the grouping mechanism, which is obviously there since I got a list of nodes and their respective count.
Can someone please explain how this works or provide a reference to an explanation?

Here's the relevant documentation on aggregation in cypher.
The important bit is this:
Aggregate functions take multiple input values and calculate an
aggregated value from them. Examples are avg that calculates the
average of multiple numeric values, or min that finds the smallest
numeric value in a set of values.
Aggregation can be done over all the matching subgraphs, or it can be
further divided by introducing key values. These are non-aggregate
expressions, that are used to group the values going into the
aggregate functions.
So, if the return statement looks something like this:
RETURN n, count(*)
We have two return expressions: n, and count(). The first, n, is no
aggregate function, and so it will be the grouping key. The latter,
count() is an aggregate expression. So the matching subgraphs will be
divided into different buckets, depending on the grouping key. The
aggregate function will then run on these buckets, calculating the
aggregate values.
If you want to use aggregations to sort your result set, the
aggregation must be included in the RETURN to be used in your ORDER
BY.
Your query is reasonably straightforward:
match (n:Label)
with n.prop as prop,
collect(n) as nodelist,
count(*) as count
where count > 1
return prop, nodelist, count
You're just doing your aggregation in your with block instead of in your return block. In this case, collect() isn't an aggregate function (I know, the terminology seems a bit strange) but count(*) is. So your query gets divided into different buckets depending on the grouping key; your query is a little bit odd here in that prop would appear to be your grouping key. Not sure if you intend that, it depends on the semantics of your query.

Related

Neo4j - most efficient way to check if nodes with a given label exist

I need to check if any node with the given label exists in my application. What's the most efficient approach to do so (in Java)? I was expecting
Transaction'getAllLabelsInUse()
to do the job, but it seems to also return truewhen any index or constraint exists for the given label.
My current workaround is running a query like this:
match (n:`label`) return n._id limit 1
assuming it would be a bit faster than
match (n:Crew) with n limit 1 return count(*)
The counts store can quickly service simple queries, such as getting the counts of all nodes of a label, so match (n:Crew) return count(n) will be very fast.
Take a look at our knowledge base article on getting fast counts from the counts store for other alternatives that leverage the counts store.

Cypher query to get subsets of different node labels, with relations

Let's assume this use case;
We have few nodes (labeled Big) and each having a simple integer ID property.
Each Big node has a relation with millions of (labeled Small) nodes.
such as :
(Small)-[:BELONGS_TO]->(Big)
How can I phrase a Cypher query to represent the following in natural language:
For each Big node in the range of ids between 4-7, get me 10 of Small nodes that belongs to it.
The supposed result would give 2 Big nodes, 20 Small nodes, and 20 Relations
The needed result would be represented by this graph:
2 Big nodes, each with a subset of 10 of Small nodes that belongs to them
What I've tried but failed (it only shows 1 big node (id=5) along with 10 of its related Small nodes, but doesn't show the second node (id=6):
MATCH (s:Small)-[:BELONGS_TO]->(b:Big)
Where 4<b.bigID<7
return b,s limit 10
I guess I need a more complex compound query.
Hope I could phrase my question in an understandable way!
As stdob-- says, you can't use limit here, at least not in this way, as it limits the entire result set.
While the aggregation solution will return you the right answer, you'll still pay the cost for the expansion to those millions of nodes. You need a solution that will lazily get the first ten for each.
Using APOC Procedures, you can use apoc.cypher.run() to effectively perform a subquery. The query will be run per-row, so if you limit the rows first, you can call this and use LIMIT within the subquery, and it will properly limit to 10 results per row, lazily expanding so you don't pay for an expansion to millions of nodes.
MATCH (b:Big)
WHERE 4 < b.bigID < 7
CALL apoc.cypher.run('
MATCH (s:Small)-[:BELONGS_TO]->(b)
RETURN s LIMIT 10',
{b:b}) YIELD value
RETURN b, value.s
Your query does not work because the limit applies to the entire previous flow.
You need to use aggregation function collect:
MATCH (s:Small)-[:BELONGS_TO]->(b:Big) Where 4<b.bigID<7
With b,
collect(distinct s)[..10] as smalls
return b,
smalls

Optimizing Cypher Query

I am currently starting to work with Neo4J and it's query language cypher.
I have a multple queries that follow the same pattern.
I am doing some comparison between a SQL-Database and Neo4J.
In my Neo4J Datababase I habe one type of label (person) and one type of relationship (FRIENDSHIP). The person has the propterties personID, name, email, phone.
Now I want to have the the friends n-th degree. I also want to filter out those persons that are also friends with a lower degree.
FOr example if I want to search for the friends 3 degree I want to filter out those that are also friends first and/or second degree.
Here my query type:
MATCH (me:person {personID:'1'})-[:FRIENDSHIP*3]-(friends:person)
WHERE NOT (me:person)-[:FRIENDSHIP]-(friends:person)
AND NOT (me:person)-[:FRIENDSHIP*2]-(friends:person)
RETURN COUNT(DISTINCT friends);
I found something similiar somewhere.
This query works.
My problem is that this pattern of query is much to slow if I search for a higher degree of friendship and/or if the number of persons becomes more.
So I would really appreciate it, if somemone could help me with optimize this.
If you just wanted to handle depths of 3, this should return the distinct nodes that are 3 degrees away but not also less than 3 degrees away:
MATCH (me:person {personID:'1'})-[:FRIENDSHIP]-(f1:person)-[:FRIENDSHIP]-(f2:person)-[:FRIENDSHIP]-(f3:person)
RETURN apoc.coll.subtract(COLLECT(f3), COLLECT(f1) + COLLECT(f2) + me) AS result;
The above query uses the APOC function apoc.coll.subtract to remove the unwanted nodes from the result. The function also makes sure the collection contains distinct elements.
The following query is more general, and should work for any given depth (by just replacing the number after *). For example, this query will work with a depth of 4:
MATCH p=(me:person {personID:'1'})-[:FRIENDSHIP*4]-(:person)
WITH NODES(p)[0..-1] AS priors, LAST(NODES(p)) AS candidate
UNWIND priors AS prior
RETURN apoc.coll.subtract(COLLECT(DISTINCT candidate), COLLECT(DISTINCT prior)) AS result;
The problem with Cypher's variable-length relationship matching is that it's looking for all possible paths to that depth. This can cause unnecessary performance issues when all you're interested in are the nodes at certain depths and not the paths to them.
APOC's path expander using 'NODE_GLOBAL' uniqueness is a more efficient means of matching to nodes at inclusive depths.
When using 'NODE_GLOBAL' uniqueness, nodes are only ever visited once during traversal. Because of this, when we set the path expander's minLevel and maxLevel to be the same, the result are nodes at that level that are not present at any lower level, which is exactly the result you're trying to get.
Try this query after installing APOC:
MATCH (me:person {personID:'1'})
CALL apoc.path.expandConfig(me, {uniqueness:'NODE_GLOBAL', minLevel:4, maxLevel:4}) YIELD path
// a single path for each node at depth 4 but not at any lower depth
RETURN COUNT(path)
Of course you'll want to parameterize your inputs (personID, level) when you get the chance.

Neo4j query performance with keyword distinct

I have a naive question regarding the use of keyword DISTINCT. So basically I have a graph (User-[Likes]->Item) with millions nodes. I want to find distinct users that like a certain item. The following two queries have significant performance difference, and I'm confused. I do create index of :Item(id) and :User(id).
Query 1:
profile match (a:Item {id:'001'})<-[:LIKES]-(u:User)
return count(distinct u);
Query 2:
profile match (a:Item {id:'001'})<-[:LIKES]-(u:User)
return distinct u;
The first query returns result in seconds, but the second query keeps running for over 5 minutes and I lost patient and stop the query. I thought the second query would be faster than the first query since there is no count aggregation operation, so I don't understand the performance difference.
Your first query is only returning the count of distinct values which is an easy job for neo4j.
Whereas your second query is returning all the nodes which are distinct, if your database has too many distinct values it would take a long time. If you just want to have a glimpse at a few distinct values you may add a limit to your query.
Eg:
profile match (a:Item {id:'001'})<-[:LIKES]-(u:User)
return distinct u
limit 5;
It returns (random) five users who like the Item('001').

What with clause do? Neo4j

I don't understand what WITH clause do in Neo4j. I read the The Neo4j Manual v2.2.2 but it is not quite clear about WITH clause. There are not many examples. For example I have the following graph where the blue nodes are football teams and the yellow ones are their stadiums.
I want to find stadiums where two or more teams play. I found that query and it works.
match (n:Team) -[r1:PLAYS]->(a:Stadium)
with a, count(*) as foaf
where foaf > 1
return a
count(*) says us the numbers of matching rows. But I don't understand what WITH clause do.
WITH allows you to pass on data from one part of the query to the next. Whatever you list in WITH will be available in the next query part.
You can use aggregation, SKIP, LIMIT, ORDER BY with WITH much like in RETURN.
The only difference is that your expressions have to get an alias with AS alias to be able to access them in later query parts.
That means you can chain query parts where one computes some data and the next query part can use that computed data. In your case it is what GROUP BY and HAVING would be in SQL but WITH is much more powerful than that.
here is another example
match (n:Team) -[r1:PLAYS]->(a:Stadium)
with distinct a
order by a.name limit 10
match (a)-[:IN_CITY]->(c:City)
return c.name

Resources