What with clause do? Neo4j - neo4j

I don't understand what WITH clause do in Neo4j. I read the The Neo4j Manual v2.2.2 but it is not quite clear about WITH clause. There are not many examples. For example I have the following graph where the blue nodes are football teams and the yellow ones are their stadiums.
I want to find stadiums where two or more teams play. I found that query and it works.
match (n:Team) -[r1:PLAYS]->(a:Stadium)
with a, count(*) as foaf
where foaf > 1
return a
count(*) says us the numbers of matching rows. But I don't understand what WITH clause do.

WITH allows you to pass on data from one part of the query to the next. Whatever you list in WITH will be available in the next query part.
You can use aggregation, SKIP, LIMIT, ORDER BY with WITH much like in RETURN.
The only difference is that your expressions have to get an alias with AS alias to be able to access them in later query parts.
That means you can chain query parts where one computes some data and the next query part can use that computed data. In your case it is what GROUP BY and HAVING would be in SQL but WITH is much more powerful than that.
here is another example
match (n:Team) -[r1:PLAYS]->(a:Stadium)
with distinct a
order by a.name limit 10
match (a)-[:IN_CITY]->(c:City)
return c.name

Related

Neo4j query performance with keyword distinct

I have a naive question regarding the use of keyword DISTINCT. So basically I have a graph (User-[Likes]->Item) with millions nodes. I want to find distinct users that like a certain item. The following two queries have significant performance difference, and I'm confused. I do create index of :Item(id) and :User(id).
Query 1:
profile match (a:Item {id:'001'})<-[:LIKES]-(u:User)
return count(distinct u);
Query 2:
profile match (a:Item {id:'001'})<-[:LIKES]-(u:User)
return distinct u;
The first query returns result in seconds, but the second query keeps running for over 5 minutes and I lost patient and stop the query. I thought the second query would be faster than the first query since there is no count aggregation operation, so I don't understand the performance difference.
Your first query is only returning the count of distinct values which is an easy job for neo4j.
Whereas your second query is returning all the nodes which are distinct, if your database has too many distinct values it would take a long time. If you just want to have a glimpse at a few distinct values you may add a limit to your query.
Eg:
profile match (a:Item {id:'001'})<-[:LIKES]-(u:User)
return distinct u
limit 5;
It returns (random) five users who like the Item('001').

Explanation of WITH and COLLECT in following duplicate finding query

I was recently researching, how to find duplicate nodes by a property and found the following results which provided a very effective solution:
neo4j find all nodes with matching properties
An effective way to lookup duplicate nodes in Neo4j 1.8?
Since I'm using Neo4j v2.2.3 Community, I used the following style:
match (n:Label) with n.prop as prop, collect(n) as nodelist, count(*) as count where count > 1 return prop, nodelist, count
I'm having trouble understanding how this works. I've spent my career using relational databases and just don't get the grouping mechanism, which is obviously there since I got a list of nodes and their respective count.
Can someone please explain how this works or provide a reference to an explanation?
Here's the relevant documentation on aggregation in cypher.
The important bit is this:
Aggregate functions take multiple input values and calculate an
aggregated value from them. Examples are avg that calculates the
average of multiple numeric values, or min that finds the smallest
numeric value in a set of values.
Aggregation can be done over all the matching subgraphs, or it can be
further divided by introducing key values. These are non-aggregate
expressions, that are used to group the values going into the
aggregate functions.
So, if the return statement looks something like this:
RETURN n, count(*)
We have two return expressions: n, and count(). The first, n, is no
aggregate function, and so it will be the grouping key. The latter,
count() is an aggregate expression. So the matching subgraphs will be
divided into different buckets, depending on the grouping key. The
aggregate function will then run on these buckets, calculating the
aggregate values.
If you want to use aggregations to sort your result set, the
aggregation must be included in the RETURN to be used in your ORDER
BY.
Your query is reasonably straightforward:
match (n:Label)
with n.prop as prop,
collect(n) as nodelist,
count(*) as count
where count > 1
return prop, nodelist, count
You're just doing your aggregation in your with block instead of in your return block. In this case, collect() isn't an aggregate function (I know, the terminology seems a bit strange) but count(*) is. So your query gets divided into different buckets depending on the grouping key; your query is a little bit odd here in that prop would appear to be your grouping key. Not sure if you intend that, it depends on the semantics of your query.

Is there anything like a "do while" match pattern that satisfy an aggregated value ? (propeties etc)

I dont know if this make sense using Cypher or graph traversal, but i was trying to do sort of a "shortest path" query but not based on weighted relationship but rather aggregated properties.
Assume i have nodes labeled People and they all vists different homepages with a VISIT relationship to the homepage node. Each homepage node has hits stats depending on its popularity. Now i would like to match people that has a visit relationship to a homepage until i reach max X number of exposure (hits).
Why ? Becuase then i know a "expected" exposure strategy for a certain group of people.
Something like
Do
MATCH (n:People)-[:VISITS]-(sites)
while (reduce (x)<100000)
Of course this "Do while" is nothing i have seen in the Cypher syntax but wouldn't it be useful? or should this be on app level by just returning a DESC list and do the math on in the applicaton. Mabey it should also be matched with some case if the loop cant be satisfied.
MATCH (n:People)-[:VISITS]-sites
WITH reduce(hits=0, x IN collect(sites.dailyhits)| hits + x) AS totalhits
RETURN totalhits;
Can return the correct aggregated hits value (all), but i would like this function to run each matched pattern until it satisfy a value and the return the match (of course i miss other possible and mixes between pages becuase the match never traversal the entire graph..but at least i have got an answer of pages in a list that match the requirement if it makes sense) ?
Thanks!
Not sure how you'd aggregate, but there are several aggregation functions (avg, sum, etc). And... you can pass these to a 2nd part of the cypher query, with a WITH clause.
That said: Cypher also supports the ability to sort a result (ORDER BY), and the ability to limit the number of results given (LIMIT). I don't know what you'd sort by, but... just for fun, let's sort it arbitrarily on something:
MATCH (n:People)-[v:VISITS]->(site:Site)
WHERE site.url= "http://somename.com"
RETURN n
ORDER BY v.VisitCount DESC
LIMIT 1000
This would cap your return set at 1,000 people, for people who visit a given site.

How to paginate query results with cypher?

Is it possible to have cypher query paginated. For instance, a list of products, but I don't want to display/retrieve/cache all the results as i can have a lot of results.
I'm looking for something similar to the offset / limit in SQL.
Is cypher skip + limit + orderby a good option ? http://docs.neo4j.org/chunked/stable/query-skip.html
SKIP and LIMIT combined is indeed the way to go. Using ORDER BY inevitably makes cypher scan every node that is relevant to your query. Same thing for using a WHERE clause. Performance should not be that bad though.
Its like normal sql, the syntax is as follow
match (user:USER_PROFILE)-[USAGE]->uUsage
where HAS(uUsage.impressionsPerHour) AND (uUsage.impressionsPerHour > 100)
ORDER BY user.hashID
SKIP 10
LIMIT 10;
This syntax suit to last version (2.x)
Neo4j apparently uses "indexed-backed order by" nowadays, which means if you are using alphabetical ORDERBY on indexed node properties within your SKIP/LIMIT query, then Neo4j will not perform a full scan of all "relevant nodes" as other have mentioned (their responses were long ago, so keep that in mind). The index will allow Neo4j to optimize on the basis that it already stores indexed properties in ORDERBY order (alphabetical), such that your pagination will be even faster than without the index.

efficiency of where clause in cypher vs match

I'm trying to find 10 posts that were not LIKED by user "mike" using cypher. Will putting a where clause with a NOT relationship be efficient than matching with an optional relationship then checking if that relationship is null in the where clause? Specifically I want to make sure it won't do the equivalent of a full table scan and make sure that this is a scalable query.
Here's what I'm using
START user=node:node_auto_index(uname:"mike"),
posts=node:node_auto_index("postId:*")
WHERE not (user-[:LIKES]->posts)
RETURN posts SKIP 20 LIMIT 10;
Or can I do something where I filter on a MATCH optional relationship
START user=node:node_auto_index(uname="mike"),
posts=node:node_auto_index("postId:*")
MATCH user-[r?:LIKES]->posts
WHERE r IS NULL
RETURN posts SKIP 100 LIMIT 10;
Some quick tests on the console seem to show faster performance in the 2nd approach. Am I right to assume the 2nd query is faster? And, if so why?
i think in the first query the engine runs through all postID nodes and manually checks the condition of not (user-[:LIKES]->posts) for each post ID
whereas in the second example (assuming you use at least v1.9.02) the engine picks up only the post nodes, which actually aren't connected to the user. this is just optimalization where the engine does not go through all postIDs nodes.
if possible, always use the MATCH clause in your queries instead of WHERE, and try to omit the asterix in the declaration START n=node:index('name:*')

Resources