Neo4j Cypher remove duplicates from simple query that contains ordering - neo4j

I'm very new to Neo4J and I can't get this simple query work.
The data I have looks like this:
(a)-[:likes]->(b)
(a)-[:likes]->(c)
Now I'd like to extract a list with everyone who likes someone else.
Tried
match (u)-[:likes]->(p) return u order by p.id desc;
This gives me a duplicate of (a).
I tried using distinct:
match (u)-[:likes]->(p) return distinct u order by p.id desc;
This gives me 'variable p undefined'.
I know that if I drop the ordering, distinct works and gives me (a) once.
But how can I work with distinct and order by in the same time?

Consider why your query isn't working:
Without the distinct, you have rows with each pairing of u and p. When you use DISTINCT, how is it supposed to order when there are multiple lines for the same u, matching to multiple p's? That's an impossible task.
If you change it to order by u.id instead, then it works just fine.
I do encourage you to use labels, by the way, to restrict your query only to relevant nodes. You can also rework your query to prevent it from emitting duplicates and avoid the need for DISTINCT completely.
If we assume the nodes you're interested in are labeled with :Person, your query might be:
MATCH (p:Person)
WHERE EXISTS( (p)-[:likes]-() )
RETURN p ORDER BY p.id DESC

Related

What is the real benefit of Cypher's WITH clause?

I am following Neo4j's 'Intermediate Cypher Queries' course in the neo4j graph academy, and I've been introduced to the WITH clause, whose basic function is to define or re-define the scope of variables. Now for some reason I can't quite wrap my head around the use of the WITH clause with the examples they have given, especially in regard to pipelining. For example, in one of the exercises I am told to use WITH to aggregate intermediate results. Here is the correct answer:
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)<-[r:RATED]-(:User)
WHERE p.name = 'Tom Hanks'
WITH m, avg(r.rating) AS avgRating
RETURN m.title AS Movie, avgRating AS `AverageRating`
ORDER BY avgRating DESC
But, to my mind, the WITH clause doesn't really do much work. To convince myself, I re-wrote the query to get the same result without the WITH clause:
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)<-[r:RATED]-(:User)
WHERE p.name = 'Tom Hanks'
RETURN m.title AS Movie, avg(r.rating) AS `AverageRating`
ORDER BY avg(r.rating) DESC
This works fine, with one less line of code. Perhaps the issue is just of example – in much longer queries the 'WITH method' would come into its own. But, as it stands, I can't fully account for the real use of WITH. So, for example, they talk about pipelining results, but we specified 'm' right at the start in the MATCH clause, so why are we bothering to have a WITH clause with it the 'm' variable in it again? As for the 'avg(r.rating)', really it just seems like we're wasting time renaming the result of a query when this is something we can just do as the end as I have done. So, what's really going on here? Can someone enlighten me?
The WITH clause is helpful when you want to do intermediate aggregations or do several aggregations in sequence. You could also do intermediate filtering. Think of it as an option to manipulate/transform data in the middle of a query statement.
MATCH (p:Person)-[:ACTED_IN]->(m:Movie)<-[r:RATED]-(:User)
WITH m, avg(r.rating) AS avgRating
WHERE avgRating > 8
RETURN m.title AS Movie, avgRating AS `AverageRating`
ORDER BY avgRating DESC
Here is one example where you perform intermediate aggregation combined with filtering, that otherwise wouldn't be possible without a WITH statement as the average rating has to be calculated, and you can't filter results in the RETURN statement

Neo4j count Query

match(m:master_node:Application)-[r]-(k:master_node:Server)-[r1]-(n:master_node)
where (m.name contains '' and (n:master_node:DeploymentUnit or n:master_node:Schema))
return distinct m.name,n.name
Hi,I am trying to get total number of records for the above query.How I change the query using count function to get the record count directly.
Thanks in advance
The following query uses the aggregating funtion COUNT. Distinct pairs of m.name, n.name values are used as the "grouping keys".
MATCH (m:master_node:Application)--(:master_node:Server)--(n:master_node)
WHERE EXISTS(m.name) AND (n:DeploymentUnit OR n:Schema)
RETURN m.name, n.name, COUNT(*) AS cnt
I assume that m.name contains '' in your query was an attempt to test for the existence of m.name. This query uses the EXISTS() function to test that more efficiently.
[UPDATE]
To determine the number of distinct n and m pairs in the DB (instead of the number of times each pair appears in the DB):
MATCH (m:master_node:Application)--(:master_node:Server)--(n:master_node)
WHERE EXISTS(m.name) AND (n:DeploymentUnit OR n:Schema)
WITH DISTINCT m.name AS n1, n.name AS n2
RETURN COUNT(*) AS cnt
Some things to consider for speeding up the query even further:
Remove unnecessary label tests from the MATCH pattern. For example, can we omit the master_node label test from any nodes? In fact, can we omit all label testing for any nodes without affecting the validity of the result? (You will likely need a label on at least one node, though, to avoid scanning all nodes when kicking off the query.)
Can you add a direction to each relationship (to avoid having to traverse relationships in both directions)?
Specify the relationship types in the MATCH pattern. This will filter out unwanted paths earlier. Once you do so, you may also be able to remove some node labels from the pattern as long as you can still get the same result.
Use the PROFILE clause to evaluate the number of DB hits needed by different Cypher queries.
You can find examples of how to use count in the Neo4j docs here
In your case the first example where:
count(*)
Is used to return a count of each returned item should work.

aggregated frequency count in neo4j

I'd like to write a cypher query which will tell me how frequently a particular node property occurs in a set of matches. For example, in
MATCH (:left)-->(p:right)
I'd like to know how many times the right nodes p.id are "id 1" or "id 2" and so on.
Currently I'm returning all the matches and then (using a separate tool - python) counting the number of times each id occurs in the records.
I'm sure there must be a way to do this purely in cypher using DISTINCT, collect() and count(), but I've got myself stuck...
I think that what your are searching is this query :
MATCH (:left)-->(p:right)
RETURN p.id, count(DISTINCT p)
Cheers

Neo4j -- WHERE, ORDER BY, and LIMIT after UNION

I'm trying to do a query that involves a UNION, but filters with a WHERE, ORDER BY, and LIMIT after the union.
The basic idea is to find all posts STARRED or POSTED by users that another user FOLLOWS. For example, the posts s and p are the posts of interest.
MATCH (a:USER {id:0})-[:FOLLOWS]->(b:USER),
(b)-[:STARRED]->(s:POST),
(b)-[:POSTED]->(p:POST)
I'd like to return the union of the id property of both s and p after filtering, sorting, and limiting the results. Any relevant indexes to create that make this query efficient would be helpful as well.
If u is the union of s and p, I'd want to do something like:
WHERE u.time > 1431546036148
RETURN u.id ORDER BY u.time SKIP 0 LIMIT 20
I don't know how to get u from s and p, and I don't know what indexes to create to make this query efficient.
You can use multiple relationships types so you'll not have to do UNION.
I guess the time property is on the POST node :
MATCH (user:USER {id:0})-[:FOLLOWS]->(friend:USER)
MATCH (friend)-[:STARRED|:POSTED]->(p:POST)
WHERE p.time > 1431546036148
RETURN p
ORDER BY p.time
LIMIT 25

Select nodes that has all relationships in Neo4j

Suppose I have two kinds of nodes, Person and Competency. They are related by a KNOWS relationship. For example:
(:Person {id: 'thiago'})-[:KNOWS]->(:Competency {id: 'neo4j'})
How do I query this schema to find out all Person that knows all nodes of a set of Competency?
Suppose that I need to find every Person that knows "java" and "haskell" and I'm only interested in the nodes that knows all of the listed Competency nodes.
I've tried this query:
match (p:Person)-[:KNOWS]->(c:Competency) where c.id in ['java','haskell'] return p.id;
But I get back a list of all Person that knows either "java" or "haskell" and duplicated entries for those who knows both.
Adding a count(c) at the end of the query eliminates the duplicates:
match (p:Person)-[:KNOWS]->(c:Competency) where c.id in ['java','haskell'] return p.id, count(c);
Then, in this particular case, I can iterate the result and filter out results that the count is less than two to get the nodes I want.
I've found out that I could do it appending consecutive match clauses to keep filtering the nodes to get the result I want, in this case:
match (p:Person)-[:KNOWS]->(:Competency {id:'haskell'})
match (p)-[:KNOWS]->(:Competency {id:'java'})
return p.id;
Is this the only way to express this query? I mean, I need to create a query by concatenating strings? I'm looking for a solution to a fixed query with parameters.
with ['java','haskell'] as skills
match (p:Person)-[:KNOWS]->(c:Competency)
where c.id in skills
with p.id, count(*) as c1 ,size(skills) as c2
where c1 = c2
return p.id
One thing you can do, is to count the number of all skills, then find the users that have the number of skill relationships equals to the skills count :
MATCH (n:Skill) WITH count(n) as skillMax
MATCH (u:Person)-[:HAS]->(s:Skill)
WITH u, count(s) as skillsCount, skillMax
WHERE skillsCount = skillMax
RETURN u, skillsCount
Chris
Untested, but this might do the trick:
match (p:Person)-[:KNOWS]->(c:Competency)
with p, collect(c.id) as cs
where all(x in ['java', 'haskell'] where x in cs)
return p.id;
How about this...
WITH ['java','haskell'] AS comp_col
MATCH (p:Person)-[:KNOWS]->(c:Competency)
WHERE c.name in comp_col
WITH comp_col
, p
, count(*) AS total
WHERE total = length(comp_col)
RETURN p.name, total
Put the competencies you want in a collection.
Match all the people that have either of those competencies
Get the count of compentencies by person where they have the same number as in the competency collection from the start
I think this will work for what you need, but if you are building these queries programatically the best performance you get might be with successive match clauses. Especially if you knew which competencies were most/least common when building your queries, you could order the matches such that the least common were first and the most common were last. I think that would chunk down to your desired persons the fastest.
It would be interesting to see what the plan analyzer in the sheel says about the different approaches.

Resources