Difference between the results of two independent queries neo4j - neo4j

I have 2 subqueries that returns 2 sets of users (each query return one set of users)
First query :
MATCH (:User {user_id: "69b3315a-ba4a-4021-94e1-0f494f9b957f"})-->(first_set_of_users)
RETURN first_set_of_users
Second query :
MATCH (:User {user_id: "69b3315a-ba4a-4021-94e1-0f494f9b957f"})<-[:LIKES]-(likers)-[:LIKES]->(v)
WITH DISTINCT v
MATCH (second_set_of_users)-[:LIKES]->(v)
RETURN second_set_of_users, COUNT(*) AS recoWeight
ORDER BY recoWeight DESC
What I want to finally return is all users from second_set_of_users minus the one in first_set_of_users and ORDER BY recoWeight DESC
How can I do that in just one query ? Everything I tried led to cartesian products of queries and took forever while each independent query takes less than a second.

MATCH (:User {user_id: "69b3315a-ba4a-4021-94e1-0f494f9b957f"})-->(first_set_of_users)
WITH collect(first_set_of_users) AS list_of_first_set_of_users
MATCH (:User {user_id: "69b3315a-ba4a-4021-94e1-0f494f9b957f"})<-[:LIKES]-(likers)-[:LIKES]->(v)
WITH DISTINCT v, list_of_first_set_of_users
MATCH (second_set_of_users)-[:LIKES]->(v)
WITH second_set_of_users, COUNT(*) AS recoWeight
WHERE NOT second_set_of_users IN list_of_first_set_of_users
RETURN second_set_of_users, recoWeight
ORDER BY recoWeight DESC
Explanation.
Using WITH clause we could pass the result of the first query into the second query.
And then using WHERE NOT IN we could filter the result of the second query.

Related

why DISTINCT is needed in this Cypher query?

The below query is taken from neo4j movie review dataset sandbox:
MATCH (u:User {name: "Some User"})-[r:RATED]->(m:Movie)
WITH u, avg(r.rating) AS mean
MATCH (u)-[r:RATED]->(m:Movie)-[:IN_GENRE]->(g:Genre)
WHERE r.rating > mean
WITH u, g, COUNT(*) AS score
MATCH (g)<-[:IN_GENRE]-(rec:Movie)
WHERE NOT EXISTS((u)-[:RATED]->(rec))
RETURN rec.title AS recommendation, rec.year AS year, COLLECT(DISTINCT g.name) AS genres, SUM(score) AS sscore
ORDER BY sscore DESC LIMIT 10
what I can not understand is: why the DISTINCT keyword is required in the query's return statement?. Because the expected results from the last MATCH statement is something like this:
g1,x
g1,y
...
g2,z
g2,v
g2,m
...
gn,m
gn,b
gn,x
where g1,g2,..gn are the set of genres and x,y,z,v,m,b... are a set of movies (in addition there is a user and score column deleted for readability).
So according to my understanding what this query is returning: For each movie return its genres and the sum of their scores.
Assumptions:
Every Movie has a unique title. (This is required for the query to work as is.)
Every Genre has a unique name.
Every Movie has at most one IN_GENRE relationship to each distinct Genre.
Given the above assumptions, you are correct that the DISTINCT is not necessary. That is because the RETURN clause is using rec.title as one of the aggregation grouping keys.

Neo4J: query optimisation/model validation

New to Neo4J so apologies in advance if I am doing things horribly wrong. I am trying to show user articles in which they could be interested in based on the categories they have selected and tags they have liked independently.
My model in Neo4j is something like this
(:USER)-[:LIKES]->(:TAG)
(:ARTICLE)-[:PUBLISHED_BY]->(:PROVIDER)
(:ARTICLE)-[:HAS_CATEGORY]->(:CATEGORY)
(:USER)-[:DISLIKES]-(:ARTICLE)
(:USER)-[:INTERESTED_IN]->(:CATEGORY)
When I try to run the following query to get the desired results...I get them but the query is taking 16-18 seconds to execute.
MATCH (u:USER {id: $userid})-[:LIKES]->(t:TAG)
WITH u,t, collect(t.name) as tags
UNWIND tags as tag with u,tag
MATCH (c:CATEGORY)<-[*]-(a:ARTICLE)-[pub:PUBLISHED_BY]->(p:PROVIDER)
WHERE a.keywords contains tag OR c.id in $categoryArray
AND NOT (u)-[:DISLIKES]->(a)
RETURN DISTINCT a.id AS id, a.title AS title, pub.pubDate
ORDER BY pub.pubDate DESC LIMIT 250
Is there a faster and better way to get the desired results?
Note: I am using Neo4j 3.4.1 version on ubuntu machine with page-cache: 512mb and MIN & MAX heap size: 1500mb
It would be better if in your model articles are connected to tags.
This bit: a.keywords contains tag is not index supported, so it will lead to a full scan.
Also, from categories to articles might be a long chain, so add a rel-type there and add an upper limit. It might be better to check found articles against categories.
MATCH (u:USER {id: $userid})-[:LIKES]->(tag:TAG)
MATCH (a:ARTICLE)-[:HAS_TAG]->(tag)
WITH distinct u, a
WHERE any(c IN categories WHERE NOT shortestPath((c)<-[:IN_CATEGORY*]-(a)) IS NULL)
AND NOT (u)-[:DISLIKES]->(a)
MATCH (a)-[pub:PUBLISHED_BY]->(p:PROVIDER)
RETURN DISTINCT a.id AS id, a.title AS title, pub.pubDate
ORDER BY pub.pubDate DESC LIMIT 250
Also check the query plan with PROFILE to see any bottlenecks or unindexed fields (you can expand the boxes with the double arrow in the lower right corner)
Thanks #Michael I understand that having tags as separate nodes related to articles would make the search faster but the following query has brought down the search time from 16-18 seconds to 3-4 seconds at the moment
MATCH (u:USER {id: $userId})-[:INTERESTED_IN]->(c:CATEGORY)<-[*]-(a:ARTICLE)[pub:PUBLISHED_BY]->(p:PROVIDER) WHERE NOT (u)-[:DISLIKES]->(a) RETURN DISTINCT a.id, a.title, pub.pubDate ORDER BY pub.pubDate DESC LIMIT 150 UNION MATCH (u:USER {id: $userId})-[:LIKES]->(t:TAG) WITH u, t, collect(t.name) AS tags UNWIND tags AS tag MATCH (a:ARTICLE)-[pub:PUBLISHED_BY]-(:PROVIDER) WHERE a.keywords CONTAINS tag AND NOT (u)-[:DISLIKES]->(a) RETURN DISTINCT a.id, a.title, pub.pubDate ORDER BY pub.pubDate DESC LIMIT 150

Get the full graph of a query in Neo4j

Suppose tha I have the default database Movies and I want to find the total number of people that have participated in each movie, no matter their role (i.e. including the actors, the producers, the directors e.t.c.)
I have already done that using the query:
MATCH (m:Movie)<-[r]-(n:Person)
WITH m, COUNT(n) as count_people
RETURN m, count_people
ORDER BY count_people DESC
LIMIT 3
Ok, I have included some extra options but that doesn't really matter in my actual question. From the above query, I will get 3 movies.
Q. How can I enrich the above query, so I can get a graph including all the relationships regarding these 3 movies (i.e.DIRECTED, ACTED_IN,PRODUCED e.t.c)?
I know that I can deploy all the relationships regarding each movie through the buttons on each movie node, but I would like to know whether I can do so through cypher.
Use additional optional match:
MATCH (m:Movie)<--(n:Person)
WITH m,
COUNT(n) as count_people
ORDER BY count_people DESC
LIMIT 3
OPTIONAL MATCH p = (m)-[r]-(RN) WHERE type(r) IN ['DIRECTED', 'ACTED_IN', 'PRODUCED']
RETURN m,
collect(p) as graphPaths,
count_people
ORDER BY count_people DESC

Limit the results of a union cypher query

Let's say we have the example query from the documentation:
MATCH (n:Actor)
RETURN n.name AS name
UNION
MATCH (n:Movie)
RETURN n.title AS name
I know that if I do that:
MATCH (n:Actor)
RETURN n.name AS name
LIMIT 5
UNION
MATCH (n:Movie)
RETURN n.title AS name
LIMIT 5
I can reduce the returned results of each sub query to 5.How can I LIMIT the total results of the union query?
This is not yet possible, but there is already an open neo4j issue that requests the ability to do post-UNION processing, which includes what you are asking about. You can add a comment to that neo4j issue if you support having it resolved.
This can be done using UNION post processing by rewriting the query using the COLLECT function and the UNWIND clause.
First we turn the columns of a result into a map (struct, hash, dictionary), to retain its structure. For each partial query we use the COLLECT to aggregate these maps into a list, which also reduces our row count (cardinality) to one (1) for the following MATCH. Combining the lists is a simple list concatenation with the “+” operator.
Once we have the complete list, we use UNWIND to transform it back into rows of maps. After this, we use the WITH clause to deconstruct the maps into columns again and perform operations like sorting, pagination, filtering or any other aggregation or operation.
The rewritten query will be as below:
MATCH (n:Actor)
with collect ({name: n.title}) as row
MATCH (n:Movie)
with row + collect({name: n.title}) as rows
unwind rows as row
with row.name as name
return name LIMIT 5
This is possible in 4.0.0
CALL {
MATCH (p:Person) RETURN p
UNION
MATCH (p:Person) RETURN p
}
RETURN p.name, p.age ORDER BY p.name
Read more about Post-union processing here https://neo4j.com/docs/cypher-manual/4.0/clauses/call-subquery/

Issue combining two cypher queries with UNION

I ran into the following problem when combining two cypher queries on console.neo4j.org
The query:
MATCH (p1:Crew)-[r_pfq]->(fq:Crew)
WHERE fq.name IN ["Neo", "Morpheus"]
RETURN distinct(p1) AS person, count(r_pfq) AS friend_score, collect(fq.name) AS friends
ORDER BY friend_score DESC
LIMIT 10
works fine, as does
MATCH (f:Crew)<-[r_fqf]-(fq:Crew)
WHERE fq.name IN ["Neo", "Morpheus"]
WITH distinct(f), count(r_fqf) AS weight
ORDER BY weight DESC
LIMIT 10
MATCH f<--(p:Crew)
RETURN distinct(p) AS person, sum(weight) AS friend_score, collect(f.name) AS friends
ORDER BY friend_score DESC
LIMIT 10
Now when I try to combine the query results using the UNION command, i.e.
MATCH (p1:Crew)-[r_pfq]->(fq:Crew)
WHERE fq.name IN ["Neo", "Morpheus"]
RETURN distinct(p1) AS person, count(r_pfq) AS friend_score, collect(fq.name) AS friends
ORDER BY friend_score DESC
LIMIT 10
UNION
MATCH (f:Crew)<-[r_fqf]-(fq:Crew)
WHERE fq.name IN ["Neo", "Morpheus"]
WITH distinct(f), count(r_fqf) AS weight
ORDER BY weight DESC
LIMIT 10
MATCH f<--(p:Crew)
RETURN distinct(p) AS person, sum(weight) AS friend_score, collect(f.name) AS friends
ORDER BY friend_score DESC
LIMIT 10
I get the error
Error: org.neo4j.graphdb.NotFoundException: Unknown identifier `weight`.
Can anyone provide me with an explanation why these query results can not be combined and how to properly do so? Why is the identifier known when running both queries separately but unknown in a UNION-combined query?
EDIT
The following simpler query is basically equivalent, except that the second query in the UNION does not ORDER BY weight. This is because we are already ordering by the derived friend_score, so it seemed redundant. Also, in order for a variable to be included in the ORDER BY clause, it has to be in the RETURN clause -- but the first query in the UNION does not have a weight variable, which would have violated the requirements for a legal UNION statement.
In addition, there is a second WITH clause in the second query because you have to define the variables used in an ORDER BY clause (like friend_score) before the RETURN clause!
MATCH (p1:Crew)-[r_pfq]->(fq:Crew)
WHERE fq.name IN ["Neo", "Morpheus"]
RETURN DISTINCT (p1) AS person, count(r_pfq) AS friend_score, collect(fq.name) AS friends
ORDER BY friend_score DESC
LIMIT 10
UNION
MATCH (p:Crew)-->(f:Crew)<-[r_fqf]-(fq:Crew)
WHERE fq.name IN ["Neo", "Morpheus"]
WITH f, count(r_fqf) AS weight, p
WITH f, sum(weight) AS friend_score, p
RETURN DISTINCT (p) AS person, friend_score, collect(DISTINCT (f).name) AS friends
ORDER BY friend_score DESC
LIMIT 10

Resources