New to Neo4J so apologies in advance if I am doing things horribly wrong. I am trying to show user articles in which they could be interested in based on the categories they have selected and tags they have liked independently.
My model in Neo4j is something like this
(:USER)-[:LIKES]->(:TAG)
(:ARTICLE)-[:PUBLISHED_BY]->(:PROVIDER)
(:ARTICLE)-[:HAS_CATEGORY]->(:CATEGORY)
(:USER)-[:DISLIKES]-(:ARTICLE)
(:USER)-[:INTERESTED_IN]->(:CATEGORY)
When I try to run the following query to get the desired results...I get them but the query is taking 16-18 seconds to execute.
MATCH (u:USER {id: $userid})-[:LIKES]->(t:TAG)
WITH u,t, collect(t.name) as tags
UNWIND tags as tag with u,tag
MATCH (c:CATEGORY)<-[*]-(a:ARTICLE)-[pub:PUBLISHED_BY]->(p:PROVIDER)
WHERE a.keywords contains tag OR c.id in $categoryArray
AND NOT (u)-[:DISLIKES]->(a)
RETURN DISTINCT a.id AS id, a.title AS title, pub.pubDate
ORDER BY pub.pubDate DESC LIMIT 250
Is there a faster and better way to get the desired results?
Note: I am using Neo4j 3.4.1 version on ubuntu machine with page-cache: 512mb and MIN & MAX heap size: 1500mb
It would be better if in your model articles are connected to tags.
This bit: a.keywords contains tag is not index supported, so it will lead to a full scan.
Also, from categories to articles might be a long chain, so add a rel-type there and add an upper limit. It might be better to check found articles against categories.
MATCH (u:USER {id: $userid})-[:LIKES]->(tag:TAG)
MATCH (a:ARTICLE)-[:HAS_TAG]->(tag)
WITH distinct u, a
WHERE any(c IN categories WHERE NOT shortestPath((c)<-[:IN_CATEGORY*]-(a)) IS NULL)
AND NOT (u)-[:DISLIKES]->(a)
MATCH (a)-[pub:PUBLISHED_BY]->(p:PROVIDER)
RETURN DISTINCT a.id AS id, a.title AS title, pub.pubDate
ORDER BY pub.pubDate DESC LIMIT 250
Also check the query plan with PROFILE to see any bottlenecks or unindexed fields (you can expand the boxes with the double arrow in the lower right corner)
Thanks #Michael I understand that having tags as separate nodes related to articles would make the search faster but the following query has brought down the search time from 16-18 seconds to 3-4 seconds at the moment
MATCH (u:USER {id: $userId})-[:INTERESTED_IN]->(c:CATEGORY)<-[*]-(a:ARTICLE)[pub:PUBLISHED_BY]->(p:PROVIDER) WHERE NOT (u)-[:DISLIKES]->(a) RETURN DISTINCT a.id, a.title, pub.pubDate ORDER BY pub.pubDate DESC LIMIT 150 UNION MATCH (u:USER {id: $userId})-[:LIKES]->(t:TAG) WITH u, t, collect(t.name) AS tags UNWIND tags AS tag MATCH (a:ARTICLE)-[pub:PUBLISHED_BY]-(:PROVIDER) WHERE a.keywords CONTAINS tag AND NOT (u)-[:DISLIKES]->(a) RETURN DISTINCT a.id, a.title, pub.pubDate ORDER BY pub.pubDate DESC LIMIT 150
Related
The below query is taken from neo4j movie review dataset sandbox:
MATCH (u:User {name: "Some User"})-[r:RATED]->(m:Movie)
WITH u, avg(r.rating) AS mean
MATCH (u)-[r:RATED]->(m:Movie)-[:IN_GENRE]->(g:Genre)
WHERE r.rating > mean
WITH u, g, COUNT(*) AS score
MATCH (g)<-[:IN_GENRE]-(rec:Movie)
WHERE NOT EXISTS((u)-[:RATED]->(rec))
RETURN rec.title AS recommendation, rec.year AS year, COLLECT(DISTINCT g.name) AS genres, SUM(score) AS sscore
ORDER BY sscore DESC LIMIT 10
what I can not understand is: why the DISTINCT keyword is required in the query's return statement?. Because the expected results from the last MATCH statement is something like this:
g1,x
g1,y
...
g2,z
g2,v
g2,m
...
gn,m
gn,b
gn,x
where g1,g2,..gn are the set of genres and x,y,z,v,m,b... are a set of movies (in addition there is a user and score column deleted for readability).
So according to my understanding what this query is returning: For each movie return its genres and the sum of their scores.
Assumptions:
Every Movie has a unique title. (This is required for the query to work as is.)
Every Genre has a unique name.
Every Movie has at most one IN_GENRE relationship to each distinct Genre.
Given the above assumptions, you are correct that the DISTINCT is not necessary. That is because the RETURN clause is using rec.title as one of the aggregation grouping keys.
I am using Neo4j CE 3.1.1 and I have a relationship WRITES between authors and books. I want to find the N (say N=10 for example) books with the largest number of authors. Following some examples I found, I came up with the query:
MATCH (a)-[r:WRITES]->(b)
RETURN r,
COUNT(r) ORDER BY COUNT(r) DESC LIMIT 10
When I execute this query in the Neo4j browser I get 10 books, but these do not look like the ones written by most authors, as they show only a few WRITES relationships to authors. If I change the query to
MATCH (a)-[r:WRITES]->(b)
RETURN b,
COUNT(r) ORDER BY COUNT(r) DESC LIMIT 10
Then I get the 10 books with the most authors, but I don't see their relationship to authors. To do so, I have to write additional queries explicitly stating the name of a book I found in the previous query:
MATCH ()-[r:WRITES]->(b)
WHERE b.title="Title of a book with many authors"
RETURN r
What am I doing wrong? Why isn't the first query working as expected?
Aggregations only have context based on the non-aggregation columns, and with your match, a unique relationship will only occur once in your results.
So your first query is asking for each relationship on a row, and the count of that particular relationship, which is 1.
You might rewrite this in a couple different ways.
One is to collect the authors and order on the size of the author list:
MATCH (a)-[:WRITES]->(b)
RETURN b, COLLECT(a) as authors
ORDER BY SIZE(authors) DESC LIMIT 10
You can always collect the author and its relationship, if the relationship itself is interesting to you.
EDIT
If you happen to have labels on your nodes (you absolutely SHOULD have labels on your nodes), you can try a different approach by matching to all books, getting the size of the incoming :WRITES relationships to each book, ordering and limiting on that, and then performing the match to the authors:
MATCH (b:Book)
WITH b, SIZE(()-[:WRITES]->(b)) as authorCnt
ORDER BY authorCnt DESC LIMIT 10
MATCH (a)-[:WRITES]->(b)
RETURN b, a
You can collect on the authors and/or return the relationship as well, depending on what you need from the output.
You are very close: after sorting, it is necessary to rediscover the authors. For example:
MATCH (a:Author)-[r:WRITES]->(b:Book)
WITH b,
COUNT(r) AS authorsCount
ORDER BY authorsCount DESC LIMIT 10
MATCH (b)<-[:WRITES]-(a:Author)
RETURN b,
COLLECT(a) AS authors
ORDER BY size(authors) DESC
Suppose tha I have the default database Movies and I want to find the total number of people that have participated in each movie, no matter their role (i.e. including the actors, the producers, the directors e.t.c.)
I have already done that using the query:
MATCH (m:Movie)<-[r]-(n:Person)
WITH m, COUNT(n) as count_people
RETURN m, count_people
ORDER BY count_people DESC
LIMIT 3
Ok, I have included some extra options but that doesn't really matter in my actual question. From the above query, I will get 3 movies.
Q. How can I enrich the above query, so I can get a graph including all the relationships regarding these 3 movies (i.e.DIRECTED, ACTED_IN,PRODUCED e.t.c)?
I know that I can deploy all the relationships regarding each movie through the buttons on each movie node, but I would like to know whether I can do so through cypher.
Use additional optional match:
MATCH (m:Movie)<--(n:Person)
WITH m,
COUNT(n) as count_people
ORDER BY count_people DESC
LIMIT 3
OPTIONAL MATCH p = (m)-[r]-(RN) WHERE type(r) IN ['DIRECTED', 'ACTED_IN', 'PRODUCED']
RETURN m,
collect(p) as graphPaths,
count_people
ORDER BY count_people DESC
I'm trying to do a query that involves a UNION, but filters with a WHERE, ORDER BY, and LIMIT after the union.
The basic idea is to find all posts STARRED or POSTED by users that another user FOLLOWS. For example, the posts s and p are the posts of interest.
MATCH (a:USER {id:0})-[:FOLLOWS]->(b:USER),
(b)-[:STARRED]->(s:POST),
(b)-[:POSTED]->(p:POST)
I'd like to return the union of the id property of both s and p after filtering, sorting, and limiting the results. Any relevant indexes to create that make this query efficient would be helpful as well.
If u is the union of s and p, I'd want to do something like:
WHERE u.time > 1431546036148
RETURN u.id ORDER BY u.time SKIP 0 LIMIT 20
I don't know how to get u from s and p, and I don't know what indexes to create to make this query efficient.
You can use multiple relationships types so you'll not have to do UNION.
I guess the time property is on the POST node :
MATCH (user:USER {id:0})-[:FOLLOWS]->(friend:USER)
MATCH (friend)-[:STARRED|:POSTED]->(p:POST)
WHERE p.time > 1431546036148
RETURN p
ORDER BY p.time
LIMIT 25
Suppose I have two kinds of nodes, Person and Competency. They are related by a KNOWS relationship. For example:
(:Person {id: 'thiago'})-[:KNOWS]->(:Competency {id: 'neo4j'})
How do I query this schema to find out all Person that knows all nodes of a set of Competency?
Suppose that I need to find every Person that knows "java" and "haskell" and I'm only interested in the nodes that knows all of the listed Competency nodes.
I've tried this query:
match (p:Person)-[:KNOWS]->(c:Competency) where c.id in ['java','haskell'] return p.id;
But I get back a list of all Person that knows either "java" or "haskell" and duplicated entries for those who knows both.
Adding a count(c) at the end of the query eliminates the duplicates:
match (p:Person)-[:KNOWS]->(c:Competency) where c.id in ['java','haskell'] return p.id, count(c);
Then, in this particular case, I can iterate the result and filter out results that the count is less than two to get the nodes I want.
I've found out that I could do it appending consecutive match clauses to keep filtering the nodes to get the result I want, in this case:
match (p:Person)-[:KNOWS]->(:Competency {id:'haskell'})
match (p)-[:KNOWS]->(:Competency {id:'java'})
return p.id;
Is this the only way to express this query? I mean, I need to create a query by concatenating strings? I'm looking for a solution to a fixed query with parameters.
with ['java','haskell'] as skills
match (p:Person)-[:KNOWS]->(c:Competency)
where c.id in skills
with p.id, count(*) as c1 ,size(skills) as c2
where c1 = c2
return p.id
One thing you can do, is to count the number of all skills, then find the users that have the number of skill relationships equals to the skills count :
MATCH (n:Skill) WITH count(n) as skillMax
MATCH (u:Person)-[:HAS]->(s:Skill)
WITH u, count(s) as skillsCount, skillMax
WHERE skillsCount = skillMax
RETURN u, skillsCount
Chris
Untested, but this might do the trick:
match (p:Person)-[:KNOWS]->(c:Competency)
with p, collect(c.id) as cs
where all(x in ['java', 'haskell'] where x in cs)
return p.id;
How about this...
WITH ['java','haskell'] AS comp_col
MATCH (p:Person)-[:KNOWS]->(c:Competency)
WHERE c.name in comp_col
WITH comp_col
, p
, count(*) AS total
WHERE total = length(comp_col)
RETURN p.name, total
Put the competencies you want in a collection.
Match all the people that have either of those competencies
Get the count of compentencies by person where they have the same number as in the competency collection from the start
I think this will work for what you need, but if you are building these queries programatically the best performance you get might be with successive match clauses. Especially if you knew which competencies were most/least common when building your queries, you could order the matches such that the least common were first and the most common were last. I think that would chunk down to your desired persons the fastest.
It would be interesting to see what the plan analyzer in the sheel says about the different approaches.