Neo4j / Cypher query syntax feedback - neo4j

I'm developing a kind of reddit service to learn Neo4j.
Everything works fine, I just want to get some feedback on the Cypher query to get the most recent news stories, the author and number of comments, likes and dislikes.
I'm using Neo4j 2.0.
MATCH comments = (n:news)-[:COMMENT]-(o)
MATCH likes = (n:news)-[:LIKES]-(p)
MATCH dislikes = (n:news)-[:DISLIKES]-(q)
MATCH (n:news)-[:POSTED_BY]-(r)
WITH n, r, count(comments) AS num_comments, count(likes) AS num_likes, count(dislikes) AS num_dislikes
ORDER BY n.post_date
LIMIT 20
RETURN *
o, p, q, r are all nodes with the label user. Should the label be added to the query to speed it up?
Is there anything else you see that I could optimize?

I think you're going to want to get rid of the multiple matches. Cypher will filter on each one, filtering through one another, rather than getting all the information.
I would also avoid the paths like comments, and rather do the count on the nodes you are saving. When you do MATCH xyz = (a)-[:COMMENT]-(b) then xyz is a path, which contains the source, relationship and destination node.
MATCH (news:news)-[:COMMENT]-(comment),(news:news)-[:LIKES]-(like),(news:news)-[:DISLIKES]-(dislike),(news:news)-[:POSTED_BY]-(posted_by)
WHERE news.post_date > 0
WITH news, posted_by, count(comment) AS num_comments, count(like) AS num_likes, count(dislike) AS num_dislikes
ORDER BY news.post_date
LIMIT 20
RETURN *

I would do something like this.
MATCH (n:news)-[:POSTED_BY]->(r)
WHERE n.post_date > {recent_start_time}
RETURN n, r,
length((n)<-[:COMMENT]-()) AS num_comments,
length((n)<-[:LIKES]-()) AS num_likes,
length((n)<-[:DISLIKES]-()) AS num_dislikes,
ORDER BY n.post_date DESC
LIMIT 20
To speed it up and have not neo search over all your posts, I would probably index the post-date field (assuming it doesn't contain time information). And then send this query in for today, yesterday etc. until you have your 20 posts.
MATCH (n:news {post_date: {day}})-[:POSTED_BY]->(r)
RETURN n, r,
length((n)<-[:COMMENT]-()) AS num_comments,
length((n)<-[:LIKES]-()) AS num_likes,
length((n)<-[:DISLIKES]-()) AS num_dislikes,
ORDER BY n.post_date DESC
LIMIT 20

Related

Neo4J get N latest relationships per unique target node?

With the following graph:
How can I write a query that would return N latest relationships by the unique target node?
For an example, this query: MATCH (p)-[r:RATED_IN]->(s) WHERE id(p)={person} RETURN p,s,r ORDER BY r.measurementDate DESC LIMIT {N} with N = 1 would return the latest relationship, whether it is RATED_IN Team Lead or Programming, but I would like to get N latest by each type. Of course, with N = 2, I would like the 2 latest measurements per skill node.
I would like the latest relationship by a person for Team Lead and the latest one for Programming.
How can I write such a query?
-- EDIT --
MATCH (p:Person) WHERE id(p)=175
CALL apoc.cypher.run('
WITH {p} AS p
MATCH (p)-[r:RATED_IN]->(s)
RETURN DISTINCT s, r ORDER BY r.measurementDate DESC LIMIT 2',
{p:p}) YIELD value
RETURN p,value.r AS r, value.s AS s
Here's a Cypher knowledge base article on limiting MATCH results per row, with a few different suggestions on how to accomplish this given current limitations. Using APOC's apoc.cypher.run() to perform a subquery with a RETURN using a LIMIT will do the trick, as it gets executed per row (thus the LIMIT is per row).
Note that for the upcoming Neo4j 4.0 release at the end of the year we're going to be getting some nice Cypher goodies that will make this significantly easier. Stay tuned as we reveal more details as we approach its release!

Querying Relationship is very slow

I want to create a new table based on relationships, after retrieving results.The query works and outputs for few results.
match (p:Person)-[r:LIVES]->(t:COUNTRY)<-[r2:LIVES]-(p2:Person)
where p<>p2 and r.year = r2.year and r.year >=2015 and r2.year>=2015
return p,r,t,r2,p2 limit 25;
But the server stops responding when I want all results.
match (p:Person)-[r:LIVES]->(t:COUNTRY)<-[r2:LIVES]-(p2:Person)
where p<>p2 and r.year = r2.year and r.year >=2015 and r2.year>=2015
return p,r,t,r2,p2;
This is basically self-joining relationship "LIVES".I have searched a lot but not found a way to create indices on relationships.
Any suggestions?
Since it sound like you aren't bound to the results format, I would like to offer an alternative Cypher.
The thing is that in your cypher, p and p2 basically form a cartesian product of people in the same year and country. So really you just need to group people by year and country, filter out groups of one, and then every pair in that bag is your p and p2 from your original query.
MATCH (p:Person)-[r:LIVES]->(t:COUNTRY)
WHERE r.year >=2015
WITH t, DISTINCT r.year as Year, COLLECT({n:p, r:r}) as people
WHERE SIZE(people) > 1
RETURN t, people

Independent matches in cypher query

I have a Neo4j database with User, Content, and Topic nodes. I want to calculate the proportion of content consumed by a given user for a given topic.
MATCH (u:User)-[:CONSUMED]->(c:Content)<-[:CONTAINS]-(t:Topic)
WHERE ID(u) = 11158 AND ID(t) = 19853
MATCH (c1:Content)<-[:CONTAINS]-(z)
RETURN toFloat(COUNT(DISTINCT(c))) / toFloat(COUNT(DISTINCT(c1)))
Two things strike me as really ugly here:
Firstly, is COUNT(DISTINCT()) a hack to get round the fact that the two MATCH queries cross-join?
Float division is ugly.
The second is something I can live with, but the first seems inefficient; is there a better way to express this idea?
The count of content should return the number of pieces of content a user consumed unless of course they consumed the same content more than once.
Instead of matching all of the content from the topic, if your model permits, you could just get the size of the outbound CONTAINS relationships.
MATCH (u:User)-[:CONSUMED]->(c:Content)<-[:CONTAINS]-(t:Topic)
WHERE ID(u) = 11158 AND ID(t) = 19853
RETURN toFloat(count(distinct c))/ size((t)-[:CONTAINS]->()) as proportion
Your original query returns a cartesian product of the number of user-content-topic matches x the number of topic-content matches. As an alternative to the above, you could re-write your original query something like this. This gets the content that is consumed by a user for the topic, does the aggregation and then passes the topic and resulting count to the next clause in the query. This will work, however, using size((t)-[:CONTAINS]->()) will be more effiecient.
MATCH (u:User)-[:CONSUMED]->(c:Content)<-[:CONTAINS]-(t:Topic)
WHERE ID(u) = 11158 AND ID(t) = 19853
WITH t, count(distinct c ) as distinct_content
MATCH (t)-[:CONTAINS]->(c1:Content)
RETURN toFloat(distinct_content) / count(c1)

Neo4j Cypher, need an optimized way of getting friend of friend of friend who are not in a 1 or 2 degree connection

I am new to Neo4j and trying to get friends of friends of friends (those who are 3 degrees away) and are also not in a 1 or 2 degree relation through a different path. I am using the below cypher which seems to take a lot of time
MATCH p = (origin:User {ID:51})-[:LINKED*3..3]-(fof:User)
WHERE NOT (origin)-[:LINKED*..2]-(fof)
RETURN fof.Nm
ORDER BY Nm LIMIT 1000
Profiling the query shows that the majority of time is taken by the "WHERE NOT" condition as it cross checks every resultant node against all the 1 and 2 degree nodes.
Am I doing something wrong here or is there a more optimized way of doing this?
Just to add, the property UsrID in label User is indexed.
There are probably a few ways you could do it. Here's one to try:
MATCH path = (origin:User {ID:51})-[:LINKED*3..3]-(fofof:User)
WHERE NOT(fofof IN (nodes(path)[0..-1]))
RETURN fofof.Nm
ORDER BY fofof.Nm LIMIT 1000
You could also be more explicit:
MATCH path = (origin:User {ID:51})-[:LINKED]-(f:User)-[:LINKED]-(fof:User)-[:LINKED]-(fofof:User)
WHERE fofof <> f AND fofof <> fof
RETURN fofof.Nm
ORDER BY fofof.Nm LIMIT 1000

Neo4j: How to limit subqueries

I just imported the English Wikipedia into Neo4j and am playing around. I started by looking up the pages that link into the Page "Berlin"
MATCH p=(p1:Page {title:"Berlin"})<-[*1..1]-(otherPage)
WITH nodes(p) as neighbors
LIMIT 500
RETURN DISTINCT neighbors
That works quite well. What I would like to achieve next is to show the 2nd degree of relationships. In order to be able to display them correctly, I would like to limit the number of first degree relationship nodes to 20 and then query the next level of relationship.
How does one achieve that?
I don't know the Wikipedia model, but I'm assuming that there are many different relationship types and that is why that -[*1..1]-, I think that is analogous to -[]- or even --. I doubt it has any serious impact though.
You can collect up the first level matches and limit them to 20 using a WITH with a LIMIT. You can then perform a second match using those (<20) other pages as the start point.
MATCH (p1:Page {title:"Berlin"})<-[*1..1]-(otherPage:Page)
WITH p1, otherPage
LIMIT 20
MATCH (otherPage)<-[*1..1]-(secondDegree:Page)
WHERE secondDegree <> p1
WITH otherPage, secondDegree
LIMIT 500
RETURN otherPage, COLLECT(secondDegree)
There are many ways to return the data, this just returns the first degree match with an array of the subsequent matches.
If the only type of relationship is :Link and you want to keep the start node then you can change the query to this:
MATCH (p1:Page {title:"Berlin"})<-[:Link]-(otherPage:Page)
WITH p1, otherPage
LIMIT 20
MATCH (otherPage)<-[:Link]-(secondDegree:Page)
WHERE secondDegree <> p1
WITH p1, otherPage, secondDegree
LIMIT 500
RETURN p1, otherPage, COLLECT(secondDegree)

Resources