I've got a Neo4j database in which hashtags and tweets are stored.
Every tweet has a topic property, which defines the topic it belongs to.
If I run the following query, I get the most popular hashtags in the db, no matter the topic:
MATCH (h:Hashtag)
RETURN h.text AS hashtag, size( (h)<--() ) AS degree ORDER BY degree DESC
I'd like to get the most popular tags for a single topic.
I tried this:
MATCH (h:Hashtag)<--(t:Tweet{topic:'test'})
RETURN h.text AS hashtag, size( (h)<--(t) ) AS degree ORDER BY degree DESC
this
MATCH (h:Hashtag)
RETURN h.text AS hashtag, size( (h)<--(t:Tweet{topic:'test'}) ) AS degree ORDER BY degree DESC
while the next one takes forever to run
MATCH (h:Hashtag), (t:Tweet)
WHERE t.topic='test'
RETURN h.text AS hashtag, size( (h)<--(t) ) AS degree ORDER BY degree DESC
What should I do? Thanks.
In Cypher, when you return the results of an aggregation function you get an implicit "group by" with whatever you are returning alongside the aggregation function. SIZE() is not an aggregation (so you'll get the size of the pattern for each row without the group by/aggregation), but COUNT() is:
MATCH (t:Tweet {topic:'test'})-->(h:Hashtag)
RETURN h, COUNT(*) AS num ORDER BY num DESC LIMIT 10
This query is counts of Tweet nodes, grouped by Hashtag.
Related
Hi there I am on neo4j and I am having some trouble I have one query where I want to return a the a node (cuisine) with the highest percentage like so
// 1. Find the most_popular_cuisine
MATCH (n:restaurants)
WITH COUNT(n.cuisine) as total
MATCH (r:restaurants)
RETURN r.cuisine , 100 * count(*)/total as percentage
order by percentage desc
limit 1
I am trying to extend this even further by getting the top result and matching to that to get nodes with just that property like so
WITH COUNT(n.cuisine) as total
MATCH (r:restaurants)
WITH r.cuisine as cuisine , count(*) as cnt
MATCH (t:restaurants)
WHERE t.cuisine = cuisine AND count(*) = MAX(cnt)
RETURN t
I think you might be better off refactoring your model a little bit such that a :Cuisine is a label and each cuisine has its own node.
(:Restaurant)-[:OFFERS]->(:Cuisine)
or
(:Restaurant)-[:SPECIALIZES_IN]->(:Cuisine)
Then your query can look like this
MATCH (cuisine:Cuisine)
RETURN cuisine, size((cuisine)<-[:OFFERS]-()) AS number_of_restaurants
ORDER BY number_of_restaurants DESC
I wasn't able to use WITH r.cuisine as cuisine , count(*) as cnt in a WITH rather than a RETURN statement, so I had to resort to a slightly more long-winded approach.
There might be a more optimized way to do this, but this works too,
// Get all unique cuisines in a list
MATCH (n:Restaurants)
WITH COUNT(n.Cuisine) as total, COLLECT(DISTINCT(n.Cuisine)) as cuisineList
// Go through each cuisine and find the number of restaurants associated with each
UNWIND cuisineList as c
MATCH (r:Restaurants{Cuisine:c})
WITH total, r.Cuisine as c, count(r) as cnt
ORDER BY cnt DESC
WITH COLLECT({Cuisine: c, Count:cnt}) as list
// For the most popular cuisine, find all the restaurants offering it
MATCH (t:Restaurants{Cuisine:list[0].Cuisine})
RETURN t
I'm trying to model a large knowledge graph. (using v3.1.1).
My actual graph contains only two types of Nodes (Topic, Properties) and a single type of Relationships (HAS_PROPERTIES).
The count of nodes is about 85M (47M :Topic, the rest of nodes are :Properties).
I'm trying to get the most connected node:Topic for this. I'm using the following query:
MATCH (n:Topic)-[r]-()
RETURN n, count(DISTINCT r) AS num
ORDER BY num
This query or almost any query I try to perform (without filtering the results) using the count(relationships) and order by count(relationships) is always extremely slow: these queries take more than 10 minutes and still no response.
Am i missing indexes or is the a better syntax?
Is there any chance i can execute this query in a reasonable time?
Use this:
MATCH (n:Topic)
RETURN n, size( (n)--() ) AS num
ORDER BY num DESC
LIMIT 100
Which reads the degree from a node directly.
I just imported the English Wikipedia into Neo4j and am playing around. I started by looking up the pages that link into the Page "Berlin"
MATCH p=(p1:Page {title:"Berlin"})<-[*1..1]-(otherPage)
WITH nodes(p) as neighbors
LIMIT 500
RETURN DISTINCT neighbors
That works quite well. What I would like to achieve next is to show the 2nd degree of relationships. In order to be able to display them correctly, I would like to limit the number of first degree relationship nodes to 20 and then query the next level of relationship.
How does one achieve that?
I don't know the Wikipedia model, but I'm assuming that there are many different relationship types and that is why that -[*1..1]-, I think that is analogous to -[]- or even --. I doubt it has any serious impact though.
You can collect up the first level matches and limit them to 20 using a WITH with a LIMIT. You can then perform a second match using those (<20) other pages as the start point.
MATCH (p1:Page {title:"Berlin"})<-[*1..1]-(otherPage:Page)
WITH p1, otherPage
LIMIT 20
MATCH (otherPage)<-[*1..1]-(secondDegree:Page)
WHERE secondDegree <> p1
WITH otherPage, secondDegree
LIMIT 500
RETURN otherPage, COLLECT(secondDegree)
There are many ways to return the data, this just returns the first degree match with an array of the subsequent matches.
If the only type of relationship is :Link and you want to keep the start node then you can change the query to this:
MATCH (p1:Page {title:"Berlin"})<-[:Link]-(otherPage:Page)
WITH p1, otherPage
LIMIT 20
MATCH (otherPage)<-[:Link]-(secondDegree:Page)
WHERE secondDegree <> p1
WITH p1, otherPage, secondDegree
LIMIT 500
RETURN p1, otherPage, COLLECT(secondDegree)
I have an embedded neo4j server with ruby on rails.
These are the configurations:
neostore.nodestore.db.mapped_memory=25M
neostore.relationshipstore.db.mapped_memory=240M
neostore.propertystore.db.mapped_memory=230M
neostore.propertystore.db.strings.mapped_memory=1200M
neostore.propertystore.db.arrays.mapped_memory=130M
wrapper.java.initmemory=1024
wrapper.java.maxmemory=2048
There are around 15lakh movie nodes. The below query is taking around 5secs to execute.
MATCH (movie:Movie)
WITH movie, toInt(movie.reviews_count) + toInt(movie.ratings_count) AS weight
RETURN movie, weight as weight
ORDER BY weight DESC
SKIP skip_count
LIMIT 10
Here the skip_count varies as the user scroll for the results.
and this another query which aims to get the movies from a particular director takes around 9secs
MATCH (movie:Movie) , (director:Director)-[:Directed]->(movie)
WHERE director.name =~ '(?i)DIRECTOR_NAME'
WITH movie, toInt(movie.ratings_count) * toInt(movie.reviews_count) * toInt(movie.rating) AS total_weight
RETURN movie, total_weight
ORDER BY total_weight DESC, movie.rating DESC
LIMIT 10
How can I reduce the query execution time?
regarding first query:
You might make the weight ordering in the graph explicit by connecting all movie nodes using :NEXT_WEIGHT relationship in descending weight order, so the movies build up a linked list.
Your query would look like:
MATCH p=(:Movie {name:'<name of movie with highest weight>'})-[:NEXT_WEIGHT*..1000]-()
WHERE length(p)>skip_count AND length(p)<skip_count+limit
WITH p
ORDER BY length(p)
WITH last(nodes(p)) as movie
RETURN movie, toInt(movie.reviews_count) + toInt(movie.ratings_count) AS weight
regarding second query:
You should use a index to speed up the director lookup. Unfortunately index lookups are currently only supported for exact lookups. So either make sure the search string is correct in terms of upper/lower case or store a normalized version in another property:
MATCH (d:Director) set d.lowerName = LOWER(d.name)
Make sure to have a index on label Director and property LowerName:
CREATE INDEX ON :Director(lowerName)
And your query should look like:
MATCH (director:Director)-[:Directed]->(movie)
WHERE director.name = {directorName}
RETURN movie, toInt(movie.ratings_count) * toInt(movie.reviews_count) * toInt(movie.rating) AS total_weight
ORDER BY total_weight DESC, movie.rating DESC
LIMIT 10
I'm developing a kind of reddit service to learn Neo4j.
Everything works fine, I just want to get some feedback on the Cypher query to get the most recent news stories, the author and number of comments, likes and dislikes.
I'm using Neo4j 2.0.
MATCH comments = (n:news)-[:COMMENT]-(o)
MATCH likes = (n:news)-[:LIKES]-(p)
MATCH dislikes = (n:news)-[:DISLIKES]-(q)
MATCH (n:news)-[:POSTED_BY]-(r)
WITH n, r, count(comments) AS num_comments, count(likes) AS num_likes, count(dislikes) AS num_dislikes
ORDER BY n.post_date
LIMIT 20
RETURN *
o, p, q, r are all nodes with the label user. Should the label be added to the query to speed it up?
Is there anything else you see that I could optimize?
I think you're going to want to get rid of the multiple matches. Cypher will filter on each one, filtering through one another, rather than getting all the information.
I would also avoid the paths like comments, and rather do the count on the nodes you are saving. When you do MATCH xyz = (a)-[:COMMENT]-(b) then xyz is a path, which contains the source, relationship and destination node.
MATCH (news:news)-[:COMMENT]-(comment),(news:news)-[:LIKES]-(like),(news:news)-[:DISLIKES]-(dislike),(news:news)-[:POSTED_BY]-(posted_by)
WHERE news.post_date > 0
WITH news, posted_by, count(comment) AS num_comments, count(like) AS num_likes, count(dislike) AS num_dislikes
ORDER BY news.post_date
LIMIT 20
RETURN *
I would do something like this.
MATCH (n:news)-[:POSTED_BY]->(r)
WHERE n.post_date > {recent_start_time}
RETURN n, r,
length((n)<-[:COMMENT]-()) AS num_comments,
length((n)<-[:LIKES]-()) AS num_likes,
length((n)<-[:DISLIKES]-()) AS num_dislikes,
ORDER BY n.post_date DESC
LIMIT 20
To speed it up and have not neo search over all your posts, I would probably index the post-date field (assuming it doesn't contain time information). And then send this query in for today, yesterday etc. until you have your 20 posts.
MATCH (n:news {post_date: {day}})-[:POSTED_BY]->(r)
RETURN n, r,
length((n)<-[:COMMENT]-()) AS num_comments,
length((n)<-[:LIKES]-()) AS num_likes,
length((n)<-[:DISLIKES]-()) AS num_dislikes,
ORDER BY n.post_date DESC
LIMIT 20