I have a Neo4j database with User, Content, and Topic nodes. I want to calculate the proportion of content consumed by a given user for a given topic.
MATCH (u:User)-[:CONSUMED]->(c:Content)<-[:CONTAINS]-(t:Topic)
WHERE ID(u) = 11158 AND ID(t) = 19853
MATCH (c1:Content)<-[:CONTAINS]-(z)
RETURN toFloat(COUNT(DISTINCT(c))) / toFloat(COUNT(DISTINCT(c1)))
Two things strike me as really ugly here:
Firstly, is COUNT(DISTINCT()) a hack to get round the fact that the two MATCH queries cross-join?
Float division is ugly.
The second is something I can live with, but the first seems inefficient; is there a better way to express this idea?
The count of content should return the number of pieces of content a user consumed unless of course they consumed the same content more than once.
Instead of matching all of the content from the topic, if your model permits, you could just get the size of the outbound CONTAINS relationships.
MATCH (u:User)-[:CONSUMED]->(c:Content)<-[:CONTAINS]-(t:Topic)
WHERE ID(u) = 11158 AND ID(t) = 19853
RETURN toFloat(count(distinct c))/ size((t)-[:CONTAINS]->()) as proportion
Your original query returns a cartesian product of the number of user-content-topic matches x the number of topic-content matches. As an alternative to the above, you could re-write your original query something like this. This gets the content that is consumed by a user for the topic, does the aggregation and then passes the topic and resulting count to the next clause in the query. This will work, however, using size((t)-[:CONTAINS]->()) will be more effiecient.
MATCH (u:User)-[:CONSUMED]->(c:Content)<-[:CONTAINS]-(t:Topic)
WHERE ID(u) = 11158 AND ID(t) = 19853
WITH t, count(distinct c ) as distinct_content
MATCH (t)-[:CONTAINS]->(c1:Content)
RETURN toFloat(distinct_content) / count(c1)
Related
I'm trying to count the different types of relationships in my neo4j graph to add them as a "frequency" property to the corresponding edges (ie I have 4 e:EX relationship types, so I would like my edges of type EX to have an e.frequency=4).
So far I have played around with this code:
MATCH ()-[e:EX]-()
WITH e, count(e) as amount
SET e.frequency = amount
RETURN e
For this piece of code my returned e.frequency is 2 for all EX edges. Maybe anyone here knows how to correct this?
It sounds like you want this information for quick access later for any node with that type. If you're planning on deleting or adding edges in your graph, you should realize that your data will get stale quickly, and a graph-wide query to update the property on every edge in the graph just doesn't make sense.
Thankfully Neo4j keeps a transactional count store of various statistics, including the number of relationships per relationship type.
It's easiest to get these via procedure calls, either in Neo4j itself or APOC Procedures.
If you have APOC installed, you can see the map of relationship type counts like this:
CALL apoc.meta.stats() YIELD relTypesCount
RETURN relTypesCount
If you know which type you want the count of, you can use dot notation into the relTypesCount map to get the value in question.
If it's dynamic (either passed in as a parameter, or obtained after matching to a relationship in the query) you can use the map index notation to get the count in question like this:
CALL apoc.meta.stats() YIELD relTypesCount
MATCH ()-[r]->()
WITH relTypesCount, r
LIMIT 5
RETURN type(r) as type, relTypesCount[type(r)] as count
If you don't have APOC, you can make use of db.stats.retrieve('GRAPH COUNTS')
YIELD data, but you'll have to do some additional filtering to make sure you get the counts for ALL of the relationships of the given type, and exclude the counts that include the labels of the start or end nodes:
CALL db.stats.retrieve('GRAPH COUNTS') YIELD data
WITH [entry IN data.relationships WHERE NOT exists(entry.startLabel) AND NOT exists(entry.endLabel)] as relCounts
MATCH ()-[r]->()
WITH relCounts, r
LIMIT 5
RETURN type(r) as type, [rel in relCounts WHERE rel.relationshipType = type(r) | rel.count][0] as count
First, here is what your query is doing
// Match all EX edges (relationships), ignore direction
MATCH ()-[e:EX]-()
// Logical partition; With the edges, count how many times that instance occurred (will always be 2. (a)-[e:EX]->(b) and the reverse order of (b)<-[e:EX]-(a)
WITH e, count(e) as amount
// Set the property frequency on the instance of e to amount (2)
SET e.frequency = amount
// return the edges
RETURN e
So to filter the duplicates (reverse direction match), you need to specify the direction on the MATCH. So MATCH ()-[e:EX]->(). for the frequency part, you don't even need a match; you can just count the occurences of the pattern WITH SIZE(()-[:EX]->()) as c (SIZE because pattern matching returns a list, not a row set)
So
WITH SIZE(()-[:EX]->()) as c
MATCH ()-[e:EX]->()
SET e.frequency = c
return e
Although, frequency will be invalidated as soon as an EX edge is created or deleted, so I would just open your Cypher with asking for the edge count.
Also, in this trivial case, the best way to get the relation count is with a MATCH - COUNT because this form helps the Cypher planer to recognize that it can just fetch the edge count from it's internal metadata store.
MATCH ()-[e:EX]->()
WITH COUNT(e) as c
MATCH ()-[e:EX]->()
SET e.frequency = c
return e
I am using neo4j version 3.0.3. I have executed the below query. It is giving the results as the count of users who have the HAS_VISITED_LOCATION relation, but I want the total count of users who don't have the HAS_VISITED_LOCATION relation also.
MATCH (c:Consumer)-[:HAS_VISITED_LOCATION]-(l:Location)
WHERE NOT l.AreaName="hyderabad"
MATCH(c)-[:HAS_DEVICE_BRAND]-(d:DeviceBrand{BrandName:"lenovo"})
RETURN count(c)
So you're asking for the count of all consumers who have the lenovo device brand and who have not visited hyderabad.
This query should do that:
MATCH (l:Location {AreaName:'hyderabad'})
MATCH (c:Consumer)-[:HAS_DEVICE_BRAND]->(:DeviceBrand{BrandName:"lenovo"})
WHERE NOT (c)-[:HAS_VISITED_LOCATION]->(l)
RETURN COUNT(DISTINCT c)
EDIT - New (but related) question on how to get consumers who have not visited hyderabad and who don't have the lenovo brand.
This new question is trickier in that it's matching on the absence of relationships.
The straight forward approach is to simply match on consumers where the consumer has not visited hyderabad and doesn't have the lenevo device brand:
MATCH (c:Consumer)
WHERE NOT (c)-[:HAS_VISITED_LOCATION]->(l:Location {AreaName:'hyderabad'})
AND NOT (c)-[:HAS_DEVICE_BRAND]->(:DeviceBrand{BrandName:"lenovo"})
RETURN COUNT(c) as count
While this is correct, it may not be the most efficient query.
If we look at the logical representation of what you want, we might see an alternate approach:
NOT (visited hyderabad) AND NOT (has lenevo)
If we take the negation of your requirement:
NOT (NOT (visited hyderabad) AND NOT (has lenevo)) = (visited
hyderabad) OR (has lenevo)
So an alternate query can be to find the count of the negation of what you want (the count of consumers who have visited hyderabad OR who have lenovo), and subtract it from the total consumer count to get the actual count you want.
You can try this query and see if it performs better than the straightforward approach:
// first get the total count of consumers, should be very fast
MATCH (c:Consumer)
WITH COUNT(c) as totalCount
MATCH (lenovo:DeviceBrand{BrandName:'lenevo'}), (hyderabad:Location{AreaName:'hyderabad'})
// union lenevo and hyderabad into one column through collecting and combining and unwinding
// (this is a workaround since Cypher can't do post-union processing)
WITH totalCount, COLLECT(lenevo) + COLLECT(hyderabad) as excludeNodes
UNWIND excludeNodes as excludeNode
// get all consumers attached to these nodes
MATCH (excludeNode)<-[:HAS_DEVICE_BRAND|:HAS_VISITED_LOCATION]-(c:Consumer)
WITH totalCount, COUNT(DISTINCT c) as excludeCount
RETURN totalCount - excludeCount as count
I just imported the English Wikipedia into Neo4j and am playing around. I started by looking up the pages that link into the Page "Berlin"
MATCH p=(p1:Page {title:"Berlin"})<-[*1..1]-(otherPage)
WITH nodes(p) as neighbors
LIMIT 500
RETURN DISTINCT neighbors
That works quite well. What I would like to achieve next is to show the 2nd degree of relationships. In order to be able to display them correctly, I would like to limit the number of first degree relationship nodes to 20 and then query the next level of relationship.
How does one achieve that?
I don't know the Wikipedia model, but I'm assuming that there are many different relationship types and that is why that -[*1..1]-, I think that is analogous to -[]- or even --. I doubt it has any serious impact though.
You can collect up the first level matches and limit them to 20 using a WITH with a LIMIT. You can then perform a second match using those (<20) other pages as the start point.
MATCH (p1:Page {title:"Berlin"})<-[*1..1]-(otherPage:Page)
WITH p1, otherPage
LIMIT 20
MATCH (otherPage)<-[*1..1]-(secondDegree:Page)
WHERE secondDegree <> p1
WITH otherPage, secondDegree
LIMIT 500
RETURN otherPage, COLLECT(secondDegree)
There are many ways to return the data, this just returns the first degree match with an array of the subsequent matches.
If the only type of relationship is :Link and you want to keep the start node then you can change the query to this:
MATCH (p1:Page {title:"Berlin"})<-[:Link]-(otherPage:Page)
WITH p1, otherPage
LIMIT 20
MATCH (otherPage)<-[:Link]-(secondDegree:Page)
WHERE secondDegree <> p1
WITH p1, otherPage, secondDegree
LIMIT 500
RETURN p1, otherPage, COLLECT(secondDegree)
I'm developing a kind of reddit service to learn Neo4j.
Everything works fine, I just want to get some feedback on the Cypher query to get the most recent news stories, the author and number of comments, likes and dislikes.
I'm using Neo4j 2.0.
MATCH comments = (n:news)-[:COMMENT]-(o)
MATCH likes = (n:news)-[:LIKES]-(p)
MATCH dislikes = (n:news)-[:DISLIKES]-(q)
MATCH (n:news)-[:POSTED_BY]-(r)
WITH n, r, count(comments) AS num_comments, count(likes) AS num_likes, count(dislikes) AS num_dislikes
ORDER BY n.post_date
LIMIT 20
RETURN *
o, p, q, r are all nodes with the label user. Should the label be added to the query to speed it up?
Is there anything else you see that I could optimize?
I think you're going to want to get rid of the multiple matches. Cypher will filter on each one, filtering through one another, rather than getting all the information.
I would also avoid the paths like comments, and rather do the count on the nodes you are saving. When you do MATCH xyz = (a)-[:COMMENT]-(b) then xyz is a path, which contains the source, relationship and destination node.
MATCH (news:news)-[:COMMENT]-(comment),(news:news)-[:LIKES]-(like),(news:news)-[:DISLIKES]-(dislike),(news:news)-[:POSTED_BY]-(posted_by)
WHERE news.post_date > 0
WITH news, posted_by, count(comment) AS num_comments, count(like) AS num_likes, count(dislike) AS num_dislikes
ORDER BY news.post_date
LIMIT 20
RETURN *
I would do something like this.
MATCH (n:news)-[:POSTED_BY]->(r)
WHERE n.post_date > {recent_start_time}
RETURN n, r,
length((n)<-[:COMMENT]-()) AS num_comments,
length((n)<-[:LIKES]-()) AS num_likes,
length((n)<-[:DISLIKES]-()) AS num_dislikes,
ORDER BY n.post_date DESC
LIMIT 20
To speed it up and have not neo search over all your posts, I would probably index the post-date field (assuming it doesn't contain time information). And then send this query in for today, yesterday etc. until you have your 20 posts.
MATCH (n:news {post_date: {day}})-[:POSTED_BY]->(r)
RETURN n, r,
length((n)<-[:COMMENT]-()) AS num_comments,
length((n)<-[:LIKES]-()) AS num_likes,
length((n)<-[:DISLIKES]-()) AS num_dislikes,
ORDER BY n.post_date DESC
LIMIT 20
is there a default way how to match only first n relationships except that filtering on LIMIT n later?
i have this query:
START n=node({id})
MATCH n--u--n2
RETURN u, count(*) as cnt order by cnt desc limit 10;
but assuming the number of n--u relationships is very high, i want to relax this query and took for example first 100 random relationships and than continue with u--n2...
this is for a collaborative filtering task, and assuming the users are more-less similar i dont want to match all users u but a random subset. this approach should be faster in performance - now i got ~500ms query time but would like to drop it under 50ms.
i know i could break the above query into 2 separate ones, but still in the first query it goes through all users and than later it limits the output. i want to limit the max rels during match phase.
You can pipe the current results of your query using WITH, then LIMIT those initial results, and then continue on in the same query:
START n=node({id})
MATCH n--u
WITH u
LIMIT 10
MATCH u--n2
RETURN u, count(*) as cnt
ORDER BY cnt desc
LIMIT 10;
The query above will give you the first 10 us found, and then continue to find the first ten matching n2s.
Optionally, you can leave off the second LIMIT and you will get all matching n2s for the first ten us (meaning you could have more than ten rows returned if they matched the first 10 us).
This is not a direct solution to your question, but since I was running into a similar problem, my work-around might be interesting for you.
What I need to do is: get relationships by index (might yield many thousands) and get the start node of these. Since the start node is always the same with that index-query, I only need the very first relationship's startnode.
Since I wasn't able to achieve that with cypher (the proposed query by ean5533 does not perform any better), I am using a simple unmanaged extension (nice template).
#GET
#Path("/address/{address}")
public Response getUniqueIDofSenderAddress(#PathParam("address") String addr, #Context GraphDatabaseService graphDB) throws IOException
{
try {
RelationshipIndex index = graphDB.index().forRelationships("transactions");
IndexHits<Relationship> rels = index.get("sender_address", addr);
int unique_id = -1;
for (Relationship rel : rels) {
Node sender = rel.getStartNode();
unique_id = (Integer) sender.getProperty("unique_id");
rels.close();
break;
}
return Response.ok().entity("Unique ID: " + unique_id).build();
} catch (Exception e) {
return Response.serverError().entity("Could not get unique ID.").build();
}
}
For this case here, the speed up is quite nice.
I don't know your exact use case, but since Neo4j even supports HTTP streaming afaik, you should be able to create to convert your query to an unmanaged extension and still get the full performance.
E.g., "java-querying" all your qualifying nodes and emit the partial result to the HTTP stream.