My data set is quite complex, but to simplify what I want, I can say that I have a set of discussions that are labeled with date and time and each discussion has a topic assigned. I am trying to find patterns: if topic 1 appears, what is it usually followed by? I am observing daily trends. I currently have the following query:
MATCH (start:Label1)
WHERE start.topic = Topic1
WITH start
MATCH (start)-[r:FOLLOWED_BY]->(end:Label2)
WITH count(end) as ecount, r as rel, collect(end.topic) as topic
RETURN DISTINCT(r.day) AS day, topic, sum(ecount) ORDER BY day DESC;
Which returns:
250 Topic1 2
250 Topic2 1
250 Topic3 3
While I want the following:
250 Topic1[2] Topic2[1] Topic3[3]
How do I achieve this? I tried to use collect and I get an error along the lines: don't know how to compare that.
It's a bit hard to say without seeing the collect that you tried, but what about this?
MATCH (start:Label1 {topic: Topic1})-[r:FOLLOWED_BY]->(end:Label2)
WITH count(end) as ecount, r as tel, collect(end.topic) as topics
UNWIND topics AS topic
RETURN DISTINCT(r.day) AS day, collect(topic + "[" + sum(ecount) + "]")
ORDER BY day DESC;
Related
I have a list of nodes with a startTime property. I need to determine if the list contains a clump of 3 or more nodes with a startTime within 10 minutes of each other. I don't need to get the nodes that are in the clump, I just need a boolean indicating the existence of such a clump.
I am at a loss, everything I have tried fails so badly that it is not worth posting them.
I feel that I am missing something easy.
This should be doable.
First you'll need to collect the startTimes, order them, and collect them.
From there, you'll need to get the relevant pairings (each entry, and the entry 2 indices ahead for the end of the duration) that will comprise a group of 3, then see if the start times of that pair occur within 10 minutes of each other.
Assuming for the sake of example :Event nodes with a startTime property, you might use this query to get the results you want:
MATCH (e:Event)
WITH e
ORDER BY e.startTime ASC
WITH collect(e.startTime)[1..] as times
WITH times, range(0, size(times) - 3) as indices
RETURN any(index in indices WHERE times[index + 2] <= times[index] + duration({minutes:10}))
I've read the questions about subqueries but still stuck with this use case.
I have Documents that contain one or more keywords and each document has linked user comments with a status property. I want to get just the most recent status (if it exists) returned for each document in the query. If I run a query like the following, I just get one row.
MATCH (d:Document)-[:Keys]->(k:Keywords)
WITH d,k
OPTIONAL MATCH (d)--(c:Comments)
ORDER BY c.created DESC LIMIT 1
RETURN d.Title as Title, k.Word as Keyword, c.Status as Status
I have hundreds of documents I want to return with the latest status like:
Title Keyword Status
War in the 19th Century WWI Reviewed
War in the 19th Century Weapons Reviewed
The Great War WWI Pending
World War I WWI <null>
I have tried multiple queries using WITH clause but no luck yet. Any suggestions would be appreciated.
This query should do what you probably intended:
MATCH (d:Document)-[:Keys]->(k:Keywords)
OPTIONAL MATCH (d)--(c:Comments)
WITH d, COLLECT(k.Word) AS Keywords, c
ORDER BY c.created DESC
WHERE c IS NOT NULL
RETURN d.Title as Title, Keywords, COLLECT(c)[0] AS Status
Since a comment is related to a document, and not a document/keyword pair, it makes more sense to return a collection of keywords for each Title/Status pair. Your original query, if it had worked, would have returned the same Title/Status pair multiple times, each time with a different keyword.
We've got a knowledge base article explaining how to limit results of a match per-row, that should give you a few good options.
EDIT
Here's a full example, using apoc.cypher.run() to perform the limited subquery.
MATCH (d:Document)-[:Keys]->(k:Keywords)
WITH d, COLLECT(k.Word) AS Keywords
// collect keywords first so each document on a row
CALL apoc.cypher.run('
OPTIONAL MATCH (d)--(c:Comments)
RETURN c
ORDER BY c.created DESC
LIMIT 1
', {d:d}) YIELD value
RETURN d.Title as Title, Keywords, value.c.status as Status
I have a very simple cypher which give me a poor performance.
I have approx. 2 million user and 60 book category with relation from user to category around 28 million.
When I do this cypher:
MATCH (u:User)-[read:READ]->(bc:BookCategory)
WHERE read.timestamp >= timestamp() - (1000*60*60*24*30)
RETURN distinct(bc.id);
It returns me 8.5k rows within 2 - 2.5 (First time) minutes
And when I do this cypher:
MATCH (u:User)-[read:READ]->(bc:BookCategory)
WHERE read.timestamp >= timestamp() - (1000*60*60*24*30)
RETURN u.id, u.email, read.timestamp;
It return 55k rows within 3 to 6 (First time) minutes.
I already have index on User id and email, but still I don't think this performance is acceptable. Any idea how can I improve this?
First of all, you can profile your query, to find what happens under the hood.
Currently looks like that query scans all nodes in database to complete query.
Reasons:
Neo4j support indexes only for '=' operation (or 'IN')
To complete query, it traverses all nodes, one by one, checking each node if it has valid timestamp
There is no straightforward way to deal with this problem.
You should look into creating proper graph structure, to deal with Time-specific queries more efficiently. There are several ways how to represent time in graph databases.
You can take look on graphaware/neo4j-timetree library.
Can you explain your model a bit?
Where are the books and the "reading"-Event in it?
Afaik all you want to know, which book categories have been recently read (in the last month)?
You could create a second type of relationship thats RECENTLY_READ which expires (is deleted) by a batch job it is older than 30 days. (That can be two simple cypher statements which create and delete those relationships).
WITH (1000*60*60*24*30) as month
MATCH (a:User)-[read:READ]->(b:BookCategory)
WHERE read.timestamp >= timestamp() - month
MERGE (a)-[rr:RECENTLY_READ]->(b)
WHERE coalesce(rr.timestamp,0) < read.timestamp
SET rr.timestamp = read.timestamp;
WITH (1000*60*60*24*30) as month
MATCH (a:User)-[rr:RECENTLY_READ]->(b:BookCategory)
WHERE rr.timestamp < timestamp() - month
DELETE rr;
There is another way to achieve what you exactly want to do here, but it's unfortunately not possible in Cypher.
With a relationship-index on timestamp on your read relationship you can run a Lucene-NumericRangeQuery in Neo4j's Java API.
But I wouldn't really recommend to go down this route.
There are 2 students: a and b. a likes subjects chem,phy,bio and b likes phy,math,bio.They have a workshop every week and I would like to know count of both their subjects according to the workshop_attended in desc order.
currently I am using this query to get subjects he likes based on workshop attended:
MATCH (s:student{id:"1",name:"a"} )-[:workshop_attended]-(b:workshop)-[y:subject_likes]-(c:subjects) RETURN c,count(c) as total
Btw,is the above query correct w.r.t count and subjects?
Now I would like to know how many common subjects they have liked together by attending the workshops and also count of each of those subjects.How can I do it .I tried this but I always got 0 rows.
MATCH (s:student{id:"1",name:"a"} )-[:workshop_attended]-(b:workshop)-[y:subject_likes]-(c:subjects),
(s2:student{id:"2",name:”b"} )-[:workshop_attended]-(b2:workshop)-[y2:subject_likes]-(c:subjects)
RETURN c,count(c) as total
Also I tried:
MATCH (s:student{id:"1",name:"a"} )-[:workshop_attended]-(b:workshop)-[y:subject_likes]-(c:subjects),
(s2:student{id:"2",name:”b"} )-[:workshop_attended]-(b2:workshop)-[y2:subject_likes]-(l:subjects)
RETURN c,count(c),l,count(l) as total
Even that is wrong also I get more rows for some reason .I really appreciate any help.
I need to model a forum with Neo4j. I have "forums" nodes which have messages and, optionally, these messages have replies: forum-->message-->reply
The cypher query I am using to retrieve the messages of a forum and their replies is:
start forum=node({forumId}) match forum-[*1..]->msg
where (msg.parent=0 and msg.ts<={ts} or msg.parent<>0)
return msg ORDER BY msg.ts DESC limit 10
This query retrieves the messages with time<=ts and all their replies (a message has parent=0 and a reply has parent<>0)
My problem is that I need to retrieve pages of 10 messages (limit 10) independently of the number or replies.
For example, if I had 20 messages and the first one with 100 replies, it would only return 10 rows: the first message and 9 replies but I need the first 10 messages and the 100 replies of the first one.
How can I limit the result based on the number of messages and not their replies?
The ts property is indexed, but is this query efficient when mixing it with other where clauses?
Do you know a better way to model this kind of forum with Neo?
Supposing you switch to labels and avoid IDs (as they can be recycled and therefore are not stable identifiers):
MATCH (forum:FORUM)<--(message:MESSAGE {parent:0})
WHERE forum.name = '%s' // where %s identifies the forum in a *stable* way
WITH message // using a subquery allows to apply LIMIT only to main messages
ORDER BY message.ts DESC
LIMIT 10
OPTIONAL MATCH (message)<-[:REPLIES_TO]-(replies)
RETURN message, replies
The only important change here is to split the reply and message matching in two sub-queries, so that the LIMIT clause applies to the first subquery only.
However, you need to link the relevant replies to the matched main messages in the second subquery (I introduced a fictional relationship REPLIES_TO to link replies to messages).
And when you need to fetch page 2,3,4 etc.
You need an extra parameter (which the biggest message timestamp of the previous page, let's say previous_timestamp).
The first sub-query WHERE clause becomes:
WHERE forum.name = '%s' AND message.ts > previous_timestamp