NEO4j 3.0 retrieve data between certain period - neo4j

I'm using NEO4J 3.0 and it seems that HAS function was removed.
Type of myrelationship is a date and I'm looking to retrieve all relation between two dates such as my property "a" is greater than certain value.
How can I test this using NEO4j
Thank you
[EDITED to add info from comments]
I have tried this:
MATCH p=(n:origin)-[r]->()
WHERE r>'2015-01'
RETURN AVG(r.amount) as totalamout;
I created relationship per date and each one has a property, amount, and I am looking to compute the average amount for certain period. As example, average amount since 2015-04.

To answer the issue raised by your first sentence: in neo4j 3.x, the HAS() function was replaced by EXISTS().
[UPDATE 1]
This version of your query should work:
MATCH p=(n:origin)-[r]->()
WHERE TYPE(r) > '2015-01'
RETURN AVG(r.amount) as totalamout;
However, it is a bad idea to give your relationships different types based on a date. It is better to just use a date property.
[UPDATE 2]
If you changed your data model to add a date property to your relationships (to which I will give the type FOO), then the following query will find the average amount, per p, of all the relationships whose date is after 2015-01 (assuming that all your dates follow the same strict YYYY-MM pattern):
MATCH p=(n:origin)-[r:FOO]->()
WHERE r.date > '2015-01'
RETURN p, AVG(r.amount) as avg_amout;

Related

Datetime comparison query doesn't return any results

I'm trying to get a simple date-time comparison to work, but the query doesn't return any results.
The query is
MATCH (n:Event) WHERE n.start_datetime > datetime("2019-06-01T18:40:32.142+0100") RETURN n.start_datetime
According to this documentation page, this type of comparisons should work. I've also tried creating the datetime object explicitly, for instance with datetime({year: 2019, month: 7}).
I've checked that the start_datetime is in fact well formatted, by checking if the values start_datetime.year for example was correct, and couldn't find any error.
Given that all the records in the database are from 2021, the above query should return every event, yet is returning nothing.
Doing the query using only the year comparison instead of doing full datetime comparison works:
MATCH (n:Event) WHERE n.start_datetime.year > datetime("2019-06-01T18:40:32.142+0100").year RETURN n.start_datetime
Double check the data type of start_datetime. It can be either in epoch seconds or epoch milliseconds. You need to convert the epoch format to datetime, so that both are on the same data type. The reason that your 2nd query works (.year) is because .year returns an integer value.
Run below to get samples:
MATCH (n:Event)
RETURN distinct n.start_datetime LIMIT 5
Then if you see that it is 10 digits then it is in epochSeconds. If yes, then run below query:
MATCH (n:Event)
WHERE n.start_datetime is not null
AND datetime({epochSeconds: n.start_datetime}) > datetime("2019-06-01T18:40:32.142+0100")
RETURN n.start_datetime
LIMIT 25
It turns out the error was due to the timezone. Neo4j had saved the properties as LocalDateTime, which apparently can't be compared to ZonedDateTime.
I used py2neo for most of the nodes management, and the solution was to give a specific timezone to the python property. This was done (in my case) using:
datetime.datetime.fromtimestamp(kwargs["end"], pytz.UTC)
After that, I was able to do the comparisons.
Hopes this saves a couple of hours to future developers.

Neo4J 3.4 how does `date` function work?

I upgraded a Neo4J v3.3 to v3.4 to try out the new spatial and temporal functions.
I'm trying very simple queries. Once with the date function and one without. The results are different.
match (r:Model) where r.open_date>"2018-04-26" return count(r);
Result is 19283.
match (r:Model) where r.open_date>date("2018-04-26") return count(r);
Result is 0.
What is the way to use the new functions?
[EDITED]
The new temporal types, like Date and Duration, are really special types, and it does not make sense to compare them directly to strings or numbers.
Assuming r.open_date has the right format, this should work:
MATCH (r:Model)
WHERE DATE(r.open_date) > DATE("2018-04-26")
RETURN
Also, the the following query may be more performant (since a second DATE object does not need to be constructed):
MATCH (r:Model)
WHERE TOSTRING(DATE(r.open_date)) > "2018-04-26"
RETURN

Neo4j Cypher NULL values and descending sorting

I have an auditing filed for all of my entities:
createDate
updateDate
I always initialize createDate during the entity creation but updateDate can contain NULL until the first update.
I have to implement sorting feature over these fields.
With createDate everything works fine but with updateDate I have issues.
In case of a mixed set of NULLs and Dates in updateDate during the descending sort, I have the NULLs first and this is not the something I'm expecting here.
I understand that according to the Neo4j documentation, this is an expecting behavior - When sorting the result set, null will always come at the end of the result set for ascending sorting, and first when doing descending sort. but I don't know right now how to implement the proper sorting from the user perspective where the user will see the latest updated documents at the top of the list. Some time ago I have even created GitHub issue for this feature https://github.com/opencypher/openCypher/issues/238
One workaround I can see here - is to populate also updateDate together with createDate during the entity creation but I really hate this solution.
Are there any other solutions in order to properly implement this task?
You can try using the coalesce() function. It will return the first non-null value in the list of expressions passed to it.
MATCH (n:Node)
RETURN n
ORDER BY coalesce(n.updateDate, 0) DESC
EDIT:
From comments:
on the database level it is something like this: "updateDate":
"2017-09-07T22:27:11.012Z". On the SDN4 level it is a Java -
java.util.Date type
In this case you can change the 0 by a date representing an Start-Of-Time constant (like "1970-01-01T00:00:00.000Z").
MATCH (n:Node)
RETURN n
ORDER BY coalesce(n.updateDate, "1970-01-01T00:00:00.000Z") DESC
I'd just use the createDate as the updateDate when updateDate IS NULL:
MATCH (n:Node)
RETURN n
ORDER BY coalesce(n.updateDate, n.createDate) DESC
You may want to consider storing your ISO 8601 timestamp strings as (millisecond) integers instead. That could make most queries that involve datetime manipulations more efficient (or even possible), and would also use up less DB space compared to the equivalent string.
One way to do that conversion is to use the APOC function apoc.date.parse. For example, this converts 2017-09-07T22:27:11.012Z to an integer (in millisecond units):
apoc.date.parse('2017-09-07T22:27:11.012Z', 'ms', "yyyy-MM-dd'T'HH:mm:ss.SSSX")
With this change to your data model, you could also initialize updateDate to 0 at node creation time. This would allow you to avoid having to use COALESCE(n.updateDate, 0) for sorting purposes (as suggested by #Bruno Peres),
and the 0 value would serve as an indication that the node was never updated.
(But the drawback would be that all nodes would have an updateDate property, even the ones that were never updated.)

Neo4j cyper performance on simple match

I have a very simple cypher which give me a poor performance.
I have approx. 2 million user and 60 book category with relation from user to category around 28 million.
When I do this cypher:
MATCH (u:User)-[read:READ]->(bc:BookCategory)
WHERE read.timestamp >= timestamp() - (1000*60*60*24*30)
RETURN distinct(bc.id);
It returns me 8.5k rows within 2 - 2.5 (First time) minutes
And when I do this cypher:
MATCH (u:User)-[read:READ]->(bc:BookCategory)
WHERE read.timestamp >= timestamp() - (1000*60*60*24*30)
RETURN u.id, u.email, read.timestamp;
It return 55k rows within 3 to 6 (First time) minutes.
I already have index on User id and email, but still I don't think this performance is acceptable. Any idea how can I improve this?
First of all, you can profile your query, to find what happens under the hood.
Currently looks like that query scans all nodes in database to complete query.
Reasons:
Neo4j support indexes only for '=' operation (or 'IN')
To complete query, it traverses all nodes, one by one, checking each node if it has valid timestamp
There is no straightforward way to deal with this problem.
You should look into creating proper graph structure, to deal with Time-specific queries more efficiently. There are several ways how to represent time in graph databases.
You can take look on graphaware/neo4j-timetree library.
Can you explain your model a bit?
Where are the books and the "reading"-Event in it?
Afaik all you want to know, which book categories have been recently read (in the last month)?
You could create a second type of relationship thats RECENTLY_READ which expires (is deleted) by a batch job it is older than 30 days. (That can be two simple cypher statements which create and delete those relationships).
WITH (1000*60*60*24*30) as month
MATCH (a:User)-[read:READ]->(b:BookCategory)
WHERE read.timestamp >= timestamp() - month
MERGE (a)-[rr:RECENTLY_READ]->(b)
WHERE coalesce(rr.timestamp,0) < read.timestamp
SET rr.timestamp = read.timestamp;
WITH (1000*60*60*24*30) as month
MATCH (a:User)-[rr:RECENTLY_READ]->(b:BookCategory)
WHERE rr.timestamp < timestamp() - month
DELETE rr;
There is another way to achieve what you exactly want to do here, but it's unfortunately not possible in Cypher.
With a relationship-index on timestamp on your read relationship you can run a Lucene-NumericRangeQuery in Neo4j's Java API.
But I wouldn't really recommend to go down this route.

Neo4j: Sum relationship properties where node properties equal Value A and Value B (intersection)

Basically my question is: how do I sum relationship properties where there is a related nodes that have properties equal to Value A and Value B?
For example:
I have a simple DB has the following relationship:
(site)-[:HAS_MEMBER]->(user)-[:POSTED]->(status)-[:TAGGED_WITH]->(tag)
On [:TAGGED_WITH] I have a property called "TimeSpent". I can easily SUM up all the time spent for a particular day and user by using the following query:
MATCH (user)-[:POSTED]->(updates)-[r:TAGGED_WITH]->(tags)
WHERE user.name = "Josh Barker" AND updates.date = 20141120
RETURN tags.name, SUM(r.TimeSpent) as totalTimeSpent;
This returns to me a nice table with tags and associated time spent on each. (i.e. #Meeting 4.5). However, the question arises if I want to do some advanced searches and say "Show me all the meetings for ProjectA" (i.e. #Meeting #ProjectA). Basically, I am looking for a query that I can get all of the relationships where a single status has BOTH tags (and only if it has both). Then I can SUM that number up to get a count for how many meetings I spent in #ProjectA.
How do I do this?
MATCH (updates)-[r:TAGGED_WITH]->(tag1 {name: 'Meeting'}),
(updates)-[r:TAGGED_WITH]->(tag2 {name: 'ProjectA'})
RETURN SUM(r.TimeSpent) as totalTimeSpent, count(updates);
This should find all updates tagged with both of those things, and sum all time spent across all of those updates.
To create a generic solution where you may want one or more tags you could use something like this, passing in the array of tags as a parameter (and using the length of the array instead of the hard coded 2.
MATCH (user)-[:POSTED]->(update)-[r:TAGGED_WITH]->(tag)
WHERE user.name = "Josh Barker" AND updates.date = 20141120 AND tag.name IN ['Meeting', 'ProjectA']
WITH update, SUM(r.TimeSpent) AS totalTimeSpent, COLLECT(tag) AS tags
WHERE LENGTH(tags) = 2
RETURN update, totalTtimeSpent
As long as tag.name is indexed, this should be fast.
Edit - Remove User constraint
MATCH (update)-[r:TAGGED_WITH]->(tag)
WHERE tag.name IN ['Meeting', 'ProjectA']
WITH update, SUM(r.TimeSpent) AS totalTimeSpent, COLLECT(tag) AS tags
WHERE LENGTH(tags) = 2
RETURN update, totalTtimeSpent

Resources