I have a very simple cypher which give me a poor performance.
I have approx. 2 million user and 60 book category with relation from user to category around 28 million.
When I do this cypher:
MATCH (u:User)-[read:READ]->(bc:BookCategory)
WHERE read.timestamp >= timestamp() - (1000*60*60*24*30)
RETURN distinct(bc.id);
It returns me 8.5k rows within 2 - 2.5 (First time) minutes
And when I do this cypher:
MATCH (u:User)-[read:READ]->(bc:BookCategory)
WHERE read.timestamp >= timestamp() - (1000*60*60*24*30)
RETURN u.id, u.email, read.timestamp;
It return 55k rows within 3 to 6 (First time) minutes.
I already have index on User id and email, but still I don't think this performance is acceptable. Any idea how can I improve this?
First of all, you can profile your query, to find what happens under the hood.
Currently looks like that query scans all nodes in database to complete query.
Reasons:
Neo4j support indexes only for '=' operation (or 'IN')
To complete query, it traverses all nodes, one by one, checking each node if it has valid timestamp
There is no straightforward way to deal with this problem.
You should look into creating proper graph structure, to deal with Time-specific queries more efficiently. There are several ways how to represent time in graph databases.
You can take look on graphaware/neo4j-timetree library.
Can you explain your model a bit?
Where are the books and the "reading"-Event in it?
Afaik all you want to know, which book categories have been recently read (in the last month)?
You could create a second type of relationship thats RECENTLY_READ which expires (is deleted) by a batch job it is older than 30 days. (That can be two simple cypher statements which create and delete those relationships).
WITH (1000*60*60*24*30) as month
MATCH (a:User)-[read:READ]->(b:BookCategory)
WHERE read.timestamp >= timestamp() - month
MERGE (a)-[rr:RECENTLY_READ]->(b)
WHERE coalesce(rr.timestamp,0) < read.timestamp
SET rr.timestamp = read.timestamp;
WITH (1000*60*60*24*30) as month
MATCH (a:User)-[rr:RECENTLY_READ]->(b:BookCategory)
WHERE rr.timestamp < timestamp() - month
DELETE rr;
There is another way to achieve what you exactly want to do here, but it's unfortunately not possible in Cypher.
With a relationship-index on timestamp on your read relationship you can run a Lucene-NumericRangeQuery in Neo4j's Java API.
But I wouldn't really recommend to go down this route.
Related
I am loading simple csv data into neo4j. The data is simple as follows :-
uniqueId compound value category
ACT12_M_609 mesulfen 21 carbon
ACT12_M_609 MNAF 23 carbon
ACT12_M_609 nifluridide 20 suphate
ACT12_M_609 sulfur 23 carbon
I am loading the data from the URL using the following query -
LOAD CSV WITH HEADERS
FROM "url"
AS row
MERGE( t: Transaction { transactionId: row.uniqueId })
MERGE(c:Compound {name: row.compound})
MERGE (t)-[r:CONTAINS]->(c)
ON CREATE SET c.category= row.category
ON CREATE SET r.price =row.value
Next I do the aggregation to count total orders for a compound and create property for a node in the following way -
MATCH (c:Compound) <-[:CONTAINS]- (t:Transaction)
with c.name as name, count( distinct t.transactionId) as ord
set c.orders = ord
So far so good. I can accomplish what I want but I have the following 2 questions -
How can I create the orders property for compound node in the first step itself? .i.e. when I am loading the data I would like to perform the aggregation straight away.
For a compound node I am also setting the property for category. Theoretically, it can also be modelled as category -contains-> compound by creating Categorynode. But what advantage will I have if I do it? Because I can execute the queries and get the expected output without creating this additional node.
Thank you for your answer.
I don't think that's possible, LOAD CSV goes over one row at a time, so at row 1, it doesn't know how many more rows will follow.
I guess you could create virtual nodes and relationships, aggregate those and then use those to create the real nodes, but that would be way more complicated. Virtual Nodes/Rels
That depends on the questions/queries you want to ask.
A graph database is optimised for following relationships, so if you often do a query where the category is a criteria (e.g. MATCH (c: Category {category_id: 12})-[r]-(:Compound) ), it might be more performant to create a label for it.
If you just want to get the category in the results (e.g. RETURN compound.category), then it's fine as a property.
I have 2 csv files which I am trying to load into a Neo4j database using cypher: drivers.csv which holds every formula 1 driver and lap times.csv which stores every lap ever raced in F1.
I have managed to load in all of the nodes, although the lap times file is very large so it took quite a long time! I then tried to add relationships after, but there is so many that needs to be added that I gave up on it waiting (it was taking multiple days and still had not loaded in fully).
I’m pretty sure there is a way to load in the nodes and relationships at the same time, which would allow me to use periodic commit for the relationships which I cannot do right now. Essentially I just need to combine the 2 commands into one and after some attempts I can’t seem to work out how to do it?
// load in the lap_times.csv, changing the variable names - about half million nodes (takes 3-4 days)
PERIODIC COMMIT 25000
LOAD CSV WITH HEADERS from 'file:///lap_times.csv'
AS row
MERGE (lt: lapTimes {raceId: row.raceId, driverId: row.driverId, lap: row.lap, position: row.position, time: row.time, milliseconds: row.milliseconds})
RETURN lt;
// add a relationship between laptimes, drivers and races - takes 3-4 days
MATCH (lt:lapTimes),(d:Driver),(r:race)
WHERE lt.raceId = r.raceId AND lt.driverId = d.driverId
MERGE (d)-[rel8:LAPPING_AT]->(lt)
MERGE (r)-[rel9:TIMED_LAP]->(lt)
RETURN type(rel8), type(rel9)
Thanks in advance for any help!
You should review the documentation for indexes here:
https://neo4j.com/docs/cypher-manual/current/administration/indexes-for-search-performance/
Basically, indexes, once created, allow quick lookups of nodes of a given label, for the given property or properties. If you DON'T have an index and you do a MATCH or MERGE of a node, then for every row of that MATCH or MERGE, it has to do a label scan of all nodes of the given label and check all of their properties to find the nodes, and that becomes very expensive, especially when loading CSVs because those operations are likely happening for each row in the CSV.
For your :lapTimes nodes (though we would recommend you use singular labels in most cases), if there are none of them in your graph to start with, then a CREATE instead of a MERGE is fine. You may want a composite index on :lapTimes(raceId, driverId, lap), since that should uniquely identify the node, if you need to look it up later. Using CREATE instead of MERGE here should process much much faster.
Your second query should be MATCHing on :lapTimes nodes (label scan), and from each doing an index lookup on the :race and :driver nodes, so indexes are key here for performance.
You need indexes on: :race(raceId) and :Driver(driverId).
MATCH (lt:lapTimes)
WITH lt, lt.raceId as raceId, lt.driverId as driverId
MATCH (d:Driver), (r:race)
WHERE r.raceId = raceId AND d.driverId = driverId
MERGE (d)-[:LAPPING_AT]->(lt)
MERGE (r)-[:TIMED_LAP]->(lt)
You might consider CREATE instead of MERGE for the relationships, if you know there are no duplicate entries.
I removed your RETURN because returning the types isn't useful information.
Also, consider using consistent cases for your node labels, and that you are using the same case between the labels in your graph and the indexes you create.
Also, you would probably want to batch these changes instead of trying to process them all at once.
If you install APOC Procedures you can make use of apoc.periodic.iterate(), which can be used to batch changes, which will be faster and easier on your heap. You will still need indexes first.
CALL apoc.periodic.iterate("
MATCH (lt:lapTimes)
WITH lt, lt.raceId as raceId, lt.driverId as driverId
MATCH (d:Driver), (r:race)
WHERE r.raceId = raceId AND d.driverId = driverId
RETURN lt, d, ir",
"MERGE (d)-[:LAPPING_AT]->(lt)
MERGE (r)-[:TIMED_LAP]->(lt)", {}) YIELD batches, total, errorMessages
RETURN batches, total, errorMessages
Single CSV load
If you want to handle everything all at once in a single CSV load, you can do that, but again you will need indexes first. Here's what you'll need at a minimum:
CREATE INDEX ON :Driver(driverId);
CREATE INDEX ON :Race(raceId);
After those are created, you can use this, assuming you are starting from scratch (I fixed the case of your labels and made them singular:
USING PERIODIC COMMIT 25000
LOAD CSV WITH HEADERS from 'file:///lap_times.csv' AS row
MERGE (d:Driver {driverId:row.driverId})
MERGE (r:Race {raceId:row.raceId})
CREATE (lt:LapTime {raceId: row.raceId, driverId: row.driverId, lap: row.lap, position: row.position, time: row.time, milliseconds: row.milliseconds})
CREATE (d)-[:LAPPING_AT]->(lt)
CREATE (r)-[:TIMED_LAP]->(lt)
I'm using NEO4J 3.0 and it seems that HAS function was removed.
Type of myrelationship is a date and I'm looking to retrieve all relation between two dates such as my property "a" is greater than certain value.
How can I test this using NEO4j
Thank you
[EDITED to add info from comments]
I have tried this:
MATCH p=(n:origin)-[r]->()
WHERE r>'2015-01'
RETURN AVG(r.amount) as totalamout;
I created relationship per date and each one has a property, amount, and I am looking to compute the average amount for certain period. As example, average amount since 2015-04.
To answer the issue raised by your first sentence: in neo4j 3.x, the HAS() function was replaced by EXISTS().
[UPDATE 1]
This version of your query should work:
MATCH p=(n:origin)-[r]->()
WHERE TYPE(r) > '2015-01'
RETURN AVG(r.amount) as totalamout;
However, it is a bad idea to give your relationships different types based on a date. It is better to just use a date property.
[UPDATE 2]
If you changed your data model to add a date property to your relationships (to which I will give the type FOO), then the following query will find the average amount, per p, of all the relationships whose date is after 2015-01 (assuming that all your dates follow the same strict YYYY-MM pattern):
MATCH p=(n:origin)-[r:FOO]->()
WHERE r.date > '2015-01'
RETURN p, AVG(r.amount) as avg_amout;
I am server engineer in company that provide dating service.
Currently I am building a PoC for our new recommendation engine.
I try to use neo4j. But performance of this database does not meet our needs.
I have strong feeling that I am doing something wrong and neo4j can do much better.
So can someone give me an advice how to improve performance of my Cypher’s query or how to tune neo4j in right way?
I am using neo4j-enterprise-2.3.1 which is running on c4.4xlarge instance with Amazon Linux.
In our dataset each user can have 4 types of relationships with others users - LIKE, DISLIKE, BLOCK and MATCH.
Also he has a properties like countryCode, birthday and gender.
I made import of all our users and relationships from RDBMS to neo4j using neo4j-import tool.
So each user is a node with properties and each reference is a relationship.
The report from neo4j-import tool said that :
2 558 667 nodes,
1 674 714 539 properties and
1 664 532 288 relationships
were imported.
So it’s huge DB :-) In our case some nodes can have up to 30 000 outgoing relationships..
I made 3 indexes in neo4j :
Indexes
ON :User(userId) ONLINE
ON :User(countryCode) ONLINE
ON :User(birthday) ONLINE
Then I try to build online recommendation engine using this query :
MATCH (me:User {userId: {source_user_id} })-[:LIKE | :MATCH]->()<-[:LIKE | :MATCH]-(similar:User)
USING INDEX me:User(userId)
USING INDEX similar:User(birthday)
WHERE similar.birthday >= {target_age_gte} AND
similar.birthday <= {target_age_lte} AND
similar.countryCode = {target_country_code} AND
similar.gender = {source_gender}
WITH similar, count(*) as weight ORDER BY weight DESC
SKIP {skip_similar_person} LIMIT {limit_similar_person}
MATCH (similar)-[:LIKE | :MATCH]-(recommendation:User)
WITH recommendation, count(*) as sheWeight
WHERE recommendation.birthday >= {recommendation_age_gte} AND
recommendation.birthday <= {recommendation_age_lte} AND
recommendation.gender= {target_gender}
WITH recommendation, sheWeight ORDER BY sheWeight DESC
SKIP {skip_person} LIMIT {limit_person}
MATCH (me:User {userId: {source_user_id} })
WHERE NOT ((me)--(recommendation))
RETURN recommendation
here is the execution plan for one of the user :
plan
When I executed this query for list of users I had the result :
count=2391, min=4565.128849, max=36257.170065, mean=13556.750555555178, stddev=2250.149335254768, median=13405.409811, p75=15361.353029999998, p95=17385.136478, p98=18040.900481, p99=18426.811424, p999=19506.149138, mean_rate=0.9957385490980866, m1=1.2148195797996817, m5=1.1418078036067119, m15=0.9928564378521962, rate_unit=events/second, duration_unit=milliseconds
So even the fastest is too slow for Real-time recommendations..
Can you tell me what I am doing wrong?
Thanks.
EDIT 1 : plan with the expanded boxes :
I built an unmanaged extension to see if I could do better than Cypher. You can grab it here => https://github.com/maxdemarzi/social_dna
This is a first shot, there are a couple of things we can do to speed things up. We can pre-calculate/save similar users, cache things here and there, and random other tricks. Give it a shot, let us know how it goes.
Regards,
Max
If I'm reading this right, it's finding all matches for users by userId and separately finding all matches for users by your various criteria. It's then finding all of the places that they come together.
Since you have a case where you're starting on the left with a single node, my guess is that we'd be better served by following the paths and then filtering what it gotten via relationship traversal.
Let's see how starting like this works for you:
MATCH
(me:User {userId: {source_user_id} })-[:LIKE | :MATCH]->()
<-[:LIKE | :MATCH]-(similar:User)
WITH similar
WHERE similar.birthday >= {target_age_gte} AND
similar.birthday <= {target_age_lte} AND
similar.countryCode = {target_country_code} AND
similar.gender = {source_gender}
WITH similar, count(*) as weight ORDER BY weight DESC
SKIP {skip_similar_person} LIMIT {limit_similar_person}
MATCH (similar)-[:LIKE | :MATCH]-(recommendation:User)
WITH recommendation, count(*) as sheWeight
WHERE recommendation.birthday >= {recommendation_age_gte} AND
recommendation.birthday <= {recommendation_age_lte} AND
recommendation.gender= {target_gender}
WITH recommendation, sheWeight ORDER BY sheWeight DESC
SKIP {skip_person} LIMIT {limit_person}
MATCH (me:User {userId: {source_user_id} })
WHERE NOT ((me)--(recommendation))
RETURN recommendation
[UPDATED]
One possible (and nonintuitive) cause of inefficiency in your query is that when you specify the similar:User(birthday) filter, Cypher uses an index seek with the :User(birthday) index (and additional tests for countryCode and gender) to find all possible DB matches for similar. Let's call that large set of similar nodes A.
Only after finding A does the query filter to see which of those nodes are actually connected to me, as specified by your MATCH pattern.
Now, if there are relatively few me to similar paths (as specified by the MATCH pattern, but without considering its WHERE clause) as compared to the size of A -- say, 2 or more orders of magnitude smaller -- then it might be faster to remove the :User label from similar (since I presume they are probably all going to be users anyway, in your data model), and also remove the USING INDEX similar:User(birthday) clause. In this case, not using the index for similar may actually be faster for you, since you will only be using the WHERE clause on a relatively small set of nodes.
The same considerations also apply to the recommendation node.
Of course, this all has to be verified by testing on your actual data.
So I created a date dimension from this article
a link
I modified it and added datestamp to Day node which is Month/Day/Year (string)
I added indexes on Year.year, Month.month, Day.day && day.datestamp
When I run this query:
MATCH p=(day2:Day {datestamp:'1/1/2015'})-[:NEXT*]->(day {day:2})
return length(p)
limit 5
It takes 1667 ms to execute
When I modify the query to this:
MATCH p=(day2:Day {datestamp:'1/1/2015'})-[:NEXT*]->(day {datestamp:'1/2/2015'})
return length(p)
After it runs for about a minute, it ends in the Unknown Error message.
My schema is:
Indexes
ON :Day(day) ONLINE
ON :Day(datestamp) ONLINE
ON :Month(month) ONLINE
ON :Year(year) ONLINE
No constraints
Any ideas what I'm doing wrong?
I think I figured it out.
Looks like the 1st query that runs 1667ms only runs and completes because of limit 5, it finds 5 records and stops further execution.
While the other keeps going and going until it runs out of juice.
I think solution in this case is constraint that indicates datestamp is unique which should prevent further execution.
Still interesting, considering there's about 2600+ records connected with HAS_NEXT so traveling through those relationships shouldn't be taking this long to find out that there's only 1 record that matches that query.