I am server engineer in company that provide dating service.
Currently I am building a PoC for our new recommendation engine.
I try to use neo4j. But performance of this database does not meet our needs.
I have strong feeling that I am doing something wrong and neo4j can do much better.
So can someone give me an advice how to improve performance of my Cypher’s query or how to tune neo4j in right way?
I am using neo4j-enterprise-2.3.1 which is running on c4.4xlarge instance with Amazon Linux.
In our dataset each user can have 4 types of relationships with others users - LIKE, DISLIKE, BLOCK and MATCH.
Also he has a properties like countryCode, birthday and gender.
I made import of all our users and relationships from RDBMS to neo4j using neo4j-import tool.
So each user is a node with properties and each reference is a relationship.
The report from neo4j-import tool said that :
2 558 667 nodes,
1 674 714 539 properties and
1 664 532 288 relationships
were imported.
So it’s huge DB :-) In our case some nodes can have up to 30 000 outgoing relationships..
I made 3 indexes in neo4j :
Indexes
ON :User(userId) ONLINE
ON :User(countryCode) ONLINE
ON :User(birthday) ONLINE
Then I try to build online recommendation engine using this query :
MATCH (me:User {userId: {source_user_id} })-[:LIKE | :MATCH]->()<-[:LIKE | :MATCH]-(similar:User)
USING INDEX me:User(userId)
USING INDEX similar:User(birthday)
WHERE similar.birthday >= {target_age_gte} AND
similar.birthday <= {target_age_lte} AND
similar.countryCode = {target_country_code} AND
similar.gender = {source_gender}
WITH similar, count(*) as weight ORDER BY weight DESC
SKIP {skip_similar_person} LIMIT {limit_similar_person}
MATCH (similar)-[:LIKE | :MATCH]-(recommendation:User)
WITH recommendation, count(*) as sheWeight
WHERE recommendation.birthday >= {recommendation_age_gte} AND
recommendation.birthday <= {recommendation_age_lte} AND
recommendation.gender= {target_gender}
WITH recommendation, sheWeight ORDER BY sheWeight DESC
SKIP {skip_person} LIMIT {limit_person}
MATCH (me:User {userId: {source_user_id} })
WHERE NOT ((me)--(recommendation))
RETURN recommendation
here is the execution plan for one of the user :
plan
When I executed this query for list of users I had the result :
count=2391, min=4565.128849, max=36257.170065, mean=13556.750555555178, stddev=2250.149335254768, median=13405.409811, p75=15361.353029999998, p95=17385.136478, p98=18040.900481, p99=18426.811424, p999=19506.149138, mean_rate=0.9957385490980866, m1=1.2148195797996817, m5=1.1418078036067119, m15=0.9928564378521962, rate_unit=events/second, duration_unit=milliseconds
So even the fastest is too slow for Real-time recommendations..
Can you tell me what I am doing wrong?
Thanks.
EDIT 1 : plan with the expanded boxes :
I built an unmanaged extension to see if I could do better than Cypher. You can grab it here => https://github.com/maxdemarzi/social_dna
This is a first shot, there are a couple of things we can do to speed things up. We can pre-calculate/save similar users, cache things here and there, and random other tricks. Give it a shot, let us know how it goes.
Regards,
Max
If I'm reading this right, it's finding all matches for users by userId and separately finding all matches for users by your various criteria. It's then finding all of the places that they come together.
Since you have a case where you're starting on the left with a single node, my guess is that we'd be better served by following the paths and then filtering what it gotten via relationship traversal.
Let's see how starting like this works for you:
MATCH
(me:User {userId: {source_user_id} })-[:LIKE | :MATCH]->()
<-[:LIKE | :MATCH]-(similar:User)
WITH similar
WHERE similar.birthday >= {target_age_gte} AND
similar.birthday <= {target_age_lte} AND
similar.countryCode = {target_country_code} AND
similar.gender = {source_gender}
WITH similar, count(*) as weight ORDER BY weight DESC
SKIP {skip_similar_person} LIMIT {limit_similar_person}
MATCH (similar)-[:LIKE | :MATCH]-(recommendation:User)
WITH recommendation, count(*) as sheWeight
WHERE recommendation.birthday >= {recommendation_age_gte} AND
recommendation.birthday <= {recommendation_age_lte} AND
recommendation.gender= {target_gender}
WITH recommendation, sheWeight ORDER BY sheWeight DESC
SKIP {skip_person} LIMIT {limit_person}
MATCH (me:User {userId: {source_user_id} })
WHERE NOT ((me)--(recommendation))
RETURN recommendation
[UPDATED]
One possible (and nonintuitive) cause of inefficiency in your query is that when you specify the similar:User(birthday) filter, Cypher uses an index seek with the :User(birthday) index (and additional tests for countryCode and gender) to find all possible DB matches for similar. Let's call that large set of similar nodes A.
Only after finding A does the query filter to see which of those nodes are actually connected to me, as specified by your MATCH pattern.
Now, if there are relatively few me to similar paths (as specified by the MATCH pattern, but without considering its WHERE clause) as compared to the size of A -- say, 2 or more orders of magnitude smaller -- then it might be faster to remove the :User label from similar (since I presume they are probably all going to be users anyway, in your data model), and also remove the USING INDEX similar:User(birthday) clause. In this case, not using the index for similar may actually be faster for you, since you will only be using the WHERE clause on a relatively small set of nodes.
The same considerations also apply to the recommendation node.
Of course, this all has to be verified by testing on your actual data.
Related
I am tasked with prototyping Neo4J as a replacement to our existing data mart which stores data in Redshift/Postgres schemas. I have loaded an instance of neo running on an EC2 instance on an m5.xlarge server to model a marketing campaign and trying to get simple counts of who saw my spots in a given demographic. The results as far as numbers produced are the exact same as my existing data mart, but i am surprised to see that performance is much slower. The same query to get simple count of impressions by a television network returns in 48 seconds compared to 1.5 seconds in Redshift. Question is am I doing something wrong in my cypher (i.e. too many joins) or is this expected behavior. Here is a diagram of the model:
Here is my Cypher to get the results in 48s:
match (c:Campaign{campaign_id:98})<-[:PART_OF]-(sa)
, (sa)-[:AIRED_ON]->(n)
, (n)-[:BELONGS_TO]->(ng:NetworkGroup{network_group_id:2})
, (sa)<-[:EXPOSED_WITH]-(e)
, (e)<-[se:CONTAINS_ENTITY]-(s:Sample{sample_id:2000005})
, (e)-[:MEMBER_OF]->(a:DemographicAudience{audience_id:2})
return c.campaign_id as `campaign_id`
, a.audience_id as `audience_id`
, a.audience_name as `audience_name`
, s.sample_id as `sample_id`
, n.network_id as `network_id`
, n.network_name as `network_name`
, n.network_call_sign as `network_call_sign`
, count(distinct sa.spot_airing_id) as `spot_airings`
, sum(se.weight) as `spot_impressions`
In addition, I believe all necessary constraints are added to optimize:
Indexes
ON :DemographicAudience(audience_id) ONLINE (for uniqueness constraint)
ON :Campaign(campaign_id) ONLINE (for uniqueness constraint)
ON :Entity(entity_id) ONLINE (for uniqueness constraint)
ON :Network(network_id) ONLINE (for uniqueness constraint)
ON :NetworkGroup(network_group_id) ONLINE (for uniqueness constraint)
ON :Sample(sample_id) ONLINE (for uniqueness constraint)
ON :SpotAiring(spot_airing_id) ONLINE (for uniqueness constraint)
Constraints
ON ( audience:DemographicAudience ) ASSERT audience.audience_id IS UNIQUE
ON ( campaign:Campaign ) ASSERT campaign.campaign_id IS UNIQUE
ON ( entity:Entity ) ASSERT entity.entity_id IS UNIQUE
ON ( network:Network ) ASSERT network.network_id IS UNIQUE
ON ( networkgroup:NetworkGroup ) ASSERT networkgroup.network_group_id IS UNIQUE
ON ( sample:Sample ) ASSERT sample.sample_id IS UNIQUE
ON ( spotairing:SpotAiring ) ASSERT spotairing.spot_airing_id IS UNIQUE
Running Neo4J 3.3.1 Community on AWS: https://aws.amazon.com/marketplace/pp/B073S5MDPV/?ref_=_ptnr_intuz_ami_neoj4
I should also mention that quite a bit of data is loaded: 24,154,440 nodes and 33,220,694 relationships. Most of these are relationships to entities.
My understanding was that Neo4J should hold toe to toe with any RDBMS and even outperform as data grows. I'm hoping I am just being naive with my rookie cypher skills. Any help would be appreciated.
Thanks.
Keep in mind that in Neo4j indexes are used to find starting points in the graph, and once those starting points are found, relationship traversal is used to expand out to find the paths which fit the pattern.
In this case we have several unique nodes, so we can save on some operations by ensuring that we match to all of these nodes first, and we should see Expand(Into) in our query plan instead of Expand(All) and then Filter. I have a hunch that the planner is only using index lookup on a single node, and the rest of them aren't using the index but using property access and filtering, which is less efficient.
In the case that the planner doesn't lookup all your unique nodes first, before expansion, we can force it by introducing a LIMIT 1 after the initial match.
Lastly, it's a good idea to aggregate using the node itself rather than its properties if the properties in question are unique. This will use the underlying graph id for the node for comparison purposes rather than having to do more expensive property access.
Give this a try and see how it compares:
MATCH (c:Campaign{campaign_id:98}), (s:Sample{sample_id:2000005}), (a:DemographicAudience{audience_id:2}), (ng:NetworkGroup{network_group_id:2})
WITH c,s,a,ng
LIMIT 1
MATCH (c)<-[:PART_OF]-(sa)
, (sa)-[:AIRED_ON]->(n)
, (n)-[:BELONGS_TO]->(ng)
, (sa)<-[:EXPOSED_WITH]-(e)
, (e)<-[se:CONTAINS_ENTITY]-(s)
, (e)-[:MEMBER_OF]->(a)
WITH c, a, s, n, count(distinct sa) as `spot_airings`, sum(se.weight) as `spot_impressions`
RETURN c.campaign_id as `campaign_id`
, a.audience_id as `audience_id`
, a.audience_name as `audience_name`
, s.sample_id as `sample_id`
, n.network_id as `network_id`
, n.network_name as `network_name`
, n.network_call_sign as `network_call_sign`
, `spot_airings`
, `spot_impressions`
I have a big graph model and I need to write the result of following query into a csv.
Match (u:USER)-[r:PURCHASED]->(o:ORDER)-[h:HAS]->(i:ITEM) return u.id as userId,i.product_number as itemId
When I "Explain" query, this is the result I get :
It shows that the estimated result is something around 9M. My problems are :
1) It takes alot of time to get a response. From neo4j-shell it takes 38 minutes! Is this normal? BTW I have all schema indexes there and they all are ONLINE.
2) When I use SpringDataNeo4j to fetch the result , it throws an "java.lang.OutOfMemoryError: GC overhead limit exceeded" error , and that happens when SDN tries to convert the loaded data to our #QueryResult object.
I tried to optimize the query in all different ways but nothing was changed ! My impression is that I am doing something wrong. Does anyone have any idea how I can solve this problem? Should I go for Batch read/write ?
P.S I am using Neo4j comunity edition Version:3.0.1 and these are my sysinfos:
and these are my server configs.
dbms.jvm.additional=-Dunsupported.dbms.udc.source=tarball
use_memory_mapped_buffers=true
neostore.nodestore.db.mapped_memory=3G
neostore.relationshipstore.db.mapped_memory=4G
neostore.propertystore.db.mapped_memory=3G
neostore.propertystore.db.strings.mapped_memory=1000M
neostore.propertystore.db.index.keys.mapped_memory=500M
neostore.propertystore.db.index.mapped_memory=500M
Although Neo4j will stream results to you as it matches them, when you use SDN it has to collect the output into a single #QueryResult object. To avoid OOM problems you'll need to either ensure your application has sufficient heap memory available to load all 9m responses, or use the neo4j-shell, or use a purpose-built streaming interface, such as https://www.npmjs.com/package/cypher-stream. (caveat emptor: I haven't tried this, but it looks like it should do the trick)
Your config settings are not correct for Neo4j 3.0.1
you have to set the heap in conf/neo4j-wrapper.conf, e.g. 8G
and page-cache in conf/neo4j.conf (looking at your store you only need 2G for page-cache).
Also as you can see it will create 8+ million rows.
You might have more luck with this query:
Match (u:USER)-[:PURCHASED]->(:ORDER)-[:HAS]->(i:ITEM)
with distinct u,i
return u.id as userId,i.product_number as itemId
It also doesn't make sense to return 8M rows to neoj-shell to be honest.
If you want to measure it, replace the RETURN with WITH and add a RETURN count(*)
Match (u:USER)-[r:PURCHASED]->(o:ORDER)-[h:HAS]->(i:ITEM)
with distinct u,i
WITH u.id as userId,i.product_number as itemId
RETURN count(*)
Another optimization could be to go via item and user and do a hash-join in the middle for a global query like this:
Match (u:USER)-[:PURCHASED]->(o:ORDER)-[:HAS]->(i:ITEM)
USING JOIN ON o
with distinct u,i
WITH u.id as userId,i.product_number as itemId
RETURN count(*)
The other thing that I'd probably do to reduce the number of returned results is to try aggregation.
Match (u:USER)-[:PURCHASED]->(o:ORDER)-[:HAS]->(i:ITEM)
with distinct u,i
WITH u, collect(distinct i) as products
WITH u.id as userId,[i in products | i.product_number] as items
RETURN count(*)
Thanks to Vince's and Michael comments I found a solution !
After doing some experiments it got clear that the server response time is actually good ! 1.5 minute for 9 million data ! The problem is with SDN as Vince mentioned ! The OOM happens when SDN tries to convert the data to #QueryResult Object. Increasing heap memory for our application is not a permanent solution as we will have more rows in future ! So we decide to use neo4j-jdbc-driver for big data queries... & it works like a jet ! Here is the code example we used :
Class.forName("org.neo4j.jdbc.Driver");
try (Connection con = DriverManager.getConnection("jdbc:neo4j:bolt://HOST:PORT", "USER", "PASSWORD")) {
// Querying
String query = "match (u:USER)-[r:PURCHASED]->(o:ORDER)-[h:HAS]->(i:ITEM) return u.id as userId,i.product_number as itemId";
con.setAutoCommit(false); // important for large dataset
Statement st = con.createStatement();
st.setFetchSize(50);// important for large dataset
try (ResultSet rs = st.executeQuery(query)) {
while (rs.next()) {
writer.write(rs.getInt("userId") + ","+rs.getInt("itemId"));
writer.newLine();
}
}
st.setFetchSize(0);
writer.close();
st.close();
}
Just make sure you use " con.setAutoCommit(false); " and "st.setFetchSize(50)" if you know that you are going to load a large dataset. Thanks Everyone !
Nodes with the Location node label have an index on Label.name
Profiling the following query gives me a smart plan, with a NodeHashJoin between the two sides of the graph on either side of Trip nodes. Very clever. Works great.
PROFILE MATCH (rosen:Location)<-[:OCCURS_AT]-(ev:Event)<-[:HAS]-(trip:Trip)-[:OPERATES_ON]->(date:Date)
WHERE rosen.name STARTS WITH "U Rosent" AND
ev.scheduled_departure_time > "07:45:00" AND
date.date = '2015-11-20'
RETURN rosen.name, ev.scheduled_departure_time, trip.headsign
ORDER BY ev.scheduled_departure_time
LIMIT 20;
However, just changing one line of the query from:
WHERE rosen.name STARTS WITH "U Rosent" AND
to
WHERE id(rosen) = 4752371 AND
seems to alter the entire behavior of the query plan, which now appears to become more "sequential", losing the parallel execution of (Trip)-[:OPERATES_ON]->(Date)
Much slower. 6x more DB hits in total.
Question
Why does changing the retrieval of one, seemingly-unrelated Location node via a different index/mechanism alter the behavior of the whole query?
(I'm not sure how best to convey more information about the graph model, but please advise, and I'd be happy to add details that are missing)
Edit:
It gets better. Changing that query line from:
WHERE rosen.name STARTS WITH "U Rosent" AND
to
WHERE rosen.name = "U Rosenthaler Platz." AND
results in the same loss of parallelism in the query plan!
Seems odd that a LIKE query is faster than an = ?
I have a very simple cypher which give me a poor performance.
I have approx. 2 million user and 60 book category with relation from user to category around 28 million.
When I do this cypher:
MATCH (u:User)-[read:READ]->(bc:BookCategory)
WHERE read.timestamp >= timestamp() - (1000*60*60*24*30)
RETURN distinct(bc.id);
It returns me 8.5k rows within 2 - 2.5 (First time) minutes
And when I do this cypher:
MATCH (u:User)-[read:READ]->(bc:BookCategory)
WHERE read.timestamp >= timestamp() - (1000*60*60*24*30)
RETURN u.id, u.email, read.timestamp;
It return 55k rows within 3 to 6 (First time) minutes.
I already have index on User id and email, but still I don't think this performance is acceptable. Any idea how can I improve this?
First of all, you can profile your query, to find what happens under the hood.
Currently looks like that query scans all nodes in database to complete query.
Reasons:
Neo4j support indexes only for '=' operation (or 'IN')
To complete query, it traverses all nodes, one by one, checking each node if it has valid timestamp
There is no straightforward way to deal with this problem.
You should look into creating proper graph structure, to deal with Time-specific queries more efficiently. There are several ways how to represent time in graph databases.
You can take look on graphaware/neo4j-timetree library.
Can you explain your model a bit?
Where are the books and the "reading"-Event in it?
Afaik all you want to know, which book categories have been recently read (in the last month)?
You could create a second type of relationship thats RECENTLY_READ which expires (is deleted) by a batch job it is older than 30 days. (That can be two simple cypher statements which create and delete those relationships).
WITH (1000*60*60*24*30) as month
MATCH (a:User)-[read:READ]->(b:BookCategory)
WHERE read.timestamp >= timestamp() - month
MERGE (a)-[rr:RECENTLY_READ]->(b)
WHERE coalesce(rr.timestamp,0) < read.timestamp
SET rr.timestamp = read.timestamp;
WITH (1000*60*60*24*30) as month
MATCH (a:User)-[rr:RECENTLY_READ]->(b:BookCategory)
WHERE rr.timestamp < timestamp() - month
DELETE rr;
There is another way to achieve what you exactly want to do here, but it's unfortunately not possible in Cypher.
With a relationship-index on timestamp on your read relationship you can run a Lucene-NumericRangeQuery in Neo4j's Java API.
But I wouldn't really recommend to go down this route.
I realise this may not be ideal usage, but apart from all the graphy goodness of Neo4j, I'd like to show a collection of nodes, say, People, in a tabular format that has indexed properties for sorting and filtering
I'm guessing the Type of a node can be stored as a Link, say Bob -> type -> Person, which would allow us to retrieve all People
Are the following possible to do efficiently (indexed?) and in a scalable manner?
Retrieve all People nodes and display all of their names, ages, cities of birth, etc (NOTE: some of this data will be properties, some Links to other nodes (which could be denormalised as properties for table display's and simplicity's sake)
Show me all People sorted by Age
Show me all People with Age < 30
Also a quick how to do the above (or a link to some place in the docs describing how) would be lovely
Thanks very much!
Oh and if the above isn't a good idea, please suggest a storage solution which allows both graph-like retrieval and relational-like retrieval
if you want to operate on these person nodes, you can put them into an index (default is Lucene) and then retrieve and sort the nodes using Lucene (see for instance How do I sort Lucene results by field value using a HitCollector? on how to do a custom sort in java). This will get you for instance People sorted by Age etc. The code in Neo4j could look like
Transaction tx = neo4j.beginTx();
idxManager = neo4j.index()
personIndex = idxManager.forNodes('persons')
personIndex.add(meNode,'name',meNode.getProperty('name'))
personIndex.add(youNode,'name',youNode.getProperty('name'))
tx.success()
tx.finish()
'*** Prepare a custom Lucene query context with Neo4j API ***'
query = new QueryContext( 'name:*' ).sort( new Sort(new SortField( 'name',SortField.STRING, true ) ) )
results = personIndex.query( query )
For combining index lookups and graph traversals, Cypher is a good choice, e.g.
START people = node:people_index(name="E*") MATCH people-[r]->() return people.name, r.age order by r.age asc
in order to return data on both the node and the relationships.
Sure, that's easily possible with the Neo4j query language Cypher.
For example:
start cat=node:Types(name='Person')
match cat<-[:IS_A]-person-[born:BORN]->city
where person.age > 30
return person.name, person.age, born.date, city.name
order by person.age asc
limit 10
You can experiment with it in our cypher console.