Neo4j performance - is it really fast? - neo4j

I'm wanting to run some tests on neo4j, and compare its performance with other databases, in this case postgresql.
This postgres database have about 2000000 'content's distributed around 3000 'categories'. ( this means that there is a table 'content', one 'category' and a relation table 'content-to-category' since one content can be in more than 1 category).
So, mapping this to a neo4j db, i'm creating nodes 'content', 'category' and their relations ( content to category, and content to content, cause contents can have related contents).
category -> category ( categories can have sub-categories )
content -> category
content -> content (related)
Do you think this 'schema' is ok for this type of domain ?
migrating all data from postgresql do neo4j: it is taking forever ( about 4, 5 days ). This is just some search for nodes and creating/updating accordingly. (search is using indexes and the insert/update if taking 500ms for each node)
Am i doing something wrong ?
Migration is done, so i went to try some querying ...
i ended up with about 2000000 content nodes, 3000 category nodes, and more than 4000000 relationships.
(please note that i'm new to all this neo4j world, so i have no idea how to optimize cypher queries...)
One of the queries i wanted to test is: get the 10 latest published contents of a given 'definition' in a given category (this includes contents that are in sub categories of the given category)
experimenting a little, i ended up with something like this :
START
c = node : node_auto_index( 'type: category AND code: category_code' ),
n = node : node_auto_index( 'type: content AND state: published AND definitionCode: definition_name' )
MATCH (c) <- [ r:BELONGS_TO * ] - (n)
RETURN n.published_stamp, n.title
ORDER BY n.published_stamp DESC
LIMIT 6
this takes around 3 seconds, excluding the first run, that takes a lot more ... is this normal ?
What am i doing wrong ?
please note that i'm using neo4j 1.9.2, and auto indexing some node properties ( type, code, state, definitionCode and published_stamp included - title is not auto indexed )
also, returning 'c' on the previous query ( start c = node : node_auto_index( 'type: category AND code : category-code' ) return c; ) is fast (again, excluding the first run, that takes aroung 20-30ms)
also, i'm not sure if this is the right way to use indexes ...
Thank you in advance (sorry if something is not making sense - ask me and i'll try to explain better).

Have you looked at the batch import facilities: http://www.neo4j.org/develop/import? You really should look at that for the initial import - it will take minutes instead of days.
I will ask some of our technical folks to get back to you on some of the other stuff. You really should not be seeing this.
Rik

How many nodes are returned by this?
START
n = node : node_auto_index( 'type: content AND state: published AND definitionCode: definition_name' )
RETURN count(*)
I would try to let the graph do the work.
How deep are your hierarchies usually?
Usually you limit arbitrary length relationships to not have the combinatorial explosion:
I would also have a different relationship-type between content and category than the category tree.
Can you point out your current relationship-types?
START
c = node : node_auto_index( 'type: category AND code: category_code' ),
MATCH (c) <- [:BELONGS_TO*5] - (n)
WHERE n.type = 'content' AND n.state='published' and n.definitionCode = 'definition_name'
RETURN n.published_stamp, n.title
ORDER BY n.published_stamp DESC
LIMIT 6
Can you try that?
For import it is easiest to generate CSV from your SQL and import that using http://github.com/jexp/batch-import

Are you running Linux, maybe on an ext4 filesystem?
You might want to set the barrier=0 mount option, as described here: http://structr.org/blog/neo4j-performance-on-ext4
Further discussion of this topic: https://groups.google.com/forum/#!topic/neo4j/nflUyBsRKyY

Related

Correct order of operations in neo4j - LOAD, MERGE, MATCH, WITH, SET

I am loading simple csv data into neo4j. The data is simple as follows :-
uniqueId compound value category
ACT12_M_609 mesulfen 21 carbon
ACT12_M_609 MNAF 23 carbon
ACT12_M_609 nifluridide 20 suphate
ACT12_M_609 sulfur 23 carbon
I am loading the data from the URL using the following query -
LOAD CSV WITH HEADERS
FROM "url"
AS row
MERGE( t: Transaction { transactionId: row.uniqueId })
MERGE(c:Compound {name: row.compound})
MERGE (t)-[r:CONTAINS]->(c)
ON CREATE SET c.category= row.category
ON CREATE SET r.price =row.value
Next I do the aggregation to count total orders for a compound and create property for a node in the following way -
MATCH (c:Compound) <-[:CONTAINS]- (t:Transaction)
with c.name as name, count( distinct t.transactionId) as ord
set c.orders = ord
So far so good. I can accomplish what I want but I have the following 2 questions -
How can I create the orders property for compound node in the first step itself? .i.e. when I am loading the data I would like to perform the aggregation straight away.
For a compound node I am also setting the property for category. Theoretically, it can also be modelled as category -contains-> compound by creating Categorynode. But what advantage will I have if I do it? Because I can execute the queries and get the expected output without creating this additional node.
Thank you for your answer.
I don't think that's possible, LOAD CSV goes over one row at a time, so at row 1, it doesn't know how many more rows will follow.
I guess you could create virtual nodes and relationships, aggregate those and then use those to create the real nodes, but that would be way more complicated. Virtual Nodes/Rels
That depends on the questions/queries you want to ask.
A graph database is optimised for following relationships, so if you often do a query where the category is a criteria (e.g. MATCH (c: Category {category_id: 12})-[r]-(:Compound) ), it might be more performant to create a label for it.
If you just want to get the category in the results (e.g. RETURN compound.category), then it's fine as a property.

Cypher Performance with Multiple Relations

I am tasked with prototyping Neo4J as a replacement to our existing data mart which stores data in Redshift/Postgres schemas. I have loaded an instance of neo running on an EC2 instance on an m5.xlarge server to model a marketing campaign and trying to get simple counts of who saw my spots in a given demographic. The results as far as numbers produced are the exact same as my existing data mart, but i am surprised to see that performance is much slower. The same query to get simple count of impressions by a television network returns in 48 seconds compared to 1.5 seconds in Redshift. Question is am I doing something wrong in my cypher (i.e. too many joins) or is this expected behavior. Here is a diagram of the model:
Here is my Cypher to get the results in 48s:
match (c:Campaign{campaign_id:98})<-[:PART_OF]-(sa)
, (sa)-[:AIRED_ON]->(n)
, (n)-[:BELONGS_TO]->(ng:NetworkGroup{network_group_id:2})
, (sa)<-[:EXPOSED_WITH]-(e)
, (e)<-[se:CONTAINS_ENTITY]-(s:Sample{sample_id:2000005})
, (e)-[:MEMBER_OF]->(a:DemographicAudience{audience_id:2})
return c.campaign_id as `campaign_id`
, a.audience_id as `audience_id`
, a.audience_name as `audience_name`
, s.sample_id as `sample_id`
, n.network_id as `network_id`
, n.network_name as `network_name`
, n.network_call_sign as `network_call_sign`
, count(distinct sa.spot_airing_id) as `spot_airings`
, sum(se.weight) as `spot_impressions`
In addition, I believe all necessary constraints are added to optimize:
Indexes
ON :DemographicAudience(audience_id) ONLINE (for uniqueness constraint)
ON :Campaign(campaign_id) ONLINE (for uniqueness constraint)
ON :Entity(entity_id) ONLINE (for uniqueness constraint)
ON :Network(network_id) ONLINE (for uniqueness constraint)
ON :NetworkGroup(network_group_id) ONLINE (for uniqueness constraint)
ON :Sample(sample_id) ONLINE (for uniqueness constraint)
ON :SpotAiring(spot_airing_id) ONLINE (for uniqueness constraint)
Constraints
ON ( audience:DemographicAudience ) ASSERT audience.audience_id IS UNIQUE
ON ( campaign:Campaign ) ASSERT campaign.campaign_id IS UNIQUE
ON ( entity:Entity ) ASSERT entity.entity_id IS UNIQUE
ON ( network:Network ) ASSERT network.network_id IS UNIQUE
ON ( networkgroup:NetworkGroup ) ASSERT networkgroup.network_group_id IS UNIQUE
ON ( sample:Sample ) ASSERT sample.sample_id IS UNIQUE
ON ( spotairing:SpotAiring ) ASSERT spotairing.spot_airing_id IS UNIQUE
Running Neo4J 3.3.1 Community on AWS: https://aws.amazon.com/marketplace/pp/B073S5MDPV/?ref_=_ptnr_intuz_ami_neoj4
I should also mention that quite a bit of data is loaded: 24,154,440 nodes and 33,220,694 relationships. Most of these are relationships to entities.
My understanding was that Neo4J should hold toe to toe with any RDBMS and even outperform as data grows. I'm hoping I am just being naive with my rookie cypher skills. Any help would be appreciated.
Thanks.
Keep in mind that in Neo4j indexes are used to find starting points in the graph, and once those starting points are found, relationship traversal is used to expand out to find the paths which fit the pattern.
In this case we have several unique nodes, so we can save on some operations by ensuring that we match to all of these nodes first, and we should see Expand(Into) in our query plan instead of Expand(All) and then Filter. I have a hunch that the planner is only using index lookup on a single node, and the rest of them aren't using the index but using property access and filtering, which is less efficient.
In the case that the planner doesn't lookup all your unique nodes first, before expansion, we can force it by introducing a LIMIT 1 after the initial match.
Lastly, it's a good idea to aggregate using the node itself rather than its properties if the properties in question are unique. This will use the underlying graph id for the node for comparison purposes rather than having to do more expensive property access.
Give this a try and see how it compares:
MATCH (c:Campaign{campaign_id:98}), (s:Sample{sample_id:2000005}), (a:DemographicAudience{audience_id:2}), (ng:NetworkGroup{network_group_id:2})
WITH c,s,a,ng
LIMIT 1
MATCH (c)<-[:PART_OF]-(sa)
, (sa)-[:AIRED_ON]->(n)
, (n)-[:BELONGS_TO]->(ng)
, (sa)<-[:EXPOSED_WITH]-(e)
, (e)<-[se:CONTAINS_ENTITY]-(s)
, (e)-[:MEMBER_OF]->(a)
WITH c, a, s, n, count(distinct sa) as `spot_airings`, sum(se.weight) as `spot_impressions`
RETURN c.campaign_id as `campaign_id`
, a.audience_id as `audience_id`
, a.audience_name as `audience_name`
, s.sample_id as `sample_id`
, n.network_id as `network_id`
, n.network_name as `network_name`
, n.network_call_sign as `network_call_sign`
, `spot_airings`
, `spot_impressions`

Neo4j removing direction to relationship significantly takes more time

I have 2 queries which take significantly different amount of time. What dont I understand in relationships in Cypher to have this issue
less than 1 second with 5 results:
allshortestpaths((et)-[*]->(st))
less than 1 second with 3 results:
allshortestpaths((et)<-[*]-(st))
Takes forever (timeout):
allshortestpaths((et)-[*]-(st))
Why does this take forever. I would assume that this just has to return 8 results!!
Complete sample query:
profile match (s:Stop)--(st:Stoptime),
(e:Stop)--(et:Stoptime)
where s.name IN [ 'Schlump', 'U Schlump']
and e.name IN [ 'Hamburg Hbf', 'Hauptbahnhof Nord']
match p = allshortestpaths((et)-[*]->(st))
return p
With allshortestpaths((et)-[*]->(st)) you will have a paths that look like that : (a)-->(b)-->(c)-->(d)
With allshortestpaths((et)<-[*]-(st)) you will have a path that look like that : (a)<--(b)<--(c)<--(d)
With allshortestpaths((et)<-[*]-(st)) you will have a paths that look like that :
(a)-->(b)-->(c)-->(d)
(a)<--(b)<--(c)<--(d)
(a)-->(b)<--(c)-->(d)
(a)<--(b)<--(c)-->(d)
(a)-->(b)-->(c)<--(d)
(a)-->(b)<--(c)--<(d)
Do you the differences ? The last one is a more complexe query than the previous ones, that's why it can take a long time, specially when you not specify the relationship type and the max depth of the path.
So here it'spossible that you are asking to Neo4j to traverse all your database ...

Low performance of neo4j

I am server engineer in company that provide dating service.
Currently I am building a PoC for our new recommendation engine.
I try to use neo4j. But performance of this database does not meet our needs.
I have strong feeling that I am doing something wrong and neo4j can do much better.
So can someone give me an advice how to improve performance of my Cypher’s query or how to tune neo4j in right way?
I am using neo4j-enterprise-2.3.1 which is running on c4.4xlarge instance with Amazon Linux.
In our dataset each user can have 4 types of relationships with others users - LIKE, DISLIKE, BLOCK and MATCH.
Also he has a properties like countryCode, birthday and gender.
I made import of all our users and relationships from RDBMS to neo4j using neo4j-import tool.
So each user is a node with properties and each reference is a relationship.
The report from neo4j-import tool said that :
2 558 667 nodes,
1 674 714 539 properties and
1 664 532 288 relationships
were imported.
So it’s huge DB :-) In our case some nodes can have up to 30 000 outgoing relationships..
I made 3 indexes in neo4j :
Indexes
ON :User(userId) ONLINE
ON :User(countryCode) ONLINE
ON :User(birthday) ONLINE
Then I try to build online recommendation engine using this query :
MATCH (me:User {userId: {source_user_id} })-[:LIKE | :MATCH]->()<-[:LIKE | :MATCH]-(similar:User)
USING INDEX me:User(userId)
USING INDEX similar:User(birthday)
WHERE similar.birthday >= {target_age_gte} AND
similar.birthday <= {target_age_lte} AND
similar.countryCode = {target_country_code} AND
similar.gender = {source_gender}
WITH similar, count(*) as weight ORDER BY weight DESC
SKIP {skip_similar_person} LIMIT {limit_similar_person}
MATCH (similar)-[:LIKE | :MATCH]-(recommendation:User)
WITH recommendation, count(*) as sheWeight
WHERE recommendation.birthday >= {recommendation_age_gte} AND
recommendation.birthday <= {recommendation_age_lte} AND
recommendation.gender= {target_gender}
WITH recommendation, sheWeight ORDER BY sheWeight DESC
SKIP {skip_person} LIMIT {limit_person}
MATCH (me:User {userId: {source_user_id} })
WHERE NOT ((me)--(recommendation))
RETURN recommendation
here is the execution plan for one of the user :
plan
When I executed this query for list of users I had the result :
count=2391, min=4565.128849, max=36257.170065, mean=13556.750555555178, stddev=2250.149335254768, median=13405.409811, p75=15361.353029999998, p95=17385.136478, p98=18040.900481, p99=18426.811424, p999=19506.149138, mean_rate=0.9957385490980866, m1=1.2148195797996817, m5=1.1418078036067119, m15=0.9928564378521962, rate_unit=events/second, duration_unit=milliseconds
So even the fastest is too slow for Real-time recommendations..
Can you tell me what I am doing wrong?
Thanks.
EDIT 1 : plan with the expanded boxes :
I built an unmanaged extension to see if I could do better than Cypher. You can grab it here => https://github.com/maxdemarzi/social_dna
This is a first shot, there are a couple of things we can do to speed things up. We can pre-calculate/save similar users, cache things here and there, and random other tricks. Give it a shot, let us know how it goes.
Regards,
Max
If I'm reading this right, it's finding all matches for users by userId and separately finding all matches for users by your various criteria. It's then finding all of the places that they come together.
Since you have a case where you're starting on the left with a single node, my guess is that we'd be better served by following the paths and then filtering what it gotten via relationship traversal.
Let's see how starting like this works for you:
MATCH
(me:User {userId: {source_user_id} })-[:LIKE | :MATCH]->()
<-[:LIKE | :MATCH]-(similar:User)
WITH similar
WHERE similar.birthday >= {target_age_gte} AND
similar.birthday <= {target_age_lte} AND
similar.countryCode = {target_country_code} AND
similar.gender = {source_gender}
WITH similar, count(*) as weight ORDER BY weight DESC
SKIP {skip_similar_person} LIMIT {limit_similar_person}
MATCH (similar)-[:LIKE | :MATCH]-(recommendation:User)
WITH recommendation, count(*) as sheWeight
WHERE recommendation.birthday >= {recommendation_age_gte} AND
recommendation.birthday <= {recommendation_age_lte} AND
recommendation.gender= {target_gender}
WITH recommendation, sheWeight ORDER BY sheWeight DESC
SKIP {skip_person} LIMIT {limit_person}
MATCH (me:User {userId: {source_user_id} })
WHERE NOT ((me)--(recommendation))
RETURN recommendation
[UPDATED]
One possible (and nonintuitive) cause of inefficiency in your query is that when you specify the similar:User(birthday) filter, Cypher uses an index seek with the :User(birthday) index (and additional tests for countryCode and gender) to find all possible DB matches for similar. Let's call that large set of similar nodes A.
Only after finding A does the query filter to see which of those nodes are actually connected to me, as specified by your MATCH pattern.
Now, if there are relatively few me to similar paths (as specified by the MATCH pattern, but without considering its WHERE clause) as compared to the size of A -- say, 2 or more orders of magnitude smaller -- then it might be faster to remove the :User label from similar (since I presume they are probably all going to be users anyway, in your data model), and also remove the USING INDEX similar:User(birthday) clause. In this case, not using the index for similar may actually be faster for you, since you will only be using the WHERE clause on a relatively small set of nodes.
The same considerations also apply to the recommendation node.
Of course, this all has to be verified by testing on your actual data.

Neo4j Facet Fields like Solr

I'm using Neo4j in a community e-commerce built in PHP and using the REST interface.
I need to get all categories related to the search results like Amazon. This feature is available in other engines like Solr (another implementation of Lucene) as Faceted Search
How can I do a Faceted Search in Neo4j? or What's the best way (performance grade) to recreate this feature?
All required modules related to this feature are excluded from the core package of neo4j. I want to know if someone try to do something like this without transverse all nodes in the graph, grab some properties, and make a groupCount of this values. If we have 200k nodes, the transverse took 10sec to only get the categories.
This is my Gremlin approach.
(new Neo4jVertexSequence(
g.getRawGraph().index().forNodes('products').query(
new org.neo4j.index.lucene.QueryContext('category:?')
), g
))._().groupBy{it.category}.cap.next();
Results in 90 rows and took 54 seconds.
Books = 12002
Movies_Music_Games = 19233
Electronics_Computers = 60540
Home_Garden_Tools = 9123
Grocery_Health_Beauty = 15643
Toys_Kids_Baby = 15099
Clothing_Shoes_Jewelry = 12543
Sports_Outdoors = 10342
Automotive_Industrial = 9638
... (more rows)
Of course, I can't put this results in cache, because, this is for "non input search". If the user makes a query like "Iphone", the query looks like
(new Neo4jVertexSequence(
g.getRawGraph().index().forNodes('products').query(
new org.neo4j.index.lucene.QueryContext('search:"iphone" AND category:?')
), g
))._().groupBy{it.category}.cap.next();
What about your domain model? Did you just put everything in the index? Usually you would model your categories as nodes and have your products being related to the category nodes.
(product)-[:HAS_CATEGORY]->(category)<-[:IS_CATEGORY]-(categories)
In your query you would just traverse this little tree and count the relationships of type :HAS_CATEGORY starting from each category node.
start categories=node(x)
match (product)-[:HAS_CATEGORY]->(category)<-[:IS_CATEGORY]-(categories)
return category.name, count(*)

Resources