RelationshipGroupStore mapped_memory setting for BatchInserter? - neo4j

I'm inserting relationships into Neo4j via the BatchInserter (as part of the initial data insertion). The process quickly slows down to 1,000,000 per ~1hr. The rest of the data (12,390,251 relationships, 6 million nodes, 74 million properties) goes in in less than an hour.
I've identified the culprit as the RelationshipGroupStore is increasing by only a few bytes at a time. While it starts off at 2megs increases in the file size, it eventually slows down to only a few bytes at a time.
I'm curious if this is related to the memory_mappnig, but there doesn't seem to be an option for the RelationshipGroupStore for the BatchInserter (probably the rest of the Kernel as well).
I've tried putting the following into the configuration but it did not seem to have an effect
neostore.relationshipgroupstore.db.mapped_memory=2G
Any ideas? Should I not do this in BatchInserter (50 mln Relationships for my current dataset, many more for others). I also believe this was much faster with older versions of Neo4j. As you can imagine 50 million relationships over 6 million nodes creates lots of highly connected nodes, which Neo4j 2.1.X have handled very well when it comes to query speed. I'm just trying to speed up insertion now.
Neo4j 2.1.5
Java: Various Versions, 1.7, 1.8
OS: Ubuntu, Mac OS X, CentOS
Neo4j Configuration:
neostore.propertystore.db.index.mapped_memory=1M
neostore.propertystore.db.strings.mapped_memory=1G
neostore.propertystore.db.index.keys.mapped_memory=1M
store_dir=data.db
dump_configuration=true
use_memory_mapped_buffers=true
neostore.propertystore.db.arrays.mapped_memory=512M
neostore.propertystore.db.mapped_memory=2G
neostore.relationshipstore.db.mapped_memory=2G
neostore.nodestore.db.mapped_memory=1G
JVM Settings
"-Xms10G"
"-Xms10G"
"-XX:+UseConcMarkSweepGC"
"-XX:+UseBiasedLocking"
"-XX:+AggressiveOpts"
"-XX:+UseCompressedOops"
"-XX:+UseFastAccessorMethods"
"-XX:+DoEscapeAnalysis"
"-Xss4096k"
"-d64"
"-server"
Here are the logs showing the speed decrease:
2014-Oct-19 00:42:43 -0500 - 1000000 correlations recorded
2014-Oct-19 01:06:36 -0500 - 2000000 correlations recorded
2014-Oct-19 01:33:45 -0500 - 3000000 correlations recorded
2014-Oct-19 02:02:27 -0500 - 4000000 correlations recorded
2014-Oct-19 02:35:37 -0500 - 5000000 correlations recorded
2014-Oct-19 03:11:37 -0500 - 6000000 correlations recorded
2014-Oct-19 03:55:40 -0500 - 7000000 correlations recorded
2014-Oct-19 04:45:53 -0500 - 8000000 correlations recorded
2014-Oct-19 05:46:36 -0500 - 9000000 correlations recorded
2014-Oct-19 06:50:18 -0500 - 10000000 correlations recorded
2014-Oct-19 08:05:46 -0500 - 11000000 correlations recorded
And then I aborted it. I've tried forcing a GC after each 1,000,000 relationships, and "throttling" by giving a 20 second sleep, but none have seemed to help. I've removed the GC force but have left in the Sleep, although neither seem to help much.
Any help is appreciated.
Thanks!

Related

Neo4j GDS algorithms execution time

I run the Dijkstra source-target shortest path algorithm in Neo4j (community edition) for 7 different graphs. The sizes of these graphs are as follows: 6,301 nodes - 8,846 nodes - 10,876 nodes - 22,687 nodes - 26,518 nodes - 36,682 nodes - 62,586 nodes.
For all these graphs, the results (the path) are received in 2 ms and completed at different amounts of times. Is it OK that the time is the same for all these graphs regardless of their sizes?
The same is happening when running the Yen algorithm.
If the time provided by the Neo4j browser is inaccurate, how can I measure the execution time accurately?
Update (tracking the execution time):
Thanks in advance.

Neo4j node creation speed

I have a fresh neo4j setup on my laptop, and creating new nodes via the REST API seems to be quite slow (~30-40 ms average). I've Googled around a bit, but can't find any real benchmarks for how long it "should" take; there's this post, but that only lists relative performance, not absolute performance. Is neo4j inherently limited to only adding ~30 new nodes per second (outside of batch mode), or is there something wrong with my configuration?
Config details:
Neo4j version 2.2.5
Server is on my mid-end 2014 laptop, running Ubuntu 15.04
OpenJDK version 1.8
Calls to the server are also from my laptop (via localhost:7474), so there shouldn't be any network latency involved
I'm calling neo4j via Clojure/Neocons; method used is "create" in the class clojurewerkz.neocons.rest.nodes
Using Cypher seems to be even slower; eg. calling "PROFILE CREATE (you:Person {name:"Jane Doe"}) RETURN you" via the HTML interface returns "Cypher version: CYPHER 2.2, planner: RULE. 5 total db hits in 54 ms."
Neo4j performance charasteristics is a tricky area.
Mesuring performance
First of all: it all depends a lot on how server is configured. Measuring anything on laptop is wrong way to do it.
Befor measuring performance you should check following:
You have appropriate server hardware (requirements)
Client and server are in local network.
Neo4j is properly configured (memory mapping, webserver thread pool, java heap size and etc)
Server is properly configured (Linux tcp stack, maximum available open files and etc)
Server is warmed up. Neo4j is written in Java, so you should do appropriate warmup before measuring numbers (i.e. making some load for ~15 minutes).
And last one - enterprise edition. Neo4j enterprise edition has some advanced features that can improve performance a lot (i.e. HPC cache).
Neo4j internally
Neo4j internally is:
Storage
Core API
Traversal API
Cypher API
Everything is performed without any additional network requests. Neo4j server is build on top of this solid foundation.
So, when you are making request to Neo4j server, you are measuring:
Latency between client and server
JSON serialization costs
Web server (Jetty)
Additional modules that are intended for managing locks, transaction and etc
And Neo4j itself
So, bottom line here is - Neo4j is pretty fast by itself, if used in embedded mode. But dealing with Neo4j server involved additional costs.
Numbers
We had internal Neo4j testing. We measured several cases.
Create nodes
Here we are using vanilla Transactional Cypher REST API.
Threads: 2
Node per transaction: 1000
Execution time: 1635
Total nodes created: 7000000
Nodes per second: 7070
Threads: 5
Node per transaction: 750
Execution time: 852
Total nodes created: 7000000
Nodes per second: 8215
Huge database sync
This one uses custom developed unmanaged extension, with binary protocol between server and client and some concurrency.
But this is still Neo4j server (in fact - Neo4j cluster).
Node count: 80.32M (80 320 000)
Relationship count: 80.30M (80 300 000)
Property count: 257.78M (257 780 000)
Consumed time: 2142 seconds
Per second:
Nodes - 37497
Relationships - 37488
Properties - 120345
This numbers shows true Neo4j power.
My numbers
I tried to measure performance right now
Fresh and unconfigured database (2.2.5), Ubuntu 14.04 (VM).
Results:
$ ab -p post_loc.txt -T application/json -c 1 -n 10000 http://localhost:7474/db/data/node
This is ApacheBench, Version 2.3 <$Revision: 1604373 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking localhost (be patient)
Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests
Completed 6000 requests
Completed 7000 requests
Completed 8000 requests
Completed 9000 requests
Completed 10000 requests
Finished 10000 requests
Server Software: Jetty(9.2.4.v20141103)
Server Hostname: localhost
Server Port: 7474
Document Path: /db/data/node
Document Length: 1245 bytes
Concurrency Level: 1
Time taken for tests: 14.082 seconds
Complete requests: 10000
Failed requests: 0
Total transferred: 14910000 bytes
Total body sent: 1460000
HTML transferred: 12450000 bytes
Requests per second: 710.13 [#/sec] (mean)
Time per request: 1.408 [ms] (mean)
Time per request: 1.408 [ms] (mean, across all concurrent requests)
Transfer rate: 1033.99 [Kbytes/sec] received
101.25 kb/s sent
1135.24 kb/s total
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.2 0 19
Processing: 1 1 1.3 1 53
Waiting: 0 1 1.2 1 53
Total: 1 1 1.3 1 54
Percentage of the requests served within a certain time (ms)
50% 1
66% 1
75% 1
80% 1
90% 2
95% 2
98% 3
99% 4
100% 54 (longest request)
This one creates 10000 nodes using REST API, with no properties in 1 thread.
As you can see, event on my laptop in Linux VM, with default settings - Neo4j is able to create nodes in 4ms or less (99%).
Note: I have warmed up database before (created and deleted 100K nodes).
Bolt
If you are looking for best Neo4j performance, you should follow Bolt development. This is new binary protocol for Neo4j server.
More info: here, here and here.
One other thing to try is to run ./bin/neo4j-shell. Since there's no HTTP connection it can help you understand how much is Neo4j and how much is from the HTTP interface.
When I do that on 2.2.2 my CREATEs are generally around 10ms.
I'm not sure what the ideal is and if there is configuration which can improve the performance.

Mahout recommender returns no results

My Mahout recommender is returning no results although from the looks of the evaluator output it seems like it should:
2014-10-15 18:33:36,704 INFO GenericDataModel - Processed 90 users
2014-10-15 18:33:36,735 INFO AbstractDifferenceRecommenderEvaluator - Beginning evaluation using 0.99 of GenericDataModel[users:1116,1117,1118...]
2014-10-15 18:33:36,767 INFO GenericDataModel - Processed 89 users
2014-10-15 18:33:36,767 INFO AbstractDifferenceRecommenderEvaluator - Beginning evaluation of 75 users
2014-10-15 18:33:36,767 INFO AbstractDifferenceRecommenderEvaluator - Starting timing of 75 tasks in 8 threads
2014-10-15 18:33:36,782 INFO StatsCallable - Average time per recommendation: 15ms
2014-10-15 18:33:36,782 INFO StatsCallable - Approximate memory used: 876MB / 1129MB
2014-10-15 18:33:36,782 INFO StatsCallable - Unable to recommend in 0 cases
2014-10-15 18:33:36,845 INFO AbstractDifferenceRecommenderEvaluator - Evaluation result: 1.0599938694784354
I'm assuming that "Unable to recommend in 0 cases" means that it was able to recommend in all cases.
I then iterate of the User id set and only see
2014-10-15 18:33:36,923 DEBUG GenericUserBasedRecommender - Recommending items for user ID '1164'
2014-10-15 18:33:36,923 DEBUG Recommendations are: []
for each Id.
I'm I reading the debug log correctly?
Thanks.
Not exactly sure what that log message means, it looks like that stat is printed every 1000 itereations so it must refer to the past 1000 requests, not for all time but that's just a guess.
In any case you will very seldom be able to recommend to all users. There will often be users that do not get recommendations due to not enough usage history them, or usage history that does not overlap other users. There will also be new users for which you have no preference history and they will get no collaborative filtering recs either. Remember that you are only using preferences that have been expressed, this does not mean you have all users represented.
You should always have some fallback method to make recommendations for this case, even recently popular or promoted items would be better than nothing.

Running the mahout example for collaborative filtering: where are the results?

I am experimenting a bit with mahout and started building everything and have a look at the examples. I am mostly interested in collaborative filtering, so I started with the example on finding recomendations from the BookCrossing dataset. I managed to get everything working, the sample runs without errors. However, the outbput is something like this:
INFO: Creating FileDataModel for file /tmp/taste.bookcrossing.
INFO: Reading file info...
INFO: Read lines: 433647
INFO: Processed 10000 users
INFO: Processed 20000 users
INFO: Processed 30000 users
INFO: Processed 40000 users
INFO: Processed 50000 users
INFO: Processed 60000 users
INFO: Processed 70000 users
INFO: Processed 77799 users
INFO: Beginning evaluation using 0.9 of BookCrossingDataModel
INFO: Processed 10000 users
INFO: Processed 20000 users
INFO: Processed 22090 users
INFO: Beginning evaluation of 4245 users
INFO: Starting timing of 4245 tasks in 2 threads
INFO: Average time per recommendation: 296ms
INFO: Approximate memory used: 115MB / 167MB
INFO: Unable to recommend in 1 cases
INFO: Average time per recommendation: 67ms
INFO: Approximate memory used: 107MB / 167MB
INFO: Unable to recommend in 2363 cases
INFO: Average time per recommendation: 72ms
INFO: Approximate memory used: 146MB / 167MB
INFO: Unable to recommend in 5095 cases
INFO: Average time per recommendation: 71ms
INFO: Approximate memory used: 113MB / 167MB
INFO: Unable to recommend in 7596 cases
INFO: Average time per recommendation: 71ms
INFO: Approximate memory used: 130MB / 167MB
INFO: Unable to recommend in 10896 cases
INFO: Evaluation result: 1.0895580110095793
When I check the code, I can see that is does this:
RecommenderIRStatsEvaluator evaluator = new GenericRecommenderIRStatsEvaluator();
File ratingsFile = TasteOptionParser.getRatings(args);
DataModel model =
ratingsFile == null ? new BookCrossingDataModel(true) : new BookCrossingDataModel(ratingsFile, true);
IRStatistics evaluation = evaluator.evaluate(
new BookCrossingBooleanRecommenderBuilder(),
new BookCrossingDataModelBuilder(),
model,
null,
3,
Double.NEGATIVE_INFINITY,
1.0);
log.info(String.valueOf(evaluation));
So that seems to be correct, but I would like to see more details from the generated suggestions and/or similarities. The object returned is of type IRStatistics, which exposes only some numbers on the statistics of the results. Should I look somewhere else? Is this recommender not intended for getting any actual recommendations?
You are not actually generating recommendations, here you are just performing an evaluation.
This example from the Mahout in Action book (link) should give you an idea on how to actually get recommendations.
The example only requests recommendations for one user, in your case you would iterate through all the users and get every users recommendations, then you decide what to do with that, like output them to a file.
Also the example doesn't use the data model builder or the recommender builder, but it shouldn't be hard for you to figure it out by looking at the method signatures.

Where do I check the results of Mahout's jester example?

After I run: mahout org.apache.mahout.cf.taste.example.jester.JesterRecommenderEvaluatorRunner
mahout org.apache.mahout.cf.taste.example.jester.JesterRecommenderEvaluatorRunner
Running on hadoop, using HADOOP_HOME=/usr
HADOOP_CONF_DIR=/etc/hadoop/conf
11/04/23 23:52:18 WARN driver.MahoutDriver: No org.apache.mahout.cf.taste.example.jester.JesterRecommenderEvaluatorRunner.props found on classpath, will use command-line arguments only
11/04/23 23:52:18 INFO file.FileDataModel: Creating FileDataModel for file src/main/java/org/apache/mahout/cf/taste/example/jester/jester-data-1.csv
11/04/23 23:52:18 INFO file.FileDataModel: Reading file info...
11/04/23 23:52:18 INFO file.FileDataModel: Read lines: 7074
11/04/23 23:52:18 INFO model.GenericDataModel: Processed 7074 users
11/04/23 23:52:19 INFO eval.AbstractDifferenceRecommenderEvaluator: Beginning evaluation using 0.9 of FileDataModel[dataFile:/usr/local/mahout-distribution-0.4/examples/src/main/java/org/apache/mahout/cf/taste/example/jester/jester-data-1.csv]
11/04/23 23:52:19 INFO model.GenericDataModel: Processed 2155 users
11/04/23 23:52:19 INFO slopeone.MemoryDiffStorage: Building average diffs...
11/04/23 23:52:19 INFO eval.AbstractDifferenceRecommenderEvaluator: Beginning evaluation of 855 users
11/04/23 23:52:19 INFO eval.AbstractDifferenceRecommenderEvaluator: Starting timing of 855 tasks in 4 threads
11/04/23 23:52:19 INFO eval.AbstractDifferenceRecommenderEvaluator: Average time per recommendation: 2ms
11/04/23 23:52:19 INFO eval.AbstractDifferenceRecommenderEvaluator: Approximate memory used: 9MB / 56MB
11/04/23 23:52:19 INFO eval.AbstractDifferenceRecommenderEvaluator: Unable to recommend in 0 cases
11/04/23 23:52:19 INFO eval.AbstractDifferenceRecommenderEvaluator: Evaluation result: 154472.97849261735
11/04/23 23:52:19 INFO jester.JesterRecommenderEvaluatorRunner: 154472.97849261735
11/04/23 23:52:19 INFO driver.MahoutDriver: Program took 740 ms
No idea where to check the results?
Thanks!
That is the result. You are running an evaluation on one recommender implementation, one which scores how well that one recommender predicts ratings. It shows the average difference between actual and predicted rating.
What result are you looking for?
However something looks pretty wrong here: 154472.97849261735 is way too large. When I run it, I get an average difference of 3.41 (on a scale of 10).
I would run with the latest code from Subversion, ideally. 0.4 is 6 months old, although I don't know of any bugs here. You also don't need to run this via the driver program, though it works.
Really I suspect your jester-data-1.csv file is wrong somehow. Best to follow up on user#mahout.apache.org.

Resources