Neo4j In memory configurations, multithreading, and slow writes - neo4j

How do I improve performance when writing to neo4j. I currently have neo4j set up on a server and I am currently running it in embedded more. I believe my configurations are storing all the content of my graph database in memory based upon configurations I've found online
neostore.nodestore.db.mapped_memory=0
neostore.relationship.db.mapped_memory=0
neostore.propertystore.db.mapped_memory=0
neostore.propertystore.db.strings.mapped_memory=0
neostore.propertystore.db.arrays.mapped_memory=0
neostore.propertystore.db.index.keys.mapped_memory=0
neostore.propertystore.db.index.mapped_memory=0
node_auto_indexing=true
node_keys_indexable=type,id
cache_type=strong
use_memory_mapped_buffers=false
node_cache_size=12G
relationship_cache_size=12G
node_cache_array_fraction=10
relationship_cache_array_fraction=10
Please let me know if this is incorrect. The problem that I am encountering is that when I try to persist information to the graph database. It appears that those times are not very quick in comparison to our MYSQL times of the samething(ex. to add 250 items would take about 3sec and in MYSQL it takes 1sec) . I read online that when you have multiple indexes that that can slow down performance on persisting data so I am working on that right now to see if that is my culprit. But, I just wanted to make sure that my configurations seem to be inline when it comes to running your graph database in memory.
Second question to this topic. Okay, if my configurations are good and my database is indeed in memory, then is there a way to optimize persisting data just in case this isn't the silver bullet. If we ran one thread against our test that executes this functionality, oppose to 10 threads, its seems like the times for execution bubbles up
ex.( thread 1 finishes 1s, thread 2 finishes 2s, thread 3 finishes 3s,etc). Is there some special multithreaded configuration that I am missing to improve the performance when mulitple threads are hitting it at one time.
Neo4J version
1.9.1-enterprise
My Jvm configs are
-Xms25G -Xmx25G -XX:+UseNUMA -XX:+UseSerialGC
My Machine Specs:
File system type ext3

You cache arguments are invalid.
node_cache_size=12G
relationship_cache_size=12G
node_cache_array_fraction=10
relationship_cache_array_fraction=10
These can only be used with the GCR cache. Setting the cache isn't going to put everything in memory for you at start up, you will have to write code to do this for you. Something like this:
GlobalGraphOperations ggo = GlobalGraphOperations.at(graphDatabaseFactory);
for (Node n : ggo.getAllNodes()) {
for (String propertyKey : n.getPropertyKeys()) {
n.getProperty(propertyKey);
}
for (Relationship relationship : n.getRelationships()) {
}
}
Beware with the strong cache, if you have a lot of nodes/relationships, eventually your cache will become large and performing GC against it will cause long pauses in your system.
My recommendation would be to use the memory mapped files, as this is an OS handled and will be outside of heap space. It doesn't provide near the speed of caching, but it will provide a speed up if you have to read from the neo store.

Related

Talend- Memory issues. Working with big files

Before admins start to eating me alive, I would like to say to my defense that I cannot comment in the original publications, because I do not have the power, therefore, I have to ask about this again.
I have issues running a job in talend (Open Studio for BIG DATA!). I have an archive of 3 gb. I do not consider that this is too much since I have a computer that has 32 GB in RAM.
While trying to run my job, first I got an error related to heap memory issue, then it changed for a garbage collector error, and now It doesn't even give me an error. (just do nothing and then stops)
I found this SOLUTIONS and:
a) Talend performance
#Kailash commented that parallel is only on the condition that I have to be subscribed to one of the Talend Platform solutions. My comment/question: So there is no other similar option to parallelize a job with a 3Gb archive size?
b) Talend 10 GB input and lookup out of memory error
#54l3d mentioned that its an option to split the lookup file into manageable chunks (may be 500M), then perform the join in many stages for each chunk. My comment/cry for help/question: how can I do that, I do not know how to split the look up, can someone explain this to me a little bit more graphical
c) How to push a big file data in talend?
just to mention that I also went through the "c" but I don't have any comment about it.
The job I am performing (thanks to #iMezouar) looks like this:
1) I have an inputFile MySQLInput coming from a DB in MySQL (3GB)
2) I used the tFirstRows to make it easier for the process (not working)
3) I used the tSplitRow to transform the data form many simmilar columns to only one column.
4) MySQLOutput
enter image description here
Thanks again for reading me and double thanks for answering.
From what I understand, your query returns a lot of data (3GB), and that is causing an error in your job. I suggest the following :
1. Filter data on the database side : replace tSampleRow by a WHERE clause in your tMysqlInput component in order to retrieve fewer rows in Talend.
2. MySQL jdbc driver by default retrieves all data into memory, so you need to use the stream option in tMysqlInput's advanced settings in order to stream rows.

Garbage collection tuning/performance degradation for neo4j bulk .csv import

I am running a bulk import of data into a neo4j instance (I have run against 2.2.0 community and enterprise editions as well as 2.1.7 community) running in server mode. My application creates a bunch of nodes in memory, and will peridoically stop to write a series .csv files and send cypher to the neo4j instance to upload the files. (this was done due to performance issues with running the application using the plain old REST API).
Overall, I'm looking to upload something like 150-5000 million nodes, so this is, in principle, the type of thing that neo4j claims to be able to handle relatively well.
Well, anyway, what I'm noticing when I run this against production data is that the application runs in two states -- one where the csv upload processes between 2k-8k of nodes per second, and one where it processes between 80-200 nodes per second. The two states are interwoven when you look at the upload as a time series, and as time goes on, it spends increasingly long amounts of time in the slow state.
Nodes are created through a series of
MERGE (:{NODE_TYPE} {csvLine.key = n.primaryKey}) on create set [PROPERTY LIST];
statements, and I have indexes on everything that I'm doing merges against. This doesn't feel like a degradation in the insert statements, because the slowdown is not linear, but rather bimodal, this feels like there are garbage collection in the neo4j instance. What is the best way to tune the neo4j JVM garbage collector for frequent bulk inserts?
neo4j.properties:
neostore.nodestore.db.mapped_memory=50M
neostore.relationshipstore.db.mapped_memory=500M
#neostore.relationshipgroupstore.db.mapped_memory=10M
neostore.propertystore.db.mapped_memory=100M
#neostore.propertystore.db.strings.mapped_memory=130M
neostore.propertystore.db.arrays.mapped_memory=130M
neo4j-wrapper.conf:
wrapper.java.additional=-XX:+UseConcMarkSweepGC
wrapper.java.additional=-XX:+CMSClassUnloadingEnabled
wrapper.java.additional=-XX:-OmitStackTraceInFastThrow
wrapper.java.additional=-XX:hashCode=5
wrapper.java.initmemory=8194
wrapper.java.maxmemory=8194
This felt like the sweet spot for both the overall heap memory and the neostore stuff. Increasing the overall heap degraded performance. That said, the neo4j garbage collection logs frequently have that GC (Allocation Failure) message.
EDIT: in response to Michael Hunger:
the machine has 64 GB of RAM, and nothing seems to be maxed out. It also seems like only a small number of cores are being used at any time. Garbage collector profiling shows that the garbage collector seems to be running quite frequently.
The exact cypher statements are, for example:
USING PERIODIC COMMIT 110000 LOAD CSV WITH HEADERS FROM 'file:///home/jschirmer/Event_2015_4_773476.csv' AS csvLine MERGE (s:Event {primaryKey: csvLine.primaryKey}) ON CREATE SET s.checkSum= csvLine.checkSum,s.epochTime= toInt(csvLine.epochTime),s.epochTimeCreated= toInt(csvLine.epochTimeCreated),s.epochTimeUpdated= toInt(csvLine.epochTimeUpdated),s.eventDescription= csvLine.eventDescription,s.fileName= csvLine.fileName,s.ip= csvLine.ip,s.lineNumber= toInt(csvLine.lineNumber),s.port= csvLine.port,s.processPid= csvLine.processPid,s.rawEventLine= csvLine.rawEventLine,s.serverId= csvLine.serverId,s.status= toInt(csvLine.status);
USING PERIODIC COMMIT 110000 LOAD CSV WITH HEADERS FROM 'file:///home/jschirmer/Event__File_2015_4_773476.csv' AS csvLine MATCH (n:SC_CSR{primaryKey: csvLine.Event_id}), (s:File{fileName: csvLine.File_id}) MERGE n-[:DATA_SOURCE]->s;
Though there are serveral such statements being made
I have tried a single concurrent transaction as well as running several (~3) such statements in parallel (which gives a roughly 2x improvement). I've tried tuning the periodic commit frequency, and the size of the file. It seems that this maximizes performance when the csv file is roughly 100k lines, which means that really, the periodic commit can be off.
I have not run profile on the staments. I will do that, but I thought that the eager merget problem was avoided by using MERGE ... on create statements.
IN general your config looks ok, what RAM does your machine have?
For the things you merge against I'd recommend constraint instead of an index.
What's your tx size? And how many concurrent tx do you run?
Instead of your generic merge statement (which wouldn't compile) can you share the concrete statements?
Did you profile the statements? Perhaps you run into the eager pipe problem:
http://www.markhneedham.com/blog/2014/10/23/neo4j-cypher-avoiding-the-eager/
Do you use periodic commit?
How large are you CSV files?
See: http://neo4j.com/developer/guide-import-csv/

ActiveRecord bulk data, memory grows forever

I am using ActiveRecord to bulk migrate some data from a table in one database to a different table in another database. About 4 million rows.
I am using find_each to fetch in batches. Then I do a little bit of logic to each record fetched, and write it to a different db. I have tried both directly writing one-by-one, and using the nice activerecord-import gem to batch write.
However, in either case, my ruby process memory usage is growing quite a bit throughout the life of the export/import. I would think that using find_each, I'm getting batches of 1000, there should only be 1000 of them in memory at a time... but no, each record I fetch seems to be consuming memory forever, until the process is over.
Any ideas? Is ActiveRecord caching something somewhere that I can turn off?
update 17 Jan 2012
I think I'm going to give up on this. I have tried:
* Making sure everything is wrapped in a ActiveRecord::Base.uncached do
* Adding ActiveRecord::IdentityMap.enabled = false (I think that should turn off the identity map for the current thread, although it's not clearly documented, and I think the identity map isn't on by default in current Rails anyhow)
Neither of those seem to have much effect, memory is still leaking.
I then added a periodic explicit:
GC.start
That seems to slow down the rate of memory leak, but the memory leak still happens (eventually exhausting all memory and bombing).
So I think I'm giving up, and deciding it is not currently possible to use AR to read millions of rows from one db and insert them into another. Perhaps there is a memory leak in MySQL-specific code being used (that's my db), or somewhere else in AR, or who knows.
I would suggest queueing each unit of work into a Resque queue . I have found that ruby has some quirks when iterating over large arrays like these.
Have one main thread that queue's up the work by ID, then have multiple resque workers hitting that queue to get the work done.
I have used this method on approx 300k records, so it would most likely scale to millions.
Change line #86 to bulk_queue = [] since bulk_queue.clear only sets the length of the arrya to 0 makeing it impossible for the GC to clear it.

Cache (large and static) data with class variables

First, let me explain the situation, I've got following:
A "Node" Class with following attributes:
node_id (unique)
node_name (unique)
And a "NodeConnection" Class with following attributes:
node_from
node_to
We'll have around 1 to 3 million nodes and something around 3 to 10 million NodeConnections.
After the nodes and connections are imported once, they won't change.
On each request to the Rails-Application, we'll have to look up around 10 to 100 node_ids by possible node_names. And we have to lookup a few hundred to a few thousands node_connections.
We currently prototyped this without any caching (so, a LOT of database-queries) and response times were horrible (like 2 Minutes).
So we switched over to cache the nodes and connections via memcached.
Got a performance boost, but still lacking of performance. (Because we're calling Cache.read for every NodeConnection, that's a few thousand calls per request)
Now, we tried caching via Classvariable and got a huge performance boost. (Response times within a few hundred ms)
# Pseudocode below
class Node
def nodes
##nodes ||= get_nodes
end
def node_connections
##node_connections ||= get_node_connections
end
end
So, I'd like to ask about Pros and Cons of this solution.
Cons I've got yet:
Every Rails instance has to build up its own cache (it's own ClassVariables) -> higher total memory usage
Initializing the cache is time consuming (1-3 minutes), so we can't do this within a request
Any other solutions out there to cache large (>100MB) and static (data won't change during applications lifetime) data efficiently, so all rails instances within the same machine can access this cache very fast (!)?
It sounds like a very specific situation, but in order to avoid the need for a per-process in-memory cache (i.e. your class variables) to naturally warm up, I'd be investigating the feasibility of scripting the warm-up process and running it from inside an initializer... your app may take longer to start up, but your users would not have to wait.
EDIT | Note that if you were using something like Unicorn, which supports pre-loading application code before forking worker processes, you could minimize the impact of such initialization.

rails high memory usage

I am planning on using delayed job to run some background analytics. In my initial test I saw tremendous amount of memory usage, so I basically created a very simple task that runs every 2 minutes just to observe how much memory is is being used.
The task is very simple and the analytics_eligbile? method always return false, given where the data is now, so basically none of the heavy hitting code is being called. I have around 200 Posts in my sample data in development. Post has_one analytics_facet.
Regardless of the internal logic/business here, the only thing this task is doing is calling the analytics_eligible? method 200 times every 2 minutes. In a matter of 4 hours my physical memory usage is at 110MB and Virtual memory at 200MB. Just for doing something this simple! I can't even begin to imagine how much memory this will eat if its doing real analytics on 10,000 Posts with real production data!! Granted it may not run evevery 2 minutes, more like every 30, still I don't think it will fly.
This is running ruby 1.9.7, rails 2.3.5 on Ubuntu 10.x 64 bit. My laptop has 4GB memory, dual core CPU.
Is rails really this bad or am I doing something wrong?
Delayed::Worker.logger.info('RAM USAGE Job Start: ' + `pmap #{Process.pid} | tail -1`[10,40].strip)
Post.not_expired.each do |p|
if p.analytics_eligible?
#this method is never called
Post.find_for_analytics_update(p.id).update_analytics
end
end
Delayed::Worker.logger.info('RAM USAGE Job End: ' + `pmap #{Process.pid} | tail -1`[10,40].strip)
Delayed::Job.enqueue PeriodicAnalyticsJob.new(), 0, 2.minutes.from_now
Post Model
def analytics_eligible?
vf = self.analytics_facet
if self.total_ratings > 0 && vf.nil?
return true
elsif !vf.nil? && vf.last_update_tv > 0
ratio = self.total_ratings / vf.last_update_tv
if (ratio - 1) >= Constants::FACET_UPDATE_ELIGIBILITY_DELTA
return true
end
end
return false
end
ActiveRecord is fairly memory-hungry - be very careful when doing selects, and be mindful that Ruby automatically returns the last statement in a block as the return value, potentially meaning that you're passing back an array of records that get saved as a result somewhere and thus aren't eligible for GC.
Additionally, when you call "Post.not_expired.each", you're loading all your not_expired posts into RAM. A better solution is find_in_batches, which specifically only loads X records into RAM at a time.
Fixing it could be something as simple as:
def do_analytics
Post.not_expired.find_in_batches(:batch_size => 100) do |batch|
batch.each do |post|
if post.analytics_eligible?
#this method is never called
Post.find_for_analytics_update(post.id).update_analytics
end
end
end
GC.start
end
do_analytics
A few things are happening here. First, the whole thing is scoped in a function to prevent variable collisions from holding onto references from the block iterators. Next, find_in_batches retrieves batch_size objects from the DB at a time, and as long as you aren't building references to them, become eligible for garbage collection after each iteration runs, which will keep total memory usage down. Finally, we call GC.start at the end of the method; this forces the GC to start a sweep (which you wouldn't want to do in a realtime app, but since this is a background job, it's okay if it takes an extra 300ms to run). It also has the very distinct benefit if returning nil, which means that the result of the method is nil, which means we can't accidentally hang on to AR instances returned from the finder.
Using something like this should ensure that you don't end up with leaked AR objects, and should vastly improve both performance and memory usage. You'll want to make sure you aren't leaking elsewhere in your app (class variables, globals, and class references are the worst offenders), but I suspect that this'll solve your problem.
All that said, this is a cron problem (periodic recurring work), rather than a DJ problem, in my opinion. You can have a one-shot analytics parser that runs your analytics every X minutes with script/runner, invoked by cron, which very neatly cleans up any potential memory leaks or misuses per-run (since the whole process terminates at the end)
Loading data in batches and using the garbage collector aggressively as Chris Heald has suggested is going to give you some really big gains, but another area people often overlook is what frameworks they're loading in.
Loading a default Rails stack will give you ActionController, ActionMailer, ActiveRecord and ActiveResource all together. If you're building a web application you may not be using all of these, but you're probably using most.
When you're building a background job, you can avoid loading things you don't need by creating a custom environment for that:
# config/environments/production_bg.rb
config.frameworks -= [ :action_controller, :active_resource, :action_mailer ]
# (Also include config directives from production.rb that apply)
Each of these frameworks will just be sitting around waiting for an email that will never be sent, or a controller that will never be called. There's simply no point in loading them. Adjust your database.yml file, set your background job to run in the production_bg environment, and you'll have a much cleaner slate to start with.
Another thing you can do is use ActiveRecord directly without loading Rails at all. This might be all that you need for this particular operation. I've also found using a light-weight ORM like Sequel makes your background job very light-weight if you're doing mostly SQL calls to reorganize records or delete old data. If you need access to your models and their methods you will need to use ActiveRecord, though. Sometimes it's worth re-implementing simple logic in pure SQL for reasons of performance and efficiency, though.
When measuring memory usage, the only number to be concerned with is "real" memory. The virtual amount contains shared libraries and the cost of these is spread amongst every process using them even though it is counted in full for each one.
In the end, if running something important takes 100MB of memory but you can get it down to 10MB with three weeks of work, I don't see why you'd bother. 90MB of memory costs at most about $60/year on a managed provider which is usually far less expensive than your time.
Ruby on Rails embraces the philosophy of being more concerned with your productivity and your time than about memory usage. If you want to trim it back, put it on a diet, you can do it but it will take a bit of effort.
If you are experiencing memory issues, one solution is to use another background processing tech, like resque. It is the BG processing used by github.
Thanks to Resque's parent / child
architecture, jobs that use too much
memory release that memory upon
completion. No unwanted growth
How?
On certain platforms, when a Resque
worker reserves a job it immediately
forks a child process. The child
processes the job then exits. When the
child has exited successfully, the
worker reserves another job and
repeats the process.
You can find more technical details in README.
It is a fact that Ruby consumes (and leaks) memory. I don't know if you can do much about it, but at least I recommend that you take a look on Ruby Enterprise Edition.
REE is an open source port which promises "33% less memory" among all the other good things. I have used REE with Passenger in production for almost two years now and I'm very pleased.

Resources