490KB insert crashes BigCouch - erlang

I have a BigCouch cluster with Q=256, N=3, R=2, W=2. Everything seems to be up and running and I can read and write small test documents. The application is in Python and uses the CouchDB library. The cluster has 3 nodes, each on CentOS on vxware with 3 cores and 6GB RAM each. BigCouch 0.4.0, CouchDB 1.1.1, Erlang R14B04, Linux version CentOS Linux release 6.0 (Final) on EC2 and CentOS release 6.2 (Final) on vmware 5.0.
Starting the application attempts to do a bulk insert with 412 documents and a total of 490KB data. This works fine with N=1 so there isn't an issue with the contents. But when N=3 I seem to randomly get one of these results:
write completes in about 9 sec
write completes in about 24 sec (nothing in between)
write fails after about 30sec (some documents were inserted)
Erlang crashes after about 30sec (some documents were inserted)
vmstat shows near 100% CPU utilization, top shows this is mostly the Erlang process, truss shows this is mostly spent in "futex" calls. Disk usage jumps up and down during the operation, but CPU remains pegged.
The logs show lovely messages like:
"could not load validation funs {{badmatch, {error, timeout}}, [{couch_db, '-load_validation_funs/1-fun-1-', 1}]}"
"Error in process <0.13489.10> on node 'bigcouch-test02#bigcouch-test02.oceanobservatories.org' with exit value: {{badmatch,{error,timeout}},[{couch_db,'-load_validation_funs/1-fun-1-',1}]}"
And of course there are Erlang dumps.
From reading about other people's use of BigCouch, this certainly isn't a large update. Our VMs seem beefy enough for the job. I can reproduce with cURL and a JSON file, so it isn't the application. (Can post that too if it helps.)
Can anyone explain why 9 cores and 18GB RAM can't handle a (3x) 490KB write?
more info in case it helps:
bigcouch.log entries including longer crash report
JSON entries that repeatably cause the failure
erl_crash.dump from an EC2 machine m1.small trying to allocate 500mb heap
can reproduce with commands:
download above JSON entries as file.json
url=http://yourhost:5984
curl -X PUT $url/test
curl -X POST $url/test/_bulk_docs -d #file.json -H "Content-Type: application/json"

Got a suggestion that Q=256 may be the issue and found that BigCouch does slow down a lot as Q grows. This is a surprise to me -- I would think the hashing and delegation would be pretty lightweight. But perhaps it dedicates too many resources to each DB shard.
As Q grows from too small to do allow any real cluster growth to maybe big enough for BigData, the time to do my 490kb update grows from uncomfortably slow to unreasonably slow and finally into the realm of BigCouch crashes. Here is the time to insert as Q varies with N=3, R=W=2, 3-nodes as originally described:
Q sec
4 6.4
8 7.7
16 10.8
32 16.9
64 37.0 <-- specific suggestion in adam#cloudant's webcast
This seems like an achilles heel for BigCouch: in spite of suggestions to overshard to support later growth of your cluster, you can't have enough shards unless you already have a moderate-sized cluster or some powerful hardware!

Related

Neo4j randomly high CPU

Neo4j 3.5.12 Community Edition
Ubuntu Server 20.04.2
RAM: 32 Gb
EC2 instance with 4 or 8 CPUs (I change it to accommodate for processing at the moment)
Database files: 6.5Gb
Python, WSGI, Flask
dbms.memory.heap.initial_size=17g
dbms.memory.heap.max_size=17g
dbms.memory.pagecache.size=11g
I'm seeing high CPU use on the server in what appears to be a random pattern. I've profiled all the queries for the pages that I know that people are visiting at those times and they are all optimised with executions under 50ms in all cases. The CPU use doesn't seem linked with user numbers which are very low at most times anyway (max 40 concurrent users). I've checked all queries in cron jobs too.
I reduced the database notes significantly and that made no difference to performance.
I warm the database by preloading all nodes into ram with MATCH (n) OPTIONAL MATCH (n)-[r]->() RETURN count(n.prop) + count(r.prop);
The pattern is that there will be a few minutes of very low CPU use (as I would expect from this setup with these user numbers) and then processing on most CPU cores goes up to the high 90%s and the machine becomes unresponsive to new requests. Changing to an 8CPU instance sorts it, but shouldn't be needed for this level of traffic.
I would like to profile the queries with query logging, but the community edition doesn't support that.
Thanks.
Run a CPU profiler such as perf to record where CPU time is spent. You can then visualize it as a FlameGraph or, since your bursts only occur at random intervals, visualize it over time with Netflix' FlameScope
Since Neo4j is a Java application, it might also be worthwhile to have a look at async-profiler which is priceless when it comes to profiling Java applications (and it generates similar FlameGraphs and can output log files compatible with FlameScope or JMC)

neo4j 3.5.x GC running over and over again, even after just starting the server

Our application uses neo4j 3.5.x (tried both community and enterprise editions) to store some data.
No matter how we setup memory in conf/neo4j.conf (tried with lots of combinations for initial/max heap settings from 4 to 16 GB), the GC process runs periodically every 3 seconds, putting the machine to its knees, slowing the whole system down.
There's a combination (8g/16g) that seems to make stuff more stable, but a few minutes (20-30) after our system is being used, GC kicks again on neo4j and goes into this "deadly" loop.
If we restart the neo4j server w/o restarting our system, as soon as our system starts querying neo4j, GC starts again... (we've noticed this behavior consistently).
We've had a 3.5.x community instance which was working fine from last week (when we've tried to switch to enterprise). We've copied over the data/ folder from enterprise to community instance and started the community instance... only to have it behave the same way the enterprise instance did, running GC every 3 seconds.
Any help is appreciated. Thanks.
Screenshot of jvisualvm with 8g/16g of heap
On debug.log, only these are significative:
2019-03-21 13:44:28.475+0000 WARN [o.n.b.r.BoltConnectionReadLimiter] Channel [/127.0.0.1:50376]: client produced 301 messages on the worker queue, auto-read is being disabled.
2019-03-21 13:45:15.136+0000 WARN [o.n.b.r.BoltConnectionReadLimiter] Channel [/127.0.0.1:50376]: consumed messages on the worker queue below 100, auto-read is being enabled.
2019-03-21 13:45:15.140+0000 WARN [o.n.b.r.BoltConnectionReadLimiter] Channel [/127.0.0.1:50376]: client produced 301 messages on the worker queue, auto-read is being disabled.
And I have a neo4j.log excerpt from around the time the jvisualvm screenshot shows but it's 3500 lines long... so here it is on Pastebin:
neo4j.log excerpt from around the time time the jvisualvm screenshot was taken
JUST_PUT_THIS_TO_KEEP_THE_SO_EDITOR_HAPPY_IGNORE...
Hope this helps, I have also the logs for the Enterprise edition if needed, though they are a bit more 'cahotic' (neo4j restarts in between) and I have no jvisualvm screenshot for them

How to configure Neo4j to run in a minimal memory environment?

For demo purposes, I am running Neo4j in a low memory environment -- A laptop with 4GB of RAM, 1644MB is use for video memory, leaving only 2452 MB available for use.. It's also running SQL Server, our WCF services, and our clients.. So there's little memory for Neo4j.
I'm running LOAD CSV cypher scripts via REST from a C# service. There are more than 20 scripts, and theyt work well in a server environment. I've written code to paginate, so that they run in smaller batches. I've reduced the batch size very low ( 25 csv rows ) and a given script may do 300 batches, but I continue to get "Java heap space" errors at some point.
I've tried configuring Neo4j with a relatively large heap space ( 640MB ) which is all the available RAM size plus setting the cache_type to none, and it gets much further before I get the java heap space error. What I don't understand is in that case, why does it grow that much? Also until I restart the neo4j service, I get these java heap space errors quickly. The batch size doesn't seem to impact how much memory is used appreciably.
However, after doing that, and I run the application with these settings, the query performance becomes very slow due to the cache settings.
I am running this on a Windows 7 laptop with 4G RAM -- using Neo4j 2.2.1 Community Edition.
Thoughts?
Perhaps you can share your LOAD CSV statement and the other queries you run.
I think you just run into this:
http://markhneedham.com/blog/2014/10/23/neo4j-cypher-avoiding-the-eager/
So PROFILE or EXPLAIN your queries and make it not to use that much intermediate state. We can help if you share your statements.
And you should use PERIODIC COMMIT 100.
Something like:
heap=512M
dbms.pagecache.memory=200M
keep_logical_logs=false
cache_type=none
http://console.neo4j.org runs neo4j in memory putting up to 50 instances in a single gigabyte of memory. So it should be doable.

Neo4j 2.0.4 browser cannot query large datasets

Whenever I try to run cypher queries in Neo4j browser 2.0 on large (anywhere from 3 to 10GB) batch-imported datasets, I receive an "Unknown Error." Then Neo4j server stops responding, and I need to exit out using Task Manager. Prior to this operation, the server shuts down quickly and easily. I have no such issues with smaller batch-imported datasets.
I work on a Win 7 64bit computer, using the Neo4j browser. I have adjusted the .properties file to allow for much larger memory allocations. I have configured my JVM heap to 12g, which should be fine for 64bit JDK. I just recently doubled my RAM, which I thought would fix the issue.
My CPU usage is pegged. I have the logs enabled but I don't know where to find them.
I really like the visualization capabilities of the 2.0.4 browser, does anyone know what might be going wrong?
Your query is taking a long time, and the web browser interface reports "Unknown Error" after a certain timeout period. The query is still running, but you won't see the results in the browser. This drove me nuts too when it first happened to me. If you run the query in the neo4j shell you can verify whether or not this is the problem, because the shell won't time out.
Once this timeout occurs, you can find that the whole system becomes quite non-responsive, especially if you re-run the query, because now you have two extremely long queries running in parallel!
Depending on the type of query, you may be able to improve performance. Sometimes it's as simple as limiting the number of returned nodes (in cases where you only need to find one node or path).
Hope this helps.
Grace and peace,
Jim

rails - tire - elasticsearch : need to restart elasticsearch server

I use rails - tire - elasticsearch, everything is mainly working very well, but just from time to time, my server start to be very slow. So I have to restart elasticsearch service and then everything go fine again.
I have impression that it happens after bulk inserts (around 6000 products). Can it be linked? Inserts last like 2 min max, but still after server has problem
EDIT :
finally it is not linked to bulk inserts
I have only this line in log
[2013-06-29 01:15:32,767][WARN ][monitor.jvm ] [Jon Spectre] [gc][ParNew][26438][9941] duration [3.4s], collections [1]/[5.2s], total [3.4s]/[57.7s], memory [951.6mb]->[713.7mb]/[989.8mb], all_pools {[Code Cache] [10.6mb]->[10.6mb]/[48mb]}{[Par Eden Space] [241.1mb]->[31mb]/[273mb]}{[Par Survivor Space] [32.2mb]->[0b]/[34.1mb]}{[CMS Old Gen] [678.3mb]->[682.6mb]/[682.6mb]}{[CMS Perm Gen] [35mb]->[35mb]/[166mb]}
Does someone understand this ?
This is just stabbing in the dark, but from what you report, there might be a bad memory setting for your java virtual machine.
ElasticSearch is built with Java and so runs on a JVM. Each JVM process has a defined set of memory to allocate when you start it up. When the available memory is not enough, it crashes, so it has to do garbage collection to free up space. When you run a Java process on the memory limit, it is occupied with a lot of GC runs and will get very slow.
You can have a look at the java jmx management console for what the process is doing and how much memory it has.

Resources