Direction of DSE Solr cluster capacity planning - datastax-enterprise

Getting started with latest DSE, trying to setup an initial DSE solr cluster and wanting to make sure basic capacity needs are met. In following docs I have done some initial capacity testing following directions here:
http://www.datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/srch/srchCapazty.html
My test single node setup is on AWS, m3.xl, 80GB raid0 for the two 40GB ssd's, latest DSE installed
I have inserted a total of 6MM example records and run some solr searches which would be similar to that which production would be running.
Have the following numbers for my 6MM records:
6MM Records
7.6GB disk (Cassandra + solr)
2.56GB solr index size
96.2MB solr field cache(totalReadableMemSize)
25.57MB solr Heap
I am trying to plan out an initial starter cluster, would like to plan for around 250MM records stored and indexed to start. Read load will be pretty minimal in the early days, so not too worried about read throughput to start.
Following the capacity planning doc page and some numbers for 250MM from 6MM looks like base requirements for dataset would be:
250MM Records
106GB solr index size
317GB disk (Cassandra + solr)
4GB solr field cache(totalReadableMemSize)
1.1GB solr Heap
So some questions looking for some guidance on and if I am understanding docs correctly:
Should I be targeting ~360GB+ storage to be safe and not exceed 80% disk capacity on average as data set grows?
Should I use nodes that can allocate 6GB for solr + XGB for Cassandra? (ie: if entire solr index for 250MM is around 6GB for heap and field cache, and I partition across 3 nodes with replication)
With ~6GB for solr, how much should I try to dedicate to Cassandra proper?
Anything else to consider with planning (will be running on AWS)?
UPDATED (11/6) - Notes/suggestions from phact
With Cass+Solr running together, will target prescribed 14GB for each node for base operation, moving to targeted 30GB memory nodes on AWS, leaving 16GB for OS, solr index, solr field cache
I added the solr index size to numbers above, if suggested target to keep most/all index in memory seems I might need to target AT LEAST 8 nodes to start, with 30GB memory per node.
Seems like a good amount of extra overhead for solr nodes for targeting index in memory, might have to re-consider approach

DSE heap on a Solr node
The recommended heap size for a DSE node running solr is 14gb. This is because Solr and Cassandra actually run in the same JVM. You don't have to allocate memory for Solr separately.
AWS M3.xl
m3.xl's with 15gb ram will be a bit tight with a 14gb heap. However, if your workload is relatively light, you can probably get away with a 12gb heap on your solr nodes.
OS page cache
You do want to make sure that you are able to at least fit your Solr indexes in OS page cache (memory left over after subtracting out your heap -- assuming this is a dedicated box). Ideally, you will also have room for Cassandra to store some of your frequently read rows in page cache.
A quick and dirty way of figuring out how big your indexes are is checking the size of your index directory on your file system. Make sure to forecast out/extrapolate if you're expecting your data to grow. You can also check the index size for each of your cores as follows:
http://localhost:8983/solr/admin/cores?action=STATUS&memory=true
Note - each node should hold it's index in memory, not the entire cluster's index.
Storage
Yes you do want to ensure your disks are not over utilized or you may face issues during compaction. In theory--worse case scenario--tiered compaction could require up to 50% of your disk to be free. This is not common though, see more details here.

Related

Neo4j randomly high CPU

Neo4j 3.5.12 Community Edition
Ubuntu Server 20.04.2
RAM: 32 Gb
EC2 instance with 4 or 8 CPUs (I change it to accommodate for processing at the moment)
Database files: 6.5Gb
Python, WSGI, Flask
dbms.memory.heap.initial_size=17g
dbms.memory.heap.max_size=17g
dbms.memory.pagecache.size=11g
I'm seeing high CPU use on the server in what appears to be a random pattern. I've profiled all the queries for the pages that I know that people are visiting at those times and they are all optimised with executions under 50ms in all cases. The CPU use doesn't seem linked with user numbers which are very low at most times anyway (max 40 concurrent users). I've checked all queries in cron jobs too.
I reduced the database notes significantly and that made no difference to performance.
I warm the database by preloading all nodes into ram with MATCH (n) OPTIONAL MATCH (n)-[r]->() RETURN count(n.prop) + count(r.prop);
The pattern is that there will be a few minutes of very low CPU use (as I would expect from this setup with these user numbers) and then processing on most CPU cores goes up to the high 90%s and the machine becomes unresponsive to new requests. Changing to an 8CPU instance sorts it, but shouldn't be needed for this level of traffic.
I would like to profile the queries with query logging, but the community edition doesn't support that.
Thanks.
Run a CPU profiler such as perf to record where CPU time is spent. You can then visualize it as a FlameGraph or, since your bursts only occur at random intervals, visualize it over time with Netflix' FlameScope
Since Neo4j is a Java application, it might also be worthwhile to have a look at async-profiler which is priceless when it comes to profiling Java applications (and it generates similar FlameGraphs and can output log files compatible with FlameScope or JMC)

ruby requests more memory when there are plenty free heap slots

We have a server running
Sidekiq 4.2.9
rails 4.2.8
MRI 2.1.9
This server periodically produce some amount of importing from external API's, perform some calculations on them and save these values to the database.
About 3 weeks ago server started hanging, as I see from NewRelic (and when ssh'ed to it) - it consumes more and more memory over time, eventually occupying all available RAM, then server hangs.
I've read some articles about how ruby GC works, but still can't understand, why at ~5:30 AM heap size jumps from ~2.3M to 3M , when there's still 1M free heap slots available(GC settings are default)
similar behavior, 3:35PM:
So, the questions are:
how to make Ruby fill free heap slots instead of requesting new slots from OS ?
how to make it release free heap slots to the system ?
how to make Ruby fill free heap slots instead of requesting new slots from OS ?
Your graph does not have "full" fidelity. It is a lot to assume that GC.stat was called by Newrelic or whatnot just at the exact right time.
It is incredibly likely that you ran out of slots, heap grew and since heaps don't shrink in Ruby you are stuck with a somewhat bloated heap.
To alleviate some of the pain you can limit RUBY_GC_HEAP_GROWTH_MAX_SLOTS to a sane number, something like 100,000 will do, I am trying to lobby setting a default here in core.
Also
Create a persistent log of jobs that run and time they ran (duration and so on), gather GC.stat before and after job runs
Split up your jobs by queue, run 1 queue on one server and other queue on another one, see which queue and which job is responsible for the problem
Profile various jobs you have using flamegraph or other profiling tools
Reduce the amount of concurrent jobs you run as an experiment, or place a mutex between certain job types. It is possible that 1 "job a" at a time is OKish, and 20 concurrent "job a"s at a time will bloat memory.

Neo4J Memory tuning having little effect

I am currently running some simple cypher queries (count etc) on a large dataset (>10G) and am having some issues with tuning NE04J.
The machine running the queries has 4TB of ram, 160 cores and is running Ubuntu 14.04/neo4j version 2.3. Originally I left all the settings as default as it is stated that free memory will be dynamically allocated as required. However, as the queries are taking several minutes to complete I assumed this was not the case. As such I have set various combinations of the following parameters within the neo4j-wrapper.conf:
wrapper.java.initmemory=1200000
wrapper.java.maxmemory=1200000
dbms.memory.heap.initial_size=1200000
dbms.memory.heap.max_size=1200000
dbms.jvm.additional=-XX:NewRatio=1
and the following within neo4j.properties:
use_memory_mapped_buffers=true
neostore.nodestore.db.mapped_memory=50G
neostore.relationshipstore.db.mapped_memory=50G
neostore.propertystore.db.mapped_memory=50G
neostore.propertystore.db.strings.mapped_memory=50G
neostore.propertystore.db.arrays.mapped_memory=1G
following every guide/Stackoverflow post I could find on the topic, but I seem to have exhausted the available material with little effect.
I am running queries through the shell using the following command neo4j-shell -c < "queries/$1.cypher", but have also tried explicitly passing the conf files with -config $NEO4J_HOME/conf/neo4j-wrapper.conf (restarting the sever everytime I make a change).
I imagine that I have missed something silly which is causing the issue, as there are many reports of neo4j working well with data of this size, but cannot think what it could be. As such any help would be greatly appreciated.
Type :SCHEMA in your neo4j browser to show if you have indexes.
Share a couple of your queries.
In the neo4j.properties file, you need to set the dbms.pagecache.memory setting to about 1.5x the size of your database files. In your example, you can set it to 15g

How to configure Neo4j to run in a minimal memory environment?

For demo purposes, I am running Neo4j in a low memory environment -- A laptop with 4GB of RAM, 1644MB is use for video memory, leaving only 2452 MB available for use.. It's also running SQL Server, our WCF services, and our clients.. So there's little memory for Neo4j.
I'm running LOAD CSV cypher scripts via REST from a C# service. There are more than 20 scripts, and theyt work well in a server environment. I've written code to paginate, so that they run in smaller batches. I've reduced the batch size very low ( 25 csv rows ) and a given script may do 300 batches, but I continue to get "Java heap space" errors at some point.
I've tried configuring Neo4j with a relatively large heap space ( 640MB ) which is all the available RAM size plus setting the cache_type to none, and it gets much further before I get the java heap space error. What I don't understand is in that case, why does it grow that much? Also until I restart the neo4j service, I get these java heap space errors quickly. The batch size doesn't seem to impact how much memory is used appreciably.
However, after doing that, and I run the application with these settings, the query performance becomes very slow due to the cache settings.
I am running this on a Windows 7 laptop with 4G RAM -- using Neo4j 2.2.1 Community Edition.
Thoughts?
Perhaps you can share your LOAD CSV statement and the other queries you run.
I think you just run into this:
http://markhneedham.com/blog/2014/10/23/neo4j-cypher-avoiding-the-eager/
So PROFILE or EXPLAIN your queries and make it not to use that much intermediate state. We can help if you share your statements.
And you should use PERIODIC COMMIT 100.
Something like:
heap=512M
dbms.pagecache.memory=200M
keep_logical_logs=false
cache_type=none
http://console.neo4j.org runs neo4j in memory putting up to 50 instances in a single gigabyte of memory. So it should be doable.

rails - tire - elasticsearch : need to restart elasticsearch server

I use rails - tire - elasticsearch, everything is mainly working very well, but just from time to time, my server start to be very slow. So I have to restart elasticsearch service and then everything go fine again.
I have impression that it happens after bulk inserts (around 6000 products). Can it be linked? Inserts last like 2 min max, but still after server has problem
EDIT :
finally it is not linked to bulk inserts
I have only this line in log
[2013-06-29 01:15:32,767][WARN ][monitor.jvm ] [Jon Spectre] [gc][ParNew][26438][9941] duration [3.4s], collections [1]/[5.2s], total [3.4s]/[57.7s], memory [951.6mb]->[713.7mb]/[989.8mb], all_pools {[Code Cache] [10.6mb]->[10.6mb]/[48mb]}{[Par Eden Space] [241.1mb]->[31mb]/[273mb]}{[Par Survivor Space] [32.2mb]->[0b]/[34.1mb]}{[CMS Old Gen] [678.3mb]->[682.6mb]/[682.6mb]}{[CMS Perm Gen] [35mb]->[35mb]/[166mb]}
Does someone understand this ?
This is just stabbing in the dark, but from what you report, there might be a bad memory setting for your java virtual machine.
ElasticSearch is built with Java and so runs on a JVM. Each JVM process has a defined set of memory to allocate when you start it up. When the available memory is not enough, it crashes, so it has to do garbage collection to free up space. When you run a Java process on the memory limit, it is occupied with a lot of GC runs and will get very slow.
You can have a look at the java jmx management console for what the process is doing and how much memory it has.

Resources