We are having an issue on our production Elasticsearch cluster where Elasticsearch seems to be consuming, over time, all of the RAM on each server. Each box has 128GB of RAM so we run two instances, 30GB is allocated to each for the JVM Heap. The remaing 68G is left for the OS and Lucene. We rebooted each of the servers last week and the RAM was started off just right using 24% of the RAM for each Elasticsearch process. It's now been almost a week and our memory consumption has gone up to around 40% per Elasticsearch instance. I have attached our config file in hopes that someone may be able to help figure out why Elasticsearch is growing out past the limit we have set for memory utilization.
Currently we are running ES 1.3.2 but will be upgrading to 1.4.2 next week with our next release.
Here is a view of top (extra fields removed for clarity) from right after the reboot:
PID USER %MEM TIME+
2178 elastics 24.1 1:03.49
2197 elastics 24.3 1:07.32
and one today:
PID USER %MEM TIME+
2178 elastics 40.5 2927:50
2197 elastics 40.1 3000:44
elasticserach-0.yml:
cluster.name: PROD
node.name: "PROD6-0"
node.master: true
node.data: true
node.rack: PROD6
cluster.routing.allocation.awareness.force.rack.values:
PROD4,PROD5,PROD6,PROD7,PROD8,PROD9,PROD10,PROD11,PROD12
cluster.routing.allocation.awareness.attributes: rack
node.max_local_storage_nodes: 2
path.data: /es_data1
path.logs:/var/log/elasticsearch
bootstrap.mlockall: true
transport.tcp.port:9300
http.port: 9200
http.max_content_length: 400mb
gateway.recover_after_nodes: 17
gateway.recover_after_time: 1m
gateway.expected_nodes: 18
cluster.routing.allocation.node_concurrent_recoveries: 20
indices.recovery.max_bytes_per_sec: 200mb
discovery.zen.minimum_master_nodes: 10
discovery.zen.ping.timeout: 3s
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: XXX
index.search.slowlog.threshold.query.warn: 10s
index.search.slowlog.threshold.query.info: 5s
index.search.slowlog.threshold.query.debug: 2s
index.search.slowlog.threshold.fetch.warn: 1s
index.search.slowlog.threshold.fetch.info: 800ms
index.search.slowlog.threshold.fetch.debug: 500ms
index.indexing.slowlog.threshold.index.warn: 10s
index.indexing.slowlog.threshold.index.info: 5s
index.indexing.slowlog.threshold.index.debug: 2s
monitor.jvm.gc.young.warn: 1000ms
monitor.jvm.gc.young.info: 700ms
monitor.jvm.gc.young.debug: 400ms
monitor.jvm.gc.old.warn: 10s
monitor.jvm.gc.old.info: 5s
monitor.jvm.gc.old.debug: 2s
action.auto_create_index: .marvel-*
action.disable_delete_all_indices: true
indices.cache.filter.size: 10%
index.refresh_interval: -1
threadpool.search.type: fixed
threadpool.search.size: 48
threadpool.search.queue_size: 10000000
cluster.routing.allocation.cluster_concurrent_rebalance: 6
indices.store.throttle.type: none
index.reclaim_deletes_weight: 4.0
index.merge.policy.max_merge_at_once: 5
index.merge.policy.segments_per_tier: 5
marvel.agent.exporter.es.hosts: ["1.1.1.1:9200","1.1.1.1:9200"]
marvel.agent.enabled: true
marvel.agent.interval: 30s
script.disable_dynamic: false
and here is /etc/sysconfig/elasticsearch-0 :
# Directory where the Elasticsearch binary distribution resides
ES_HOME=/usr/share/elasticsearch
# Heap Size (defaults to 256m min, 1g max)
ES_HEAP_SIZE=30g
# Heap new generation
#ES_HEAP_NEWSIZE=
# max direct memory
#ES_DIRECT_SIZE=
# Additional Java OPTS
#ES_JAVA_OPTS=
# Maximum number of open files
MAX_OPEN_FILES=65535
# Maximum amount of locked memory
MAX_LOCKED_MEMORY=unlimited
# Maximum number of VMA (Virtual Memory Areas) a process can own
MAX_MAP_COUNT=262144
# Elasticsearch log directory
LOG_DIR=/var/log/elasticsearch
# Elasticsearch data directory
DATA_DIR=/es_data1
# Elasticsearch work directory
WORK_DIR=/tmp/elasticsearch
# Elasticsearch conf directory
CONF_DIR=/etc/elasticsearch
# Elasticsearch configuration file (elasticsearch.yml)
CONF_FILE=/etc/elasticsearch/elasticsearch-0.yml
# User to run as, change this to a specific elasticsearch user if possible
# Also make sure, this user can write into the log directories in case you change them
# This setting only works for the init script, but has to be configured separately for systemd startup
ES_USER=elasticsearch
# Configure restart on package upgrade (true, every other setting will lead to not restarting)
#RESTART_ON_UPGRADE=true
Please let me know if there is any other data I can provide. Thanks in advance for any help.
total used free shared buffers cached
Mem: 129022 119372 9650 0 219 46819
-/+ buffers/cache: 72333 56689
Swap: 28603 0 28603
What you are seeing isn't heap blow out, heap will always be restricted by what you set in the config. free -m and top report on OS related use, so the use there would most likely be the OS caching FS calls.
This will not cause a java OOM.
If you are experiencing java OOM, which is directly related to the java heap running out of space, then there is something else at play. Your logs may provide some info around that.
Related
I am trying to run a spark job with PySpark through Jupyter notebook running in Docker. Workers are located on separate machines in the same network. I am performing a take operation on RDD:
data.take(number_of_elements)
When the number_of_elements is 2000 everything works fine. When it is 20000 an exception occurs. From my point of view it breaks when the size of the result exceeds 2GB (or it seems for me so). The idea about 2GB comes from that spark can send results smaller than 2GB in one block and when the result is bigger than 2GB another mechanism starts to work and something breaks there (see here). Here is the exception from executor log:
19/11/05 10:27:14 INFO CodeGenerator: Code generated in 205.7623 ms
19/11/05 10:27:40 INFO PythonRunner: Times: total = 25421, boot = 3, init = 1751, finish = 23667
19/11/05 10:27:42 INFO MemoryStore: Block taskresult_4 stored as bytes in memory (estimated size 927.7 MB, free 6.4 GB)
19/11/05 10:27:42 INFO Executor: Finished task 0.0 in stage 3.0 (TID 4). 972788748 bytes result sent via BlockManager)
19/11/05 10:27:49 ERROR TransportRequestHandler: Error sending result ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1585998572000, chunkIndex=0}, buffer=org.apache.spark.storage.BlockManagerManagedBuffer#4399ad49} to /10.0.0.9:56222; closing connection
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
at sun.nio.ch.IOUtil.write(IOUtil.java:65)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
at org.apache.spark.util.io.ChunkedByteBufferFileRegion.transferTo(ChunkedByteBufferFileRegion.scala:64)
at org.apache.spark.network.protocol.MessageWithHeader.transferTo(MessageWithHeader.java:121)
at io.netty.channel.socket.nio.NioSocketChannel.doWriteFileRegion(NioSocketChannel.java:355)
at io.netty.channel.nio.AbstractNioByteChannel.doWrite(AbstractNioByteChannel.java:224)
at io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:382)
at io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:934)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:362)
at io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:901)
at io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1321)
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749)
at io.netty.channel.ChannelOutboundHandlerAdapter.flush(ChannelOutboundHandlerAdapter.java:115)
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749)
at io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:117)
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:776)
at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:768)
at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:749)
at io.netty.channel.DefaultChannelPipeline.flush(DefaultChannelPipeline.java:983)
at io.netty.channel.AbstractChannel.flush(AbstractChannel.java:248)
at io.netty.channel.nio.AbstractNioByteChannel$1.run(AbstractNioByteChannel.java:284)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
at java.lang.Thread.run(Thread.java:748)
As we can see from the log executor tries to send result to 10.0.0.9:56222. It fails because the port is not opened in docker compose. 10.0.0.9 is an IP address of a master node but port 56222 is random though I explicitly set up all ports I can find in documentation to disable random port selection:
spark = SparkSession.builder\
.master('spark://spark.cyber.com:7077')\
.appName('My App')\
.config('spark.task.maxFailures', '16')\
.config('spark.driver.port', '20002')\
.config('spark.driver.host', 'spark.cyber.com')\
.config('spark.driver.bindAddress', '0.0.0.0')\
.config('spark.blockManager.port', '6060')\
.config('spark.driver.blockManager.port', '6060')\
.config('spark.shuffle.service.port', '7070')\
.config('spark.driver.maxResultSize', '14g')\
.getOrCreate()
I mapped these ports with docker compose:
version: "3"
services:
jupyter:
image: jupyter/pyspark-notebook:latest
ports:
- "4040-4050:4040-4050"
- "6060:6060"
- "7070:7070"
- "8888:8888"
- "20000-20010:20000-20010"
You should probably configure you spark driver memory to follow your docker container memory settings
I added
.config('spark.driver.memory', '14g')
as #ML_TN proposed and everything works now.
From my point of view it is strange that the memory setting affects the ports that spark uses.
→ or totally ignored strings like name of new DB for testing purposes.
Firstly tries to add something about ~250 to 250 already added hosts and Z-server shutted down. I've restarted it and inside docker logs I saw this:
6:20191014:091840.201 using configuration file: /etc/zabbix/zabbix_server.conf
6:20191014:091840.223 current database version (mandatory/optional): 04020000/04020001
6:20191014:091840.223 required mandatory version: 04020000
6:20191014:091840.484 __mem_malloc: skipped 7 asked 108424 skip_min 304 skip_max 12192
6:20191014:091840.484 [file:dbconfig.c,line:94] __zbx_mem_realloc(): out of memory (requested 108424 bytes)
6:20191014:091840.484 [file:dbconfig.c,line:94] __zbx_mem_realloc(): please increase CacheSize configuration parameter
6:20191014:091840.484 === memory statistics for configuration cache ===
Solution for those problem was to increase CacheSize in zabbix_server.conf . Okay, that's not a problem and after this Im push a new config to Z-server and restart it... → and z-server stops already after start and logs says the same problem. After reading config in container I saw what string what I corrected to matching my wishes are missing O_o. Strings are deleted.
My config:
LogType=console
DBHost=postgres-server
DBName=zabbix_pwd
DBSchema=public
DBUser=zabbix
DBPassword=zabbix
DBPort=5432
StartPollers=5
StartIPMIPollers=5
StartPollersUnreachable=5
SNMPTrapperFile=/var/lib/zabbix/snmptraps/snmptraps.log
StartSNMPTrapper=1
CacheSize=512M
HistoryCacheSize=512M
HistoryIndexCacheSize=512M
TrendCacheSize=512m
ValueCacheSize=256M
AlertScriptsPath=/usr/lib/zabbix/alertscripts
ExternalScripts=/usr/lib/zabbix/externalscripts
FpingLocation=/usr/sbin/fping
Fping6Location=/usr/sbin/fping6
SSHKeyLocation=/var/lib/zabbix/ssh_keys
SSLCertLocation=/var/lib/zabbix/ssl/certs/
SSLKeyLocation=/var/lib/zabbix/ssl/keys/
SSLCALocation=/var/lib/zabbix/ssl/ssl_ca/
LoadModulePath=/var/lib/zabbix/modules/
And what I've getting after starting z-server:
LogType=console
DBHost=postgres-server
DBName=zabbix_pwd
DBSchema=public
DBUser=zabbix
DBPassword=zabbix
DBPort=5432
SNMPTrapperFile=/var/lib/zabbix/snmptraps/snmptraps.log
StartSNMPTrapper=1
AlertScriptsPath=/usr/lib/zabbix/alertscripts
ExternalScripts=/usr/lib/zabbix/externalscripts
FpingLocation=/usr/sbin/fping
Fping6Location=/usr/sbin/fping6
SSHKeyLocation=/var/lib/zabbix/ssh_keys
SSLCertLocation=/var/lib/zabbix/ssl/certs/
SSLKeyLocation=/var/lib/zabbix/ssl/keys/
SSLCALocation=/var/lib/zabbix/ssl/ssl_ca/
LoadModulePath=/var/lib/zabbix/modules/
Any suggestions to how-to rule the world and don't be captured by doctors ?
With docker you need to send conf parameters in the docker-compose.yml file, or in your docker run command using the -e :
For example from my docker yml file:
zabbix-server:
image: zabbix/zabbix-server-pgsql:ubuntu-4.2.6
environment:
ZBX_MAXHOUSEKEEPERDELETE: 5000
ZBX_STARTPOLLERS: 15
ZBX_CACHESIZE: 8M
ZBX_STARTDBSYNCERS: 4
ZBX_HISTORYCACHESIZE: 16M
ZBX_TRENDCACHESIZE: 4M
ZBX_VALUECACHESIZE: 8M
ZBX_LOGSLOWQUERIES: 3000
Another way to work with zabbix:
https://hub.docker.com/r/monitoringartist/zabbix-3.0-xxl/
I have a redis instance (3.2) on docker (official image) which is pretty much unused, except the script I launched every second, to unqueue potential items in a ZSET.
Here is my script:
local latestSchedule = redis.call('ZRANGEBYSCORE', KEYS[1], '-inf', 123456789, 'LIMIT', '0', '1')
if latestSchedule[1] == nil then return nil end
redis.call('ZREM', KEYS[1], latestSchedule[1])
return latestSchedule[1]
Even though this ZSET is most of the time empty, Redis is eating more and more memory, up to 128MB, until it restarts and goes up again.
Am I missing something?
Is redis memory usage usually growing without doing anything?
Is my script not well suited for unqueuing from a ZSET?
Should I watch somewhere else?
As per Karthikeyan Gopall request, here is the INFO, just before it reaches 128MB:
# Server
redis_version:3.2.0
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:5382f69a4e75566b
redis_mode:standalone
os:Linux 3.16.0-4-amd64 x86_64
arch_bits:64
multiplexing_api:epoll
gcc_version:4.9.2
process_id:1
run_id:4e22b73f22436677376b4d097746c2a30ba2b9bc
tcp_port:6379
uptime_in_seconds:21140
uptime_in_days:0
hz:10
lru_clock:6816977
executable:/data/redis-server
config_file:
# Clients
connected_clients:5
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0
# Memory
used_memory:33392560
used_memory_human:31.85M
used_memory_rss:125399040
used_memory_rss_human:119.59M
used_memory_peak:33473544
used_memory_peak_human:31.92M
total_system_memory:1787236352
total_system_memory_human:1.66G
used_memory_lua:67447808
used_memory_lua_human:64.32M
maxmemory:134217728
maxmemory_human:128.00M
maxmemory_policy:noeviction
mem_fragmentation_ratio:3.76
mem_allocator:jemalloc-4.0.3
# Persistence
loading:0
rdb_changes_since_last_save:162
rdb_bgsave_in_progress:0
rdb_last_save_time:1466413629
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:-1
rdb_current_bgsave_time_sec:-1
aof_enabled:1
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
aof_current_size:1039856
aof_base_size:0
aof_pending_rewrite:0
aof_buffer_length:0
aof_rewrite_buffer_length:0
aof_pending_bio_fsync:0
aof_delayed_fsync:0
# Stats
total_connections_received:21174
total_commands_processed:339098
instantaneous_ops_per_sec:10
total_net_input_bytes:34329347
total_net_output_bytes:11702705
instantaneous_input_kbps:0.96
instantaneous_output_kbps:0.14
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:24
evicted_keys:0
keyspace_hits:200
keyspace_misses:84628
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:0
migrate_cached_sockets:0
# Replication
role:master
connected_slaves:0
master_repl_offset:0
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0
# CPU
used_cpu_sys:27.75
used_cpu_user:21.72
used_cpu_sys_children:0.00
used_cpu_user_children:0.00
# Cluster
cluster_enabled:0
# Keyspace
db0:keys=47,expires=47,avg_ttl=17584379
And just when it restarts:
# Server
redis_version:3.2.0
redis_git_sha1:00000000
redis_git_dirty:0
redis_build_id:5382f69a4e75566b
redis_mode:standalone
os:Linux 3.16.0-4-amd64 x86_64
arch_bits:64
multiplexing_api:epoll
gcc_version:4.9.2
process_id:1
run_id:4e22b73f22436677376b4d097746c2a30ba2b9bc
tcp_port:6379
uptime_in_seconds:21140
uptime_in_days:0
hz:10
lru_clock:6816977
executable:/data/redis-server
config_file:
# Clients
connected_clients:5
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0
# Memory
used_memory:33392560
used_memory_human:31.85M
used_memory_rss:125399040
used_memory_rss_human:119.59M
used_memory_peak:33473544
used_memory_peak_human:31.92M
total_system_memory:1787236352
total_system_memory_human:1.66G
used_memory_lua:67447808
used_memory_lua_human:64.32M
maxmemory:134217728
maxmemory_human:128.00M
maxmemory_policy:noeviction
mem_fragmentation_ratio:3.76
mem_allocator:jemalloc-4.0.3
# Persistence
loading:0
rdb_changes_since_last_save:162
rdb_bgsave_in_progress:0
rdb_last_save_time:1466413629
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:-1
rdb_current_bgsave_time_sec:-1
aof_enabled:1
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
aof_current_size:1039856
aof_base_size:0
aof_pending_rewrite:0
aof_buffer_length:0
aof_rewrite_buffer_length:0
aof_pending_bio_fsync:0
aof_delayed_fsync:0
# Stats
total_connections_received:21174
total_commands_processed:339098
instantaneous_ops_per_sec:10
total_net_input_bytes:34329347
total_net_output_bytes:11702705
instantaneous_input_kbps:0.96
instantaneous_output_kbps:0.14
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:24
evicted_keys:0
keyspace_hits:200
keyspace_misses:84628
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:0
migrate_cached_sockets:0
# Replication
role:master
connected_slaves:0
master_repl_offset:0
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0
# CPU
used_cpu_sys:27.75
used_cpu_user:21.72
used_cpu_sys_children:0.00
used_cpu_user_children:0.00
# Cluster
cluster_enabled:0
# Keyspace
db0:keys=47,expires=47,avg_ttl=17584379
Your understanding is wrong. Here maxmemory_human:128.00M means the maximum memory redis can take as per your configuration (you can change this in redis.conf file, your current value will be 134217728 bytes ie, 128 MB). If your memory usage goes beyond this range, redis will start throwing out of memory error as per your eviction policy (maxmemory_policy:noeviction)
You need to see used_memory_human:31.85M for the current memory used by redis.
# Keyspace
db0:keys=47,expires=47,avg_ttl=17584379
With 46 more keys in your server I guess this is a normal memory.
You can see more details about each values in info command in this link http://redis.io/commands/INFO.
Hope this helps.
I'm running the service that I developed by myself.
Ruby on Rails.3.2.11, Passenger, and Apache2 are being used.
It seemed working fine until there are over 100 registered users accessing to the service at the same time.
When it happens, my service completely freezes and there won't be any response(Just keep loading forever)
So, all I can do is restarting Apache. It solves the problem for a moment but it occurs again and again!
I thought that handling about 100 users won't be that big problem in Ruby on Rails App.
But I'm guessing that my unique feature is preventing that.
There are 2 things that I care about.
All the registered user's last_active_at(datetime) will be updated when every load
(Every page, and Every time)
All the registered user's point will be increased by 100 when it's his first access in a day(If user access to the service, he can earn 100 points. But only once in a day)
This will be checked in every page, too. Just like last_active_at
The codes for that is just like this
application_controller.rb
class ApplicationController < ActionController::Base
before_filter :record_user_activity
def record_user_activity
if current_user
#Retrieving current_user
#myself_user = User.includes(:profile).find(current_user)
#Checking if current_user hasn't received bonus for today yet
if #myself_user.point_added_at.nil? || !#myself_user.point_added_at.today?
#Checking if current_user shows his online status to public(If so he can earn 100 points)
if #myself_user.profile.activity_invisible.blank?
plus_point(#myself_user, 100)
flash[:alert] = '100 points for today's bonus is added!'
#myself_user.touch :point_added_at
#myself_user.save
end
end
#Updating last_active_at(datetime)
if #myself_user.profile.activity_invisible.blank?
#myself_user.touch :last_active_at
#myself_user.save
else
#myself_user.touch :updated_at
#myself_user.save
end
end
end
end
And this is the result of performance monitoring.
Please, tell me what would be the bottle neck problem, and how to solve it!
Thanks!
UPDATE:
my.cnf
# The following options will be passed to all MySQL clients
[client]
#password = your_password
port = 3306
socket = /var/lib/mysql/mysql.sock
# Here follows entries for some specific programs
# The MySQL server
[mysqld]
port = 3306
socket = /var/lib/mysql/mysql.sock
skip-external-locking
key_buffer_size = 16M
max_allowed_packet = 1M
table_open_cache = 64
sort_buffer_size = 512K
net_buffer_length = 8K
read_buffer_size = 256K
read_rnd_buffer_size = 512K
myisam_sort_buffer_size = 8M
character_set-server=utf8
innodb_buffer_pool_size=384M
innodb_log_file_size=128M
# Don't listen on a TCP/IP port at all. This can be a security enhancement,
# if all processes that need to connect to mysqld run on the same host.
# All interaction with mysqld must be made via Unix sockets or named pipes.
# Note that using this option without enabling named pipes on Windows
# (via the "enable-named-pipe" option) will render mysqld useless!
#
#skip-networking
# Replication Master Server (default)
# binary logging is required for replication
log-bin=mysql-bin
# binary logging format - mixed recommended
binlog_format=mixed
# required unique id between 1 and 2^32 - 1
# defaults to 1 if master-host is not set
# but will not function as a master if omitted
server-id = 1
# Replication Slave (comment out master section to use this)
#
# To configure this host as a replication slave, you can choose between
# two methods :
#
# 1) Use the CHANGE MASTER TO command (fully described in our manual) -
# the syntax is:
#
# CHANGE MASTER TO MASTER_HOST=<host>, MASTER_PORT=<port>,
# MASTER_USER=<user>, MASTER_PASSWORD=<password> ;
#
# where you replace <host>, <user>, <password> by quoted strings and
# <port> by the master's port number (3306 by default).
#
# Example:
#
# CHANGE MASTER TO MASTER_HOST='125.564.12.1', MASTER_PORT=3306,
# MASTER_USER='joe', MASTER_PASSWORD='secret';
#
# OR
#
# 2) Set the variables below. However, in case you choose this method, then
# start replication for the first time (even unsuccessfully, for example
# if you mistyped the password in master-password and the slave fails to
# connect), the slave will create a master.info file, and any later
# change in this file to the variables' values below will be ignored and
# overridden by the content of the master.info file, unless you shutdown
# the slave server, delete master.info and restart the slaver server.
# For that reason, you may want to leave the lines below untouched
# (commented) and instead use CHANGE MASTER TO (see above)
#
# required unique id between 2 and 2^32 - 1
# (and different from the master)
# defaults to 2 if master-host is set
# but will not function as a slave if omitted
#server-id = 2
#
# The replication master for this slave - required
#master-host = <hostname>
#
# The username the slave will use for authentication when connecting
# to the master - required
#master-user = <username>
#
# The password the slave will authenticate with when connecting to
# the master - required
#master-password = <password>
#
# The port the master is listening on.
# optional - defaults to 3306
#master-port = <port>
#
# binary logging - not required for slaves, but recommended
#log-bin=mysql-bin
# Uncomment the following if you are using InnoDB tables
#innodb_data_home_dir = /var/lib/mysql
#innodb_data_file_path = ibdata1:10M:autoextend
#innodb_log_group_home_dir = /var/lib/mysql
# You can set .._buffer_pool_size up to 50 - 80 %
# of RAM but beware of setting memory usage too high
innodb_buffer_pool_size = 768M
#innodb_additional_mem_pool_size = 2M
# Set .._log_file_size to 25 % of buffer pool size
#innodb_log_file_size = 100M
#innodb_log_buffer_size = 8M
innodb_flush_log_at_trx_commit = 2
#innodb_lock_wait_timeout = 50
[mysqldump]
quick
max_allowed_packet = 16M
[mysql]
no-auto-rehash
# Remove the next comment character if you are not familiar with SQL
#safe-updates
default_character_set=utf8
[myisamchk]
key_buffer_size = 20M
sort_buffer_size = 20M
read_buffer = 2M
write_buffer = 2M
[mysqlhotcopy]
interactive-timeout
UPDATE2:
[mysqld]
port = 3306
socket = /var/lib/mysql/mysql.sock
skip-external-locking
key_buffer_size = 256M
join_buffer_size = 1M
thread_cache = 8
thread_concurrency = 8
thread_cache_size = 60
query_cache_size = 32M
max_connections = 200
max_allowed_packet = 1M
table_open_cache = 256
sort_buffer_size = 512K
net_buffer_length = 8K
read_buffer_size = 1M
read_rnd_buffer_size = 512K
myisam_sort_buffer_size = 8M
character_set-server=utf8
innodb_buffer_pool_size=384M
innodb_log_file_size=128M
Passenger defaults to a max of 6 concurrent processess. 6 does not sound like a lot, but in general, even with 100 users at the same time, you will not need 100 processes at the same time.
http://www.modrails.com/documentation/Users%20guide%20Apache.html#PassengerMaxPoolSize
You can increase this to 12 in passenger.
Note that each process in passenger will take up ram. Significantly more ram.
Here are 2 alternatives:
1) Move to a threaded web server, Puma. The default concurrency with Puma is 25.
2) Move the processing offline
* use Sidekiq or Resque to store the record-activity offline
Or, do all of them.
I would create a test environment, and use blitz.io to test your setup and find when your system will show slowdowns, and then stoppages.
Posting this as an Answer since unable to comment on original question.
The behavior you describe is consistent with thread or database connection management issues. Could you tell us this size of your database connection pool (e.g., 100?)? Is it possible your application is not releasing their database connections? If all the db connections in the pool are used up and not released, it would result in similar behavior you are describing.
I have a rails app running which is using redis quite a lot - however - I'm seeing quite a few Redis::TimeoutError occurring here and there, from time to time. There is no pattern in the circumstances. It occurs both in the web app and in the background jobs (which is being processed using sidekiq) - not often but from time to time.
Now I have no idea how to track down the root cause of this and hence no idea how to fix it.
Here is a little background on my setup:
The redis instance is running on a separate physical server which is connected to both my web server and background server in a private local 1Gbit network. All servers are running ubuntu 12.04. The redis version is 2.6.10. I'm connecting from my rails app (which is 3.2) using an initializer like so:
require 'redis'
require 'redis/objects'
REDIS = Redis.new(:url => APP_CONFIG['REDIS_URL'])
Redis.current = REDIS
This is the output of redis-cli INFO:
# Server
redis_version:2.6.10
redis_git_sha1:00000000
redis_git_dirty:0
redis_mode:standalone
os:Linux 3.2.0-38-generic x86_64
arch_bits:64
multiplexing_api:epoll
gcc_version:4.6.3
process_id:28475
run_id:d89bbb1b81d3169c4228cf23c0988ae437d496a1
tcp_port:6379
uptime_in_seconds:14913365
uptime_in_days:172
lru_clock:1507056
# Clients
connected_clients:233
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:19
# Memory
used_memory:801637360
used_memory_human:764.50M
used_memory_rss:594706432
used_memory_peak:4295394784
used_memory_peak_human:4.00G
used_memory_lua:31744
mem_fragmentation_ratio:0.74
mem_allocator:jemalloc-3.3.0
# Persistence
loading:0
rdb_changes_since_last_save:23166
rdb_bgsave_in_progress:0
rdb_last_save_time:1378219310
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:4
rdb_current_bgsave_time_sec:-1
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
# Stats
total_connections_received:932395
total_commands_processed:3088408103
instantaneous_ops_per_sec:837
rejected_connections:0
expired_keys:31428
evicted_keys:3007
keyspace_hits:124093049
keyspace_misses:53060192
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:17651
# Replication
role:master
connected_slaves:1
slave0:192.168.0.2,6379,online
# CPU
used_cpu_sys:54000.21
used_cpu_user:73692.52
used_cpu_sys_children:36229.79
used_cpu_user_children:420655.84
# Keyspace
db0:keys=1498962,expires=1310
In my redis config I have the following set:
\fidaemonize yes
pidfile /var/run/redis/redis-server.pid
timeout 0
loglevel notice
databases 1
save 900 1
save 300 10
save 60 10000
stop-writes-on-bgsave-error yes
rdbcompression yes
rdbchecksum yes
dbfilename dump.rdb
dir /var/lib/redis
slave-serve-stale-data yes
slave-read-only yes
slave-priority 100
maxclients 1000
maxmemory 4GB
maxmemory-policy volatile-lru
appendonly no
appendfsync everysec
no-appendfsync-on-rewrite no
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb
lua-time-limit 5000
slowlog-log-slower-than 10000
slowlog-max-len 128
hash-max-ziplist-entries 512
hash-max-ziplist-value 64
list-max-ziplist-entries 512
list-max-ziplist-value 64
set-max-intset-entries 512
zset-max-ziplist-entries 128
zset-max-ziplist-value 64
activerehashing yes
client-output-buffer-limit normal 0 0 0
client-output-buffer-limit slave 256mb 64mb 60
client-output-buffer-limit pubsub 32mb 8mb 60
That could come from many issues :
because you use the SAVE command (it is setup in your conf) generating a lot of I/O and hammering the server, especially if you use EBS volumes on Amazon.
because you have a Redis slave (same as before, doing SAVE before mirroring).
because you use a KEY * which is very slow on a lot of indexes.
Try "slowlog" command on the redis server to see if there are some "slow query".
Write some logs when "TimeoutError" happens, to see if the "error redis command" in the "slow log".
adjust your timeout setting on the client side。
It might be a problem on the client side if server performs normally. Each redis client instance, not the server, also has a timeout setting, and the default setting is very short - something like a few milliseconds. So if the server does not respond within that time, a Redis::TimeoutError will be raised by the client.
First thing you can try is to set a longer timeout value, and see if things get better.
redis_url = 'redis://user:password#host:port/'
redis = Redis.connect(:url => redis_url, :timeout => 0.7)
Even with longer timeout setting, there is no guarantee that timeout would not happen, but then it'd be a problem of the design of your system.
Are you rolling your own code to connect to redis or just letting sidekiq handle it? I think you should really just design your connection code to reconnect if the connection has been lost. You can rescue Redis::BaseConnectionError and reconnect.