I have a debian box running tasks with celery and rabbitmq for about a year. Recently I noticed tasks were not being processed so I logged into the system and noticed that celery could not connect to rabbitmq. I restarted rabbitmq-server and even though celery was not complaining anymore it was not executing new tasks now. The odd thing was that rabbitmq was devouring cpu and memory resources like crazy. Restarting server would not solve the problem. After spending couple hours looking for solution online to no avail I decided to rebuild the server.
I rebuilt new server with Debian 7.5, rabbitmq 2.8.4, celery 3.1.13 (Cipater). For about an hour or so everything worked beautifully again until celery started complaining again that it can't connect to rabbitmq!
[2014-08-06 05:17:21,036: ERROR/MainProcess] consumer: Cannot connect to amqp://guest:**#127.0.0.1:5672//: [Errno 111] Connection refused.
Trying again in 6.00 seconds...
I restarted rabbitmq service rabbitmq-server start and same issue gain:
rabbitmq started again swelling up constantly pounding on cpu and slowly taking over all ram and swap:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
21823 rabbitmq 20 0 908m 488m 3900 S 731.2 49.4 9:44.74 beam.smp
Here's the result on rabbitmqctl status:
Status of node 'rabbit#li370-61' ...
[{pid,21823},
{running_applications,[{rabbit,"RabbitMQ","2.8.4"},
{os_mon,"CPO CXC 138 46","2.2.9"},
{sasl,"SASL CXC 138 11","2.2.1"},
{mnesia,"MNESIA CXC 138 12","4.7"},
{stdlib,"ERTS CXC 138 10","1.18.1"},
{kernel,"ERTS CXC 138 10","2.15.1"}]},
{os,{unix,linux}},
{erlang_version,"Erlang R15B01 (erts-5.9.1) [source] [64-bit] [smp:8:8] [async-threads:30] [kernel-poll:true]\n"},
{memory,[{total,489341272},
{processes,462841967},
{processes_used,462685207},
{system,26499305},
{atom,504409},
{atom_used,473810},
{binary,98752},
{code,11874771},
{ets,6695040}]},
{vm_memory_high_watermark,0.3999999992280962},
{vm_memory_limit,414559436},
{disk_free_limit,1000000000},
{disk_free,48346546176},
{file_descriptors,[{total_limit,924},
{total_used,924},
{sockets_limit,829},
{sockets_used,3}]},
{processes,[{limit,1048576},{used,1354}]},
{run_queue,0},
Some entries from /var/log/rabbitmq:
=WARNING REPORT==== 8-Aug-2014::00:11:35 ===
Mnesia('rabbit#li370-61'): ** WARNING ** Mnesia is overloaded: {dump_log,
write_threshold}
=WARNING REPORT==== 8-Aug-2014::00:11:35 ===
Mnesia('rabbit#li370-61'): ** WARNING ** Mnesia is overloaded: {dump_log,
write_threshold}
=WARNING REPORT==== 8-Aug-2014::00:11:35 ===
Mnesia('rabbit#li370-61'): ** WARNING ** Mnesia is overloaded: {dump_log,
write_threshold}
=WARNING REPORT==== 8-Aug-2014::00:11:35 ===
Mnesia('rabbit#li370-61'): ** WARNING ** Mnesia is overloaded: {dump_log,
write_threshold}
=WARNING REPORT==== 8-Aug-2014::00:11:36 ===
Mnesia('rabbit#li370-61'): ** WARNING ** Mnesia is overloaded: {dump_log,
write_threshold}
=INFO REPORT==== 8-Aug-2014::00:11:36 ===
vm_memory_high_watermark set. Memory used:422283840 allowed:414559436
=WARNING REPORT==== 8-Aug-2014::00:11:36 ===
memory resource limit alarm set on node 'rabbit#li370-61'.
**********************************************************
*** Publishers will be blocked until this alarm clears ***
**********************************************************
=INFO REPORT==== 8-Aug-2014::00:11:43 ===
started TCP Listener on [::]:5672
=INFO REPORT==== 8-Aug-2014::00:11:44 ===
vm_memory_high_watermark clear. Memory used:290424384 allowed:414559436
=WARNING REPORT==== 8-Aug-2014::00:11:44 ===
memory resource limit alarm cleared on node 'rabbit#li370-61'
=INFO REPORT==== 8-Aug-2014::00:11:59 ===
vm_memory_high_watermark set. Memory used:414584504 allowed:414559436
=WARNING REPORT==== 8-Aug-2014::00:11:59 ===
memory resource limit alarm set on node 'rabbit#li370-61'.
**********************************************************
*** Publishers will be blocked until this alarm clears ***
**********************************************************
=INFO REPORT==== 8-Aug-2014::00:12:00 ===
vm_memory_high_watermark clear. Memory used:411143496 allowed:414559436
=WARNING REPORT==== 8-Aug-2014::00:12:00 ===
memory resource limit alarm cleared on node 'rabbit#li370-61'
=INFO REPORT==== 8-Aug-2014::00:12:01 ===
vm_memory_high_watermark set. Memory used:415563120 allowed:414559436
=WARNING REPORT==== 8-Aug-2014::00:12:01 ===
memory resource limit alarm set on node 'rabbit#li370-61'.
**********************************************************
*** Publishers will be blocked until this alarm clears ***
**********************************************************
=INFO REPORT==== 8-Aug-2014::00:12:07 ===
Server startup complete; 0 plugins started.
=ERROR REPORT==== 8-Aug-2014::00:15:32 ===
** Generic server rabbit_disk_monitor terminating
** Last message in was update
** When Server state == {state,"/var/lib/rabbitmq/mnesia/rabbit#li370-61",
50000000,46946492416,100,10000,
#Ref<0.0.1.79456>,false}
** Reason for termination ==
** {unparseable,[]}
=INFO REPORT==== 8-Aug-2014::00:15:37 ===
Disk free limit set to 50MB
=ERROR REPORT==== 8-Aug-2014::00:16:03 ===
** Generic server rabbit_disk_monitor terminating
** Last message in was update
** When Server state == {state,"/var/lib/rabbitmq/mnesia/rabbit#li370-61",
50000000,46946426880,100,10000,
#Ref<0.0.1.80930>,false}
** Reason for termination ==
** {unparseable,[]}
=INFO REPORT==== 8-Aug-2014::00:16:05 ===
Disk free limit set to 50MB
UPDATE:
Seems like problem was solved when installed newest version of rabbitmq (3.3.4-1) from rabbitmq.com repository. Originally I had one installed (2.8.4) from Debian repositories. So far rabbitmq-server is working smoothly. I will update this post if issue comes back.
UPDATE:
Unfortunately after about 24 hours the issue reappeared where rabbitmq shut down and restarting the process would make it consume resources until it shuts down again within minutes.
Finally I found the solution. These posts helped to figure this out.
RabbitMQ on EC2 Consuming Tons of CPU
and
https://serverfault.com/questions/337982/how-do-i-restart-rabbitmq-after-switching-machines
What happened was rabbitmq was holding on to all the results that were never freed to the point it became overloaded. I cleared all the stale data in /var/lib/rabbitmq/mnesia/rabbit/, restarted rabbit and it works fine now.
My solution was to disable storing results alltogether with CELERY_IGNORE_RESULT = True in the Celery config file to assure this does not happen again.
You can also reset the queue:
Warning: This clears all data and configuration! Use with caution.
sudo service rabbitmq-server start
sudo rabbitmqctl stop_app
sudo rabbitmqctl reset
sudo rabbitmqctl start_app
You might need to run these command right after rebooting if your system is not responding.
You are running out the memory resources because of celery, I got a similar issue and it was a problem with the queues used by celery backend result.
You can check how many queues there are using rabbitmqctl list_queues command, take attention if that number grows forever. In that case check out your celery use.
About celery, if you are not getting the results as asycronous events dont configure a backend for store those unused results.
I experienced a similar issue and it turned out to be due to some rogue RabbitMQ client applications.
The issue seems to have been that due to some un-handled error, the rogue application was continuously trying to make a connection to the RabbitMQ broker.
Once the client applications were restarted, everything went back to normal (since the application stopped malfunctioning and there for stopped trying to make a connection to RabbitMQ in an endless loop)
Another possible cause: The management-plugin.
I'm running RabbitMQ 3.8.1 with enabled management-plugin.
On a 10-core server I had up to 1000% CPU-usage with 3 idle consumers and not a single message being sent, and one queue.
When I disabled the management-plugin by executing rabbitmq-plugins disable rabbitmq_management the usage dropped to 0% with occasional spikes of 200%.
Related
After making a fresh installation , My ejab(15.11) server still getting crashed.
ejabberd-15.11/logs/crash.log
Offender: [{pid,{restarting,<0.366.0>}},{name,ejabberd_listener},{mfargs,{ejabberd_listener,start_link,[]}},{restart_type,permanent},{shutdown,infinity},{child_type,supervisor}]
2015-12-26 08:26:54 =ERROR REPORT====
Error in process <0.631.1> on node 'ejabberd#archie' with exit value: {badarg,[{ets,lookup,[local_config,{hosts,global}],[]},{ejabberd_config,get_option,3,[{file,"src/ejabberd_config.erl"},{line,749}]},{ejabberd_system_monitor,process_large_heap,2,[{file,"src/ejabberd_system_monitor.er...
2015-12-26 08:26:54 =ERROR REPORT====
Error in process <0.632.1> on node 'ejabberd#archie' with exit value: {badarg,[{ets,lookup,[local_config,{hosts,global}],[]},{ejabberd_config,get_option,3,[{file,"src/ejabberd_config.erl"},{line,749}]},{ejabberd_system_monitor,process_large_heap,2,[{file,"src/ejabberd_system_monitor.er...
2015-12-26 08:26:54 =ERROR REPORT====
[{application_master,shutdown_error},{ejabberd_app,{prep_stop,[[]]}},{error_info,{badarg,[{ets,lookup,[local_config,{listen,global}],[]},{ejabberd_config,get_option,3,[{file,"src/ejabberd_config.erl"},{line,749}]},{ejabberd_listener,stop_listeners,0,[{file,"src/ejabberd_listener.erl"},{line,380}]},{ejabberd_app,prep_stop,1,[{file,"src/ejabberd_app.erl"},{line,84}]},{application_master,prep_stop,2,[{file,"application_master.erl"},{line,376}]},{application_master,loop_it,4,[{file,"application_master.erl"},{line,368}]}]}}]
ejabberd-15.11/logs/error.log
Failed TCP accept: emfile
Failed TCP accept: emfile
Your installation deployment is broken. You should look at root cause (check the logs). In your situation the local_config table is seen missing by ets module so something went very wrong before. It may be related to your custom local modules.
Check the log for previous errors.
My rabbitMQ service is crashing as soon as I start it. The service was running fine for last two years but suddenly, it stopped working. I looked at the log and saw this:
=INFO REPORT==== 8-Oct-2015::11:40:56 ===
Starting RabbitMQ 3.0.2 on Erlang R16B
=INFO REPORT==== 8-Oct-2015::11:40:56 ===
Limiting to approx 8092 file handles (7280 sockets)
=WARNING REPORT==== 8-Oct-2015::11:40:56 ===
Only 2048MB of 4095MB memory usable due to limited address space.
=INFO REPORT==== 8-Oct-2015::11:40:56 ===
Memory limit set to 819MB of 4095MB total.
=INFO REPORT==== 8-Oct-2015::11:40:56 ===
Disk free limit set to 1000MB
=INFO REPORT==== 8-Oct-2015::11:40:56 ===
msg_store_transient: using rabbit_msg_store_ets_index to provide index
=INFO REPORT==== 8-Oct-2015::11:40:56 ===
msg_store_persistent: using rabbit_msg_store_ets_index to provide index
=INFO REPORT==== 8-Oct-2015::11:40:56 ===
started TCP Listener on [::]:5672
=ERROR REPORT==== 8-Oct-2015::11:40:56 ===
Error in process
So I decided to run following commad:
rabbitmqctl set_vm_memory_high_watermark 0.5
it gives following error:
Setting memory threshold on rabbit#sn4324324 to 0.5 ...
Error: unable to connect to node rabbit#sn4324324: nodedown
DIAGNOSTICS
===========
nodes in question: [rabbit#sn4324324]
hosts, their running nodes and ports:
- sn4324324: [{rabbitmqctl601389,64542}]
current node details:
- node name: rabbitmqctl601389#sn4324324
- home dir: C:\Users\TestUser
- cookie hash: /GF4XhumN66/5SsNp0a8gQ==
How can I set th value for set_vm_memory_high_watermark
My riak node is not responding to ping yet everything seems ok
(myclient#127.0.0.1)2> {ok, Pid} = riakc_pb_socket:start_link("127.0.0.1", 10018).
{ok,<0.217.0>}
(myclient#127.0.0.1)3> riakc_pb_socket:ping(Pid).
** exception exit: disconnected
(myclient#127.0.0.1)4>
=ERROR REPORT==== 21-Dec-2014::04:41:22 ===
** Generic server <0.217.0> terminating
** Last message in was {req_timeout,#Ref<0.0.0.306>}
** When Server state == {state,"127.0.0.1",10018,false,false,undefined,
undefined,
{[],[]},
1,[],infinity,100}
** Reason for termination ==
** disconnected
What is that i am not doing right??
Thanks in advance.
Port 10018 is the HTTP listening port for a Riak cluster set up in development mode, but you're using the protocol buffers client. Try port 10017 instead.
I was about to install Sensu with Chef, but RabbitMQ does not seem to be working. The service of rabbitmq-server does not start, even though installation of erlang and RabbitMQ was successful.
The error of rabbitMQ says
Error: unable to connect to node rabbit#localhost: nodedown
and
rabbitmq service has already started
So I checked the process of rabbitmq with ps command.
ps aux |grep rabbitmq
for sure one process is running with rabbitmq user
/usr/lib64/erlang/erts-6.1/bin/epmd -daemon
I killed that process, and restarted the rabbitmq-server service. However failed to start rabbitmq-server and the same log was shown the same thing and the same thing happened.
I once removed the erlang and rabbitmq and reinstalled them, but the result was same.
The followings are the detail.
Server
OS: CentOS 6.5
Related installed packages
erlang.x86_64 17.1-1.1.el6
rabbitmq-server.noarch 3.1.5-1.el6
Original log
# /etc/init.d/rabbitmq-server status
Status of node rabbit#localhost ...
Error: unable to connect to node rabbit#localhost: nodedown
DIAGNOSTICS
===========
nodes in question: [rabbit#localhost]
hosts, their running nodes and ports:
- localhost: [{rabbitmqctl23036,37270}]
current node details:
- node name: rabbitmqctl23036#localhost
- home dir: /var/lib/rabbitmq
- cookie hash: Tghu0ucbQ4pq3Sc0JJBbAg==
# tail /var/log/rabbitmq/rabbit\#localhost.log
=INFO REPORT==== 6-Aug-2014::14:59:15 ===
Starting RabbitMQ 3.1.5 on Erlang 17
Copyright (C) 2007-2013 GoPivotal, Inc.
Licensed under the MPL. See http://www.rabbitmq.com/
=INFO REPORT==== 6-Aug-2014::14:59:15 ===
node : rabbit#localhost
home dir : /var/lib/rabbitmq
cookie hash : 9qNy1Q7BP12PVVcbSnZwRw==
log : /var/log/rabbitmq/rabbit#localhost.log
sasl log : /var/log/rabbitmq/rabbit#localhost-sasl.log
database dir : /var/lib/rabbitmq/mnesia/rabbit#localhost
=INFO REPORT==== 6-Aug-2014::14:59:15 ===
Limiting to approx 924 file handles (829 sockets)
=INFO REPORT==== 6-Aug-2014::14:59:15 ===
Memory limit set to 802MB of 2006MB total.
=INFO REPORT==== 6-Aug-2014::14:59:15 ===
Disk free limit set to 1000MB
=INFO REPORT==== 6-Aug-2014::14:59:15 ===
msg_store_transient: using rabbit_msg_store_ets_index to provide index
=INFO REPORT==== 6-Aug-2014::14:59:15 ===
msg_store_persistent: using rabbit_msg_store_ets_index to provide index
=INFO REPORT==== 6-Aug-2014::14:59:15 ===
started TCP Listener on [::]:5672
=INFO REPORT==== 6-Aug-2014::14:59:15 ===
Error description:
{case_clause,{error,{already_started,<0.193.0>}}}
Log files (may contain more information):
/var/log/rabbitmq/rabbit#localhost.log
/var/log/rabbitmq/rabbit#localhost-sasl.log
Stack trace:
[{rabbit_networking,start_listener0,4,[]},
{rabbit_networking,'-start_listener/4-lc$^0/1-0-',4,[]},
{rabbit_networking,start_listener,4,[]},
{rabbit_networking,'-boot_ssl/0-lc$^0/1-0-',1,[]},
{rabbit_networking,boot_ssl,0,[]},
{rabbit_networking,boot,0,[]},
{rabbit,'-run_boot_step/1-lc$^1/1-1-',1,[]},
{rabbit,run_boot_step,1,[]}]
=INFO REPORT==== 6-Aug-2014::14:59:16 ===
stopped TCP Listener on [::]:5672
=INFO REPORT==== 6-Aug-2014::14:59:16 ===
Error description:
{could_not_start,rabbit,
{bad_return,
{{rabbit,start,[normal,[]]},
{'EXIT',
{rabbit,failure_during_boot,
{case_clause,{error,{already_started,<0.193.0>}}}}}}}}
Log files (may contain more information):
/var/log/rabbitmq/rabbit#localhost.log
/var/log/rabbitmq/rabbit#localhost-sasl.log
It looks like the port that RabbitMQ (5672 to be exact) is using is already taken. Or in your case it might be still taken. If you kill application that opened socket on some port, you are not giving it time to close this connection properly. System will eventualy notice this and free the resource, but it might take some time. So what you could do is wait a little, or change RabbitMQ configuration.
Hope this will help with some of your problems.
I try to start tsung slaves on EC2 machines.
Keys are in place and the test with
erl -rsh ssh -sname root -setcookie mycookie
slave:start('i-d6807c9d',root,"-setcookie mycookie").
> {ok,'root#i-d6807c9d'}
is working.
When I now execute tsung I get the following error message: no_rsh.
The Erlang documentation writes that no_rsh means "There is no rsh program on the computer.".
78 =INFO REPORT==== 18-Sep-2012::10:51:15 ===
79 ts_config_server:(5:<0.52.0>) SYSINFO:Current path: /usr/lib/erlang/lib/tsung-1.4.2/ebin/tsung.beam
80
81 =INFO REPORT==== 18-Sep-2012::10:51:15 ===
82 ts_job_notify:(5:<0.64.0>) No listen port defined, can't open listening socketĀ·
83
84 =INFO REPORT==== 18-Sep-2012::10:51:15 ===
85 ts_os_mon:(5:<0.49.0>) os_mon disabled
86 =INFO REPORT==== 18-Sep-2012::10:51:15 ===
87 ts_mon:(5:<0.53.0>) Activate clients with text backend
88
89 =INFO REPORT==== 18-Sep-2012::10:51:15 ===
90 ts_config_server:(0:<0.73.0>) Can't start newbeam on host 'i-d6807c9d' (reason: no_rsh) ! Aborting!
Does anyone know how to get this running?
thx
I found the mistake.
I used the tsung parameter "-r" to set remote connector to a debug script mentioned here: Tsung error: can't start newbeam on host.
This script was not available on the EC2 instance. Running the test without the "-r" option worked.