RabbitMQ (beam.smp) and high CPU/memory load issue

ErlangDebianRabbitmqCeleryMnesia

Erlang Problem Overview


I have a debian box running tasks with celery and rabbitmq for about a year. Recently I noticed tasks were not being processed so I logged into the system and noticed that celery could not connect to rabbitmq. I restarted rabbitmq-server and even though celery was not complaining anymore it was not executing new tasks now. The odd thing was that rabbitmq was devouring cpu and memory resources like crazy. Restarting server would not solve the problem. After spending couple hours looking for solution online to no avail I decided to rebuild the server.

I rebuilt new server with Debian 7.5, rabbitmq 2.8.4, celery 3.1.13 (Cipater). For about an hour or so everything worked beautifully again until celery started complaining again that it can't connect to rabbitmq!

[2014-08-06 05:17:21,036: ERROR/MainProcess] consumer: Cannot connect to amqp://guest:**@127.0.0.1:5672//: [Errno 111] Connection refused.
Trying again in 6.00 seconds...

I restarted rabbitmq service rabbitmq-server start and same issue gain:

rabbitmq started again swelling up constantly pounding on cpu and slowly taking over all ram and swap:

PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND
21823 rabbitmq  20   0  908m 488m 3900 S 731.2 49.4   9:44.74 beam.smp

Here's the result on rabbitmqctl status:

Status of node 'rabbit@li370-61' ...
[{pid,21823},
 {running_applications,[{rabbit,"RabbitMQ","2.8.4"},                        {os_mon,"CPO  CXC 138 46","2.2.9"},                        {sasl,"SASL  CXC 138 11","2.2.1"},                        {mnesia,"MNESIA  CXC 138 12","4.7"},                        {stdlib,"ERTS  CXC 138 10","1.18.1"},                        {kernel,"ERTS  CXC 138 10","2.15.1"}]},
 {os,{unix,linux}},
 {erlang_version,"Erlang R15B01 (erts-5.9.1) [source] [64-bit] [smp:8:8] [async-threads:30] [kernel-poll:true]\n"},
 {memory,[{total,489341272},          {processes,462841967},          {processes_used,462685207},          {system,26499305},          {atom,504409},          {atom_used,473810},          {binary,98752},          {code,11874771},          {ets,6695040}]},
 {vm_memory_high_watermark,0.3999999992280962},
 {vm_memory_limit,414559436},
 {disk_free_limit,1000000000},
 {disk_free,48346546176},
 {file_descriptors,[{total_limit,924},                    {total_used,924},                    {sockets_limit,829},                    {sockets_used,3}]},
 {processes,[{limit,1048576},{used,1354}]},
 {run_queue,0},

Some entries from /var/log/rabbitmq:

=WARNING REPORT==== 8-Aug-2014::00:11:35 ===
Mnesia('rabbit@li370-61'): ** WARNING ** Mnesia is overloaded: {dump_log,
                                                                write_threshold}

=WARNING REPORT==== 8-Aug-2014::00:11:35 ===
Mnesia('rabbit@li370-61'): ** WARNING ** Mnesia is overloaded: {dump_log,
                                                                write_threshold}

=WARNING REPORT==== 8-Aug-2014::00:11:35 ===
Mnesia('rabbit@li370-61'): ** WARNING ** Mnesia is overloaded: {dump_log,
                                                                write_threshold}

=WARNING REPORT==== 8-Aug-2014::00:11:35 ===
Mnesia('rabbit@li370-61'): ** WARNING ** Mnesia is overloaded: {dump_log,
                                                                write_threshold}

=WARNING REPORT==== 8-Aug-2014::00:11:36 ===
Mnesia('rabbit@li370-61'): ** WARNING ** Mnesia is overloaded: {dump_log,
                                                                write_threshold}

=INFO REPORT==== 8-Aug-2014::00:11:36 ===
vm_memory_high_watermark set. Memory used:422283840 allowed:414559436

=WARNING REPORT==== 8-Aug-2014::00:11:36 ===
memory resource limit alarm set on node 'rabbit@li370-61'.

**********************************************************
*** Publishers will be blocked until this alarm clears ***
**********************************************************

=INFO REPORT==== 8-Aug-2014::00:11:43 ===
started TCP Listener on [::]:5672

=INFO REPORT==== 8-Aug-2014::00:11:44 ===
vm_memory_high_watermark clear. Memory used:290424384 allowed:414559436

=WARNING REPORT==== 8-Aug-2014::00:11:44 ===
memory resource limit alarm cleared on node 'rabbit@li370-61'

=INFO REPORT==== 8-Aug-2014::00:11:59 ===
vm_memory_high_watermark set. Memory used:414584504 allowed:414559436

=WARNING REPORT==== 8-Aug-2014::00:11:59 ===
memory resource limit alarm set on node 'rabbit@li370-61'.

**********************************************************
*** Publishers will be blocked until this alarm clears ***
**********************************************************

=INFO REPORT==== 8-Aug-2014::00:12:00 ===
vm_memory_high_watermark clear. Memory used:411143496 allowed:414559436

=WARNING REPORT==== 8-Aug-2014::00:12:00 ===
memory resource limit alarm cleared on node 'rabbit@li370-61'

=INFO REPORT==== 8-Aug-2014::00:12:01 ===
vm_memory_high_watermark set. Memory used:415563120 allowed:414559436

=WARNING REPORT==== 8-Aug-2014::00:12:01 ===
memory resource limit alarm set on node 'rabbit@li370-61'.

**********************************************************
*** Publishers will be blocked until this alarm clears ***
**********************************************************

=INFO REPORT==== 8-Aug-2014::00:12:07 ===
Server startup complete; 0 plugins started.

=ERROR REPORT==== 8-Aug-2014::00:15:32 ===
** Generic server rabbit_disk_monitor terminating 
** Last message in was update
** When Server state == {state,"/var/lib/rabbitmq/mnesia/rabbit@li370-61",
                               50000000,46946492416,100,10000,
                               #Ref<0.0.1.79456>,false}
** Reason for termination == 
** {unparseable,[]}

=INFO REPORT==== 8-Aug-2014::00:15:37 ===
Disk free limit set to 50MB

=ERROR REPORT==== 8-Aug-2014::00:16:03 ===
** Generic server rabbit_disk_monitor terminating 
** Last message in was update
** When Server state == {state,"/var/lib/rabbitmq/mnesia/rabbit@li370-61",
                               50000000,46946426880,100,10000,
                               #Ref<0.0.1.80930>,false}
** Reason for termination == 
** {unparseable,[]}

=INFO REPORT==== 8-Aug-2014::00:16:05 ===
Disk free limit set to 50MB

UPDATE: Seems like problem was solved when installed newest version of rabbitmq (3.3.4-1) from rabbitmq.com repository. Originally I had one installed (2.8.4) from Debian repositories. So far rabbitmq-server is working smoothly. I will update this post if issue comes back.

UPDATE: Unfortunately after about 24 hours the issue reappeared where rabbitmq shut down and restarting the process would make it consume resources until it shuts down again within minutes.

Erlang Solutions


Solution 1 - Erlang

Finally I found the solution. These posts helped to figure this out. https://stackoverflow.com/questions/6362829/rabbitmq-on-ec2-consuming-tons-of-cpu and https://serverfault.com/questions/337982/how-do-i-restart-rabbitmq-after-switching-machines

What happened was rabbitmq was holding on to all the results that were never freed to the point it became overloaded. I cleared all the stale data in /var/lib/rabbitmq/mnesia/rabbit/, restarted rabbit and it works fine now.

My solution was to disable storing results alltogether with CELERY_IGNORE_RESULT = True in the Celery config file to assure this does not happen again.

Solution 2 - Erlang

You can also reset the queue:

Warning: This clears all data and configuration! Use with caution.

sudo service rabbitmq-server start
sudo rabbitmqctl stop_app
sudo rabbitmqctl reset
sudo rabbitmqctl start_app

You might need to run these command right after rebooting if your system is not responding.

Solution 3 - Erlang

You are running out the memory resources because of celery, I got a similar issue and it was a problem with the queues used by celery backend result.

You can check how many queues there are using rabbitmqctl list_queues command, take attention if that number grows forever. In that case check out your celery use.

About celery, if you are not getting the results as asycronous events dont configure a backend for store those unused results.

Solution 4 - Erlang

I experienced a similar issue and it turned out to be due to some rogue RabbitMQ client applications. The issue seems to have been that due to some un-handled error, the rogue application was continuously trying to make a connection to the RabbitMQ broker. Once the client applications were restarted, everything went back to normal (since the application stopped malfunctioning and there for stopped trying to make a connection to RabbitMQ in an endless loop)

Solution 5 - Erlang

Another possible cause: The management-plugin.

I'm running RabbitMQ 3.8.1 with enabled management-plugin. On a 10-core server I had up to 1000% CPU-usage with 3 idle consumers and not a single message being sent, and one queue.

When I disabled the management-plugin by executing rabbitmq-plugins disable rabbitmq_management the usage dropped to 0% with occasional spikes of 200%.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
Questionmarcin_kossView Question on Stackoverflow
Solution 1 - Erlangmarcin_kossView Answer on Stackoverflow
Solution 2 - ErlangKostyantynView Answer on Stackoverflow
Solution 3 - ErlangpfreixesView Answer on Stackoverflow
Solution 4 - ErlanggeoandView Answer on Stackoverflow
Solution 5 - ErlangantyView Answer on Stackoverflow