Possible memory issue crashing Hbase Thrift Server - memory

I'm running Cloudera CDH4 with Hbase and Hbase Thrift Server. Several times a day, the Thrift Server crashes.
In /var/log/hbase/hbase-hbase-thrift-myserver.out, there is this:
#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
# Executing /bin/sh -c "kill -9 8151"...
In /var/log/hbase/hbase-hbase-thrift-myserver.log, there are no error messages at the end of the file. There are only a lot of DEBUG messages stating that one of the nodes is caching a particular file.
I can't figure out any configuration options for the Hbase Thrift Server. There are no obvious files in /etc/. Just /etc/hbase/conf and its Hbase files.
Any ideas on debugging?

We had this exact same problem with our HBase Thrift setup, and ended up using a watchdog script that restarts Thrift if its not running.
Are you hitting your HBase server hard, several times a day? That could result in this. No way around this, Thrift does seem to take up (or leak) a lot of memory every time its used, so you need a watchdog script.
If a watchdog script is too heavy-duty, you could use a simple cron job to restart Thrift during frequent intervals to make sure it stays up.
The following cron restarts Thrift every two hours.
0 */2 * * * hbase-daemon.sh restart thrift

Using /etc/hbase/conf/hbase-env.sh, I increased my heap size, and this addressed the crashing issue.
# The maximum amount of heap to use, in MB. Default is 1000.
export HBASE_HEAPSIZE=8000
Thanks to Harsh J on the CDH Users mailing list for helping me figure out. As he pointed out, my lack of log messages indicates a kill -9 is probably taking place:
Indeed if a shutdown handler message is missing in the log tail
pre-crash, there may have been a kill -9 passed to the process via the
OOM handler.

increasing heap size may not be the solution always.
as per this cloudera blog,
Thrift server might be receiving invalid data.
i would suggest to enable the Framed transport and compact protocol.
there's a catch if you enable these protocols on server, client should be using the same protocol.

Related

Docker container stopping in the Jelastic environment

When stopping a Docker container in native Docker environment, by default it sends the SIGTERM signal to the container's init process (PID 1) which should be the actual application, which should then handle the shutdown properly. However when running container in the Jelastic, this does not seem to be case, and instead of gracefully terminating the SQL server, it seems that the server crashes every time.
I did try writing and enabling a Systemd service that gets the SQL PID and then send SIGTERM to it, but it doesn't seem to run, and judging from the logs there's no service shutdown messages at all, just startup messages.
So what changes would be required to the container or the environment to get the server to get the SIGTERM signal and have enough time, maybe few seconds, to do the graceful shutdown?
thank you for reporting the issue, we tried to reproduce the problem on our test lab and were able to get exactly same result. We agree that issue is really serious so we are going to fix it with highest priority now. Please accept our apologies for that inconvenience. I want to notice that due to our primary design we also expect the process to be terminated first with "sigterm" signal and only after not receiving a termination result for some period of time the system had to send "sigkill", only after considering that process cannot be terminated gracefully. Our engineers will work on this to explore the issue deeper and will deliver a fix shortly.
Thank you!

SSH and -bash: fork: Cannot allocate memory Ubuntu , rails , Passenger, redis , sidekiq

I'm running a rails app (dev server) with passenger in Amazon AWS with t2.micro instance. But i'm getting -bash: fork: Cannot allocate memory error constantly.
I'm running redis server on it with 50 sidekiq concurrency. Normally sites runs fine but when i start 2-3 sidekiq process simultaneously do do some batch process. The site take take time to redirect and eventylly crash with
502 Bad Gateway
nginx/1.10.0
Then i have to nginx restart every to to get the site running again. This is my dev server so i don't want to put more financial resources for upgrading to t2.small (as of now, this is our last option) as this is dev servre and will be using twice in 15 days. Is there any way i can solve this otherwise? Previously i had same 120 concurrency as production but then i changed to 50. That help a bit but still memory problems.
here are few stats with htop
This stats are while the server is idle. But when i run few task with sidekiq it crashing with 502.
I check few post suggesting swap memory but not sure this is preferable with t2.micro. Is this advisable for this server setup. Here in pic you can see i don't swap memory. Is it okay yo add swap memory to tackle this issue or there is any other better option.
Your server has a lack of memory, to fix it:
or: buy more operative memory
or: mount a swap
Then try again
In my case, redis used 2.5G memory, the server has 4.5G in total, and 3G is used.
1.5G free.
and Redis kept throwing this error.
solution:
add vm.overcommit_memory=1 to file: /etc/sysctl.conf
sudo sysctl -p /etc/sysctl.conf
refer to: redis bgsave failed because fork Cannot allocate memory

Sidekiq worker is leaking memory

Using the sidekiq gem - I have sidekiq worker that runs a process (git-tf clone of big repository) using IO.popen and tracks the stdout to check the progress of the clone.
When I am running the worker, I see that sidekiq memory is getting larger over the time until I get kernel OOM and the process get killed. the subprocess (java process) is taking only 5% of the total memory.
How I can debug/check the memory leak I have in my code? and does the sidekiq memory is the total of my workers memory with the popen process?
And does anyone have any idea how to fix it?
EDIT
This is the code of my worker -
https://gist.github.com/yosy/5227250
EDIT 2
I ran the code without sidekiq, and I have no memory leaks.. this is something strange with sidekiq and big repositories in tfs
I didn't find the cause for the memory leak in sidekiq, but I found a away to get a way from sidekiq.
I have modified git-tf to have server command that accepts command from redis queue, it removes lot of complexity from my code.
The modified version of git-tf is here:
https://github.com/yosy/gittf
I will add documentation later about the sever command when I will fix some bugs.

How to detect and prevent spawning failing Unicorn workers

Situation: I am using Rails + Unicorn, deploying with Capistrano. Sometimes Rails app fails to start in production mode (though it is not the real production, but a staging env). This usually happens due to errors in deploy scripts or configuration (thus usually not detectable by tests). When this happens, unicorn master process kills the worker that failed and spawns a new one, which also fails and so on and so forth. During all that time unicorn consumes lots of CPU and pollutes logs with the same message.
Manual way (not good): Go to your home page to see if it works. Look at the htop. Tail the logs. Kill unicorn manually. Cons: easy to forget. Logs are polluted, CPU is loaded while you are reacting.
Another solution: Use unicorn's preload_app true. This will cause master process to fail fast. Cons: higher memory consumption in happy scenario.
Best practice: - ???
Is there any way to cleverly detect that unicorn master uselessly tries to spawn failing children and stop it?
You have something like "unicorn start" in your Capistrano script right? Make your Capistrano script ping Unicorn right after invoking that command. If Unicorn does not return an expected response within a timeout, then you know that something went wrong, and you can choose to rollback the deploy or performing some other action.
As for how to ping Unicorn, that depends. If you have Unicorn listening on a TCP socket then you can use curl. If you have Unicorn listening on a Unix domain socket then you have to write a little script that connects to it, like this:
require 'socket'
sock = UNIXSocket.new('/path-to-unicorn.sock')
sock.write("HEAD / HTTP/1.0\r\n")
sock.write("Host: www.foo.com\r\n")
sock.write("Connection: close\r\n")
sock.write("\r\n")
if sock.read !~ /something/
exit 1
end
But it sounds like Phusion Passenger Enterprise solves your problem beautifully. It has this feature called "deployment error resistance". When you deploy a new version and Phusion Passenger detects that it cannot spawn any processes for your new codebase, it will stop trying to spawn your new version and keep the processes for the old versions around indefinitely, until you manually give the signal that it's okay to spawn processes for the new version. In the mean time it will log all errors into the log file so that you can analyze the problem.
I would suggest brushing off your bash skills. The functionality you need is already in Unicorn as it leverages the Unix-y master/worker process.
You need a init.d script. Or at the very least godrb or monit. I recommend the init.d script route AND monitoring. Its more complex, but it can more easily be leveraged by your monitoring software and also gives you an automatic start on reboot.
The gist of it is:
Send the USR2 signal to the unicorn master process, this will fork the master process.
Then send the WINCH to the old master process that gets created, this will kill each worker.
Then you can send the old master process the QUIT signal.
Unicorn Signals
This will spin up a new master process running the new code and label the old one as (old). If it fails the old one should be returned to its prior state and you shouldn't suffer an outage, just a restart error. This is the beauty of unicorn. You can almost get instantaneous deploys of your code.
I'm using a lot of hedge words because I did this work on my apps over a year ago so there are a lot of cobwebs upstairs. Hope this helps!
This is by no mean a correct script. Its a good starting point though ... feel free to update the gist if you can improve upon it! :-)
Example Unicorn Control Script

How many threads should Jenkins run?

I have a Jenkins server that keeps running out of memory and cannot create native threads. I've upped the memory and installed the Monitoring plugin.
There are about 150 projects on the server, and I've been watching the thread count creep up all day. It is now around 990. I expect when it hits 1024, which is the user limit for threads, Jenkins will run out of memory again.
[edit]: I have hit 1016 threads and am now getting the out of memory error
Is this an appropriate number of threads for Jenkins to be running? How can I tell Jenkins to destroy threads when it is finished with them?
tl;dr:
There was a post-build action running a bash script that didn't return anything via stderr or stdout to Jenkins. Therefore, every time the build ran, threads would be created and get stuck waiting. I resolved this issue by having the bash script return an exit status.
long answer
I am running Jenkins on CentOS and have installed in via the RPM. In terms of modifying the Winstone servlet container, you can change that in Jenkin's init script in /etc/sysctrl/jenkins. However the options above only control the number of HTTP threads that are created, not the number of threads overall.
That would be a solution if my threads were hanging on accessing an HTTP API of Jenkins as part of a post-commit action. However, using the ever-handy Monitoring plugin mentioned in my question, I inspected the stuck threads.
The threads were stuck on something in the com.trilead.ssh2.channel package. The getChannelData method has a while(true) loop that looks for output on the stderr or stdout of an ssh stream. The thread was getting suck in that loop because nothing was coming through. I learned this on GrepCode.
This was because the post-build action was to execute a command via SSH onto a server and execute a bash script that would inspect a git repo. However, the git repo was misconfigured and the git command would error, but the exit 1 status did not bubble up through the bash script (partially due to a misformed if-elif-else statement).
The script completed and the build was considered a success, but somehow the thread handling the SSH connection from Jenkins was left hanging due to this git error.
But thank you for your help on this question!
If you run Jenkins "out of the box" it uses Winstone servlet container. You can pass command-line arguments to it as described here. Some of those parameters can limit the number of threads:
--handlerCountStartup = set the no of worker threads to spawn at startup. Default is 5
--handlerCountMax = set the max no of worker threads to allow. Default is 300
--handlerCountMaxIdle = set the max no of idle worker threads to allow. Default is 50
Now, I tried this some time ago and was not 100% convinced that it worked, so no guarantees, but it is worth a try.

Resources