PHP-FPM / FastCGI + exit() causing CPU spikes - fastcgi

I have been having intermittent issues on some servers running Archlinux / php-fpm 5.3.9 in FastCGI on Cherokee 1.2.101. I am using a caching plugin that builds and serves static cache files using logic like:
$cache_file = md5($host . $uri) . '.cache';
if( file_exists($cache_file) ) {
$cache_file_contents = file_get_contents($cache_file)
exit( $cache_file_contents );
}
// else build/save the $cache_file
A few processes will end up in the slow log of php-fpm hanging on that exit() call. At that time the load spikes, 100% CPU usage goes (almost) entirely to the webserver and PHP pages start returning 500 - Internal Server errors. Sometimes the server recovers on it's own, others I need to restart php-fpm and cherokee.
I have the FastCGI settings for PHP-FPM configured to do a
Even though this is a VPS I would tentatively rule out IO wait on the filesystem as the cache file should already be loaded. I have not been able to catch it in the act to test with vmstat
I have pm.max_requests set to 500 but wonder if the exit() call is interfering with the cycling of processes.
The php-fpm log shows a lot of WARNING: [pool www] seems busy (you may need to increase pm.start_servers, or pm.min/max_spare_servers). This seems to be a normal part of php-fpm regulating the number of child processes in the pool though
Any tips on troubleshooting would be appreciated. Here are 3 things I found that raised some red flags:
http://www.php.net/manual/en/function.exit.php#96930
https://serverfault.com/questions/84962/php-via-fastcgi-terminated-by-calling-exit#85008
Errors when calling exit() function for fastCGI?

This might be related to passing the output over to the server (track I/O as well). You could keep FPM out of the way by making your webserver serving the static cache files. Apart from that, I'd suggest you use this PHP chunk instead to lessen memory/I/O a little bit:
if (file_exists($cache_file))
{
readfile($cache_file)
exit;
}
See readfile.
If you don't want to use exit (I personally never encountered an issue with using it with FastCGI in PHP) you should clean your code so that it's not necessary to use exit, e.g. you could return or look on your code-flow why you need to use exit and eliminate the issue.

I ended up using the Pythonic Exception wrapping method cited in the comments at http://www.php.net/manual/en/function.exit.php
In the main index.php
class SystemExit extends Exception {}
try{
/* Resume loading web-app */
}
catch (SystemExit $e) {}
In the Cache Logic from the Question, replacing exit( $cache_file_contents );
while (#ob_end_flush());
flush();
echo $cache_file_contents;
throw new SystemExit();
This has alleviated the php-fpm slow logs that show hangs on that exit(). I'm not entirely convinced that it solved the underlying problem but it has cleaned up the log files.

Related

How do I intercept the unbuffered output of a Proc::Async in Raku?

With a snippet like
# Contents of ./run
my $p = Proc::Async.new: #*ARGS;
react {
whenever Promise.in: 5 { $p.kill }
whenever $p.stdout { say "OUT: { .chomp }" }
whenever $p.ready { say "PID: $_" }
whenever $p.start { say "Done" }
}
executed like
./run raku -e 'react whenever Supply.interval: 1 { .say }'
I expected to see something like
PID: 1234
OUT: 0
OUT: 1
OUT: 2
OUT: 3
OUT: 4
Done
but instead I see
PID: 1234
OUT: 0
Done
I understand that this has to do with buffering: if I change that command into something like
# The $|++ disables buffering
./run perl -E '$|++; while(1) { state $i; say $i++; sleep 1 }'
I get the desired output.
I know that TTY IO::Handle objects are unbuffered, and that in this case the $*OUT of the spawned process is not one. And I've read that IO::Pipe objects are buffered "so that a write without a read doesn't immediately block" (although I cannot say I entirely understand what this means).
But no matter what I've tried, I cannot get the unbuffered output stream of a Proc::Async. How do I do this?
I've tried binding an open IO::Handle using $proc.bind-stdout but I still get the same issue.
Note that doing something like $proc.bind-stdout: $*OUT does work, in the sense that the Proc::Async object no longer buffers, but it's also not a solution to my problem, because I cannot tap into the output before it goes out. It does suggest to me that if I can bind the Proc::Async to an unbuffered handle, it should do the right thing. But I haven't been able to get that to work either.
For clarification: as suggested with the Perl example, I know I can fix this by disabling the buffering on the command I'll be passing as input, but I'm looking for a way to do this from the side that creates the Proc::Async object.
You can set the .out-buffer of a handle (such as $*OUT or $*ERR) to 0:
$ ./run raku -e '$*OUT.out-buffer = 0; react whenever Supply.interval: 1 { .say }'
PID: 11340
OUT: 0
OUT: 1
OUT: 2
OUT: 3
OUT: 4
Done
Proc::Async itself isn't performing buffering on the received data. However, spawned processes may do their own depending on what they are outputting to, and that's what is being observed here.
Many programs make decisions about their output buffering (among other things, such as whether to emit color codes) based on whether the output handle is attached to a TTY (a terminal). The assumption is that a TTY means a human is going to be watching the output, and thus latency is preferable to throughput, so buffering is disabled (or restricted to line buffering). If, on the other hand, the output is going to a pipe or a file, then the assumption is that latency is not so important, and buffering is used to achieve a significant throughput win (a lot less system calls to write data).
When we spawn something with Proc::Async, the standard output of the spawned process is bound to a pipe - which is not a TTY. Thus the invoked program may use this to decide to apply output buffering.
If you're willing to have another dependency, then you can invoke the program via. something that fakes up a TTY, such as unbuffer (part of the expect package, it seems). Here's an example of a program that is suffering from buffering:
my $proc = Proc::Async.new: 'raku', '-e',
'react whenever Supply.interval(1) { .say }';
react whenever $proc.stdout {
.print
}
We only see a 0 and then have to wait a long time for more output. Running it via unbuffer:
my $proc = Proc::Async.new: 'unbuffer', 'raku', '-e',
'react whenever Supply.interval(1) { .say }';
react whenever $proc.stdout {
.print
}
Means that we see a number output every second.
Could Raku provide a built-in solution to this some day? Yes - by doing the "magic" that unbuffer itself does (I presume allocating a pty - kind of a fake TTY). This isn't trivial - although it is being explored by the libuv developers; at least so far as Rakudo on MoarVM goes, the moment there's a libuv release available offering such a feature, we'll work on exposing it.

Redis - monitoring maximum memory before inserts fail?

While this Q/A does not address the actual issue of: How to detect with client (eg redis-py) that redis is running out of memory constraint not by machine but by the maxmem configuration? Before inserts fail which command to use in the programm to detect about to be full?
My first guess is: info and check if used_memory_peak < maxmem setting. Is this correct?
(Besides, for out of machine memory, since defrag, use which setting, none of the returned INFO fields help here)
Well should i just try an insert and see if fail (but that would be after the fact then.)
Trail and error, good enough tested by running
while true; do redis-cli lpush mm longstringhere; done; results on maxmem - used_memory < 0.1MB with insert failures:
(error) OOM command not allowed when used memory > 'maxmemory'.
So i have set i poll it via redis-py client and once the diff goes <1mb threshold throw up, sry raise Error of course. Make sure the user_memory memory addon of your longest command is < threshold too of course otherwise you run into it on insert.
I try to figure how to calc the ~percentage of used mem so i get notification way earlier eg 90% of maxmem, therefore this solution is fine.
Info dump:
# Memory
used_memory:3126272
used_memory_human:2.98M
used_memory_rss:5292032
used_memory_rss_human:5.05M
used_memory_peak:4914296
used_memory_peak_human:4.69M
used_memory_peak_perc:63.62%
used_memory_overhead:696654...
Furthermore maxmem is not a hardcap, when running it further by eg adding members to existing set.
used_memory:3162584
used_memory_human:3.02M
code to get percent 0-100
rmem_info = pipe.info(section='memory')
{'redis_mem_percent': math.ceil(rmem_info['used_memory'] / rmem_info['maxmemory'] *100)}

Why is my Ruby script utilizing 90% of my CPU?

I wrote a admin script that tails a heroku log and every n seconds, it summarizes averages and notifies me if i cross a certain threshold (yes I know and love new relic -- but I want to do custom stuff).
Here is the entire script.
I have never been a master of IO and threads, I wonder if I am making a silly mistake. I have a couple of daemon threads that have while(true){} which could be the culprit. For example:
# read new lines
f = File.open(file, "r")
f.seek(0, IO::SEEK_END)
while true do
select([f])
line = f.gets
parse_heroku_line(line)
end
I use one daemon to watch for new lines of a log, and the other to periodically summarize.
Does someone see a way to make it less processor-intensive?
This probably runs hot because you never really block while reading from the temporary file. IO::select is a thin layer over POSIX select(2). It looks like you're trying to block until the file is ready for reading, but select(2) considers EOF to be ready ("a file descriptor is also ready on end-of-file"), so you always return right away from select then call gets which returns nil at EOF.
You can get a truer EOF reading and nice blocking behavior by avoiding the thread which writes to the temp file and instead using IO::popen to fork the %x[heroku logs --ps router --tail --app pipewave-cedar] log tailer, connected to a ruby IO object on which you can loop over gets, exiting when gets returns nil (indicating the log tailer finished). gets on the pipe from the tailer will block when there's nothing to read and your script will only run as hot as it takes to do your line parsing and reporting.
EDIT: I'm not set up to actually try your code, but you should be able to replace the log tailer thread and your temp file read loop with this code to get the behavior described above:
IO.popen( %w{ heroku logs --ps router --tail --app my-heroku-app } ) do |logf|
while line = logf.gets
parse_heroku_line(line) if line =~ /^/
end
end
I also notice your reporting thread does not do anything to synchronize access to #total_lines, #total_errors, etc. So, you have some minor race conditions where you can get inconsistent values from the instance vars that parse_heroku_line method updates.
select is about whether a read would block. f is just a plain old file, so you when get to the end reads don't block, they just return nil instantly. As a result select returns instantly rather than waiting for something to be appending to the file as I assume you're expecting. Because of this you're sitting in a tight busy loop, so high cpu is to be expected.
If you are at eof (you could either check f.eof? or whether gets returns nil), then you could either start sleeping (perhaps with some sort of back off) or use something like listen to be notified of filesystem changes

ServiceController seems to be unable to stop a service

I'm trying to stop a Windows service on a local machine (the service is Topshelf.Host, if that matters) with this code:
serviceController.Stop();
serviceController.WaitForStatus(ServiceControllerStatus.Stopped, timeout);
timeout is set to 1 hour, but service never actually gets stopped. Strange thing with it is that from within Services MMC snap-in I see it in "Stopping" state first, but after a while it reverts back to "Started". However, when I try to stop it manually, an error occurs:
Windows could not stop the Topshelf.Host service on Local Computer.
Error 1061: The service cannot accept control messages at this time.
Am I missing something here?
I know I am quite late to answer this but I faced a similar issue , i.e., the error: "The service cannot accept control messages at this time." and would like to add this as a reference for others.
You can try killing this service using powershell (run powershell as administrator):
#Get the PID of the required service with the help of the service name, say, service name.
$ServicePID = (get-wmiobject win32_service | where { $_.name -eq 'service name'}).processID
#Now with this PID, you can kill the service
taskkill /f /pid $ServicePID
Either your service is busy processing some big operation or is in transition to change the state. hence is not able to accept anymore input...just think of it as taking more than it can chew...
if you are sure that you haven't fed anything big to it, just go to task manager and kill the process for this service or restart your machine.
I had exact same problem with Topshelf hosted service. Cause was long service start time, more than 20 seconds. This left service in state where it was unable to process further requests.
I was able to reproduce problem only when service was started from command line (net start my_service).
Proper initialization for Topshelf service with long star time is following:
namespace Example.My.Service
{
using System;
using System.Threading.Tasks;
using Topshelf;
internal class Program
{
public static void Main()
{
HostFactory.Run(
x =>
{
x.Service<MyService>(
s =>
{
MyService testServerService = null;
s.ConstructUsing(name => testServerService = new MyService());
s.WhenStarted(service => service.Start());
s.WhenStopped(service => service.Stop());
s.AfterStartingService(
context =>
{
if (testServerService == null)
{
throw new InvalidOperationException("Service not created yet.");
}
testServerService.AfterStart(context);
});
});
x.SetServiceName("my_service");
});
}
}
public sealed class MyService
{
private Task starting;
public void Start()
{
this.starting = Task.Run(() => InitializeService());
}
private void InitializeService()
{
// TODO: Provide service initialization code.
}
[CLSCompliant(false)]
public void AfterStart(HostControl hostStartedContext)
{
if (hostStartedContext == null)
{
throw new ArgumentNullException(nameof(hostStartedContext));
}
if (this.starting == null)
{
throw new InvalidOperationException("Service start was not initiated.");
}
while (!this.starting.Wait(TimeSpan.FromSeconds(7)))
{
hostStartedContext.RequestAdditionalTime(TimeSpan.FromSeconds(10));
}
}
public void Stop()
{
// TODO: Provide service shutdown code.
}
}
}
I've seen this issue as well, specifically when a service is start pending and I send it a stop programmatically which succeeds but does nothing. Also sometimes I see stop commands to a running service fail with this same exception but then still actually stop the service. I don't think the API can be trusted to do what it says. This error message explanation is quite helpful...
http://technet.microsoft.com/en-us/library/cc962384.aspx
I run into a similar issue and found out it was due to one of the services getting stuck in a state of start-pending, stop pending, or stopped.
Rebooting the server or trying to restart services did not work.
To solve this, I run the Task Manager in the server and in the "Details" tab I located the services that were stuck and killed the process by ending the task. After ending the task I was able to restart services without problem.
In brief:
1. Go to Task Manager
2. Click on "Detail" tab
3. Locate your service
4. Right click on it and stop/kill the process.
That is it.
I know it was opened while ago, but i am bit missing the option with Windows command prompt, so only for sake of completeness
Open Task Manager and find respective process and its PID i.e PID = 111
Eventually you can narrow down the executive file i.e. Image name = notepad.exe
in command prompt use command TASKKILL
example: TASKKILL /F /PID 111 ; TASKKILL /F /IM notepad.exe
I had this exact issue internally when starting and stopping a service using PowerShell (Via Octopus Deploy). The root cause for the service not responding to messages appeared to be related to devs accessing files/folders within the root service install directory via an SMB connection (looking at a config file with notepad/explorer).
If the service gets stuck in that situation then the only option is to kill it and sever the connections using computer management. After that, service was able to be redeployed fine.
May not be the exact root cause, but something we now check for.
I faced the similar issue. This error sometimes occur because the service can no longer accept control messages, this may be due to disk space issues in the server where that particular service's log file is present.
If this occurs, you can consider the below option as well.
Go to the location where the service exe & its log file is located.
Free up some space
Kill the service's process via Task manager
Start the service.
I just fought this problem while moving code from an old multi partition box to a newer single partition box. On service stop I was writing to D: and since it didn't exist anymore I got a 1061 error. Any long operation during the OnStop will cause this though unless you spin the call off to another thread with a callback delegate.

Rails logging error: "Error during failsafe response: Shifting failed." ... is there an elegant solution to this?

I've configured my Rails 2.3.8 logger in the environment.rb to rotate daily:
config.logger = Logger.new("#{RAILS_ROOT}/logs/#{RAILS_ENV}.log", 'daily')
and every day in the morning I get the usual:
Error during failsafe response: Shifting failed.
Is there a decent/elegant/better solution to this?
What I've done in the past is just set up a cron job to notice when this happens and to drop a Passenger restart.txt file in the app's tmp/ directory.
Thanks.
It's pretty common on UNIX/Linux to use a program named logrotate to perform log file rotation. Slicehost have a couple of nice articles on how to use it.
For a Phusion Passenger deployment you can use a configuration like the example below. Obviously adjust the directories and rotation frequency as appropriate.
/home/deploy/public_html/railsapp/shared/log/*.log {
weekly
missingok
rotate 30
compress
delaycompress
notifempty
sharedscripts
postrotate
touch /home/deploy/public_html/railsapp/current/tmp/restart.txt
endscript
}
If you have many requests coming in simultaneously, and it is time for the Rails to rotate logs. If a stream is trying to write to a file ( the logger.rb code has a line that says : #dev.stat.size) and when the file does not exist (because it is being rotated) then it throws a fatal exception, and basically the server stops responding to requests (it doesn’t necessarily shut down, but bombs out on requests.

Resources