Apache2 - error running shared postrotate script

So something happened not too long ago such that in one of our Xen development environments (running Debian etch), apache was bailing every Sunday morning at a conspicuously consistent time.

This fairly obviously pointed to something going on inside of cron with the likely suspect being log rotation. Even more of a no-brainer given cron’s weekly complaints:
/etc/cron.daily/logrotate:
apache2: Could not reliably determine the server's fully qualified domain name, using 10.1.4.24 for ServerName
apache2: Could not reliably determine the server's fully qualified domain name, using 10.1.4.24 for ServerName
(98)Address already in use: make_sock: could not bind to address [::]:80
(98)Address already in use: make_sock: could not bind to address 0.0.0.0:80
no listening sockets available, shutting down
Unable to open logs
error: error running shared postrotate script for /var/log/apache2/*.log
run-parts: /etc/cron.daily/logrotate exited with return code 1

So a bit of googling on this brought up surprisingly few good explanations as to exactly why it was going on. There are, of course, a lot of voodoo, crap explanations all over discussion boards, but little in the way of concrete evidence. Said voodoo even mentioned in this blog post as deadends describing the exact same thing I was experiencing. It wasn’t too long before I stumbled across Debian Bug report #301702 with a host of duplicates all describing the same problem but with little solidfying something such as a repro case, so I set out to do just that - repro the problem. Here’s how I did it.

First of all, cron quickly points to the following logrotate configuration (in package apache2.2-common or some variant) as a good starting point for investigation:
/var/log/apache2/*.log {
weekly
missingok
rotate 52
compress
delaycompress
notifempty
create 640 root adm
sharedscripts
postrotate
if [ -f /var/run/apache2.pid ]; then
/etc/init.d/apache2 restart > /dev/null
fi
endscript
}

But what on earth is going on such that Apache bails on a restart? In all of the other development and test environments we use, including production, this was only happening in a single one. The answer turns out to be that the problem is load related, or at least that’s how I was able to repro the problem. Slow apache children in general, however, are the cause of the problem.

I could not get this to repro simply by running the logrotate cron tasks as they would run normally (except for the –force, of course):
$ sudo /usr/sbin/logrotate --verbose --force /etc/logrotate.conf
or
$ sudo /usr/sbin/logrotate --verbose --force /etc/logrotate.d/apache2
In the above cases, apache would restart with no problems. So I introduced some load onto the machine. I did this basically by running updatedb (because this is also cron’d to run at the same time, in my case) as well as forking off a couple of while(1){}’s to use a bit of cpu.

With the introduction of load, I could repro the failure of apache to start during the logrotate, but I still didn’t understand what was going on with the log files. I had tunnel vision on the “Unable to open logs” red herring from the original cron complaint. Turns out the real problem is just a slow child and arguably poor behavior of the apache2 init.d scripts as well as the etch logrotate.d/apache2 configuration posted above.

In the repro scenario, I noticed apache complaining about slow children just before SIGKILLing them:
[Wed Jan 02 21:19:55 2008] [notice] Apache/2.2.3 configured -- resuming normal operations
[Wed Jan 02 21:20:15 2008] [warn] child process 18762 still did not exit, sending a SIGTERM
[Wed Jan 02 21:20:19 2008] [warn] child process 18766 still did not exit, sending a SIGTERM
... snip ...
[Wed Jan 02 21:20:21 2008] [error] child process 18765 still did not exit, sending a SIGKILL
[Wed Jan 02 21:20:22 2008] [error] could not make child process 18765 exit, attempting to continue anyway
[Wed Jan 02 21:20:22 2008] [notice] caught SIGTERM, shutting down

The introduction of load was causing children to be delayed in exiting. The /etc/init.d/apache2 scripts in etch attempt to kill the main apache2 process and once it is ascertained that it can no longer receive a signal (via kill -0, such as if the process is already gone), the init scripts continue on their way and attempt to start apache back up. At this point, they assume that apache and all of its children are gone, which is not necessarily the case. This is where the real problem lies. Long story short, those children which haven’t exited quite yet are still holding onto their sockets. As you can see in a trivial example where you try to run apache when it’s already running:
rwoodrum@slard:~$ sudo apache2
(98)Address already in use: make_sock: could not bind to address [::]:80
(98)Address already in use: make_sock: could not bind to address 0.0.0.0:80
no listening sockets available, shutting down
Unable to open logs

There’s that pesky “Unable to open logs” message that was throwing my feeble brain off course.

The solution? A couple of things…

  • In Debian lenny/sid+, the logrotate.d/apache2 package config no longer calls /etc/init.d/apache2 restart and instead uses the graceful restart by way of /etc/init.d/apache2 reload. Apache is able to function much more cleanly in this fashion by holding onto its connections, not bailing on errors, etc.
  • In Debian lenny/sid+ the /etc/init.d/apache2 init scripts have been modified (hacked?) to add a long sleep time between the stop and start. Certainly not foolproof, but apache2 really should be able to exit in 10 whole seconds.

So now I have satisifed my who, what, when, where, and why.

Comments

One Response to “Apache2 - error running shared postrotate script”

  1. Rob Groen on October 20th, 2009 1:17 am

    This fixed my problem.

    Thanks for the solution. This was the only posting I could find with a good solution.

Leave a Reply