Here's a failure mode I've now seen on multiple production systems: everything looks fine, deploys are green, the app responds — and somewhere, a queue worker died days ago. Exports hang at 0%. Emails don't send. Nobody gets paged, because nothing is "down."
How it happens
The pattern is almost always the same:
- MySQL restarts, or the network blips for thirty seconds.
- The worker process throws a connection exception and exits.
- Supervisor tries to restart it — but the database is still down for another few seconds, so the restart fails too.
- After a handful of rapid failures, Supervisor gives up and marks the process FATAL.
- The database comes back. The worker does not. Supervisor does not retry FATAL processes. Ever.
From that moment, your queue is a write-only data structure. Jobs pile up; users see spinners.
The fix is configuration, not code
The core mistake is leaving Supervisor's retry defaults in place. They're tuned for "process has a bug," not "dependency had a blip." What you want:
[program:laravel-queue]
command=php /var/www/app/artisan queue:work --tries=3 --max-time=3600
autostart=true
autorestart=true
startretries=30
startsecs=10
stopwaitsecs=60
The two lines that matter:
startretries=30— survive a multi-minute outage instead of giving up after 3 attempts in 3 seconds.startsecs=10— a worker that exits within 10s counts as a failed start, which keeps genuinely broken deploys from flapping forever.
Add --max-time=3600 so workers recycle themselves hourly — long-lived PHP processes accumulate state you don't want (stale config, leaked memory, dropped DB handles).
Detect it anyway
Config reduces the odds; it doesn't make the failure impossible. Two cheap monitors:
- Heartbeat job: dispatch a trivial job every 5 minutes that touches a timestamp; alert if the timestamp goes stale. This tests the whole path — dispatch, queue, worker, database.
- Queue depth alert: if
jobstable count exceeds N for more than M minutes, page someone.
And write the recovery runbook before you need it: supervisorctl status, supervisorctl start laravel-queue:*, check storage/logs, requeue stuck exports. At 2am, nobody improvises well.
The meta-lesson
"It works" and "it keeps working" are different engineering problems. The first one is the demo. The second one is the job.