← All writing

Your Laravel queue workers will die quietly

June 6, 20262 min read

Here's a failure mode I've now seen on multiple production systems: everything looks fine, deploys are green, the app responds — and somewhere, a queue worker died days ago. Exports hang at 0%. Emails don't send. Nobody gets paged, because nothing is "down."

How it happens

The pattern is almost always the same:

  1. MySQL restarts, or the network blips for thirty seconds.
  2. The worker process throws a connection exception and exits.
  3. Supervisor tries to restart it — but the database is still down for another few seconds, so the restart fails too.
  4. After a handful of rapid failures, Supervisor gives up and marks the process FATAL.
  5. The database comes back. The worker does not. Supervisor does not retry FATAL processes. Ever.

From that moment, your queue is a write-only data structure. Jobs pile up; users see spinners.

The fix is configuration, not code

The core mistake is leaving Supervisor's retry defaults in place. They're tuned for "process has a bug," not "dependency had a blip." What you want:

[program:laravel-queue]
command=php /var/www/app/artisan queue:work --tries=3 --max-time=3600
autostart=true
autorestart=true
startretries=30
startsecs=10
stopwaitsecs=60

The two lines that matter:

  • startretries=30 — survive a multi-minute outage instead of giving up after 3 attempts in 3 seconds.
  • startsecs=10 — a worker that exits within 10s counts as a failed start, which keeps genuinely broken deploys from flapping forever.

Add --max-time=3600 so workers recycle themselves hourly — long-lived PHP processes accumulate state you don't want (stale config, leaked memory, dropped DB handles).

Detect it anyway

Config reduces the odds; it doesn't make the failure impossible. Two cheap monitors:

  • Heartbeat job: dispatch a trivial job every 5 minutes that touches a timestamp; alert if the timestamp goes stale. This tests the whole path — dispatch, queue, worker, database.
  • Queue depth alert: if jobs table count exceeds N for more than M minutes, page someone.

And write the recovery runbook before you need it: supervisorctl status, supervisorctl start laravel-queue:*, check storage/logs, requeue stuck exports. At 2am, nobody improvises well.

The meta-lesson

"It works" and "it keeps working" are different engineering problems. The first one is the demo. The second one is the job.