Heartbeat Mechanism
The heartbeat mechanism ensures jobs aren’t lost when a worker crashes. While processing, Monque periodically updates a timestamp on claimed jobs. On startup, Monque can recover jobs that were left stuck in processing.
The Problem
Section titled “The Problem”When a worker crashes mid-processing:
- Job remains in
processingstatus claimedBystill points to crashed instance- No other worker can pick it up
- Job is effectively lost
The Solution: Heartbeats + Stale Recovery
Section titled “The Solution: Heartbeats + Stale Recovery”While a job is processing, Monque periodically updates lastHeartbeat to signal liveness. Separately, Monque can recover jobs that have been processing for longer than lockTimeout by resetting them back to pending on startup.
It helps to think of these fields as serving different purposes:
lockedAt: Used for stale recovery.lockTimeoutis an absolute limit on how long a job may remain locked/processing before it is considered stale and eligible for recovery.lastHeartbeat: Used for monitoring and debugging. It lets you confirm a worker is still actively updating jobs, but it is not what the startup recovery logic uses to decide staleness.
How It Works
Section titled “How It Works”sequenceDiagram
participant W as Worker
participant DB as MongoDB
participant S as New Worker (startup)
W->>DB: Claim job (set lastHeartbeat)
loop Every 30 seconds
W->>DB: Update lastHeartbeat
end
Note over W: Worker crashes!
S->>DB: initialize(): Find jobs with old lockedAt
S->>DB: Reset to pending
Note over DB: Job available for retry
Configuration
Section titled “Configuration”Options Explained
Section titled “Options Explained”| Option | Default | Description |
|---|---|---|
heartbeatInterval | 30000 (30s) | How often Monque updates lastHeartbeat while processing (monitoring/debugging) |
lockTimeout | 1800000 (30min) | Maximum time since a job was claimed (lockedAt) before it is considered stale (absolute duration limit, not a heartbeat timeout) |
recoverStaleJobs | true | Whether to recover stale jobs on startup |
Heartbeat Lifecycle
Section titled “Heartbeat Lifecycle”1. Job Claimed
Section titled “1. Job Claimed”When a worker claims a job:
2. During Processing
Section titled “2. During Processing”Every heartbeatInterval milliseconds:
3. Job Completion
Section titled “3. Job Completion”When job completes or fails:
Stale Job Recovery
Section titled “Stale Job Recovery”On Startup
Section titled “On Startup”During initialize() (when recoverStaleJobs: true), Monque recovers jobs that were left stuck in processing:
Monitoring Recovery
Section titled “Monitoring Recovery”Disabling Stale Recovery
Section titled “Disabling Stale Recovery”In some cases, you may want manual control:
Best Practices
Section titled “Best Practices”1. Tune Timeouts for Your Workload
Section titled “1. Tune Timeouts for Your Workload”2. Monitoring Heartbeats
Section titled “2. Monitoring Heartbeats”Monque manages heartbeats automatically in the background. If a heartbeat update fails (e.g., database connectivity issues), Monque emits a job:error event.
Note that stale recovery is based on lockedAt + lockTimeout; lastHeartbeat is primarily an observability signal to verify worker liveness.
To verify heartbeats are updating normally during debugging, you can check the job status using the public API:
3. Handle Long-Running Jobs
Section titled “3. Handle Long-Running Jobs”For jobs that may exceed lockTimeout:
4. Log Stale Recoveries
Section titled “4. Log Stale Recoveries”Troubleshooting
Section titled “Troubleshooting”Jobs Recovered Too Aggressively
Section titled “Jobs Recovered Too Aggressively”Symptom: Jobs in progress are being reset
Solution: Increase lockTimeout:
Jobs Never Recovered
Section titled “Jobs Never Recovered”Symptom: Stale jobs remain stuck
Checks:
- Verify
recoverStaleJobs: true - Check
lockTimeoutisn’t too high - Ensure
initialize()was called
High Stale Recovery Count
Section titled “High Stale Recovery Count”Symptom: Many jobs recovered on each startup
Possible Causes:
- Frequent worker crashes
- Jobs taking longer than
lockTimeout - Network instability causing heartbeat failures
Investigation:
Index Support
Section titled “Index Support”Monque creates indexes to support both recovery and observability queries:
The stale recovery query shown in this document filters by status (equality) and lockedAt (range), so the recovery index is ordered to start with { status: 1, lockedAt: 1 }. The lastHeartbeat suffix mainly supports additional monitoring/debugging access patterns (and leaves room for recovery strategies that also consider heartbeat age), rather than implying that lockTimeout is a heartbeat timeout.
Multi-Instance Behavior
Section titled “Multi-Instance Behavior”Each scheduler instance:
- Maintains its own heartbeat interval
- Only updates heartbeats for jobs it owns (
claimedBy: thisInstanceId) - Can recover stale jobs from any crashed instance
Next Steps
Section titled “Next Steps”- Atomic Claim Pattern - How job claiming works
- Change Streams - Real-time notifications
- Graceful Shutdown - Clean worker termination