Job Errors

Jobs are software and will fail for many reasons: bugs, network issues, etc. Faktory recognizes this and provides automatic error handling for all jobs by default.

The Faktory worker process fetches a job and executes it.

If the job does not raise an error, it is considered a success. The worker will ACK it to report success.
If the job does raise an error, the worker will send FAIL with error information to Faktory. This kicks off the error process.

The Process

Faktory provides retries with exponential backoff. This means that Faktory will retry the job N times, each time waiting a little more time for the next retry. By default Faktory will retry a job 25 times, which provides for retries over 21 days. In other words, if this is your software bug, you have three weeks to deploy a fix. You deploy a fix, the job executes successfully, everyone is happy.

The wait formula is:

15 + count ^ 4 + (rand(30) * (count + 1))

15 establishes a minimum wait time.
count^4 is our exponential, the first retry will be 0, the 20th retry will 20^4 (160,000 sec), or about two days.
rand(30) gives us a random "smear". Sometimes people enqueue 1000s of jobs at one time, which all fail for the same reason. This ensures we don't retry 1000s of jobs all at the exact same time and take down a system.

Job Death

After retrying N times, Faktory assumes the job will continue to fail forever and will stop retrying. It moves the job into the Dead Set. Jobs in the Dead Set are not touched by Faktory but can be manually executed from the Web UI. If you have a fix which takes a while to develop, you can trigger a retry after deploying the fix.

FAQ

How do I configure the number of retries?

Set "retry": 6 in the job payload, where 6 is the chosen retry count. After that count, the job will go to the Dead Set as normal.

How do I disable retry completely?

Set "retry": 0 in the job payload. The job will be discarded if it fails. Set "retry": -1 if you want failed jobs to be saved to the Dead set.

Do worker crashes trigger retries?

Yes, any jobs left over by a worker crash will cause Faktory to re-enqueue the job after the job reservation times out. This is treated identical to a FAIL.

This wiki is tracked by git and publicly editable. You are welcome to fix errors and typos. Any defacing or vandalism of content will result in your changes being reverted and you being blocked.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly