Tuesday, October 6, 2009

Gearman and Poison Messages/Jobs

"Gearman provides a generic application framework to farm out work to other machines or processes that are better suited to do the work." (gearman.org)

gearmand, the server, includes an important feature that should be enabled for all installations:
-j, --job-retries
This feature protects against poison messages/jobs which is not enabled by default.
Poison Message: A message in the queue that has met the threshold of allowable retries.
For Gearman, this would be when a worker receives a job but disconnects from the server without advising the job's completion whether success or failure.

A disconnect can occur due to an error in the worker's processing of a job such as "out of memory", an error in the processing such as a "Segmentation fault" when resizing an image, or even an error in the Gearman PHP extension.

Now why is this important?

Let's run a scenario:
  1. Worker connects to gearmand
  2. Worker receives job
  3. Worker begins processing job
  4. Worker dies due to an error and disconnects
  5. Gearmand sesnses disconnect
  6. Gearmand redispatches job to another worker
  7. Lather, rinse, repeat until all workers crash
If you set --job-retries=3 then as soon as the above scenario happens three times, gearmand will delete the message and log the error saving the rest of your workers from similar fate.

What if I want to handle that deleted job?
(for example, emailing the client that their job failed)

Unfortunately, currently, you will have to scan the log looking for deleted jobs and manually handle them.

However, aligned with Ender Tech's open source commitment, I have submitted a feature request to have the message redispatched to a "Dead Letter Queue" or "Poison Queue":

https://bugs.launchpad.net/gearmand/+bug/442539

This way you can assign a worker to handle the dead jobs without scanning the log thereby allowing a fast, robust, and consistent job handling infrastructure.

No comments:

Post a Comment