Skip to content

Dealing With Steps Running Longer than 24 Hours per Honeybadger

Andrew Berger edited this page Dec 1, 2022 · 8 revisions

e.g. https://app.honeybadger.io/projects/58890/faults/64252829#notice-summary

This usually happens for preservationIngestWF transfer-object step or for accessionWF publish step.

These can be caused by

  • A. a step that is taking a loooooooong time to execute, per workflow server
  • B. the step is restarted over and over again
  • C. workflow service never received the "completed" signal (or other indication that the step has stopped actively processing).

A.the step is still running

How do we know we are in this situation?

  • the step is currently running
    • check via https://robot-console-prod.stanford.edu/ -- is there an active worker for the step running the druid? Go to the appropriate queue and determine if any of the running jobs are for the druid in question.

How do we address this situation?

  • wait if the step is still running (e.g. is the transfer-object step still moving data?)
    • steps that involve reading or writing files can take a very long time for druids with large files. You can look at the "preservation size" for the druid in Argo to see how large it is. If the total size is small and the step is still running, that's a sign that something may be wrong.
  • if the step has stalled, stop it and re-enqueue it
    • NOTE: this is dangerous for transfer-object as material to be preserved may have been partially removed from old location. Ask sdr-manager for help.
    • stop it by killing the forked resque worker running the job, and then enqueuing it. (See instructions for enqueuing workflow-step in C. below)

B. The step is restarted over and over again

How do we know we are in this situation?

Look in the sidekiq console on the server where the step runs (dor-services-app or the techmd server for technical-metadata) and see if the step is repeating. See if there are associated errors in Honeybadger that would cause the job to repeat.

How do we address this situation?

Fix the problem that is causing the job to fail and let sidekiq retry again.

C. Workflow service never received the "completed" signal (or other indication that the step has stopped actively processing).

This can happen for a few reasons:

  1. Most common: when the robot can't reach the workflow server to signal the step completed. Maybe there was a network outage or some other issue along those lines.

Solution: enqueue the workflow step afresh.

  1. Less common in prod, more common in stage/QA: a change was made to a workflow definition where a previously-required step was deprecated and removed from the workflow. Any druid that happened to have a deprecated step in progress (stalled or otherwise) can no longer complete the deprecated steps.

Solution: find the deprecated steps by looking up the druid in the workflow database and set them to "skipped".

How do we know we are in this situation?

  • the step is not currently running
    • check robot logs
    • check via https://robot-console-prod.stanford.edu/ -- is there an active worker for the step running the druid? Go to the appropriate queue and determine if any of the running jobs are for the druid in question.
  • the step has not been run for some time
    • check robot logs
    • check timestamps in workflow step

One way to check workflow step timestamps:

from rails console on workflow-server-prod:

[2] pry(main)> WorkflowStep.where(druid: 'druid:vf193ym6861',  workflow: 'accessionWF').order(:version, :id).pluck(:version,:process,:status,:id)

=> [[1, "start-accession", "completed", 169172449],
 [1, "descriptive-metadata", "skipped", 169172450],
 [1, "rights-metadata", "skipped", 169172451],
 [1, "content-metadata", "skipped", 169172452],
 [1, "technical-metadata", "completed", 169172453],
 [1, "remediate-object", "completed", 169172454],
 [1, "shelve", "completed", 169172455],
 [1, "publish", "started", 169172456],             # <-- HERE is the step!
 [1, "provenance-metadata", "waiting", 169172457],
 [1, "sdr-ingest-transfer", "waiting", 169172458],
 [1, "sdr-ingest-received", "waiting", 169172459],
 [1, "reset-workspace", "waiting", 169172460],
 [1, "end-accession", "waiting", 169172461]]

[3] pry(main)> wfs = WorkflowStep.find(169172456)

=> #<WorkflowStep:0x0000000002cec6e0
 id: 169172456,
 druid: "druid:vf193ym6861",
 workflow: "accessionWF",
 process: "publish",
 status: "started",
 error_msg: nil,
 error_txt: nil,
 attempts: 0,
 lifecycle: "published",
 elapsed: 0.1e1,
 version: 1,
 note: "common-accessioning-prod-e.stanford.edu",
 lane_id: "low",
 created_at: Mon, 29 Jun 2020 10:06:55 UTC +00:00,
 updated_at: Mon, 29 Jun 2020 10:07:01 UTC +00:00,  # <--  THIS TIMESTAMP
 active_version: true>

How do we address this situation?

When the step needs to be re-enqueued:

From rails console on workflow-server:

  1. On the workflow-server, find the step:

One way:

[2] pry(main)> WorkflowStep.where(druid: 'druid:vf193ym6861',  workflow: 'accessionWF').order(:version, :id).pluck(:version,:process,:status,:id)

=> [[1, "start-accession", "completed", 169172449],
 [1, "descriptive-metadata", "skipped", 169172450],
 [1, "rights-metadata", "skipped", 169172451],
 [1, "content-metadata", "skipped", 169172452],
 [1, "technical-metadata", "completed", 169172453],
 [1, "remediate-object", "completed", 169172454],
 [1, "shelve", "completed", 169172455],
 [1, "publish", "started", 169172456],            # <-- HERE is the step!
 [1, "provenance-metadata", "waiting", 169172457],
 [1, "sdr-ingest-transfer", "waiting", 169172458],
 [1, "sdr-ingest-received", "waiting", 169172459],
 [1, "reset-workspace", "waiting", 169172460],
 [1, "end-accession", "waiting", 169172461]]

[3] pry(main)> wfs = WorkflowStep.find(169172456)
  1. Enqueue the step
[3] pry(main)> QueueService.enqueue(wfs)
  1. Ensure the step gets completed, or errors

Often easiest in Argo:

argo item view workflow steps

argo item workflow view showing completed step

  1. If it errors, treat it like any other step error -- does it need SDR manager attention? Can you fix it? Etc.

Setting a deprecated step to "skipped"

WARNING: do not use this method if a step is still part of a current workflow. Setting a step manually to completed in the workflow database without actually running the step can cause a number of problems down the line, as some workflow steps depend on the actions taken by other steps. If in doubt as to whether a step has been deprecated, check the Argo workflow view to see if the step appears in workflow model. If still in doubt, ask for another set of eyes in the #dlss-infrastructure channel.

In this example (from stage/QA) a druid was caught in progress during a workcycle where the wasCrawlDisseminationWF was modified to replace several workflow steps with new ones, as part of a migration from openwayback to pywb as the technology underlying SWAP.

  1. Look at the list of steps in the active workflow and determine what is "waiting", "queued", or "started". Steps that are "skipped" or "completed" can be ignored.
3.0.0 :004 > WorkflowStep.where(druid: 'druid:bq185xg0026', workflow: "wasCrawlDisseminationWF").pluck(:version,:process,:status)
 => 
[[1, "cdxj-generator", "completed"],
 [1, "cdxj-merge", "completed"],
 [1, "start", "completed"],
 [1, "cdx-merge-sort-publish", "waiting"], # <-- DEPRECATED step
 [1, "path-indexer", "waiting"], # <-- DEPRECATED step
 [1, "warc-extractor", "completed"],
 [1, "cdx-generator", "queued"]] # <-- DEPRECATED step

Note that none of the deprecated steps are part of the current workflow view for that druid in Argo:

Screen Shot 2022-12-01 at 11 59 02 AM

None of the deprecated steps will complete on their own, as the workflow no longer processes those steps.

(NOTE: this is only a problem if a workflow is in progress at the time the steps are deprecated. New workflows on new versions will run only the new steps. Old workflows will remain complete.)

  1. Manually update the status of the relevant steps to "skipped". In terms of completing the workflow overall, it does not seem to matter if the steps are marked "completed" or "skipped" but "skipped" is more accurate in this situation.
3.0.0 :006 > WorkflowStep.where(druid: 'druid:bq185xg0026', workflow: "wasCrawlDisseminationWF", process: "cdx-merge-sort-publish").update(status: "skipped")
 => 
[#<WorkflowStep:0x0000557f9c9beb98
  id: 106213,
  druid: "druid:bq185xg0026",
  workflow: "wasCrawlDisseminationWF",
  process: "cdx-merge-sort-publish",
  status: "skipped",
  error_msg: nil,
  error_txt: nil,
  attempts: 0,
  lifecycle: nil,
  elapsed: nil,
  version: 1,
  note: nil,
  lane_id: "default",
  created_at: Mon, 25 Jul 2022 13:10:17.328979000 UTC +00:00,
  updated_at: Wed, 30 Nov 2022 21:01:12.478007024 UTC +00:00,
  active_version: true,
  completed_at: Wed, 30 Nov 2022 21:01:12.477899236 UTC +00:00>] 
  1. Check Argo and see if the whole workflow is now listed as completed:
Screen Shot 2022-12-01 at 12 02 31 PM

If the workflow is still active, re-check the list of steps and repeat the above actions.

NOTE: In some instances you may need to both manually complete deprecated steps and re-enqueue a current step. In rare instances, you may find a step still listed in progress (waiting, queued, started) from an earlier version of the druid, though most historical problems like that have been addressed as of December 2022.