Workflows Exercise 1.4: Handling a DAG That Fails

The objective of this exercise is to help you learn how DAGMan deals with job failures. DAGMan is built to help you recover from such failures.

Background

DAGMan can handle a situation where some of the nodes in a DAG fail. DAGMan will run as many nodes as possible, then create a rescue DAG making it easy to continue when the problem is fixed.

Breaking Things

Recall that DAGMan decides that a jobs fails if its exit code is non-zero. Let's modify our montage job so that it fails. Work in the same directory where you did the last DAG. Edit montage.sub to add a -h to the arguments. It will look like this (the change is highlighted in red):

executable              = /usr/bin/montage
arguments               = -h tile_0_0.ppm tile_0_1.ppm tile_1_0.ppm tile_1_1.ppm -mode Concatenate -tile 2x2 mandle-from-dag.jpg
transfer_input_files    = tile_0_0.ppm,tile_0_1.ppm,tile_1_0.ppm,tile_1_1.ppm
output                  = montage.out
error                   = montage.err
log                     = montage.log
request_memory          = 1GB
request_disk            = 1GB
request_cpus            = 1
queue

Submit the DAG again:

username@learn $ condor_submit_dag goatbrot.dag
-----------------------------------------------------------------------
File for submitting this DAG to Condor           : goatbrot.dag.condor.sub
Log of DAGMan debugging messages                 : goatbrot.dag.dagman.out
Log of Condor library output                     : goatbrot.dag.lib.out
Log of Condor library error messages             : goatbrot.dag.lib.err
Log of the life of condor_dagman itself          : goatbrot.dag.dagman.log

Submitting job(s).
1 job(s) submitted to cluster 77.
-----------------------------------------------------------------------

Use watch to watch the jobs until they finish. In a separate window, use tail --lines=500 -f goatbrot.dag.dagman.out to watch what DAGMan does.

06/22/12 17:57:41 Setting maximum accepts per cycle 8.
06/22/12 17:57:41 ******************************************************
06/22/12 17:57:41 ** condor_scheduniv_exec.77.0 (CONDOR_DAGMAN) STARTING UP
06/22/12 17:57:41 ** /usr/bin/condor_dagman
06/22/12 17:57:41 ** SubsystemInfo: name=DAGMAN type=DAGMAN(10) class=DAEMON(1)
06/22/12 17:57:41 ** Configuration: subsystem:DAGMAN local:<NONE> class:DAEMON
06/22/12 17:57:41 ** $CondorVersion: 7.7.6 Apr 16 2012 BuildID: 34175 PRE-RELEASE-UWCS $
06/22/12 17:57:41 ** $CondorPlatform: x86_64_rhap_5.7 $
06/22/12 17:57:41 ** PID = 26867
06/22/12 17:57:41 ** Log last touched time unavailable (No such file or directory)
06/22/12 17:57:41 ******************************************************
06/22/12 17:57:41 Using config source: /etc/condor/condor_config
06/22/12 17:57:41 Using local config sources: 
06/22/12 17:57:41    /etc/condor/config.d/00-chtc-global.conf
06/22/12 17:57:41    /etc/condor/config.d/01-chtc-submit.conf
06/22/12 17:57:41    /etc/condor/config.d/02-chtc-flocking.conf
06/22/12 17:57:41    /etc/condor/config.d/03-chtc-jobrouter.conf
06/22/12 17:57:41    /etc/condor/config.d/04-chtc-blacklist.conf
06/22/12 17:57:41    /etc/condor/config.d/99-osg-ss-group.conf
06/22/12 17:57:41    /etc/condor/config.d/99-roy-extras.conf
06/22/12 17:57:41    /etc/condor/condor_config.local

Below is where DAGMan realizes that the montage node failed:

06/22/12 18:08:42 Event: ULOG_EXECUTE for Condor Node montage (82.0.0)
06/22/12 18:08:42 Number of idle job procs: 0
06/22/12 18:08:42 Event: ULOG_IMAGE_SIZE for Condor Node montage (82.0.0)
06/22/12 18:08:42 Event: ULOG_JOB_TERMINATED for Condor Node montage (82.0.0)
06/22/12 18:08:42 Node montage job proc (82.0.0) failed with status 1.
06/22/12 18:08:42 Number of idle job procs: 0
06/22/12 18:08:42 Of 5 nodes total:
06/22/12 18:08:42  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
06/22/12 18:08:42   ===     ===      ===     ===     ===        ===      ===
06/22/12 18:08:42     4       0        0       0       0          0        1
06/22/12 18:08:42 0 job proc(s) currently held
06/22/12 18:08:42 Aborting DAG...
06/22/12 18:08:42 Writing Rescue DAG to goatbrot.dag.rescue001...
06/22/12 18:08:42 Note: 0 total job deferrals because of -MaxJobs limit (0)
06/22/12 18:08:42 Note: 0 total job deferrals because of -MaxIdle limit (0)
06/22/12 18:08:42 Note: 0 total job deferrals because of node category throttles
06/22/12 18:08:42 Note: 0 total PRE script deferrals because of -MaxPre limit (0)
06/22/12 18:08:42 Note: 0 total POST script deferrals because of -MaxPost limit (0)
06/22/12 18:08:42 **** condor_scheduniv_exec.77.0 (condor_DAGMAN) pid 26867 EXITING WITH STATUS 1

DAGMan notices that one of the jobs failed because it's exit code was non-zero. DAGMan ran as much of the DAG as possible and logged enough information to continue the run when the situation is resolved. Do you see the part where it wrote the resuce DAG?

Look at the rescue DAG file. It's called a partial DAG because it indicates what part of the DAG has already been completed.

username@learn $ cat goatbrot.dag.rescue001
# Rescue DAG file, created after running
#   the goatbrot.dag DAG file
# Created 6/22/2012 23:08:42 UTC
# Rescue DAG version: 2.0.1 (partial)
#
# Total number of Nodes: 5
# Nodes premarked DONE: 4
# Nodes that failed: 1
#   montage,<ENDLIST>

DONE g1
DONE g2
DONE g3
DONE g4

From the comment near the top, we know that the montage node failed. Let's fix it by getting rid of the offending -h argument. Change montage.sub to look like:

executable              = /usr/bin/montage
arguments               = tile_0_0.ppm tile_0_1.ppm tile_1_0.ppm tile_1_1.ppm -mode Concatenate -tile 2x2 mandle-from-dag.jpg
transfer_input_files    = tile_0_0.ppm,tile_0_1.ppm,tile_1_0.ppm,tile_1_1.ppm
output                  = montage.out
error                   = montage.err
log                     = montage.log
request_memory          = 1GB
request_disk            = 1GB
request_cpus            = 1
queue

Now we can re-submit our original DAG and DAGMan will pick up where it left off. It will automatically notice the rescue DAG. If you didn't fix the problem, DAGMan would generate another rescue DAG.

username@learn $ condor_submit_dag goatbrot.dag
Running rescue DAG 1
-----------------------------------------------------------------------
File for submitting this DAG to Condor           : goatbrot.dag.condor.sub
Log of DAGMan debugging messages                 : goatbrot.dag.dagman.out
Log of Condor library output                     : goatbrot.dag.lib.out
Log of Condor library error messages             : goatbrot.dag.lib.err
Log of the life of condor_dagman itself          : goatbrot.dag.dagman.log

Submitting job(s).
1 job(s) submitted to cluster 83.
-----------------------------------------------------------------------

username@learn $ tail -f goatbrot.dag.dagman.out
06/23/12 11:30:53 ******************************************************
06/23/12 11:30:53 ** condor_scheduniv_exec.83.0 (CONDOR_DAGMAN) STARTING UP
06/23/12 11:30:53 ** /usr/bin/condor_dagman
06/23/12 11:30:53 ** SubsystemInfo: name=DAGMAN type=DAGMAN(10) class=DAEMON(1)
06/23/12 11:30:53 ** Configuration: subsystem:DAGMAN local:<NONE> class:DAEMON
06/23/12 11:30:53 ** $CondorVersion: 7.7.6 Apr 16 2012 BuildID: 34175 PRE-RELEASE-UWCS $
06/23/12 11:30:53 ** $CondorPlatform: x86_64_rhap_5.7 $
06/23/12 11:30:53 ** PID = 28576
06/23/12 11:30:53 ** Log last touched 6/22 18:08:42
06/23/12 11:30:53 ******************************************************
06/23/12 11:30:53 Using config source: /etc/condor/condor_config
...

Here is where DAGMAN notices that there is a rescue DAG

06/23/12 11:30:53 Parsing 1 dagfiles
06/23/12 11:30:53 Parsing goatbrot.dag ...
%RED%06/23/12 11:30:53 Found rescue DAG number 1; running goatbrot.dag.rescue001 in combination with normal DAG file
06/23/12 11:30:53 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
06/23/12 11:30:53 USING RESCUE DAG goatbrot.dag.rescue001
06/23/12 11:30:53 Dag contains 5 total jobs

Shortly thereafter it sees that four jobs have already finished.

06/23/12 11:31:05 Bootstrapping...
06/23/12 11:31:05 Number of pre-completed nodes: 4
06/23/12 11:31:05 Registering condor_event_timer...
06/23/12 11:31:06 Sleeping for one second for log file consistency
06/23/12 11:31:07 MultiLogFiles: truncating log file /home/roy/condor/goatbrot/montage.log

Here is where DAGMan resubmits the montage job and waits for it to complete.

06/23/12 11:31:07 Submitting Condor Node montage job(s)...
06/23/12 11:31:07 submitting: condor_submit 
      -a dag_node_name' '=' 'montage 
      -a +DAGManJobId' '=' '83 
      -a DAGManJobId' '=' '83 
      -a submit_event_notes' '=' 'DAG' 'Node:' 'montage 
      -a DAG_STATUS' '=' '0 
      -a FAILED_COUNT' '=' '0 
      -a +DAGParentNodeNames' '=' '"g1,g2,g3,g4" 
      montage.sub
06/23/12 11:31:07 From submit: Submitting job(s).
06/23/12 11:31:07 From submit: 1 job(s) submitted to cluster 84.
06/23/12 11:31:07   assigned Condor ID (84.0.0)
06/23/12 11:31:07 Just submitted 1 job this cycle...
06/23/12 11:31:07 Currently monitoring 1 Condor log file(s)
06/23/12 11:31:07 Event: ULOG_SUBMIT for Condor Node montage (84.0.0)
06/23/12 11:31:07 Number of idle job procs: 1
06/23/12 11:31:07 Of 5 nodes total:
06/23/12 11:31:07  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
06/23/12 11:31:07   ===     ===      ===     ===     ===        ===      ===
06/23/12 11:31:07     4       0        1       0       0          0        0
06/23/12 11:31:07 0 job proc(s) currently held
06/23/12 11:40:22 Currently monitoring 1 Condor log file(s)
06/23/12 11:40:22 Event: ULOG_EXECUTE for Condor Node montage (84.0.0)
06/23/12 11:40:22 Number of idle job procs: 0
06/23/12 11:40:22 Event: ULOG_IMAGE_SIZE for Condor Node montage (84.0.0)
06/23/12 11:40:22 Event: ULOG_JOB_TERMINATED for Condor Node montage (84.0.0)

This is where the montage finished.

06/23/12 11:40:22 Node montage job proc (84.0.0) completed successfully.
06/23/12 11:40:22 Node montage job completed
06/23/12 11:40:22 Number of idle job procs: 0
06/23/12 11:40:22 Of 5 nodes total:
06/23/12 11:40:22  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
06/23/12 11:40:22   ===     ===      ===     ===     ===        ===      ===
06/23/12 11:40:22     5       0        0       0       0          0        0
06/23/12 11:40:22 0 job proc(s) currently held

And here DAGMan decides that the work is all done.

06/23/12 11:40:22 All jobs Completed!
06/23/12 11:40:22 Note: 0 total job deferrals because of -MaxJobs limit (0)
06/23/12 11:40:22 Note: 0 total job deferrals because of -MaxIdle limit (0)
06/23/12 11:40:22 Note: 0 total job deferrals because of node category throttles
06/23/12 11:40:22 Note: 0 total PRE script deferrals because of -MaxPre limit (0)
06/23/12 11:40:22 Note: 0 total POST script deferrals because of -MaxPost limit (0)
06/23/12 11:40:22 **** condor_scheduniv_exec.83.0 (condor_DAGMAN) pid 28576 EXITING WITH STATUS 0

Success! Now go ahead and clean up.

Bonus Challenge

If you have time, add an extra node to the DAG. Copy our original "simple" program, but make it exit with a 1 instead of a 0. DAGMan would consider this a failure, but you'll tell DAGMan that it's really a success. This is reasonable--many real world programs use a variety of return codes, and you might need to help DAGMan distinguish success from failure.

Write a POST script that checks the return value. Check the HTCondor manual to see how to describe your post script.