Simple Example of a DAGMan Workflow¶
This guide walks you step-by-step through the construction and submission of a simple DAGMan workflow. We recommend this guide if you are interested in automating your job submissions.
Overview¶
In this guide:
- Introduction
- Structure of the DAG
- The Minimal DAG Input File
- The Submit Files
- Running the Simple DAG
- Monitoring the Simple DAG
- Wrapping Up
For the full details on various DAGMan features, see the HTCondor manual pages:
1. Introduction¶
Consider the case of two HTCondor jobs that use the submit files A.sub
and B.sub
.
Let's say that A.sub
generates an output file (output.txt
) that B.sub
will analyze.
To run this workflow manually, we would
- Submit the first HTCondor job with
condor_submit A.sub
. - Wait for the first HTCondor job to complete successfully.
- Submit the second HTCondor job with
condor_submit B.sub
.
If the first HTCondor job using A.sub
is fairly short, then manually running this workflow is not a big deal.
But if the first HTCondor job takes a long time to complete (maybe takes several hours to run, or has to wait for special resources),
this can be very inconvenient.
Instead, we can use DAGMan to automatically submit B.sub
once the first HTCondor job using A.sub
has completed successfully.
This guide walks through the process of creating such a DAGMan workflow.
2. Structure of the DAG¶
In this scenario, our workflow could be described as a DAG consisting of two nodes (A.sub
and B.sub
) connected by a single edge (output.txt
).
To represent this relationship, we will define nodes A
and B
- corresponding to A.sub
and B.sub
, respectively - and connect them with a line pointing from A
and B
, like in this figure:
In order to use DAGMan to run this workflow, we need to communicate this structure to DAGMan via the .dag
input file.
3. The Minimal DAG Input File¶
Let's call the input file simple.dag
.
At minimum, the contents of the simple.dag
input file are
# simple.dag
# Define the DAG jobs
JOB A A.sub
JOB B B.sub
# Define the connections
PARENT A CHILD B
In a DAGMan input file, a node is defined using the JOB
keyword, followed by the name of the node and the name of the corresponding submit file.
In this case, we have created a node named A
and instructed DAGMan to use the submit file A.sub
for executing that node.
We have similarly created node B
and instructed DAGMan to use the submit file B.sub
.
(While there is no requirement that the name of the node match the name of the corresponding submit file, it is convenient to use a consistent naming scheme.)
To connect the nodes, we use the PARENT .. CHILD ..
syntax.
Since node B
requires that node A
has completed successfully, we say that node A
is the PARENT
while node B
is the CHILD
.
Note that we do not need to define why node B
is dependent on node A
, only that it is.
4. The Submit Files¶
Now let's define simple examples of the submit files A.sub
and B.sub
.
Node A¶
First, the submit file A.sub
uses the executable A.sh
, which will generate the file called output.txt
.
We have explicitly told HTCondor to transfer back this file by using the transfer_output_files
command.
# A.sub
executable = A.sh
log = A.log
output = A.out
error = A.err
transfer_output_files = output.txt
+JobDurationCategory = "Medium"
request_cpus = 1
request_memory = 1GB
request_disk = 1GB
queue
The executable file simply saves the hostname of the machine running the script:
#!/bin/bash
# A.sh
hostname > output.txt
sleep 1m # so we can see the job in "running" status
Node B¶
Second, the submit file B.sub
uses the executable B.sh
to print a message using the contents of the output.txt
file generated by A.sh
.
We have explicitly told HTCondor to transfer output.txt
as an input file for this job, using the transfer_input_files
command.
Thus we have finally defined the "edge" that connects nodes A
and B
: the use of output.txt
.
# B.sub
executable = B.sh
log = B.log
output = B.out
error = B.err
transfer_input_files = output.txt
+JobDurationCategory = "Medium"
request_cpus = 1
request_memory = 1GB
request_disk = 1GB
queue
The executable file contains the command for printing the desired message, which will be printed to B.out
.
#!/bin/bash
# B.sh
echo "The previous job was executed on the following machine:"
cat output.txt
sleep 1m # so we can see the job in "running" status
The directory structure¶
Based on the contents of simple.dag
, DAGMan is expecting that the submit files A.sub
and B.sub
are in the same directory as simple.dag
.
The submit files in turn are expecting A.sh
and B.sh
be in the same directory as A.sub
and B.sub
.
Thus, we have the following directory structure:
DAG_simple/
|-- A.sh
|-- A.sub
|-- B.sh
|-- B.sub
|-- simple.dag
It is possible to organize each job into its own directory, but for now we will use this simple, flat organization.
5. Running the Simple DAG¶
To run the DAG workflow described by simple.dag
, we use the HTCondor command condor_submit_dag
:
condor_submit_dag simple.dag
The DAGMan utility will then parse the input file and generate an assortment of related files that it will use for monitoring and managing your workflow. Here is the output of running the above command:
[user@ap40 DAG_simple]$ condor_submit_dag simple.dag
Loading classad userMap 'checkpoint_destination_map' ts=1699037029 from /etc/condor/checkpoint-destination-mapfile
-----------------------------------------------------------------------
File for submitting this DAG to HTCondor : simple.dag.condor.sub
Log of DAGMan debugging messages : simple.dag.dagman.out
Log of HTCondor library output : simple.dag.lib.out
Log of HTCondor library error messages : simple.dag.lib.err
Log of the life of condor_dagman itself : simple.dag.dagman.log
Submitting job(s).
1 job(s) submitted to cluster 562265.
-----------------------------------------------------------------------
The output shows the list of standard files that are created with every DAG submission along with brief descriptions. A couple of additional files, some of them temporary, will be created during the lifetime of the DAG.
6. Monitoring the Simple DAG¶
You can see the status of the DAG in your queue just like with any other HTCondor job submission.
[user@ap40 DAG_simple]$ condor_q
-- Schedd: ap40.uw.osg-htc.org : <128.105.68.92:9618?... @ 12/14/23 11:26:51
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS
user simple.dag+562265 12/14 11:26 _ _ 1 2 562279.0
There are a couple of things to note about the condor_q
output:
- The
BATCH_NAME
for the DAGMan job is the name of the input DAG file,simple.dag
, plus the Job ID of the DAGMan scheduler job (562265
in this case):simple.dag+562265
. - The total number of jobs for
simple.dag+562265
corresponds to the total number of nodes in the DAG (2). - Only 1 node is listed as "Idle", meaning that DAGMan has only submitted 1 job so far. This is consistent with the fact that node
A
has to complete before DAGMan can submit the job for nodeB
.
Note that if you are very quick to run your
condor_q
command after running yourcondor_submit_dag
command, then you may see only the DAGMan scheduler job. It may take a few seconds for DAGMan to start up and submit the HTCondor job associated with the first node.
To see more detailed information about the DAG workflow, use condor_q -nob -dag
.
For example,
[user@ap40 DAG_simple]$ condor_q -dag -nob
-- Schedd: ap40.uw.osg-htc.org : <128.105.68.92:9618?... @ 12/14/23 11:27:03
ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD
562265.0 user 12/14 11:26 0+00:00:37 R 0 0.5 condor_dagman -p 0 -f -l . -Loc
562279.0 |-A 12/14 11:26 0+00:00:00 I 0 0.0 A.sh
In this case, the first entry is the DAGMan scheduler job that you created when you first submitted the DAG.
The following entries correspond to the nodes whose jobs are currently in the queue.
Nodes that have not yet been submitted by DAGMan or that have completed and thus left the queue will not show up in your condor_q
output.
7. Wrapping Up¶
After waiting enough time, this simple DAG workflow should complete without any issues. But of course, that will not be the case for every DAG, especially as you start to create your own. DAGMan has a lot more features for managing and submitting DAG workflows, ranging from how to handle errors, combining DAG workflows, and restarting failed DAG workflows.
For now, we recommend that you continue exploring DAGMan by going through our Intermediate DAGMan Tutorial. There is also our guide Overview: Submit Workflows with HTCondor's DAGMan, which contains links to more resources in the More Resources section.
Finally, the definitive guide to DAGMan and DAG workflows is HTCondor's DAGMan Documentation.