A More Complex DAG
Objective
The objective of this exercise is to run a real set of jobs with DAGMan.
Make your job submission files
We'll run our goatbrot
example. If you didn't read about it yet, please do so now. We are going to make a DAG with four simultaneous jobs (goatbrot
) and one final node to stitch them together (montage
). This means we have five jobs. We're going to run goatbrot
with more iterations (100,000) so it will take longer to run.
You can create your five jobs. The goatbrot jobs very similar to each other, but they have slightly different parameters (arguments) and output files.
You have placed the goatbrot executable in your bin directory: ~/bin/goatbrot
.
Condor does not deal well with ~/
as the home directory, so we will use the full path /home/jovyan/bin/
instead in the submit scripts, which goes to the same directory.
goatbrot1.sub
executable = /home/jovyan/bin/goatbrot
arguments = -i 100000 -c -0.75,0.75 -w 1.5 -s 500,500 -o tile_0_0.ppm
log = goatbrot.log
output = goatbrot.out.0.0
error = goatbrot.err.0.0
should_transfer_files = YES
when_to_transfer_output = ONEXIT
queue
goatbrot2.sub
executable = /home/jovyan/bin/goatbrot
arguments = -i 100000 -c 0.75,0.75 -w 1.5 -s 500,500 -o tile_0_1.ppm
log = goatbrot.log
output = goatbrot.out.0.1
error = goatbrot.err.0.1
should_transfer_files = YES
when_to_transfer_output = ONEXIT
queue
goatbrot3.sub
executable = /home/jovyan/bin/goatbrot
arguments = -i 100000 -c -0.75,-0.75 -w 1.5 -s 500,500 -o tile_1_0.ppm
log = goatbrot.log
output = goatbrot.out.1.0
error = goatbrot.err.1.0
should_transfer_files = YES
when_to_transfer_output = ONEXIT
queue
goatbrot4.sub
executable = /home/jovyan/bin/goatbrot
arguments = -i 100000 -c 0.75,-0.75 -w 1.5 -s 500,500 -o tile_1_1.ppm
log = goatbrot.log
output = goatbrot.out.1.1
error = goatbrot.err.1.1
should_transfer_files = YES
when_to_transfer_output = ONEXIT
queue
montage.sub
You should notice a few things about the montage submission file:
- The
transfer_input_files
statement refers to the files created by the other jobs. - We do not transfer the montage program because it is on the VM.
- We need to write a wrapper script named
montage
(see below) which sets up the montage program and then runs it.
universe = vanilla
executable = montage
arguments = tile_0_0.ppm tile_0_1.ppm tile_1_0.ppm tile_1_1.ppm -mode Concatenate -tile 2x2 mandle.gif
should_transfer_files = YES
when_to_transfer_output = ONEXIT
transfer_input_files = tile_0_0.ppm,tile_0_1.ppm,tile_1_0.ppm,tile_1_1.ppm
transfer_executable = true
output = montage.out
error = montage.err
log = montage.log
queue
montage
#!/bin/bash
source /cvmfs/sft.cern.ch/lcg/views/setupViews.sh LCG_105a x86_64-ubuntu2204-gcc11-opt
montage $*
And you will need to make the montage
script executable with chmod +x montage
again.
Make your DAG
In a file called goatbrot.dag
, you have your DAG specification:
JOB g1 goatbrot1.sub
JOB g2 goatbrot2.sub
JOB g3 goatbrot3.sub
JOB g4 goatbrot4.sub
JOB montage montage.sub
PARENT g1 g2 g3 g4 CHILD montage
Ask yourself: do you know how we ensure that all the goatbrot
commands can run simultaneously and all of them will complete before we run the montage job?
Running the DAG
Submit your DAG:
$ condor_submit_dag goatbrot.dag
-----------------------------------------------------------------------
File for submitting this DAG to Condor : goatbrot.dag.condor.sub
Log of DAGMan debugging messages : goatbrot.dag.dagman.out
Log of Condor library output : goatbrot.dag.lib.out
Log of Condor library error messages : goatbrot.dag.lib.err
Log of the life of condor_dagman itself : goatbrot.dag.dagman.log
Submitting job(s).
1 job(s) submitted to cluster 71.
-----------------------------------------------------------------------
Watch your DAG
Watch with condor_q:
$ watch -n 1 condor_q YOUR_USER_ID -nobatch
To quit watch
command, press Ctrl-c
.
Here we see DAGMan running:
-- Submitter: [email protected] : <172.16.200.1:9645> : frontal.cci.ucad.sn
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
68.0 kagross 8/19 11:38 0+00:00:10 R 0 0.3 condor_dagman
1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended
DAGMan has submitted the goatbrot jobs, but they haven't started running yet (note that the I
status stands for Idle):
-- Submitter: [email protected] : <172.16.200.1:9645> : frontal.cci.ucad.sn
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
68.0 kagross 8/19 11:38 0+00:00:10 R 0 0.3 condor_dagman
69.0 kagross 8/19 11:38 0+00:00:00 I 0 0.0 goatbrot -i 100000
70.0 kagross 8/19 11:38 0+00:00:00 I 0 0.0 goatbrot -i 100000
71.0 kagross 8/19 11:38 0+00:00:00 I 0 0.0 goatbrot -i 100000
72.0 kagross 8/19 11:38 0+00:00:00 I 0 0.0 goatbrot -i 100000
6 jobs; 0 completed, 0 removed, 4 idle, 2 running, 0 held, 0 suspended
They're running! (All four jobs are in state R
- running)
-- Submitter: [email protected] : <172.16.200.1:9645> : frontal.cci.ucad.sn
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
68.0 kagross 8/19 11:38 0+00:00:15 R 0 0.3 condor_dagman
69.0 kagross 8/19 11:38 0+00:00:05 R 0 0.0 goatbrot -i 100000
70.0 kagross 8/19 11:38 0+00:00:05 R 0 0.0 goatbrot -i 100000
71.0 kagross 8/19 11:38 0+00:00:05 R 0 0.0 goatbrot -i 100000
72.0 kagross 8/19 11:38 0+00:00:05 R 0 0.0 goatbrot -i 100000
5 jobs; 0 completed, 0 removed, 0 idle, 5 running, 0 held, 0 suspended
Two of the jobs have finished, while the others are still running (remember that completed jobs disappear from condor_q
output):
-- Submitter: [email protected] : <172.16.200.1:9645> : frontal.cci.ucad.sn
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
68.0 kagross 8/19 11:38 0+00:00:20 R 0 0.3 condor_dagman
71.0 kagross 8/19 11:38 0+00:00:10 R 0 0.0 goatbrot -i 100000
72.0 kagross 8/19 11:38 0+00:00:10 R 0 0.0 goatbrot -i 100000
3 jobs; 0 completed, 0 removed, 0 idle, 3 running, 0 held, 0 suspended
They finished, but DAGMan hasn't noticed yet. It only checks periodically:
-- Submitter: [email protected] : <172.16.200.1:9645> : frontal.cci.ucad.sn
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
68.0 kagross 8/19 11:38 0+00:00:30 R 0 0.3 condor_dagman
1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended
DAGMan submitted and ran the montage job. It ran so fast I didn't capture it running. DAGMan will finish up soon
-- Submitter: [email protected] : <172.16.200.1:9645> : frontal.cci.ucad.sn
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
68.0 kagross 8/19 11:38 0+00:01:01 R 0 0.3 condor_dagman
1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended
Now it's all done:
-- Submitter: [email protected] : <172.16.200.1:9645> : frontal.cci.ucad.sn
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended
Examine your results. For some reason, goatbrot prints everything to stderr, not stdout.
$ cat goatbrot.err.0.0
Complex image:
Center: -0.75 + 0.75i
Width: 1.5
Height: 1.5
Upper Left: -1.5 + 1.5i
Lower Right: 0 + 0i
Output image:
Filename: tile_0_0.ppm
Width, Height: 500, 500
Theme: beej
Antialiased: no
Mandelbrot:
Max Iterations: 100000
Continuous: no
Goatbrot:
Multithreading: not supported in this build
Completed: 100.0%
Examine your log files (goatbrot.log
and montage.log
) and DAGMan output file (goatbrot.dag.dagman.out
). Do they look as you expect? Can you see the progress of the DAG in the DAGMan output file?
Does your final Mandlebrot image (mandle.gif
) look correct? To view it we can download it again from the left side bar, and then display it with Firefox.
Clean up your results. Be careful about deleting the goatbrot.dag. files, you do not want to delete the goatbrot.dag file, just goatbrot.dag. .
$ rm goatbrot.dag.*
$ rm goatbrot.out.*
$ rm goatbrot.err.*
On your own.
- Re-run your DAG. When jobs are running, try
condor_q -dag
. What does it do differently? - Challenge, if you have time: Make a bigger DAG by making more tiles in the same area.