Lab

Lab Exercise: DAGMan

Purpose:

During this lab the user will become familiar with using DAGMan to run and monitor jobs.

Running Simple DAGMan Jobs
More Complex DAGMan Jobs
Recovering Failed DAGMan Jobs

Running Simple DAGMan Jobs

Start by creating a directory called lab8 and cd into it:

$ cd

$ mkdir lab8

$ cd lab8

$

This new lab8 directory should be used to contain any files we create during the remainder of this lab exercise.
Create a file called lab8.sh and copy the following code into it.

#! /bin/sh echo "I'm process id $$ on" `hostname` echo "This is sent to standard error" 1>&2 date echo "Running as binary $0" "$@" echo "My name (argument 1) is $1" echo "My sleep duration (argument 2) is $2" sleep $2 echo "Sleep of $2 seconds finished. Exiting" echo "RESULT: 0 SUCCESS" exit 0

Set the permissions on the file so that it may be executed by condor.

$ chmod +x lab8.sh

$

Create a file called lab8.submit and copy the following into it:

executable=lab8.sh globusscheduler = ldas-grid.ligo-la.caltech.edu/jobmanager-condor universe=globus arguments=Example.$(Cluster).$(Process) 10 output=z.lab8.output.$(Process) error=z.lab8.error.$(Process) log=z.lab8.log notification=never should_transfer_files=YES when_to_transfer_output = ON_EXIT queue

Create a new file called: lab8.dag This will be our very simple DAG instruction file for DAGMan to run.

Copy the following information into the file:

Job Simple lab8.submit

Run the DAGMan using condor_submit_dag and watch the program run using condor_q.

$ condor_submit_dag lab8.dag


-----------------------------------------------------------------------
File for submitting this DAG to Condor : lab8.dag.condor.sub
Log of DAGMan debugging messages : lab8.dag.dagman.out
Log of Condor library debug messages : lab8.dag.lib.out
Log of the life of condor_dagman itself : lab8.dag.dagman.log
Condor Log file for all Condor jobs of this DAG: lab8.dag.dummy_log
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 43.
-----------------------------------------------------------------------
$ condor_q
-- Submitter: ligo-client.ncsa.uiuc.edu : <141.142.96.174:33149> : ligo-client.ncsa.uiuc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
  43.0   btest          12/21 10:07   0+00:00:10 R  0   2.3  condor_dagman -f -
  44.0   btest          12/21 10:07   0+00:00:00 I  0   0.0  lab8.sh Example.
2 jobs; 1 idle, 1 running, 0 held
$ condor_q -globus
-- Submitter: ligo-client.ncsa.uiuc.edu : <141.142.96.174:33149> : ligo-client.ncsa.uiuc.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE
  44.0   btest         UNSUBMITTED fork     ligo-server.ncsa.u  /home/btest/dag/pr
$ condor_q
-- Submitter: ligo-client.ncsa.uiuc.edu : <141.142.96.174:33149> : ligo-client.ncsa.uiuc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
  43.0   btest          12/21 10:07   0+00:00:20 R  0   2.3  condor_dagman -f -
  44.0   btest          12/21 10:07   0+00:00:02 R  0   0.0  lab8.sh Example.
2 jobs; 0 idle, 2 running, 0 held
$ condor_q -globus
-- Submitter: ligo-client.ncsa.uiuc.edu : <141.142.96.174:33149> : ligo-client.ncsa.uiuc.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE
  44.0   btest         ACTIVE   fork     ligo-server.ncsa.u  /home/btest/dag/pr
$ condor_q
-- Submitter: ligo-client.ncsa.uiuc.edu : <141.142.96.174:33149> : ligo-client.ncsa.uiuc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
  43.0   btest          12/21 10:07   0+00:00:30 R  0   2.3  condor_dagman -f -
  44.0   btest          12/21 10:07   0+00:00:11 C  0   0.0  lab8.sh Example.
1 jobs; 0 idle, 1 running, 0 held
$ condor_q -globus
-- Submitter: ligo-client.ncsa.uiuc.edu : <141.142.96.174:33149> : ligo-client.ncsa.uiuc.edu
 ID      OWNER          STATUS  MANAGER  HOST                EXECUTABLE
  44.0   btest         DONE     fork     ligo-server.ncsa.u  /home/btest/dag/pr
$ condor_q
-- Submitter: ligo-client.ncsa.uiuc.edu : <141.142.96.174:33149> : ligo-client.ncsa.uiuc.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
0 jobs; 0 idle, 0 running, 0 held
$

Once the job is finished you should have a number of new files in your directory. These files will include output from the condor job and also from DAGMan. Take some time to look at the contents of these files.

$ ls -la

total 48
drwxrwxr-x 2 mfreemon mfreemon 4096 Mar 9 23:38 .
drwx----- 23 mfreemon mfreemon 4096 Mar 9 22:14 ..
-rw-rw-r-- 1 mfreemon mfreemon 23   Mar 9 23:36 lab8.dag
-rw-rw-r-- 1 mfreemon mfreemon 482  Mar 9 23:37 lab8.dag.condor.sub
-rw-rw-r-- 1 mfreemon mfreemon 608  Mar 9 23:38 lab8.dag.dagman.log
-rw-r--r-- 1 mfreemon mfreemon 2814 Mar 9 23:38 lab8.dag.dagman.out
-rw------- 1 mfreemon mfreemon 0    Mar 9 23:37 lab8.dag.dummy_log
-rw-rw-r-- 1 mfreemon mfreemon 29   Mar 9 23:38 lab8.dag.lib.out
-rwxrwxr-x 1 mfreemon mfreemon 298  Mar 8 22:57 lab8.sh
-rw-rw-r-- 1 mfreemon mfreemon 306  Mar 9 23:35 lab8.submit
-rw-r--r-- 1 mfreemon mfreemon 31   Mar 9 23:38 z.lab8.error.0
-rw-r--r-- 1 mfreemon mfreemon 861  Mar 9 23:38 z.lab8.log
-rw-r--r-- 1 mfreemon mfreemon 347  Mar 9 23:38 z.lab8.output.0
$

More Complex DAGMan Jobs

Let's try a more complex example. We will need several files for this. Perform the following copies.

$ cp lab8.sh lab8a.sh
$ cp lab8.sh lab8b.sh

$ cp lab8.sh lab8c.sh

$ cp lab8.submit lab8a.submit

$ cp lab8.submit lab8b.submit

$ cp lab8.submit lab8c.submit

$

In each of the *.submit files you will need to modify the output and error tags. Each file should point to their own uniquely named files so that information is not overwritten. Change the tags to something like this in each file:

output=z.lab8a.output
error=z.lab8a.error

Also, in file lab8b.submit, change the argument tag to the following:

arguments=lab8b 120

Leave the log entries alone. DAGMan requires that all nodes output their logs in the same location. Condor will ensure that the different jobs will not overwrite each other's entries in the log. (Newer versions of DAGMan lift this requirement, and allow each job to use its own log file -- but you may want to use one common log file anyway because it's convenient to have all of your job status information in a single place.)

log=z.lab8.log

Create a file called: complex.dag

Copy the following information into the file.

Job Setup lab8.submit
Job WorkNode1 lab8a.submit
Job WorkNode2 lab8b.submit
Job CollectResults lab8c.submit
PARENT Setup CHILD WorkNode1 WorkNode2
PARENT WorkNode1 WorkNode2 CHILD CollectResults

These instructions are telling DAGMan that there are 4 jobs that need to be run. These jobs will be referenced by the jobname and the submit scripts will be passed to condor by DAGMan. The parent tag tells DAGMan the order that the jobs need to be run. First setup will run. If it completes successfully, its two child scripts will be run: WorkNode1 and WorkNode2. Both of these scripts need to complete successfully for CollectResults to run.

Create new file called: watch_condor_q.sh

Copy the following code into the file.

#! /bin/sh while true; do echo .... echo .... Output from condor_q echo .... condor_q echo .... echo .... Output from condor_q -globus echo .... condor_q -globus echo .... echo .... Output from condor_q -dag echo .... condor_q -dag sleep 10 done

Set the execute bit:

$ chmod +x watch_condor_q.sh $

This script will loop over calls to condor_q so we can more easily monitor the progress of our DAG job.

Now submit the DAG script.

$ condor_submit_dag complex.dag ----------------------------------------------------------------------- File for submitting this DAG to Condor : complex.dag.condor.sub Log of DAGMan debugging messages : complex.dag.dagman.out Log of Condor library debug messages : complex.dag.lib.out Log of the life of condor_dagman itself : complex.dag.dagman.log Condor Log file for all Condor jobs of this DAG: complex.dag.dummy_log Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 20. ----------------------------------------------------------------------- $

Once it is running, start the watch script. You may need to increase the size of your window so you can see all of the output.

$ ./watch_condor_q.sh

Watch as each job is submit and run. You will notice that the order is followed. First Setup is run, then WorkNode1 and WorkNode2. WorkNode2 will stay on the queue longer since it has a 120 second delay in it. CollectResults will not run until WorkNode2 finishes.

After the job completes, type Ctrl-C to kill the watch_condor_q script.

Now look at your output.

$ ls -ltra
-rw------- 1 btest btest    0 Dec 22 08:09 complex.dag.dummy_log
-rw-rw-r-- 1 btest btest  508 Dec 22 08:09 complex.dag.condor.sub
-rw-r--r-- 1 btest btest   31 Dec 22 08:09 z.lab8.error.0
-rw-r--r-- 1 btest btest  330 Dec 22 08:09 z.lab8.output.0
-rw-r--r-- 1 btest btest   31 Dec 22 08:10 z.lab8a.error.0
-rw-r--r-- 1 btest btest   31 Dec 22 08:10 z.lab8b.error.0
-rw-r--r-- 1 btest btest  330 Dec 22 08:10 z.lab8a.output.0
-rw-r--r-- 1 btest btest  321 Dec 22 08:12 z.lab8b.output.0
-rw-r--r-- 1 btest btest   31 Dec 22 08:12 z.lab8c.error.0
-rw-r--r-- 1 btest btest  330 Dec 22 08:12 z.lab8c.output.0
-rw-r--r-- 1 btest btest 3381 Dec 22 08:12 z.lab8.log
-rw-rw-r-- 1 btest btest   29 Dec 22 08:12 complex.dag.lib.out
-rw-r--r-- 1 btest btest 6072 Dec 22 08:12 complex.dag.dagman.out
-rw-rw-r-- 1 btest btest  608 Dec 22 08:12 complex.dag.dagman.log
$

Take some time to look at all of the output files and verify your results.

Recovering Failed DAGMan Jobs

DAGMan can handle a situation where some of the nodes in a DAG fails. DAGMan will run as many nodes as possible, then create a rescue DAG making it easy to continue when the problem is fixed.
Create a file called bad_script.sh and type the following into it:

#! /bin/sh

echo "I'm process id $$ on" `hostname`
echo "This is sent to standard error" 1>&2
date
echo "Running as binary $0" "$@"
echo "My name (argument 1) is $1"
echo "My sleep duration (argument 2) is $2"
sleep $2
echo "Sleep of $2 seconds finished. Exiting"
echo "RESULT: 1 FAILURE"
exit 1

Set the execute bit:

$ chmod +x bad_script.sh $

This script when it completes will exit with a value of 1. DAGMan will take this as an error. All scripts that are run with DAGMan need to exit with a value of zero when the finish successfully. If this is not done, unexpected results will occur.

Now create a file called: bad.submit. This is our condor submission script. Copy the following into it.

executable=bad_script.sh
output=z.bad.work2.output
error=z.bad.work2.error
log=z.bad.log
notification=never
universe=globus
globusscheduler = ldas-grid.ligo-la.caltech.edu/jobmanager-condor
arguments=WorkerNode2 60
queue

We need to create one more script. This script will also show one of the features of DAG. We will call this script: postscript_checker.sh. It should look like the following:

#! /bin/sh
grep 'RESULT: 0 SUCCESS' $1 > /dev/null 2>/dev/null

Set the execute bit:

$ chmod +x postscript_checker.sh $

DAGMan allows us to specify both PRE and POST operations to be performed along with a job. We simply specify the job we want using the jobname and then use either the PRE or POST tag. In this case we will use the POST tag. Our post script will look at the output of the job we indicate and if it finds that job has failed it will tell DAGMan to fail.

Finally, create a file called: bad.dag. This will be our DAGMan instruction script. Copy the following into it.

Job Setup lab8.submit
Job WorkNode1 bad.submit
Job WorkNode2 lab8b.submit
Job CollectResults lab8c.submit
PARENT Setup CHILD WorkNode1 WorkNode2
PARENT WorkNode1 WorkNode2 CHILD CollectResults
Script POST Setup postscript_checker.sh z.lab8.output
Script POST WorkNode1 postscript_checker.sh z.bad.work2.output
Script POST WorkNode2 postscript_checker.sh z.lab8b.output
Script POST CollectResults postscript_checker.sh z.lab8c.output

Notice it is almost the same as our last DAGMan script. This time however, WorkNode1 is our bad script. We have also added lines to make sure that our post script is run after each job. We are also passing the output file from each job into the post script as a parameter. Submit the script and watch it run using watch_condor_q.

$ condor_submit_dag bad.dag

-----------------------------------------------------------------------
File for submitting this DAG to Condor : bad.dag.condor.sub
Log of DAGMan debugging messages : bad.dag.dagman.out
Log of Condor library debug messages : bad.dag.lib.out
Log of the life of condor_dagman itself : bad.dag.dagman.log

Condor Log file for all Condor jobs of this DAG: bad.dag.dummy_log
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 63.
-----------------------------------------------------------------------

$ ./watch_condor_q.sh

You should see the various jobs and scripts running. Both WorkNode jobs should run to completion and then the DAG should fail. Look in the file: bad.dag.dagman.out. You should see something like:

$ cat bad.dag.dagman.out

3/10 12:16:24 Job WorkNode1 completed successfully.
3/10 12:16:24 Running POST script of Job WorkNode1...
3/10 12:16:24 Of 4 nodes total:
3/10 12:16:24  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
3/10 12:16:24   ===     ===      ===     ===     ===        ===      ===
3/10 12:16:24     1       0        1       1       0          1        0
3/10 12:16:29 Event: ULOG_POST_SCRIPT_TERMINATED for Condor Job WorkNode1 (18.0.0)
3/10 12:16:29 POST Script of Job WorkNode1 failed with status 1
3/10 12:16:29 Of 4 nodes total:
3/10 12:16:29  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
3/10 12:16:29   ===     ===      ===     ===     ===        ===      ===
3/10 12:16:29     1       0        1       0       0          1        1
3/10 12:17:29 Event: ULOG_JOB_TERMINATED for Condor Job WorkNode2 (19.0.0)
3/10 12:17:29 Job WorkNode2 completed successfully.
3/10 12:17:29 Running POST script of Job WorkNode2...
3/10 12:17:29 Of 4 nodes total:
3/10 12:17:29  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
3/10 12:17:29   ===     ===      ===     ===     ===        ===      ===
3/10 12:17:29     1       0        0       1       0          1        1
3/10 12:17:34 Event: ULOG_POST_SCRIPT_TERMINATED for Condor Job WorkNode2 (19.0.0)
3/10 12:17:34 POST Script of Job WorkNode2 completed successfully.
3/10 12:17:34 Of 4 nodes total:
3/10 12:17:34  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
3/10 12:17:34   ===     ===      ===     ===     ===        ===      ===
3/10 12:17:34     2       0        0       0       0          1        1
3/10 12:17:34 ERROR: the following job(s) failed:
3/10 12:17:34 ---------------------- Job ----------------------
3/10 12:17:34       Node Name: WorkNode1
3/10 12:17:34          NodeID: 1
3/10 12:17:34     Node Status: STATUS_ERROR
3/10 12:17:34           Error: POST Script failed with status 1
3/10 12:17:34 Job Submit File: bad.submit
3/10 12:17:34     POST Script: postscript_checker.sh z.bad.work2.output
3/10 12:17:34   Condor Job ID: (18.0.0)
3/10 12:17:34       Q_PARENTS: 0, 
3/10 12:17:34       Q_WAITING: 
3/10 12:17:34      Q_CHILDREN: 3, 
3/10 12:17:34 ---------------------------------------   
3/10 12:17:34 Aborting DAG...
3/10 12:17:34 Writing Rescue DAG to bad.dag.rescue...
3/10 12:17:34 **** condor_scheduniv_exec.16.0 (condor_DAGMAN) EXITING WITH STATUS 1

DAG failed as we expected with a our post script check of our bad.submit job. Also take note of the line:

12/23 08:34:01 Writing Rescue DAG to bad.dag.rescue...

This is the DAGMan rescue file. This file can be used to rerun our DAGMan job when we fix the errors that cause the original failure. Open up the rescue file and take a look at it.

$ cat bad.dag.rescue

# Rescue .dag DAG file
#
# Total number of Nodes: 4
# Nodes premarked DONE: 2
# Nodes that failed: 1
# WorkNode1,<ENDLIST>

JOB Setup lab8.submit DONE
SCRIPT POST Setup postscript_checker prog_a.output

JOB WorkNode1 bad.submit
SCRIPT POST WorkNode1 postscript_checker results.work2.output

JOB WorkNode2 lab8b.submit DONE
SCRIPT POST WorkNode2 postscript_checker prog_c.output

JOB CollectResults lab8c.submit
SCRIPT POST CollectResults postscript_checker prog_d.output

PARENT Setup CHILD WorkNode1 WorkNode2
PARENT WorkNode1 CHILD CollectResults
PARENT WorkNode2 CHILD CollectResults
$

Take note of the tag DONE. The DONE tag tells DAGMan that these jobs have completed and do not need to be rerun. When you submit the rescue DAG, DONE nodes will be skipped.

Fix the bad_script.sh so that the job will complete successfully.

#! /bin/sh

echo "I'm process id $$ on" `hostname`
echo "This is sent to standard error" 1>&2
date
echo "Running as binary $0" "$@"
echo "My name (argument 1) is $1"
echo "My sleep duration (argument 2) is $2"
sleep $2
echo "Sleep of $2 seconds finished. Exiting"
echo "RESULT: 0 SUCCESS"
exit 0

Submit the job once again only this time use bad.dag.rescue as the DAG job file. Watch the job run using watch_condor_q.

$ condor_submit_dag bad.dag.rescue ----------------------------------------------------------------------- File for submitting this DAG to Condor : bad.dag.rescue.condor.sub Log of DAGMan debugging messages : bad.dag.rescue.dagman.out Log of Condor library debug messages : bad.dag.rescue.lib.out Log of the life of condor_dagman itself : bad.dag.rescue.dagman.log Condor Log file for all Condor jobs of this DAG: bad.dag.rescue.dummy_log Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 27. ----------------------------------------------------------------------- $ ./watch_condor_q.sh

If all goes well, this time the job will complete. Look at the files that have been created in your directory. You will not that they now reference bad.dag.rescue. Look in the file bad.dag.rescue.dagman.out. You should see that all the submit job complete successfully. Also, you should see that only the jobs WorkNode1 and CollectResults were run. The other jobs had already completed successfully on the first run. They did not need to be rerun. Since CollectResults depends on both WorkNode1 and WorkNode2, it had to wait until both jobs completed successfully in order to run.