Lab Exercise: DAGMan
$ cd
$ mkdir lab8
$ cd lab8
$
#! /bin/sh
echo "I'm process id $$ on" `hostname`
echo "This is sent to standard error" 1>&2
date
echo "Running as binary $0" "$@"
echo "My name (argument 1) is $1"
echo "My sleep duration (argument 2) is $2"
sleep $2
echo "Sleep of $2 seconds finished. Exiting"
echo "RESULT: 0 SUCCESS"
exit 0
$ chmod +x lab8.sh
$
executable=lab8.sh
globusscheduler = ldas-grid.ligo-la.caltech.edu/jobmanager-condor
universe=globus
arguments=Example.$(Cluster).$(Process) 10
output=z.lab8.output.$(Process)
error=z.lab8.error.$(Process)
log=z.lab8.log
notification=never
should_transfer_files=YES
when_to_transfer_output = ON_EXIT
queue
Copy the following information into the file:
Job Simple lab8.submit
$ condor_submit_dag lab8.dag
----------------------------------------------------------------------- File for submitting this DAG to Condor : lab8.dag.condor.sub Log of DAGMan debugging messages : lab8.dag.dagman.out Log of Condor library debug messages : lab8.dag.lib.out Log of the life of condor_dagman itself : lab8.dag.dagman.log Condor Log file for all Condor jobs of this DAG: lab8.dag.dummy_log Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 43. ----------------------------------------------------------------------- $ condor_q -- Submitter: ligo-client.ncsa.uiuc.edu : <141.142.96.174:33149> : ligo-client.ncsa.uiuc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 43.0 btest 12/21 10:07 0+00:00:10 R 0 2.3 condor_dagman -f - 44.0 btest 12/21 10:07 0+00:00:00 I 0 0.0 lab8.sh Example. 2 jobs; 1 idle, 1 running, 0 held $ condor_q -globus -- Submitter: ligo-client.ncsa.uiuc.edu : <141.142.96.174:33149> : ligo-client.ncsa.uiuc.edu ID OWNER STATUS MANAGER HOST EXECUTABLE 44.0 btest UNSUBMITTED fork ligo-server.ncsa.u /home/btest/dag/pr $ condor_q -- Submitter: ligo-client.ncsa.uiuc.edu : <141.142.96.174:33149> : ligo-client.ncsa.uiuc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 43.0 btest 12/21 10:07 0+00:00:20 R 0 2.3 condor_dagman -f - 44.0 btest 12/21 10:07 0+00:00:02 R 0 0.0 lab8.sh Example. 2 jobs; 0 idle, 2 running, 0 held $ condor_q -globus -- Submitter: ligo-client.ncsa.uiuc.edu : <141.142.96.174:33149> : ligo-client.ncsa.uiuc.edu ID OWNER STATUS MANAGER HOST EXECUTABLE 44.0 btest ACTIVE fork ligo-server.ncsa.u /home/btest/dag/pr $ condor_q -- Submitter: ligo-client.ncsa.uiuc.edu : <141.142.96.174:33149> : ligo-client.ncsa.uiuc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 43.0 btest 12/21 10:07 0+00:00:30 R 0 2.3 condor_dagman -f - 44.0 btest 12/21 10:07 0+00:00:11 C 0 0.0 lab8.sh Example. 1 jobs; 0 idle, 1 running, 0 held $ condor_q -globus -- Submitter: ligo-client.ncsa.uiuc.edu : <141.142.96.174:33149> : ligo-client.ncsa.uiuc.edu ID OWNER STATUS MANAGER HOST EXECUTABLE 44.0 btest DONE fork ligo-server.ncsa.u /home/btest/dag/pr $ condor_q -- Submitter: ligo-client.ncsa.uiuc.edu : <141.142.96.174:33149> : ligo-client.ncsa.uiuc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held $
$ ls -la total 48 drwxrwxr-x 2 mfreemon mfreemon 4096 Mar 9 23:38 . drwx----- 23 mfreemon mfreemon 4096 Mar 9 22:14 .. -rw-rw-r-- 1 mfreemon mfreemon 23 Mar 9 23:36 lab8.dag -rw-rw-r-- 1 mfreemon mfreemon 482 Mar 9 23:37 lab8.dag.condor.sub -rw-rw-r-- 1 mfreemon mfreemon 608 Mar 9 23:38 lab8.dag.dagman.log -rw-r--r-- 1 mfreemon mfreemon 2814 Mar 9 23:38 lab8.dag.dagman.out -rw------- 1 mfreemon mfreemon 0 Mar 9 23:37 lab8.dag.dummy_log -rw-rw-r-- 1 mfreemon mfreemon 29 Mar 9 23:38 lab8.dag.lib.out -rwxrwxr-x 1 mfreemon mfreemon 298 Mar 8 22:57 lab8.sh -rw-rw-r-- 1 mfreemon mfreemon 306 Mar 9 23:35 lab8.submit -rw-r--r-- 1 mfreemon mfreemon 31 Mar 9 23:38 z.lab8.error.0 -rw-r--r-- 1 mfreemon mfreemon 861 Mar 9 23:38 z.lab8.log -rw-r--r-- 1 mfreemon mfreemon 347 Mar 9 23:38 z.lab8.output.0 $
$ cp lab8.sh lab8a.sh $ cp lab8.sh lab8b.sh
$ cp lab8.sh lab8c.sh
$ cp lab8.submit lab8a.submit
$ cp lab8.submit lab8b.submit
$ cp lab8.submit lab8c.submit
$
output=z.lab8a.output
error=z.lab8a.error
arguments=lab8b 120 Leave the log entries alone. DAGMan requires that all nodes output their logs in the same location. Condor will ensure that the different jobs will not overwrite each other's entries in the log. (Newer versions of DAGMan lift this requirement, and allow each job to use its own log file -- but you may want to use one common log file anyway because it's convenient to have all of your job status information in a single place.)
log=z.lab8.log 
Copy the following information into the file.
Job Setup lab8.submit
Job WorkNode1 lab8a.submit
Job WorkNode2 lab8b.submit
Job CollectResults lab8c.submit
PARENT Setup CHILD WorkNode1 WorkNode2
PARENT WorkNode1 WorkNode2 CHILD CollectResultsThese instructions are telling DAGMan that there are 4 jobs that need to be run. These jobs will be referenced by the jobname and the submit scripts will be passed to condor by DAGMan. The parent tag tells DAGMan the order that the jobs need to be run. First setup will run. If it completes successfully, its two child scripts will be run: WorkNode1 and WorkNode2. Both of these scripts need to complete successfully for CollectResults to run.
Copy the following code into the file.
#! /bin/sh
while true; do
echo ....
echo .... Output from condor_q
echo ....
condor_q
echo ....
echo .... Output from condor_q -globus
echo ....
condor_q -globus
echo ....
echo .... Output from condor_q -dag
echo ....
condor_q -dag
sleep 10
doneSet the execute bit:
$ chmod +x watch_condor_q.sh
$This script will loop over calls to condor_q so we can more easily monitor the progress of our DAG job.
$ condor_submit_dag complex.dag
-----------------------------------------------------------------------
File for submitting this DAG to Condor : complex.dag.condor.sub
Log of DAGMan debugging messages : complex.dag.dagman.out
Log of Condor library debug messages : complex.dag.lib.out
Log of the life of condor_dagman itself : complex.dag.dagman.log
Condor Log file for all Condor jobs of this DAG: complex.dag.dummy_log
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 20.
-----------------------------------------------------------------------
$
$ ./watch_condor_q.sh
Watch as each job is submit and run. You will notice that the order is followed. First Setup is run, then WorkNode1 and WorkNode2. WorkNode2 will stay on the queue longer since it has a 120 second delay in it. CollectResults will not run until WorkNode2 finishes.
Now look at your output.
$ ls -ltra -rw------- 1 btest btest 0 Dec 22 08:09 complex.dag.dummy_log -rw-rw-r-- 1 btest btest 508 Dec 22 08:09 complex.dag.condor.sub -rw-r--r-- 1 btest btest 31 Dec 22 08:09 z.lab8.error.0 -rw-r--r-- 1 btest btest 330 Dec 22 08:09 z.lab8.output.0 -rw-r--r-- 1 btest btest 31 Dec 22 08:10 z.lab8a.error.0 -rw-r--r-- 1 btest btest 31 Dec 22 08:10 z.lab8b.error.0 -rw-r--r-- 1 btest btest 330 Dec 22 08:10 z.lab8a.output.0 -rw-r--r-- 1 btest btest 321 Dec 22 08:12 z.lab8b.output.0 -rw-r--r-- 1 btest btest 31 Dec 22 08:12 z.lab8c.error.0 -rw-r--r-- 1 btest btest 330 Dec 22 08:12 z.lab8c.output.0 -rw-r--r-- 1 btest btest 3381 Dec 22 08:12 z.lab8.log -rw-rw-r-- 1 btest btest 29 Dec 22 08:12 complex.dag.lib.out -rw-r--r-- 1 btest btest 6072 Dec 22 08:12 complex.dag.dagman.out -rw-rw-r-- 1 btest btest 608 Dec 22 08:12 complex.dag.dagman.log $Take some time to look at all of the output files and verify your results.
#! /bin/sh
echo "I'm process id $$ on" `hostname`
echo "This is sent to standard error" 1>&2
date
echo "Running as binary $0" "$@"
echo "My name (argument 1) is $1"
echo "My sleep duration (argument 2) is $2"
sleep $2
echo "Sleep of $2 seconds finished. Exiting"
echo "RESULT: 1 FAILURE"
exit 1Set the execute bit:
$ chmod +x bad_script.sh
$This script when it completes will exit with a value of 1. DAGMan will take this as an error. All scripts that are run with DAGMan need to exit with a value of zero when the finish successfully. If this is not done, unexpected results will occur.
executable=bad_script.sh
output=z.bad.work2.output
error=z.bad.work2.error
log=z.bad.log
notification=never
universe=globus
globusscheduler = ldas-grid.ligo-la.caltech.edu/jobmanager-condor
arguments=WorkerNode2 60
queue
#! /bin/sh
grep 'RESULT: 0 SUCCESS' $1 > /dev/null 2>/dev/null  Set the execute bit:
$ chmod +x postscript_checker.sh
$DAGMan allows us to specify both PRE and POST operations to be performed along with a job. We simply specify the job we want using the jobname and then use either the PRE or POST tag. In this case we will use the POST tag. Our post script will look at the output of the job we indicate and if it finds that job has failed it will tell DAGMan to fail.
Job Setup lab8.submit
Job WorkNode1 bad.submit
Job WorkNode2 lab8b.submit
Job CollectResults lab8c.submit
PARENT Setup CHILD WorkNode1 WorkNode2
PARENT WorkNode1 WorkNode2 CHILD CollectResults
Script POST Setup postscript_checker.sh z.lab8.output
Script POST WorkNode1 postscript_checker.sh z.bad.work2.output
Script POST WorkNode2 postscript_checker.sh z.lab8b.output
Script POST CollectResults postscript_checker.sh z.lab8c.output
$ condor_submit_dag bad.dag
-----------------------------------------------------------------------
File for submitting this DAG to Condor : bad.dag.condor.sub
Log of DAGMan debugging messages : bad.dag.dagman.out
Log of Condor library debug messages : bad.dag.lib.out
Log of the life of condor_dagman itself : bad.dag.dagman.log
Condor Log file for all Condor jobs of this DAG: bad.dag.dummy_log
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 63.
-----------------------------------------------------------------------$ ./watch_condor_q.sh
You should see the various jobs and scripts running. Both WorkNode jobs should run to completion and then the DAG should fail. Look in the file: bad.dag.dagman.out. You should see something like:
$ cat bad.dag.dagman.out
3/10 12:16:24 Job WorkNode1 completed successfully. 3/10 12:16:24 Running POST script of Job WorkNode1... 3/10 12:16:24 Of 4 nodes total: 3/10 12:16:24 Done Pre Queued Post Ready Un-Ready Failed 3/10 12:16:24 === === === === === === === 3/10 12:16:24 1 0 1 1 0 1 0 3/10 12:16:29 Event: ULOG_POST_SCRIPT_TERMINATED for Condor Job WorkNode1 (18.0.0) 3/10 12:16:29 POST Script of Job WorkNode1 failed with status 1 3/10 12:16:29 Of 4 nodes total: 3/10 12:16:29 Done Pre Queued Post Ready Un-Ready Failed 3/10 12:16:29 === === === === === === === 3/10 12:16:29 1 0 1 0 0 1 1 3/10 12:17:29 Event: ULOG_JOB_TERMINATED for Condor Job WorkNode2 (19.0.0) 3/10 12:17:29 Job WorkNode2 completed successfully. 3/10 12:17:29 Running POST script of Job WorkNode2... 3/10 12:17:29 Of 4 nodes total: 3/10 12:17:29 Done Pre Queued Post Ready Un-Ready Failed 3/10 12:17:29 === === === === === === === 3/10 12:17:29 1 0 0 1 0 1 1 3/10 12:17:34 Event: ULOG_POST_SCRIPT_TERMINATED for Condor Job WorkNode2 (19.0.0) 3/10 12:17:34 POST Script of Job WorkNode2 completed successfully. 3/10 12:17:34 Of 4 nodes total: 3/10 12:17:34 Done Pre Queued Post Ready Un-Ready Failed 3/10 12:17:34 === === === === === === === 3/10 12:17:34 2 0 0 0 0 1 1 3/10 12:17:34 ERROR: the following job(s) failed: 3/10 12:17:34 ---------------------- Job ---------------------- 3/10 12:17:34 Node Name: WorkNode1 3/10 12:17:34 NodeID: 1 3/10 12:17:34 Node Status: STATUS_ERROR 3/10 12:17:34 Error: POST Script failed with status 1 3/10 12:17:34 Job Submit File: bad.submit 3/10 12:17:34 POST Script: postscript_checker.sh z.bad.work2.output 3/10 12:17:34 Condor Job ID: (18.0.0) 3/10 12:17:34 Q_PARENTS: 0,3/10 12:17:34 Q_WAITING: 3/10 12:17:34 Q_CHILDREN: 3, 3/10 12:17:34 --------------------------------------- 3/10 12:17:34 Aborting DAG... 3/10 12:17:34 Writing Rescue DAG to bad.dag.rescue... 3/10 12:17:34 **** condor_scheduniv_exec.16.0 (condor_DAGMAN) EXITING WITH STATUS 1
12/23 08:34:01 Writing Rescue DAG to bad.dag.rescue...
$ cat bad.dag.rescue
# Rescue .dag DAG file
#
# Total number of Nodes: 4
# Nodes premarked DONE: 2
# Nodes that failed: 1
# WorkNode1,<ENDLIST>
JOB Setup lab8.submit DONE
SCRIPT POST Setup postscript_checker prog_a.output
JOB WorkNode1 bad.submit
SCRIPT POST WorkNode1 postscript_checker results.work2.output
JOB WorkNode2 lab8b.submit DONE
SCRIPT POST WorkNode2 postscript_checker prog_c.output
JOB CollectResults lab8c.submit
SCRIPT POST CollectResults postscript_checker prog_d.output
PARENT Setup CHILD WorkNode1 WorkNode2
PARENT WorkNode1 CHILD CollectResults
PARENT WorkNode2 CHILD CollectResults
$Take note of the tag DONE. The DONE tag tells DAGMan that these jobs have completed and do not need to be rerun. When you submit the rescue DAG, DONE nodes will be skipped.
#! /bin/sh
echo "I'm process id $$ on" `hostname`
echo "This is sent to standard error" 1>&2
date
echo "Running as binary $0" "$@"
echo "My name (argument 1) is $1"
echo "My sleep duration (argument 2) is $2"
sleep $2
echo "Sleep of $2 seconds finished. Exiting"
echo "RESULT: 0 SUCCESS"
exit 0
$ condor_submit_dag bad.dag.rescue
-----------------------------------------------------------------------
File for submitting this DAG to Condor : bad.dag.rescue.condor.sub
Log of DAGMan debugging messages : bad.dag.rescue.dagman.out
Log of Condor library debug messages : bad.dag.rescue.lib.out
Log of the life of condor_dagman itself : bad.dag.rescue.dagman.log
Condor Log file for all Condor jobs of this DAG: bad.dag.rescue.dummy_log
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 27.
-----------------------------------------------------------------------$ ./watch_condor_q.sh
If all goes well, this time the job will complete. Look at the files that have been created in your directory. You will not that they now reference bad.dag.rescue. Look in the file bad.dag.rescue.dagman.out. You should see that all the submit job complete successfully. Also, you should see that only the jobs WorkNode1 and CollectResults were run. The other jobs had already completed successfully on the first run. They did not need to be rerun. Since CollectResults depends on both WorkNode1 and WorkNode2, it had to wait until both jobs completed successfully in order to run.