Lab Exercise: CondorG
- The environment variable GLOBUS_HOSTNAME
- Condor configuration parameters NETWORK_INTERFACE and CONDOR_HOST
$ globus-hostname
192.168.0.203
$If it is not, export it as follows:
$ export GLOBUS_HOSTNAME=192.168.0.203
$ echo $GLOBUS_HOSTNAME
192.168.0.203
$
Update the values for NETWORK_INTERFACE and CONDOR_HOST to your IP address. The other lines should remain unchanged.
.
.
NETWORK_INTERFACE = 192.168.0.203
.
.
CONDOR_HOST = 192.168.0.203
.
.
$ condor_master
$ ps -ef | grep condor
mfreemon 24678 1 0 22:34 ? 00:00:00 condor_master
mfreemon 24679 24678 0 22:34 ? 00:00:00 condor_collector -f
mfreemon 24680 24678 0 22:34 ? 00:00:00 condor_negotiator -f
mfreemon 24681 24678 8 22:34 ? 00:00:05 condor_startd -f
mfreemon 24682 24678 0 22:34 ? 00:00:00 condor_schedd -f
mfreemon 24742 23491 0 22:36 pts/3 00:00:00 grep condor
$
echo "*********************************************"
my_ip_address=`/sbin/ifconfig eth0 | grep "inet addr" | \
cut -d: -f2 | cut -d ' ' -f1`
echo "my local hostname is " `hostname`
echo "my local IP address is " $my_ip_address
echo "globus-hostname is " `globus-hostname`
echo "condor host is " `condor_config_val condor_host`
echo "condor network interface is " `condor_config_val network_interface`
echo "*********************************************"(Optional) You may also want to update your local /etc/hosts file so that local name lookups for your local hostname resolve correctly. Otherwise, Condor will not be able to send job notification emails.
$ condor_version
$CondorVersion: 6.6.6 Jul 26 2004 $
$CondorPlatform: I386-LINUX_RH9 $
$
$ condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTime ldas-grid.n LINUX INTEL Unclaimed Idle 0.000 501 0+00:05:04 Machines Owner Claimed Unclaimed Matched Preempting INTEL/LINUX 1 0 0 1 0 0 Total 1 0 0 1 0 0 $
$ condor_q -- Submitter: ldas-grid.ligo-la.caltech.edu : <141.142.96.171:35056> : ldas-grid.ligo-la.caltech.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held $If you're logged into to server and want to see just your jobs, you can specify your userid as follows:
$ condor_q mfreemon -- Submitter: mfreemon@ligo : <10.13.0.12:32772> : ldas-grid.ligo-la.caltech.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 28098.0 mfreemon 2/24 15:50 0+00:00:00 I 0 0.0 hostname 28098.1 mfreemon 2/24 15:50 0+00:00:00 I 0 0.0 hostname 28098.2 mfreemon 2/24 15:50 0+00:00:00 I 0 0.0 hostname 28098.3 mfreemon 2/24 15:50 0+00:00:00 I 0 0.0 hostname 28098.4 mfreemon 2/24 15:50 0+00:00:00 I 0 0.0 hostname 28098.5 mfreemon 2/24 15:50 0+00:00:00 I 0 0.0 hostname 28098.6 mfreemon 2/24 15:50 0+00:00:00 I 0 0.0 hostname 28098.7 mfreemon 2/24 15:50 0+00:00:00 I 0 0.0 hostname 28098.8 mfreemon 2/24 15:50 0+00:00:00 I 0 0.0 hostname 28098.9 mfreemon 2/24 15:50 0+00:00:00 I 0 0.0 hostname 29105.0 mfreemon 2/25 15:44 0+00:00:00 I 0 0.0 condor_simple.sh E 11 jobs; 11 idle, 0 running, 0 held $Click here for complete documentation on the condor_q command.
Start by creating a directory in your home directory called lab7 and cd into it:
$ cd $ mkdir lab7 $ cd lab7This new lab7 directory should be used to contain any files we create during the remainder of this lab exercise.
#! /bin/sh
echo "I'm process id $$ on" `hostname`
echo "This is sent to standard error" 1>&2
date
echo "Running as binary $0" "$@"
echo "My name (argument 1) is $1"
echo "My sleep duration (argument 2) is $2"
sleep $2
echo "Sleep of $2 seconds finished. Exiting"
exit 42
executable=lab7.sh
globusscheduler = ldas-grid.ligo-la.caltech.edu/jobmanager-condor
universe=globus
arguments=Example.$(Cluster).$(Process) 100
output=z.lab7.output.$(Process)
error=z.lab7.error.$(Process)
log=z.lab7.log
notification=never
should_transfer_files=YES
when_to_transfer_output = ON_EXIT
queueLooking at the submit file you should note several tags. The executable tag tells condor the name of the program to run. In this case it is the shell script we just created. Also, there is a tag called: arguments. These arguments will be passed to the running executable. Looking at our shell script we see it takes 2 arguments. The first is a string with two predefined values. Cluster and Process are values that Condor provides referring to the cluster the program is running on and the programs process id (PID). The second argument is the value which will be used by the sleep command. This tells the program how long to sleep before continuing. In this case, it is set for 100 seconds.
$ condor_submit lab7.submit Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 13. $
$ condor_q
-- Submitter: ligo-client.ncsa.uiuc.edu : <141.142.96.171:36350> : ligo-client.ncsa.uiuc.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
13.0 mfreemon 3/9 22:21 0+00:00:00 I 0 0.0 lab7.sh Example.13
1 jobs; 1 idle, 0 running, 0 held
$ condor_q
-- Submitter: ligo-client.ncsa.uiuc.edu : <141.142.96.171:36350> : ligo-client.ncsa.uiuc.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
13.0 mfreemon 3/9 22:21 0+00:00:36 R 0 0.0 lab7.sh Example.13
1 jobs; 0 idle, 1 running, 0 held
$ condor_q
-- Submitter: ligo-client.ncsa.uiuc.edu : <141.142.96.171:36350> : ligo-client.ncsa.uiuc.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
0 jobs; 0 idle, 0 running, 0 held
$
output=z.lab7.output.$(Process)
error=z.lab7.error.$(Process)
log=z.lab7.logThe output file will contain the output of the executable. The error file will contain any error output that the program might director to stderr. The log file is condor's log of the job. Look at each file in turn.
$ cat z.lab7.error.0
This is sent to standard error $ cat z.lab7.log 000 (015.000.000) 12/15 10:38:06 Job submitted from host: <141.142.96.174:33149> ... 017 (015.000.000) 12/15 10:38:19 Job submitted to Globus RM-Contact: ligo-server.ncsa.uiuc.edu/jobmanager-condor JM-Contact: https://ligo-server.ncsa.uiuc.edu:38307/24309/1103128689/ Can-Restart-JM: 1 ... 001 (015.000.000) 12/15 10:38:19 Job executing on host: ligo-server.ncsa.uiuc.edu ... 005 (015.000.000) 12/15 10:40:11 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job 0 - Total Bytes Sent By Job 0 - Total Bytes Received By Job ... $ cat z.lab7.output.0 I'm process id 25399 on node3 Wed Mar 9 23:18:32 CST 2005 Running as binary /data2/mfreemon/.globus/.gass_cache/local/md5/25/ c7a5f6954e29c32437d6f95efdd3bd/md5/aa/fba59e77460a1ee1668459ceeb5fb0/ data Example.13.0 100 My name (argument 1) is Example.13.0 My sleep duration (argument 2) is 100 Sleep of 100 seconds finished. Exiting $
Copy the following information into the file:
executable=lab7.sh
globusscheduler = ldas-grid.ligo-la.caltech.edu/jobmanager-condor
universe=globus
arguments=Example.$(Cluster).$(Process) 5
output=z.lots.output.$(Process)
error=z.lots.error.$(Process)
log=z.lots.log
notification=never
should_transfer_files=YES
when_to_transfer_output = ON_EXIT
Queue 15Looking at this file you should see that it is almost exactly the same as the previous submission file. The only things that have changed are the names of the files that will be written and the last tag: Queue. The queue tag tells condor how many instances of the executable to run. In this case 15 instances of lab7.sh will be run simultaneously. One thing to keep in mind when telling condor to rerun multiple instances of a executable is what will happen to the output. In the above set of instructions we have added the process id to the end of the file name. Condor will now create 15 different files each being unique because of the id number. If we had not done this, condor would have used the same file for all 15 processes.
$ condor_submit condor_lots.submit Submitting job(s)............... Logging submit event(s)............... 15 job(s) submitted to cluster 29. $
$ condor_q -- Submitter: ligo-client.ncsa.uiuc.edu : <141.142.96.171:35056> : ligo-client.ncsa.uiuc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 9.0 mfreemon 3/7 00:02 0+00:00:00 I 0 0.0 lab7.sh E 9.1 mfreemon 3/7 00:02 0+00:00:00 I 0 0.0 lab7.sh E 9.2 mfreemon 3/7 00:02 0+00:00:00 I 0 0.0 lab7.sh E 9.3 mfreemon 3/7 00:02 0+00:00:00 I 0 0.0 lab7.sh E 9.4 mfreemon 3/7 00:02 0+00:00:00 I 0 0.0 lab7.sh E 9.5 mfreemon 3/7 00:02 0+00:00:00 I 0 0.0 lab7.sh E 9.6 mfreemon 3/7 00:02 0+00:00:00 I 0 0.0 lab7.sh E 9.7 mfreemon 3/7 00:02 0+00:00:00 I 0 0.0 lab7.sh E 9.8 mfreemon 3/7 00:02 0+00:00:00 I 0 0.0 lab7.sh E 9.9 mfreemon 3/7 00:02 0+00:00:00 I 0 0.0 lab7.sh E 9.10 mfreemon 3/7 00:02 0+00:00:00 I 0 0.0 lab7.sh E 9.11 mfreemon 3/7 00:02 0+00:00:00 I 0 0.0 lab7.sh E 9.12 mfreemon 3/7 00:02 0+00:00:00 I 0 0.0 lab7.sh E 9.13 mfreemon 3/7 00:02 0+00:00:00 I 0 0.0 lab7.sh E 9.14 mfreemon 3/7 00:02 0+00:00:00 I 0 0.0 lab7.sh E 16 jobs; 16 idle, 0 running, 0 held $ condor_q -- Submitter: ligo-client.ncsa.uiuc.edu : <141.142.96.171:35056> : ligo-client.ncsa.uiuc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 9.2 mfreemon 3/7 00:02 0+00:00:12 R 0 0.0 lab7.sh E 9.9 mfreemon 3/7 00:02 0+00:00:10 C 0 0.0 lab7.sh E 2 jobs; 1 idle, 1 running, 0 held $
$ ls
drwxrwxr-x 2 mfreemon mfreemon 4096 Mar 7 00:02 .
drwx------ 21 mfreemon mfreemon 4096 Mar 6 23:41 ..
-rw-rw-r-- 1 mfreemon mfreemon 31 Mar 7 00:03 lots.error.0
-rw-rw-r-- 1 mfreemon mfreemon 31 Mar 7 00:03 lots.error.1
-rw-rw-r-- 1 mfreemon mfreemon 31 Mar 7 00:02 lots.error.10
-rw-rw-r-- 1 mfreemon mfreemon 31 Mar 7 00:03 lots.error.11
-rw-rw-r-- 1 mfreemon mfreemon 31 Mar 7 00:03 lots.error.12
-rw-rw-r-- 1 mfreemon mfreemon 31 Mar 7 00:03 lots.error.13
-rw-rw-r-- 1 mfreemon mfreemon 31 Mar 7 00:03 lots.error.14
-rw-rw-r-- 1 mfreemon mfreemon 31 Mar 7 00:03 lots.error.2
-rw-rw-r-- 1 mfreemon mfreemon 31 Mar 7 00:03 lots.error.3
-rw-rw-r-- 1 mfreemon mfreemon 31 Mar 7 00:03 lots.error.4
-rw-rw-r-- 1 mfreemon mfreemon 31 Mar 7 00:03 lots.error.5
-rw-rw-r-- 1 mfreemon mfreemon 31 Mar 7 00:03 lots.error.6
-rw-rw-r-- 1 mfreemon mfreemon 31 Mar 7 00:03 lots.error.7
-rw-rw-r-- 1 mfreemon mfreemon 31 Mar 7 00:02 lots.error.8
-rw-rw-r-- 1 mfreemon mfreemon 31 Mar 7 00:03 lots.error.9
-rw-rw-r-- 1 mfreemon mfreemon 12585 Mar 7 00:04 lots.log
-rw-rw-r-- 1 mfreemon mfreemon 324 Mar 7 00:03 lots.output.0
-rw-rw-r-- 1 mfreemon mfreemon 324 Mar 7 00:03 lots.output.1
-rw-rw-r-- 1 mfreemon mfreemon 326 Mar 7 00:02 lots.output.10
-rw-rw-r-- 1 mfreemon mfreemon 326 Mar 7 00:03 lots.output.11
-rw-rw-r-- 1 mfreemon mfreemon 325 Mar 7 00:03 lots.output.12
-rw-rw-r-- 1 mfreemon mfreemon 325 Mar 7 00:03 lots.output.13
-rw-rw-r-- 1 mfreemon mfreemon 326 Mar 7 00:03 lots.output.14
-rw-rw-r-- 1 mfreemon mfreemon 323 Mar 7 00:03 lots.output.2
-rw-rw-r-- 1 mfreemon mfreemon 324 Mar 7 00:03 lots.output.3
-rw-rw-r-- 1 mfreemon mfreemon 324 Mar 7 00:03 lots.output.4
-rw-rw-r-- 1 mfreemon mfreemon 324 Mar 7 00:03 lots.output.5
-rw-rw-r-- 1 mfreemon mfreemon 324 Mar 7 00:03 lots.output.6
-rw-rw-r-- 1 mfreemon mfreemon 324 Mar 7 00:03 lots.output.7
-rw-rw-r-- 1 mfreemon mfreemon 324 Mar 7 00:02 lots.output.8
-rw-rw-r-- 1 mfreemon mfreemon 323 Mar 7 00:03 lots.output.9
$
Copy lab7.submit to, or create a new file called, lab7a.submit, and edit it as follows:
executable=lab7.sh
globusscheduler = ldas-grid.ligo-la.caltech.edu/jobmanager-condor
universe=globus
arguments=Example.$(Cluster).$(Process) 10
output=z.lab7a.output.$(Process)
error=z.lab7a.error.$(Process)
log=z.lab7a.log
notification=never
should_transfer_files=YES
when_to_transfer_output = ON_EXIT
Initialdir = run_1
queue
Initialdir = run_2
queueThe Initialdir tag is "Used to give jobs a directory with respect to file input and output. Also provides a directory (on the machine from which the job is submitted) for the user log, when a full path is not specified."
For this and other tags used by condor_submit, look in the condor manual.
http://www.cs.wisc.edu/condor/ manual/v6.6/ condor_submit.html
$ mkdir run_1 $ mkdir run_2 $
$ condor_submit lab7a.submit Submitting job(s).. Logging submit event(s).. 2 job(s) submitted to cluster 28. $ condor_q -- Submitter: ligo-client.ncsa.uiuc.edu : <141.142.96.171:36350> : ligo-client.ncsa.uiuc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 15.0 mfreemon 3/9 22:52 0+00:00:00 I 0 0.0 lab7.sh Example.15 15.1 mfreemon 3/9 22:52 0+00:00:00 I 0 0.0 lab7.sh Example.15 2 jobs; 2 idle, 0 running, 0 held $ condor_q -- Submitter: ligo-client.ncsa.uiuc.edu : <141.142.96.171:36350> : ligo-client.ncsa.uiuc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 15.0 mfreemon 3/9 22:52 0+00:01:14 R 0 0.0 lab7.sh Example.15 15.1 mfreemon 3/9 22:52 0+00:01:24 R 0 0.0 lab7.sh Example.15 2 jobs; 0 idle, 2 running, 0 held $ condor_q -- Submitter: ligo-client.ncsa.uiuc.edu : <141.142.96.171:36350> : ligo-client.ncsa.uiuc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held $Look at the contents of both directories run_1 and run_2.
For this example, we'll make the output file non-writable. The job will be unable to copy the results back and will be placed on hold.
executable=lab7.sh
globusscheduler = ldas-grid.ligo-la.caltech.edu/jobmanager-condor
universe=globus
arguments=Example.$(Cluster).$(Process) 5
output=z.hold.output.$(Process)
error=z.hold.error.$(Process)
log=z.hold.log
notification=never
should_transfer_files=YES
when_to_transfer_output = ON_EXIT
queue 15
$ condor_submit condor_hold.submit ; chmod -w z.hold.output.0 Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 18. $
$ condor_q
-- Submitter: ligo-client.ncsa.uiuc.edu : <141.142.96.174:33149> : ligo-client.ncsa.uiuc.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMDD
18.0 btest 12/15 14:16 0+00:00:00 I 0 0.0 condor_hold.sh Exa
1 jobs; 1 idle, 0 running, 0 held
$ condor_q
-- Submitter: ligo-client.ncsa.uiuc.edu : <141.142.96.174:33149> : ligo-client.ncsa.uiuc.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
18.0 btest 12/15 14:16 0+00:00:16 R 0 0.0 condor_hold.sh Exa
1 jobs; 0 idle, 1 running, 0 held
$ condor_q
-- Submitter: ligo-client.ncsa.uiuc.edu : <141.142.96.174:33149> : ligo-client.ncsa.uiuc.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
18.0 btest 12/15 14:16 0+00:00:53 H 0 0.0 condor_hold.sh Exa
1 jobs; 0 idle, 0 running, 1 held
$
condor_q -held information.
$ condor_q -held -- Submitter: ligo-client.ncsa.uiuc.edu : <141.142.96.174:33149> : ligo-client.ncsa.uiuc.edu ID OWNER HELD_SINCE HOLD_REASON 18.0 btest 12/15 14:17:20 Globus error 129: the standard output/error size is different 1 jobs; 0 idle, 0 running, 1 held $Log file information.
$ cat z.hold.log 000 (018.000.000) 12/15 14:16:14 Job submitted from host: <141.142.96.174:33149> ... 017 (018.000.000) 12/15 14:16:27 Job submitted to Globus RM-Contact: ldas-grid.ligo-la.caltech.edu/jobmanager-condor JM-Contact: https://ldas-grid.ligo-la.caltech.edu:40046/13167/1110434858/ Can-Restart-JM: 1 ... 001 (018.000.000) 12/15 14:16:27 Job executing on host: ldas-grid.ligo-la.caltech.edu ... 012 (018.000.000) 12/15 14:17:20 Job was held. Globus error 129: the standard output/error size is different Code 2 Subcode 129 ...
$ chmod +w z.hold.output.0 $
$ condor_release -all All jobs released. $
$ condor_q -- Submitter: ligo-client.ncsa.uiuc.edu : <141.142.96.174:33149> : ligo-client.ncsa.uiuc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 18.0 btest 12/15 14:16 0+00:00:53 I 0 0.0 condor_hold.sh Exa 1 jobs; 1 idle, 0 running, 0 held $ condor_q -- Submitter: ligo-client.ncsa.uiuc.edu : <141.142.96.174:33149> : ligo-client.ncsa.uiuc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held $
$ cat z.hold.log 013 (018.000.000) 12/15 14:30:23 Job was released. via condor_release (by user mfreemon) ... 001 (018.000.000) 12/15 14:30:38 Job executing on host: ldas-grid.ligo-la.caltech.edu ... 005 (018.000.000) 12/15 14:30:43 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job 0 - Total Bytes Sent By Job 0 - Total Bytes Received By Job ...
$ condor_off -master
Sent "Kill-Daemon" command for "master" to local master
$