Lab Exercise: Condor (Local)
$ ssh ldas-grid.ligo-la.caltech.edu
$
$ condor_version
$CondorVersion: 6.7.3 Dec 28 2004 $ $CondorPlatform: I386-LINUX_RH9 $ $
$ condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTime vm1@ldas-grid LINUX INTEL Owner Idle 0.040 700 11+06:25:28 vm2@ldas-grid LINUX INTEL Owner Idle 0.000 1200 11+06:25:29 vm1@node10.li LINUX INTEL Claimed Busy 1.000 700 0+00:01:51 [..snip..] vm2@node8.lig LINUX INTEL Claimed Busy 1.010 1200 0+00:13:00 vm1@node9.lig LINUX INTEL Claimed Busy 1.000 700 0+00:11:42 vm2@node9.lig LINUX INTEL Claimed Busy 1.000 1200 0+00:24:28 Machines Owner Claimed Unclaimed Matched Preempting INTEL/LINUX 138 2 136 0 0 0 Total 138 2 136 0 0 0 $
$ condor_q -- Submitter: ldas-grid.ligo-la.caltech.edu : <10.13.0.12:32772> : ldas-grid.ligo-la.caltech.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 61122.0 dbrown 3/7 16:38 0+00:00:00 I 0 16.3 lalapps_inspiral - 61123.0 dbrown 3/7 16:38 0+00:00:00 I 0 16.3 lalapps_inspiral - 61124.0 dbrown 3/7 16:38 0+00:00:00 I 0 16.3 lalapps_inspiral - [..snip..] 61140.0 kipp 3/7 16:45 0+00:06:35 R 0 2.4 condor_dagman -f - 61141.0 kipp 3/7 16:45 0+00:06:28 R 0 0.0 dagdbUpdator -j 13 61143.0 kipp 3/7 16:45 0+00:06:07 R 0 18.0 lalapps_power --wi 988 jobs; 820 idle, 168 running, 0 held $If you're logged into to server and want to see just your jobs, you can specify your userid as follows:
$ condor_q mfreemon -- Submitter: ldas-grid.ligo-la.caltech.edu : <10.13.0.12:32772> : ldas-grid.ligo-la.caltech.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 28098.0 mfreemon 2/24 15:50 0+00:00:00 I 0 0.0 hostname 28098.1 mfreemon 2/24 15:50 0+00:00:00 I 0 0.0 hostname 28098.2 mfreemon 2/24 15:50 0+00:00:00 I 0 0.0 hostname 28098.3 mfreemon 2/24 15:50 0+00:00:00 I 0 0.0 hostname 28098.4 mfreemon 2/24 15:50 0+00:00:00 I 0 0.0 hostname 28098.5 mfreemon 2/24 15:50 0+00:00:00 I 0 0.0 hostname 28098.6 mfreemon 2/24 15:50 0+00:00:00 I 0 0.0 hostname 28098.7 mfreemon 2/24 15:50 0+00:00:00 I 0 0.0 hostname 28098.8 mfreemon 2/24 15:50 0+00:00:00 I 0 0.0 hostname 28098.9 mfreemon 2/24 15:50 0+00:00:00 I 0 0.0 hostname 29105.0 mfreemon 2/25 15:44 0+00:00:00 I 0 0.0 condor_simple.sh E 11 jobs; 11 idle, 0 running, 0 held $Click here for complete documentation on the condor_q command.
Start by creating a directory in your home directory on the server called lab6 and cd into it:
$ cd $ mkdir lab6 $ cd lab6This new lab6 directory should be used to contain any files we create during the remainder of this lab exercise.
#! /bin/sh
echo "I'm process id $$ on" `hostname`
echo "This is sent to standard error" 1>&2
date
echo "Running as binary $0" "$@"
echo "My name (argument 1) is $1"
echo "My sleep duration (argument 2) is $2"
sleep $2
echo "Sleep of $2 seconds finished. Exiting"
exit 42
executable=lab6.sh
universe=vanilla
arguments=Example.$(Cluster).$(Process) 5
output=z.lab6.output.$(Process)
error=z.lab6.error.$(Process)
log=z.lab6.log
notification=never
should_transfer_files=YES
when_to_transfer_output = ON_EXIT
queue 10Looking at the submit file you should note several tags. The executable tag tells condor the name of the program to run. In this case it is the shell script we just created. Also, there is a tag called: arguments. These arguments will be passed to the running executable. Looking at our shell script we see it takes 2 arguments. The first is a string with two predefined values. Cluster and Process are values that Condor provides referring to the cluster the program is running on and the programs process id (PID). The second argument is the value which will be used by the sleep command. This tells the program how long to sleep before continuing. In this case, it is set for 5 seconds.
The queue tag tells condor how many instances of the executable to run. In this case 10 instances of lab6.sh will be run simultaneously. One thing to keep in mind when telling condor to rerun multiple instances of a executable is what will happen to the output. In the above set of instructions we have added the process id to the end of the file name. Condor will now create 10 different files each being unique because of the id number. If we had not done this, condor would have used the same file for all 10 processes.
$ condor_submit lab6.submit Submitting job(s)...............
Logging submit event(s)............... 10 job(s) submitted to cluster 29. $
$ condor_q mfreemon -- Submitter: ldas-grid.ligo-la.caltech.edu : <10.13.0.12:32772> : ldas-grid.ligo-la.caltech.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 9.0 mfreemon 3/7 00:02 0+00:00:00 I 0 0.0 lab6.sh E 9.1 mfreemon 3/7 00:02 0+00:00:00 I 0 0.0 lab6.sh E 9.2 mfreemon 3/7 00:02 0+00:00:00 I 0 0.0 lab6.sh E 9.3 mfreemon 3/7 00:02 0+00:00:00 I 0 0.0 lab6.sh E 9.4 mfreemon 3/7 00:02 0+00:00:00 I 0 0.0 lab6.sh E 9.5 mfreemon 3/7 00:02 0+00:00:00 I 0 0.0 lab6.sh E 9.6 mfreemon 3/7 00:02 0+00:00:00 I 0 0.0 lab6.sh E 9.7 mfreemon 3/7 00:02 0+00:00:00 I 0 0.0 lab6.sh E 9.8 mfreemon 3/7 00:02 0+00:00:00 I 0 0.0 lab6.sh E 9.9 mfreemon 3/7 00:02 0+00:00:00 I 0 0.0 lab6.sh E 194 jobs; 40 idle, 154 running, 0 held $
output=z.lab6.output.$(Process)
error=z.lab6.error.$(Process)
log=z.lab6.logThe output file will contain the output of the executable. The error file will contain any error output that the program might director to stderr. The log file is condor's log of the job. Look at each file in turn.
$ ls -la total 98 drwxrwxr-x 2 mfreemon mfreemon 776 Mar 7 16:56 . drwx------ 7 mfreemon mfreemon 464 Mar 7 16:51 .. -rw-rw-r-- 1 mfreemon mfreemon 31 Mar 7 16:57 z.lab6.error.0 -rw-rw-r-- 1 mfreemon mfreemon 31 Mar 7 16:57 z.lab6.error.1 -rw-rw-r-- 1 mfreemon mfreemon 31 Mar 7 16:57 z.lab6.error.2 -rw-rw-r-- 1 mfreemon mfreemon 31 Mar 7 16:57 z.lab6.error.3 -rw-rw-r-- 1 mfreemon mfreemon 31 Mar 7 16:57 z.lab6.error.4 -rw-rw-r-- 1 mfreemon mfreemon 31 Mar 7 16:57 z.lab6.error.5 -rw-rw-r-- 1 mfreemon mfreemon 31 Mar 7 16:57 z.lab6.error.6 -rw-rw-r-- 1 mfreemon mfreemon 31 Mar 7 16:57 z.lab6.error.7 -rw-rw-r-- 1 mfreemon mfreemon 31 Mar 7 16:57 z.lab6.error.8 -rw-rw-r-- 1 mfreemon mfreemon 31 Mar 7 16:57 z.lab6.error.9 -rw-rw-r-- 1 mfreemon mfreemon 150 Mar 7 16:57 z.lab6.log -rw-rw-r-- 1 mfreemon mfreemon 256 Mar 7 16:57 z.lab6.output.0 -rw-rw-r-- 1 mfreemon mfreemon 256 Mar 7 16:57 z.lab6.output.1 -rw-rw-r-- 1 mfreemon mfreemon 256 Mar 7 16:57 z.lab6.output.2 -rw-rw-r-- 1 mfreemon mfreemon 256 Mar 7 16:57 z.lab6.output.3 -rw-rw-r-- 1 mfreemon mfreemon 256 Mar 7 16:57 z.lab6.output.4 -rw-rw-r-- 1 mfreemon mfreemon 256 Mar 7 16:57 z.lab6.output.5 -rw-rw-r-- 1 mfreemon mfreemon 256 Mar 7 16:57 z.lab6.output.6 -rw-rw-r-- 1 mfreemon mfreemon 256 Mar 7 16:57 z.lab6.output.7 -rw-rw-r-- 1 mfreemon mfreemon 256 Mar 7 16:57 z.lab6.output.8 -rw-rw-r-- 1 mfreemon mfreemon 256 Mar 7 16:57 z.lab6.output.9 -rw-rw-r-- 1 mfreemon mfreemon 273 Mar 7 16:53 lab6.sh -rw-rw-r-- 1 mfreemon mfreemon 241 Mar 7 16:55 lab6.submit $ cat z.lab6.error.0 This is sent to standard error $ cat z.lab6.log 000 (015.000.000) 12/15 10:38:06 Job submitted from host: <141.142.96.174:33149> ... 017 (015.000.000) 12/15 10:38:19 Job submitted to Globus RM-Contact: ldas-grid.ligo-la.caltech.edu/jobmanager-condor JM-Contact: https://ligo-server.ncsa.uiuc.edu:38307/24309/1103128689/ Can-Restart-JM: 1 ... 001 (015.000.000) 12/15 10:38:19 Job executing on host: ldas-grid.ligo-la.caltech.edu ... 005 (015.000.000) 12/15 10:40:11 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job 0 - Total Bytes Sent By Job 0 - Total Bytes Received By Job ... $ cat z.lab6.output.0 I'm process id 18036 on node54 Mon Mar 7 17:05:22 CST 2005 Running as binary /usr1/condor/execute/dir_18034/condor_exec.exe Example.61245.0 5 My name (argument 1) is Example.61245.0 My sleep duration (argument 2) is 5 Sleep of 5 seconds finished. Exiting $
Create a file: condor_req.submit
Copy the following into the file.
executable=lab6.sh
Requirements = Memory >= 32 && OpSys == "LINUX" && Arch =="INTEL"
universe=vanilla
arguments=Example.$(Cluster).$(Process) 5
output=z.req.output.$(Process)
error=z.req.error.$(Process)
log=z.req.log
notification=never
should_transfer_files=YES
when_to_transfer_output = ON_EXIT
queueRequirements for the job are defined by the requirements tag. In this case we have told condor that we need a minimum of 32 megs of memory. That the operating system has to be Linux and that the processor needs to be Intel based. You can find a full listing of the requirements that can be specified in the condor manual.
http://www.cs.wisc.edu/condor/ manual/v6.6/ 2_5Submitting_Job.html
$ condor_submit condor_req.submit Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 30. $ condor_q -- Submitter: ldas-grid.ligo-la.caltech.edu : <141.142.96.174:33149> : ldas-grid.ligo-la.caltech.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 30.0 btest 12/17 09:44 0+00:00:03 R 0 0.0 lab6.sh E 1 jobs; 0 idle, 1 running, 0 held $
Comprehensive documentation on submitting jobs can be found at http://www.cs.wisc.edu/condor/manual/v6.6/ 2_5Submitting_Job.html
Copy the following into the file:
executable=condor_bad.sh
universe=vanilla
arguments=Example.$(Cluster).$(Process) 10
output=z.results.output.$(Process)
error=z.results.error.$(Process)
log=z.results.log
notification=never
should_transfer_files=YES
when_to_transfer_output = ON_EXIT
requirements=Memory>2000
queue
Input the following into the file.
#!/bin/sh
echo $1
$ condor_submit condor_bad.submit Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 31. $ condor_q -- Submitter: ldas-grid.ligo-la.caltech.edu : <141.142.96.174:33149> : ldas-grid.ligo-la.caltech.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 31.0 btest 12/20 08:02 0+00:00:00 I 0 0.0 condor_bad.sh Exam 1 jobs; 1 idle, 0 running, 0 held $ condor_q -- Submitter: ldas-grid.ligo-la.caltech.edu : <141.142.96.174:33149> : ldas-grid.ligo-la.caltech.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 31.0 btest 12/20 08:02 0+00:00:00 I 0 0.0 condor_bad.sh Exam 1 jobs; 1 idle, 0 running, 0 held
$ condor_q -analyze 31
-- Submitter: ldas-grid.ligo-la.caltech.edu : <141.142.96.174:33149> : ldas-grid.ligo-la.caltech.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
---
031.000: Run analysis summary. Of 2 machines,
2 are rejected by your job's requirements
0 reject your job because of their own requirements
0 match, but are serving users with a better priority in the pool
0 match, match, but reject the job for unknown reasons
0 match, but will not currently preempt their existing job
0 are available to run your job
No successful match recorded.
Last failed match: Mon Dec 20 08:02:07 2004
Reason for last match failure: no match found
WARNING: Be advised:
No resources matched request's constraints
Check the Requirements expression below: Requirements = (Memory > 2000) && (Arch == "INTEL") && (OpSys == "LINUX") && (Disk >= DiskUsage) && (HasFileTransfer)
$According to the information condor has returned to us, we can see that the job failed because of its requirements. In this case, we know which part of the requirement caused the failure but this might not always be true. While the output of condor_q -analyze is useful, it is not all that we could hope for when debugging.
Perform the following commands:
$ cd
$ wget http://www.cs.wisc.edu/~adesmet/condor_analyze.gz
$ gunzip condor_analyze.gz
$ chmod a+x condor_analyze
$ cd lab6
$
$ ~/condor_analyze 31 -- Submitter: ldas-grid.ligo-la.caltech.edu : <141.142.96.174:33149> : ldas-grid.ligo-la.caltech.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD --- 031.000: Run analysis summary. Of 2 machines, 2 are rejected by your job's requirements No successful match recorded. Last failed match: Mon Dec 20 08:02:07 2004 Reason for last match failure: no match found WARNING: Be advised: No machines matched job's requirements The Requirements expression for your job is: ( target.Memory > 2000 ) && ( target.Arch == "INTEL" ) && ( target.OpSys == "LINUX" ) && ( target.Disk >= DiskUsage ) && ( target.HasFileTransfer ) Condition Machines Matched Suggestion --------- ---------------- ---------- 1 ( target.Memory > 2000 ) 0 MODIFY TO 500 2 ( target.Arch == "INTEL" ) 2 3 ( target.OpSys == "LINUX" ) 2 4 ( target.Disk >= 1 ) 2 5 ( target.HasFileTransfer ) 2 1 jobs; 1 idle, 0 running, 0 held $The utility informs us that the problem is with the memory requirement. We are also given a suggestion as to what to change the requirement value to.
$ condor_q -long
-- Submitter: ldas-grid.ligo-la.caltech.edu : <141.142.96.174:33149> : ldas-grid.ligo-la.caltech.edu
MyType = "Job"
TargetType = "Machine"
ClusterId = 31
QDate = 1103551327
CompletionDate = 0
Owner = "btest"
RemoteWallClockTime = 0.000000
LocalUserCpu = 0.000000
LocalSysCpu = 0.000000
RemoteUserCpu = 0.000000
RemoteSysCpu = 0.000000
ExitStatus = 0
NumCkpts = 0
NumRestarts = 0
NumSystemHolds = 0
CommittedTime = 0
TotalSuspensions = 0
LastSuspensionTime = 0
CumulativeSuspensionTime = 0
ExitBySignal = FALSE
CondorVersion = "$CondorVersion: 6.6.6 Jul 26 2004 $"
CondorPlatform = "$CondorPlatform: I386-LINUX_RH9 $"
RootDir = "/"
Iwd = "/home/btest/condor_lab"
JobUniverse = 5
Cmd = "/home/btest/condor_lab/ condor_bad.sh"
MinHosts = 1
MaxHosts = 1
CurrentHosts = 0
WantRemoteSyscalls = FALSE
WantCheckpoint = FALSE
JobStatus = 1
EnteredCurrentStatus = 1103551327
JobPrio = 0
User = "btest@ldas-grid.ligo-la.caltech.edu"
NiceUser = FALSE
Env = ""
JobNotification = 0
UserLog = "/home/btest/condor_lab/ results.log"
CoreSize = 0
KillSig = "SIGTERM"
Rank = 0.000000
In = "/dev/null"
TransferIn = FALSE
Out = "results.output.0"
Err = "results.error.0"
BufferSize = 524288
BufferBlockSize = 32768
ShouldTransferFiles = "YES"
WhenToTransferOutput = "ON_EXIT"
TransferFiles = "ONEXIT"
ImageSize = 1
ExecutableSize = 1
DiskUsage = 1
Requirements = (Memory > 2000) && (Arch == "INTEL") && (OpSys == "LINUX") && (Disk >= DiskUsage) && (HasFileTransfer)
PeriodicHold = FALSE
PeriodicRelease = FALSE
PeriodicRemove = FALSE
OnExitHold = FALSE
OnExitRemove = TRUE
LeaveJobInQueue = FALSE
Args = "Example.31.0 10"
ProcId = 0
WantMatchDiagnostics = TRUE
LastRejMatchReason = "no match found"
LastRejMatchTime = 1103552121
ServerTime = 1103552380
$ condor_rm 31 Cluster 31 has been marked for removal. $ condor_q -- Submitter: ldas-grid.ligo-la.caltech.edu : <141.142.96.174:33149> : ldas-grid.ligo-la.caltech.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held $
executable=condor_bad.sh
universe=vanilla
arguments=Example.$(Cluster).$(Process) 10
output=z.results.output.$(Process)
error=z.results.error.$(Process)
log=z.results.log
notification=never
should_transfer_files=YES
when_to_transfer_output = ON_EXIT
queue
$ condor_submit condor_bad.submit Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 32. $ condor_q -- Submitter: ldas-grid.ligo-la.caltech.edu : <141.142.96.174:33149> : ldas-grid.ligo-la.caltech.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 32.0 btest 12/20 08:22 0+00:00:03 R 0 0.0 condor_bad.sh Exam 1 jobs; 0 idle, 1 running, 0 held $ condor_q -- Submitter: ldas-grid.ligo-la.caltech.edu : <141.142.96.174:33149> : ldas-grid.ligo-la.caltech.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held $