Lab

Lab Exercise: Condor (Local)

Purpose:

During this lab the user will become familiar with using condor to run jobs locally while SSHed into the head node of a cluster.

SSH into the Server
Displaying Condor Information
Submitting Local Condor Jobs
Single Job Submission with Requirements
Diagnosing & Restarting Non-Running Jobs

SSH into the Server

All of the work we will be during during this lab exercise will be done on the login node of the LLO cluster.

$ ssh ldas-grid.ligo-la.caltech.edu $

Displaying Condor Information

The condor_version command is a good starting point.

$ condor_version

$CondorVersion: 6.7.3 Dec 28 2004 $
$CondorPlatform: I386-LINUX_RH9 $
$

The condor_status command will show the status of the nodes in the Condor pool.

$ condor_status

Name          OpSys       Arch   State      Activity   LoadAv Mem   ActvtyTime

vm1@ldas-grid LINUX       INTEL  Owner      Idle       0.040   700  11+06:25:28
vm2@ldas-grid LINUX       INTEL  Owner      Idle       0.000  1200  11+06:25:29
vm1@node10.li LINUX       INTEL  Claimed    Busy       1.000   700  0+00:01:51
[..snip..]
vm2@node8.lig LINUX       INTEL  Claimed    Busy       1.010  1200  0+00:13:00
vm1@node9.lig LINUX       INTEL  Claimed    Busy       1.000   700  0+00:11:42
vm2@node9.lig LINUX       INTEL  Claimed    Busy       1.000  1200  0+00:24:28

                     Machines Owner Claimed Unclaimed Matched Preempting

         INTEL/LINUX      138     2     136         0       0          0

               Total      138     2     136         0       0          0
$

The condor_q command will display the job queues.

$ condor_q


-- Submitter: ldas-grid.ligo-la.caltech.edu : <10.13.0.12:32772> : ldas-grid.ligo-la.caltech.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
61122.0   dbrown          3/7 16:38    0+00:00:00 I  0   16.3 lalapps_inspiral -
61123.0   dbrown          3/7 16:38    0+00:00:00 I  0   16.3 lalapps_inspiral -
61124.0   dbrown          3/7 16:38    0+00:00:00 I  0   16.3 lalapps_inspiral -
[..snip..]
61140.0   kipp            3/7 16:45    0+00:06:35 R  0    2.4 condor_dagman -f -
61141.0   kipp            3/7 16:45    0+00:06:28 R  0    0.0 dagdbUpdator -j 13
61143.0   kipp            3/7 16:45    0+00:06:07 R  0   18.0 lalapps_power --wi

988 jobs; 820 idle, 168 running, 0 held
$

If you're logged into to server and want to see just your jobs, you can specify your userid as follows:

$ condor_q mfreemon


-- Submitter: ldas-grid.ligo-la.caltech.edu : <10.13.0.12:32772> : ldas-grid.ligo-la.caltech.edu

 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
28098.0   mfreemon       2/24 15:50    0+00:00:00 I  0    0.0 hostname 
28098.1   mfreemon       2/24 15:50    0+00:00:00 I  0    0.0 hostname 
28098.2   mfreemon       2/24 15:50    0+00:00:00 I  0    0.0 hostname 
28098.3   mfreemon       2/24 15:50    0+00:00:00 I  0    0.0 hostname 
28098.4   mfreemon       2/24 15:50    0+00:00:00 I  0    0.0 hostname 
28098.5   mfreemon       2/24 15:50    0+00:00:00 I  0    0.0 hostname 
28098.6   mfreemon       2/24 15:50    0+00:00:00 I  0    0.0 hostname 
28098.7   mfreemon       2/24 15:50    0+00:00:00 I  0    0.0 hostname 
28098.8   mfreemon       2/24 15:50    0+00:00:00 I  0    0.0 hostname 
28098.9   mfreemon       2/24 15:50    0+00:00:00 I  0    0.0 hostname 
29105.0   mfreemon       2/25 15:44    0+00:00:00 I  0    0.0 condor_simple.sh E

11 jobs; 11 idle, 0 running, 0 held
$

Click here for complete documentation on the condor_q command.

Submitting Local Condor Jobs

Two files will need to be created on your local client machine. The first file will be a program that will be submitted to Condor for execution. The second file is the condor submission script. This script contains information for condor on how the executable is to be run.

Start by creating a directory in your home directory on the server called lab6 and cd into it:
$ cd

$ mkdir lab6

$ cd lab6
This new lab6 directory should be used to contain any files we create during the remainder of this lab exercise.

Create a file called lab6.sh and copy the following code into the file:

#! /bin/sh

echo "I'm process id $$ on" `hostname`
echo "This is sent to standard error" 1>&2
date
echo "Running as binary $0" "$@"
echo "My name (argument 1) is $1"
echo "My sleep duration (argument 2) is $2"
sleep $2
echo "Sleep of $2 seconds finished. Exiting"
exit 42

Create a file called lab6.submit and copy the following into it:

executable=lab6.sh
universe=vanilla
arguments=Example.$(Cluster).$(Process) 5
output=z.lab6.output.$(Process)
error=z.lab6.error.$(Process)
log=z.lab6.log
notification=never
should_transfer_files=YES
when_to_transfer_output = ON_EXIT
queue 10

Looking at the submit file you should note several tags. The executable tag tells condor the name of the program to run. In this case it is the shell script we just created. Also, there is a tag called: arguments. These arguments will be passed to the running executable. Looking at our shell script we see it takes 2 arguments. The first is a string with two predefined values. Cluster and Process are values that Condor provides referring to the cluster the program is running on and the programs process id (PID). The second argument is the value which will be used by the sleep command. This tells the program how long to sleep before continuing. In this case, it is set for 5 seconds.

The queue tag tells condor how many instances of the executable to run. In this case 10 instances of lab6.sh will be run simultaneously. One thing to keep in mind when telling condor to rerun multiple instances of a executable is what will happen to the output. In the above set of instructions we have added the process id to the end of the file name. Condor will now create 10 different files each being unique because of the id number. If we had not done this, condor would have used the same file for all 10 processes.

Now submit the job to Condor. This is done using the condor_submit command.

$ condor_submit lab6.submit
Submitting job(s)...............
Logging submit event(s)...............
10 job(s) submitted to cluster 29.
$

Once the job has been submitted we can look at it status with the utility condor_q. This application gives us information about the condor job queue. We can see what jobs are on the queue and what their status is. By using condor_q you can follow the progress of our submitted job. Running condor_q several times you should see output similar to what is shown below. First the job is entered onto the queue, then it begins to run and finally it completes and is removed from the queue.

$ condor_q mfreemon


-- Submitter: ldas-grid.ligo-la.caltech.edu : <10.13.0.12:32772> : ldas-grid.ligo-la.caltech.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
   9.0   mfreemon         3/7 00:02   0+00:00:00 I  0   0.0  lab6.sh E
   9.1   mfreemon         3/7 00:02   0+00:00:00 I  0   0.0  lab6.sh E
   9.2   mfreemon         3/7 00:02   0+00:00:00 I  0   0.0  lab6.sh E
   9.3   mfreemon         3/7 00:02   0+00:00:00 I  0   0.0  lab6.sh E
   9.4   mfreemon         3/7 00:02   0+00:00:00 I  0   0.0  lab6.sh E
   9.5   mfreemon         3/7 00:02   0+00:00:00 I  0   0.0  lab6.sh E
   9.6   mfreemon         3/7 00:02   0+00:00:00 I  0   0.0  lab6.sh E
   9.7   mfreemon         3/7 00:02   0+00:00:00 I  0   0.0  lab6.sh E
   9.8   mfreemon         3/7 00:02   0+00:00:00 I  0   0.0  lab6.sh E
   9.9   mfreemon         3/7 00:02   0+00:00:00 I  0   0.0  lab6.sh E

194 jobs; 40 idle, 154 running, 0 held
$

Looking back at our submission script you will note that there were several files defined:

output=z.lab6.output.$(Process)
error=z.lab6.error.$(Process)
log=z.lab6.log

The output file will contain the output of the executable. The error file will contain any error output that the program might director to stderr. The log file is condor's log of the job. Look at each file in turn.

$ ls -la
total 98
drwxrwxr-x   2 mfreemon mfreemon  776 Mar  7 16:56 .
drwx------   7 mfreemon mfreemon  464 Mar  7 16:51 ..
-rw-rw-r--   1 mfreemon mfreemon   31 Mar  7 16:57 z.lab6.error.0
-rw-rw-r--   1 mfreemon mfreemon   31 Mar  7 16:57 z.lab6.error.1
-rw-rw-r--   1 mfreemon mfreemon   31 Mar  7 16:57 z.lab6.error.2
-rw-rw-r--   1 mfreemon mfreemon   31 Mar  7 16:57 z.lab6.error.3
-rw-rw-r--   1 mfreemon mfreemon   31 Mar  7 16:57 z.lab6.error.4
-rw-rw-r--   1 mfreemon mfreemon   31 Mar  7 16:57 z.lab6.error.5
-rw-rw-r--   1 mfreemon mfreemon   31 Mar  7 16:57 z.lab6.error.6
-rw-rw-r--   1 mfreemon mfreemon   31 Mar  7 16:57 z.lab6.error.7
-rw-rw-r--   1 mfreemon mfreemon   31 Mar  7 16:57 z.lab6.error.8
-rw-rw-r--   1 mfreemon mfreemon   31 Mar  7 16:57 z.lab6.error.9
-rw-rw-r--   1 mfreemon mfreemon  150 Mar  7 16:57 z.lab6.log
-rw-rw-r--   1 mfreemon mfreemon  256 Mar  7 16:57 z.lab6.output.0
-rw-rw-r--   1 mfreemon mfreemon  256 Mar  7 16:57 z.lab6.output.1
-rw-rw-r--   1 mfreemon mfreemon  256 Mar  7 16:57 z.lab6.output.2
-rw-rw-r--   1 mfreemon mfreemon  256 Mar  7 16:57 z.lab6.output.3
-rw-rw-r--   1 mfreemon mfreemon  256 Mar  7 16:57 z.lab6.output.4
-rw-rw-r--   1 mfreemon mfreemon  256 Mar  7 16:57 z.lab6.output.5
-rw-rw-r--   1 mfreemon mfreemon  256 Mar  7 16:57 z.lab6.output.6
-rw-rw-r--   1 mfreemon mfreemon  256 Mar  7 16:57 z.lab6.output.7
-rw-rw-r--   1 mfreemon mfreemon  256 Mar  7 16:57 z.lab6.output.8
-rw-rw-r--   1 mfreemon mfreemon  256 Mar  7 16:57 z.lab6.output.9
-rw-rw-r--   1 mfreemon mfreemon  273 Mar  7 16:53 lab6.sh
-rw-rw-r--   1 mfreemon mfreemon  241 Mar  7 16:55 lab6.submit

$ cat z.lab6.error.0
This is sent to standard error

$ cat z.lab6.log
000 (015.000.000) 12/15 10:38:06 Job submitted from host: <141.142.96.174:33149>
...
017 (015.000.000) 12/15 10:38:19 Job submitted to Globus
    RM-Contact: ldas-grid.ligo-la.caltech.edu/jobmanager-condor
    JM-Contact: https://ligo-server.ncsa.uiuc.edu:38307/24309/1103128689/
    Can-Restart-JM: 1
...
001 (015.000.000) 12/15 10:38:19 Job executing on host: ldas-grid.ligo-la.caltech.edu
...
005 (015.000.000) 12/15 10:40:11 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
                Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
        0 - Run Bytes Sent By Job
        0 - Run Bytes Received By Job
        0 - Total Bytes Sent By Job
        0 - Total Bytes Received By Job
...

$ cat z.lab6.output.0
I'm process id 18036 on node54
Mon Mar 7 17:05:22 CST 2005
Running as binary /usr1/condor/execute/dir_18034/condor_exec.exe Example.61245.0 5
My name (argument 1) is Example.61245.0
My sleep duration (argument 2) is 5
Sleep of 5 seconds finished. Exiting
$

Single Job Submission with Requirements

Condor also allows us to define requirements that need to be met before a job is run. These requirements give direction to condor on what type of machine a job needs to be run on.

Create a file: condor_req.submit
Copy the following into the file.

executable=lab6.sh Requirements = Memory >= 32 && OpSys == "LINUX" && Arch =="INTEL" universe=vanilla arguments=Example.$(Cluster).$(Process) 5 output=z.req.output.$(Process) error=z.req.error.$(Process) log=z.req.log notification=never should_transfer_files=YES when_to_transfer_output = ON_EXIT queue

Requirements for the job are defined by the requirements tag. In this case we have told condor that we need a minimum of 32 megs of memory. That the operating system has to be Linux and that the processor needs to be Intel based. You can find a full listing of the requirements that can be specified in the condor manual.
http://www.cs.wisc.edu/condor/ manual/v6.6/ 2_5Submitting_Job.html

Submit the job and watch it run. Then verify the output.

$ condor_submit condor_req.submit

Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 30.

$ condor_q

-- Submitter: ldas-grid.ligo-la.caltech.edu : <141.142.96.174:33149> : ldas-grid.ligo-la.caltech.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD              
  30.0   btest          12/17 09:44   0+00:00:03 R  0   0.0 lab6.sh E

1 jobs; 0 idle, 1 running, 0 held
$

Comprehensive documentation on submitting jobs can be found at http://www.cs.wisc.edu/condor/manual/v6.6/ 2_5Submitting_Job.html

Diagnosing & Restarting Non-Running Jobs

Create a file called: condor_bad.submit

Copy the following into the file:

executable=condor_bad.sh
universe=vanilla
arguments=Example.$(Cluster).$(Process) 10
output=z.results.output.$(Process)
error=z.results.error.$(Process)
log=z.results.log
notification=never
should_transfer_files=YES
when_to_transfer_output = ON_EXIT
requirements=Memory>2000
queue

Create a file called: condor_bad.sh

Input the following into the file.

#!/bin/sh
echo $1

You will notice a requirements tag that specifies that the memory must be greater then 2000 megabytes. This requirement should prevent the job from running. Submit the job and watch it (not) run.

$ condor_submit condor_bad.submit

Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 31.

$ condor_q

-- Submitter: ldas-grid.ligo-la.caltech.edu : <141.142.96.174:33149> : ldas-grid.ligo-la.caltech.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
  31.0   btest          12/20 08:02   0+00:00:00 I  0   0.0  condor_bad.sh Exam

1 jobs; 1 idle, 0 running, 0 held

$ condor_q

-- Submitter: ldas-grid.ligo-la.caltech.edu : <141.142.96.174:33149> : ldas-grid.ligo-la.caltech.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
  31.0   btest          12/20 08:02   0+00:00:00 I  0   0.0  condor_bad.sh Exam

1 jobs; 1 idle, 0 running, 0 held

Once you see the job become idle, use condor_q with the -analyze option to look at what is going on. The analyze option returns information about the job in question. Use the job number that is specified for your job.

$ condor_q -analyze 31 -- Submitter: ldas-grid.ligo-la.caltech.edu : <141.142.96.174:33149> : ldas-grid.ligo-la.caltech.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD --- 031.000: Run analysis summary. Of 2 machines, 2 are rejected by your job's requirements 0 reject your job because of their own requirements 0 match, but are serving users with a better priority in the pool 0 match, match, but reject the job for unknown reasons 0 match, but will not currently preempt their existing job 0 are available to run your job No successful match recorded. Last failed match: Mon Dec 20 08:02:07 2004 Reason for last match failure: no match found WARNING: Be advised: No resources matched request's constraints Check the Requirements expression below: Requirements = (Memory > 2000) && (Arch == "INTEL") && (OpSys == "LINUX") && (Disk >= DiskUsage) && (HasFileTransfer) $
According to the information condor has returned to us, we can see that the job failed because of its requirements. In this case, we know which part of the requirement caused the failure but this might not always be true. While the output of condor_q -analyze is useful, it is not all that we could hope for when debugging.

Condor provides another tool to help with this. This utility is called condor_analyze. In order to use this tool we will first need to download it and install it (soon, it will be part of the standard Condor installation).

Perform the following commands:

$ cd $ wget http://www.cs.wisc.edu/~adesmet/condor_analyze.gz $ gunzip condor_analyze.gz $ chmod a+x condor_analyze $ cd lab6 $

You should now be ready to use the condor_analyze utility. Run condor_analyze and look at the output.

$ ~/condor_analyze 31


-- Submitter: ldas-grid.ligo-la.caltech.edu : <141.142.96.174:33149> : ldas-grid.ligo-la.caltech.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
---
031.000: Run analysis summary. Of 2 machines,
      2 are rejected by your job's requirements
        No successful match recorded.
        Last failed match: Mon Dec 20 08:02:07 2004
        Reason for last match failure: no match found


WARNING: Be advised:
   No machines matched job's requirements


The Requirements expression for your job is:

( target.Memory > 2000 ) && ( target.Arch == "INTEL" ) &&
( target.OpSys == "LINUX" ) && ( target.Disk >= DiskUsage ) &&
( target.HasFileTransfer )

    Condition                         Machines Matched    Suggestion
    ---------                         ----------------    ----------
1   ( target.Memory > 2000 )          0                   MODIFY TO 500
2   ( target.Arch == "INTEL" )        2
3   ( target.OpSys == "LINUX" )       2
4   ( target.Disk >= 1 )              2
5   ( target.HasFileTransfer )        2

1 jobs; 1 idle, 0 running, 0 held
$

The utility informs us that the problem is with the memory requirement. We are also given a suggestion as to what to change the requirement value to.

Another way to look at details about a job in the queue is to use condor_q with the -long option.

$ condor_q -long

-- Submitter: ldas-grid.ligo-la.caltech.edu : <141.142.96.174:33149> : ldas-grid.ligo-la.caltech.edu
MyType = "Job"
TargetType = "Machine"
ClusterId = 31
QDate = 1103551327
CompletionDate = 0
Owner = "btest"
RemoteWallClockTime = 0.000000
LocalUserCpu = 0.000000
LocalSysCpu = 0.000000
RemoteUserCpu = 0.000000
RemoteSysCpu = 0.000000
ExitStatus = 0
NumCkpts = 0
NumRestarts = 0
NumSystemHolds = 0
CommittedTime = 0
TotalSuspensions = 0
LastSuspensionTime = 0
CumulativeSuspensionTime = 0
ExitBySignal = FALSE
CondorVersion = "$CondorVersion: 6.6.6 Jul 26 2004 $"
CondorPlatform = "$CondorPlatform: I386-LINUX_RH9 $"
RootDir = "/"
Iwd = "/home/btest/condor_lab"
JobUniverse = 5
Cmd = "/home/btest/condor_lab/ condor_bad.sh"
MinHosts = 1
MaxHosts = 1
CurrentHosts = 0
WantRemoteSyscalls = FALSE
WantCheckpoint = FALSE
JobStatus = 1
EnteredCurrentStatus = 1103551327
JobPrio = 0
User = "btest@ldas-grid.ligo-la.caltech.edu"
NiceUser = FALSE
Env = ""
JobNotification = 0
UserLog = "/home/btest/condor_lab/ results.log"
CoreSize = 0
KillSig = "SIGTERM"
Rank = 0.000000
In = "/dev/null"
TransferIn = FALSE
Out = "results.output.0"
Err = "results.error.0"
BufferSize = 524288
BufferBlockSize = 32768
ShouldTransferFiles = "YES"
WhenToTransferOutput = "ON_EXIT"
TransferFiles = "ONEXIT"
ImageSize = 1
ExecutableSize = 1
DiskUsage = 1
Requirements = (Memory > 2000) && (Arch == "INTEL") && (OpSys == "LINUX") && (Disk >= DiskUsage) && (HasFileTransfer)
PeriodicHold = FALSE
PeriodicRelease = FALSE
PeriodicRemove = FALSE
OnExitHold = FALSE
OnExitRemove = TRUE
LeaveJobInQueue = FALSE
Args = "Example.31.0 10"
ProcId = 0
WantMatchDiagnostics = TRUE
LastRejMatchReason = "no match found"
LastRejMatchTime = 1103552121
ServerTime = 1103552380

Now that we know what is wrong with the job, remove the job from the queue. Do this by using the condor_rm command. This command kills the job and removes it from the queue. You will need to specify a job number when issuing this command.

$ condor_rm 31

Cluster 31 has been marked for removal.

$ condor_q

-- Submitter: ldas-grid.ligo-la.caltech.edu : <141.142.96.174:33149> : ldas-grid.ligo-la.caltech.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD

0 jobs; 0 idle, 0 running, 0 held
$

Edit the file condor_bad.submit and remove the memory requirement tag.

executable=condor_bad.sh universe=vanilla arguments=Example.$(Cluster).$(Process) 10 output=z.results.output.$(Process) error=z.results.error.$(Process) log=z.results.log notification=never should_transfer_files=YES when_to_transfer_output = ON_EXIT queue

Resubmit the job and watch it run to completion.

$ condor_submit condor_bad.submit

Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 32.

$ condor_q

-- Submitter: ldas-grid.ligo-la.caltech.edu : <141.142.96.174:33149> : ldas-grid.ligo-la.caltech.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
  32.0   btest          12/20 08:22   0+00:00:03 R  0   0.0  condor_bad.sh Exam

1 jobs; 0 idle, 1 running, 0 held

$ condor_q

-- Submitter: ldas-grid.ligo-la.caltech.edu : <141.142.96.174:33149> : ldas-grid.ligo-la.caltech.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD

0 jobs; 0 idle, 0 running, 0 held
$