Exercise 1.2: Investigating Job Attributes¶
The objective of this exercise is to your awareness of job "class ad attributes", especially ones that may help you look for issues with your jobs in the OSPool.
Recall that a job class ad contains attributes and their values that describe what HTCondor knows about the job. OSPool jobs contain extra attributes that are specific to that pool. Thus, an OSPool job class ad may have well over 150 attributes.
Some OSPool job attributes are especially helpful when you are scaling up jobs and want to see if jobs are running as expected or are maybe doing surprising things that are worth extra attention.
Preparing exercise files¶
Because this exercise focuses on OSPool job attributes, please use your OSPool account on ap40.uw.osg-htc.org
.
-
Create a shell script for testing called
simple.sh
:#!/bin/bash SLEEPTIME=$1 hostname pwd whoami for i in {1..5} do echo "performing iteration $i" sleep $SLEEPTIME done
-
Create an HTCondor submit file that queues three jobs:
universe = vanilla log = logs/$(Cluster)_$(Process).log error = logs/$(Cluster)_$(Process).err output = $(Cluster)_$(Process).out executable = simple.sh should_transfer_files = YES when_to_transfer_output = ON_EXIT request_cpus = 1 request_memory = 1GB request_disk = 1GB # set arguments, queue a normal job arguments = 600 queue 1 # queue a job that will go on hold transfer_input_files = test.txt queue 1 # queue a job that will never start request_memory = 40TB queue 1
Exploring OSPool job class ad attributes¶
For this exercise, you will submit the three jobs defined in the submit file above, then examine their job class ad attributes.
Here are some attributes that may be interesting:
CpusProvisioned
is the number of CPUs given to your job for the current or most recent runResidentSetSize_RAW
is the maximum amount of memory that HTCondor has noticed your job using (in KB)DiskUsage_RAW
is the maximum amount of disk that HTCondor has noticed your job using (in MB)NumJobStarts
is the number of times HTCondor has started your job;1
is typical for a running job, and higher counts may indicate issues running the jobLastRemoteHost
identifies the name for the slot where your job is running or most recently ranMachineAttrGLIDEIN_ResourceName*N*
is a set of numbered attributes that identify the most recent sites where your job ran; N is0
for the most recent (or current) run,1
for the previous run, and so on up to9
ExitCode
exists only if your job exited (completed) at least once; a value of0
typically means successHoldReasonCode
exists only if your job went on hold; if so, it is a number corresponding to the main hold reason (see here for details)NumHoldsByReason
is a list of all of the main reasons your job has gone on hold so far with counts of each hold type
Let’s explore these attributes on real jobs.
-
Submit the jobs (above) and note the cluster ID
-
When one job from the cluster is running, view all of its job class ad attributes:
$ condor_q -l <JobId>
where
<JobId>
is your job's ID, and-l
stands for-long
This command lists all of the job’s class ad attributes. Details of some of the attributes are in the HTcondor Manual. Others are defined (and not well documented) only for the OSPool. Can you find any of the attributes listed above?
-
Next, use
condor_q -af <AttributeName>
to examine one attribute at a time for several jobs:$ condor_q -af NumJobStarts <ClusterID>
where
<ClusterID>
is the HTCondor cluster ID noted above, and-af
stands for-autoformat
.What does the output tell you?
-
Finally, display several attributes at once for the jobs:
$ condor_q -af:j NumJobStarts DiskUsage_RAW LastRemoteHost HoldReasonCode <ClusterID>
Why do some values appear as
undefined
?