Skip to content

Monitor and Review Jobs With condor_q and condor_history"

Objectives

This guide discusses how to monitor jobs in the queue with condor_q and to review jobs that have recently left the queue with condor_history.

Monitor Queued Jobs with condor_q

Default condor_q

The default behavior of condor_q is to list all of a user's jobs currently in HTCondor's queue grouped into batches. A batch consists of all jobs submitted using a single submit file. For example:

$ condor_q

-- Schedd: ap40.uw.osg-htc.org : <192.170.227.146:9618?... @ 03/04/22 12:31:45
OWNER     BATCH_NAME  SUBMITTED    DONE   RUN    IDLE  TOTAL JOB_IDS
alice ID: 21562536   3/4  12:31      _      _      5      5 21562536.0-4

Total for query: 5 jobs; 0 completed, 0 removed, 5 idle, 0 running, 0 held, 0 suspended 
Total for alice: 5 jobs; 0 completed, 0 removed, 5 idle, 0 running, 0 held, 0 suspended 
Total for all users: 4112 jobs; 0 completed, 0 removed, 76 idle, 904 running, 3132 held, 0 suspended

Constraints for condor_q

condor_q can be used to list individual jobs associated with a username<U>, cluster ID <C>, or job ID <J> as indicated by <U/C/J>.

Additionally, the flag -nobatch can be used to list individual jobs instead of batches of jobs using the format condor_q <U/C/J> -nobatch.

$ condor_q alice -nobatch

-- Schedd: ap40.uw.osg-htc.org : <192.170.227.146:9618?... @ 03/04/22 12:52:22
 ID          OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
21562638.0   alice            3/4  12:52   0+00:00:00 I  0    0.0 soilModel.py parameter1.csv
21562638.1   alice            3/4  12:52   0+00:00:00 I  0    0.0 soilModel.py parameter2.csv
21562638.2   alice            3/4  12:52   0+00:00:00 I  0    0.0 soilModel.py parameter3.csv
21562638.3   alice            3/4  12:52   0+00:00:00 I  0    0.0 soilModel.py parameter4.csv
21562638.4   alice            3/4  12:52   0+00:00:00 I  0    0.0 soilModel.py parameter5.csv

21562639.0   alice            3/4  12:52   0+00:00:00 I  0    0.0 wordcount.py Alice_in_Wonderland.tx
21562639.1   alice            3/4  12:52   0+00:00:00 I  0    0.0 wordcount.py Dracula.txt
21562639.2   alice            3/4  12:52   0+00:00:00 I  0    0.0 wordcount.py Huckleberry_Finn.txt
21562639.3   alice            3/4  12:52   0+00:00:00 I  0    0.0 wordcount.py Pride_and_Prejudice.tx
21562639.4   alice            3/4  12:52   0+00:00:00 I  0    0.0 wordcount.py Ulysses.txt

View All Job Attributes

Information about HTCondor jobs are saved as "job attributes". Job attributes can be viewed using the -l flag, a shorthand for -long. The output of condor_q <U/C/J> -l can be used to learn more about a job and to diagnose errors.

Examples of job attributes listed when using condor_q <U/C/J> -l are as follows:

Attribute Description
MemoryUsage Maximum memory that a job used in MB
DiskUsage Maximum disk space that a job used in KB
BatchName Job batch label
MATCH_EXP_JOBGLIDEIN_ResourceName Location of site at which a job is running
RemoteHost Location of ite and slot number where a job is running
ExitCode Exit code of a job upon its completion
HoldReason Human-readable message as to why a job was held. It can be used to determine if a job should be released or not.
HoldReasonCode Integer value that represents why a job was put on hold
JobNotification Integer indicating when the user should be emailed regarding a change of status for their job
RemotePool Name of the pool in which a job is running
NumRestarts Number of restarts carried out by a job

Many additional attributes are provided by HTCondor to learn about your jobs, including attributes dedicated to workflows that utilize DAGman and containers.

For more information about these and other attributes, please see the HTCondor Manual.

Constraints for Job Attributes

To display only the output of specified attributes, it is possible to use the "auto format" flag denoted as -af with condor_q <U/C/J>. An example use case is to view the owner and location of the site where a given job, such as job ID 15244592.127, is running by using:

$ condor_q 15244592.127 -af Owner MATCH_EXP_JOBGLIDEIN_ResourceName 

alice BNL-ATLAS

In the above example, the Owner is the user alice and the job is running on resources owned by the Brookhaven National Laboratory as indicated by BNL_ATLAS.

View Specific Job Attributes Across More Than One Job

It is possible to sort and filter the output for one or more job attributes across a batch of jobs. When investigating more than one job, it is advantageous to limit the print out to a certain number of jobs to avoid flooding your screen. To limit the output to a specified number of jobs, use -limit N and replace N with the number of jobs you would like to view. For example, to view the site location where 100 jobs belonging to batch 12245532 ran, you can use:

$ condor_q 12245532 -limit 100 -af MATCH_EXP_JOBGLIDEIN_ResourceName | sort | uniq -c

      9 Crane
      4 LSU-DB-CE1
      4 ND-CAML_gpu
     71 Rice-RAPID-Backfill
      2 SDSC-PRP-CE1
      6 TCNJ-ELSA
      1 Tufts-Cluster
      3 WSU-GRID

In this example, 71 jobs ran at Rice University (Rice-RAPID-Backfill) while only one job ran at Tufts University (Tufts-Cluster). If you would like to know which abbreviations correspond to which compute resource provider in the OSPool, contact a Research Computing Facilitator.

View Jobs that are Held

To isolate and print out held jobs, use condor_q <U/C/J> -held. The this command will print jobs currently in the "Held" state and will not print jobs that are in the "Run", "Done", or "Idle" states.

Using the job ads and constraints described above, it is possible to print out the reasons why a subset of a user's jobs are being held.

$ condor_q alice -held -af HoldReason | sort | uniq -c
      4 Error from [email protected]: SHADOW at 192.170.227.166 failed to send file(s) to <192.41.230.81:44309>: error reading from /home/alice/InputData.txt: (errno 2) No such file or directory; STARTER failed to receive file(s) from <192.170.227.166:9618>
      1 Job in status 2 put on hold by SYSTEM_PERIODIC_HOLD due to memory usage 10572684.

In the output above, four jobs were place on hold due to a "missing file or directory" in the path of /home/alice/InputData.txt that was specified in the transfer_input_files line of the submit file. Because HTCondor could not locate this input (possibly due to an incorrect file path), the job was placed on hold. Additionally, one job was placed on hold due to exceeding the requested memory specified in the submit file.

An in-depth guide on troubleshooting issues with held jobs on the OSPool is available on our website.

View Machine Matches for a Job

The -analyze and -better-analyze options can be used to view the number of machines that match to a job. These flags are often used to diagnose many problems, including understanding why a job has not started running.

A portion of the output from these options shows the number of machines in the pool and how many of these are able to run your job:

21607747.000:  Run analysis summary ignoring user priority.  Of 2189 machines,
   1605 are rejected by your job's requirements
     53 reject your job because of their own requirements
      1 match and are already running your jobs
      0 match but are serving other users
    530 are able to run your job

Additional output of these options include the requirements line of the job's submit file, last successful match date, hold reason messages, and other useful information.

The -analyze and -better-analyze options deliver similar output, however, -better-analyze is a newer feature that provides additional information including the number of slots matched by your job given the different requirements specified in the submit file.

Additional information on using -analyze and -better-analyze for troubleshooting will be available in our troubleshooting guide in the near future.

Review Job History with condor_history

Default condor_history

Somewhat similar to condor_q, which shows jobs currently in the queue, condor_history is used to show information about jobs that have recently left the queue.

By default, condor_history will show every user's job that HTCondor still has a record of in its history. Because HTCondor jobs are constantly being sent to the queue on OSG-managed Access Points, HTCondor cleans its history of jobs every few days to free up space for new jobs that have recently left the queue. Once a job is cleaned from HTCondor's history, it is removed permanently from the queue.

Before a job is cleaned from HTCondor's queue, condor_history can be valuable for learning about recently completed jobs.

As previously stated, condor_history without any additional flags will list every user's job, which can be thousands of lines long. To exit this behavior, use control + C. In most cases, it is recommended to combine condor_history with one or more of the options below to help limit the output of this command to only the desired information.

Constrain Your condor_history Query

Like condor_q, it is possible to limit the output of your condor_history query by user <U>, cluster ID <C>, and job ID <J> as indicated by (<U/C/J>). By default, HTCondor will continue to search through its history of jobs by the option it is constrained by. Since HTCondor's history is extensive, this means your command line prompt will not be returned to you until HTCondor has finished its search and analysis of its entire history. To prevent this time-consuming behavior from occurring, we recommend using the -limit N flag with condor_history. This will tell HTCondor to limit its search to the first N items that appear matching its constraint. For example, condor_history alice -limit 20 will return the condor_history output of the user alice's 20 most recently submitted jobs.

Viewing and Constraining Job Attributes

Displaying the list of job attributes using -l and -af can also be used with condor_history.

It is important to note that some attributes are renamed when a job exits the queue and enters HTCondor's history. For example, RemoteHost is renamed to LastRemoteHost and HoldReason will become LastHoldReason.

Special Considerations

Although many options that exist for condor_q also exist for condor_history, some do not. For example, -analyze and -better-analyze cannot be used with condor_history. Additionally, -hold cannot be used with condor_history as no job in HTCondor's history can be in the held state.

More Information on Options for condor_q and condor_history

A full list of the options for condor_q and condor_history may be listed by using combining them with the –-help flag or by viewing the HTCondor manual.