HTC Exercise 1.4: Read and Interpret Log Files¶
Exercise Goal¶
In the previous exercise, we learned how to translate a simple list of computational tasks into HTCondor jobs. What if we want to learn more about our jobs?
The goal of this exercise is to learn how to understand the contents of a job's log file, which contains a history of the steps HTCondor took to run your job. The log file is also a great place to look while you are testing your jobs, as it records resource usage.
Additionally, if you suspect something has gone wrong with your job, the log is the a great place to start looking for indications of whether things might have gone wrong (in addition to the error file).
Reading a Log File¶
In our last exercise, we collected information about the slots and submitted a relatively small batch of jobs as a initial test. What is HTCondor doing behind the scenes? In addition, how many resources are we actually using?
Why should we care about our resource usage? There are two undesirable scenarios:
- Under-requesting resources. If you under-request resources (i.e. memory, disk), your jobs go into the hold state when their usage exceeds the resources allocated to it. This means your jobs stops running, and you have to fix the issue by requesting more resources and resubmit the jobs.
- Over-requesting resources. The easy solution to avoid the above scenario is to request a lot of resources, right? Unfortunately, no! Over-requesting resources means HTCondor needs to find a slot that has those resources, which can take longer than necessary if your jobs could have run on a slot with fewer resources. This is especially detrimental when you plan to submit many jobs.
For this exercise, we can examine a log file for any previous jobs that you have run. The example output below is based on a single job (process) within with the batch of jobs we submitted.
A job log file is updated throughout the life of a job, usually at key events. Each event starts with a heading that indicates what happened and when. Here are some of the event headings from the tz_slotinfo
job log (detailed output in between headings has been omitted here):
000 (12636880.000.000) 2025-05-27 17:35:28 Job submitted from host: <128...
040 (12636880.000.000) 2025-05-27 17:35:52 Started transferring input files
040 (12636880.000.000) 2025-05-27 17:35:52 Finished transferring input files
021 (12636880.000.000) 2025-05-27 17:35:54 Message from starter on slot1...
001 (12636880.000.000) 2025-05-27 17:35:54 Job executing on host: <10.11...
006 (12636880.000.000) 2025-05-27 17:35:55 Image size of job updated: 1
040 (12636880.000.000) 2025-05-27 17:35:55 Started transferring output files
040 (12636880.000.000) 2025-05-27 17:35:55 Finished transferring output files
005 (12636880.000.000) 2025-05-27 17:35:55 Job terminated.
View one of these log files and scroll through, observing what's written. There is a lot of extra information in those lines, but you can see:
- The job ID: cluster
12636880
, process0
(written000
) - The date and local time of each event
- A brief description of the event: submission, execution, some information updates, and termination
- Each event ends with a line that contains only 3 dots:
...
Note
Because we printed a single log file for all 50 jobs in the batch of jobs, all 50 jobs' events are printed in this log file as they happen. If you want an individual log file for each job in the batch, use $(Process)
in the log
line of the submit file.
However, some lines have additional information to help you quickly understand where and how your jobs are running. For example:
001 (12636880.000.000) 2025-05-27 17:35:54 Job executing on host: <10.118.5.219:33393?CCBID=128.105.82.148:9618%3faddrs%3d128.105.82.148-9618+[2607-f388-2200-87-d439-a1c8-2a11-24fc]-9618%26alias%3dospool-ccb.osg.chtc.io%26noUDP%26sock%3dcollector1#72672458%20192.170.231.11:9618%3faddrs%3d192.170.231.11-9618+[fd85-ee78-d8a6-8607--1-73ab]-9618%26alias%3dospool-ccb.osgprod.tempest.chtc.io%26noUDP%26sock%3dcollector7#31809902&PrivNet=c219.mgmt.hellbender&addrs=10.118.5.219-33393&alias=c219.mgmt.hellbender&noUDP>
SlotName: slot1_2@[email protected]
CondorScratchDir = "/local/scratch/glide_bKAhkg/execute/dir_1368654"
Cpus = 1
Disk = 1049600
GLIDEIN_ResourceName = "Missouri-Hellbender-CE1"
GPUs = 0
Memory = 1024
...
- The
SlotName
is the name of the execution point slot your job was assigned to by HTCondor, and the name of the execution point resource is provided inGLIDEIN_ResourceName
. - The
CondorScratchDir
is the name of the scratch directory that was created by HTCondor for your job to run inside. - The
Cpu
,GPUs
,Disk
(in KiB), andMemory
(in MB) values provide the maximum amount of each resource your job can use while running.
Another example of is the periodic update:
006 (12636880.000.000) 2025-05-27 17:35:55 Image size of job updated: 1
0 - MemoryUsage of job (MB)
0 - ResidentSetSize of job (KB)
...
These updates record the amount of memory that your jobs are using on the Execution Points. This can be helpful information, so that in future runs of the job, you can tell HTCondor how much memory you will need.
The job termination event includes a lot of very useful information:
005 (12636880.000.000) 2025-05-27 17:35:55 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
147 - Run Bytes Sent By Job
211 - Run Bytes Received By Job
147 - Total Bytes Sent By Job
211 - Total Bytes Received By Job
Partitionable Resources : Usage Request Allocated
Cpus : 0 1 1
Disk (KB) : 130 1048576 1049600
GPUs : 0
Memory (MB) : 0 1024 1024
TimeExecute (s) : 1
TimeSlotBusy (s) : 3
...
Probably the most interesting information is:
- The
return value
orexit code
(0
here, means the executable completed and didn't indicate any internal errors; non-zero usually means failure) - The total number of bytes transferred each way, which could be useful if your network is slow
- The
Partitionable Resources
table, especially disk and memory usage, which will inform larger submissions.
There are many other kinds of events, but the ones above will occur in almost every job log.
Questions to consider
- Did we under- or over-request resources for our jobs?
- Why do you think the CPU usage shows
0
instead of1
? - What might account for the difference between
TimeExecute
andTimeSlotBusy
?
Discuss your answers to these questions with a neighbor or staff member.
Understanding How HTCondor Writes Files¶
When HTCondor writes the output, error, and log files, does it erase the previous contents of the file or does it add new lines onto the end? Let’s find out!
For this exercise, we will use the tz_slotinfo
job from earlier.
- Edit the submit file so it submits 5 jobs instead of 50.
- Submit the job three separate times in a row.
- Wait for all the jobs to finish.
- Examine the output file: Did HTCondor erase the previous contents for each job, or add new lines?
- Examine the log file carefully: What happened there? Pay close attention to the times and job IDs of the events.
- How can you modify the submit file so it creates a unique
.out
and.err
file for everycondor_submit
attempt?
For further clarification about how HTCondor handles these files, reach out to your neighbor or one of the other School staff.