Getting Job Information from Log Files¶

HTCondor job log files contain useful information about submitted, running, and/or completed jobs, but the format of that information may not always be useful to you. Here, we have a few examples of how to use some powerful Unix commands (grep, sort, uniq) to pull information out of these job log files. It is now time for you to try these on your own jobs!

Before starting this exercise, copy a couple of your job log files from previous exercises (for example, HTC Exercise 1.5 and/or OSG Exercise 1.1) in to a new directory for this exercise. Use these log files in place of my-job.log in the examples below.

The grep command displays lines from a file matching a given pattern, where the pattern is the first argument provided to grep. For example grep 'alice' address_book.txt would print out all lines containing the characters alice in the file named address_book.txt. While working through this exercise, consider keeping one of your job log files open in a separate window to see if you can figure out how we came up with the patterns presented in this exercise.

Job terminations¶

Lines for job termination events in the job log always start with 005 and contain the timestamp of when the job(s) ended. Use the following grep command to get a list of when jobs ended in your log files:

$ grep '^005' my-job.log

Optional challenge: What is the importance of ^ in the pattern (^005) provided above?

Recall that executables typically exit with code 0 when they exit normally, which often (but not always!) means that they exited successfully. Lines containing jobs' exit codes (i.e. return values) all contain the word termination. Use grep to get a list of jobs' exit codes:

$ grep termination my-job.log

By "piping" the output of the previous command through the sort and then uniq commands, we can get a count of each exit code:

$ grep termination my-job.log | sort | uniq -c

Here's an example of the output from the previous commands when run on a log file written to from eight jobs. Six jobs exited with exit code 0, while two exited 1:

[username@ap40]$ grep '^005' my-job.log
005 (236881.000.000) 2022-07-27 15:07:38 Job terminated.
005 (236883.000.000) 2022-07-27 15:07:42 Job terminated.
005 (236882.000.000) 2022-07-27 15:08:01 Job terminated.
005 (236880.000.000) 2022-07-27 15:08:07 Job terminated.
005 (236891.000.000) 2022-07-27 15:13:31 Job terminated.
005 (236893.000.000) 2022-07-27 15:13:32 Job terminated.
005 (236892.000.000) 2022-07-27 15:13:58 Job terminated.
005 (236890.000.000) 2022-07-27 15:13:59 Job terminated.

[username@ap40]$ grep 'termination' my-job.log
(1) Normal termination (return value 0)
(1) Normal termination (return value 1)
(1) Normal termination (return value 0)
(1) Normal termination (return value 0)
(1) Normal termination (return value 1)
(1) Normal termination (return value 0)
(1) Normal termination (return value 0)
(1) Normal termination (return value 0)

[username@ap40]$ grep 'termination' my-job.log | sort | uniq -c
6   (1) Normal termination (return value 0)
2   (1) Normal termination (return value 1)

Job resource usage¶

Jobs' resource usages (and requests and allocations) are logged in the following format:

Partitionable Resources :    Usage  Request Allocated
    Cpus                 :                 1         1
    Disk (KB)            :    10382  1048576   1468671
    Memory (MB)          :      692     1024      1024

Run the following grep command to pull out the memory information from your job logs:

$ grep 'Memory (MB) *:' my-job.log

Look back at the format in the example above. Columns after the : will first show memory usage, then memory requested, and then the memory allocated to your job.

Similarly, use the following command to get the disk information from your job logs:

$ grep 'Disk (KB) *:' my-job.log

Here's some example output from running the memory grep command on the same eight-job log file:

[username@ap40]$ grep 'Memory (MB) *:' my-job.log
Memory (MB)          :      692     1024      1024 
Memory (MB)          :      714     1024      1024 
Memory (MB)          :      703     1024      1024 
Memory (MB)          :      699     1024      1024 
Memory (MB)          :      705     1024      1024 
Memory (MB)          :      704     1024      1024 
Memory (MB)          :      711     1024      1024 
Memory (MB)          :      697     1024      1024

In this example, the memory usage for the jobs ranged from 692 to 714 MB, and they all requested (and were allocated) 1 GB of memory.

Other job information¶

See if you can come up with grep commands to gather the number of bytes sent and received by jobs (i.e. how much data was transferred to/from the access point). Here is some example output for comparison:

[username@ap40]$ grep '<PATTERN REMOVED>' my-job.log
760393  -  Total Bytes Sent By Job
760395  -  Total Bytes Sent By Job
760397  -  Total Bytes Sent By Job
760395  -  Total Bytes Sent By Job
760393  -  Total Bytes Sent By Job
760395  -  Total Bytes Sent By Job
760397  -  Total Bytes Sent By Job
760395  -  Total Bytes Sent By Job

[username@ap40]$ grep '<PATTERN REMOVED>' my-job.log
19240  -  Total Bytes Received By Job
19240  -  Total Bytes Received By Job
19240  -  Total Bytes Received By Job
19240  -  Total Bytes Received By Job
19240  -  Total Bytes Received By Job
19240  -  Total Bytes Received By Job
19240  -  Total Bytes Received By Job
19240  -  Total Bytes Received By Job

Job log files may also contain additional information about held jobs or interrupted jobs. If you feel that your jobs are bouncing from idle to running and back to idle, or that they are otherwise not making as much progress as you expect, the log files are a good place to check. Though they might eventually become impossibly large to read line-by-line once you start scaling up, using grep to pull out specific lines and using sort and uniq to reduce the output can help you make sense of the information contained in the logs.