Automatically Rerunning a Job That Used Too Much Memory¶

Some workloads have variable memory requirements across jobs. For example, most jobs may need only 2 GB of RAM, but a small subset, perhaps a few percent, might require 6 GB or more to complete successfully. The challenge is that you often do not know in advance which jobs will need additional memory.

HTCondor provides a built-in mechanism to handle this: the retry_request_memory feature. When enabled, HTCondor will automatically retry a job that exceeds its initial memory request, increasing the requested memory on subsequent runs.

Use this feature only if a small fraction (typically less than 20%) of your jobs require additional memory. If the majority of jobs exceed the baseline, you should instead increase the default request_memory value.

For example, using the submit file below, the initial job execution requests 2 GBs of memory. If the job exceeds this limit, HTCondor will will evict the job with the reason memory usage exceeded request_memory. The job will return to the queue to run again, and this time the execution (and any further retries) will run with a memory limit of 6 GBs.

executable = my-job.sh

log        = job.log
output     = job.out
error      = job.err

request_cpus   = 1
request_memory = 2 GB
request_disk   = 5 GB

# provide higher memory when retrying
retry_request_memory = 6 GB

queue

Keep in mind that OSG resources have limited availability for large-memory jobs. Requests for higher memory will generally experience longer queue times compared to lower memory jobs.