GPU Exercise 1.3: Running a GPU job¶

To ensure that our job runs on a resource with an available GPU, all we need to do is update two lines in the submit file. First, set request_gpus = 1. This tells HTCondor that a GPU is needed to run this job. Second, we need to specify a GPU enabled container image. This can be done by adding +SingularityImage = "/cvmfs/singularity.opensciencegrid.org/opensciencegrid/tensorflow-gpu:2.3" to the submit file. Note that in the previous section, the container image specified was not GPU enabled. The updated submit file with the changes mentioned above is named gpu-job.submit and contains the following contents:

universe = vanilla

# Job requirements - ensure we are running on a Singularity enabled
# node and have enough resources to execute our code
# Tensorflow also requires AVX instruction set and a newer host kernel
Requirements = HAS_SINGULARITY == True && HAS_AVX2 == True && OSG_HOST_KERNEL_VERSION >= 31000
request_cpus = 1
request_gpus = 1
request_memory = 1 GB
request_disk = 1 GB

# Container image to run the job in
+SingularityImage = "/cvmfs/singularity.opensciencegrid.org/opensciencegrid/tensorflow-gpu:2.3"

# Executable is the program your job will run It's often useful
# to create a shell script to "wrap" your actual work.
Executable = job-wrapper.sh
Arguments =

# Inputs/outputs - in this case we just need our python code.
# If you leave out transfer_output_files, all generated files comes back
transfer_input_files = test.py
#transfer_output_files =

# Error and output are the error and output channels from your job
# that HTCondor returns from the remote host.
Error = $(Cluster).$(Process).error
Output = $(Cluster).$(Process).output

# The LOG file is where HTCondor places information about your
# job's status, success, and resource consumption.
Log = $(Cluster).log

# Send the job to Held state on failure.
#on_exit_hold = (ExitBySignal == True) || (ExitCode != 0)

# Periodically retry the jobs every 1 hour, up to a maximum of 5 retries.
#periodic_release =  (NumJobStarts < 5) && ((CurrentTime - EnteredCurrentStatus) > 60*60)

# queue is the "start button" - it launches any jobs that have been
# specified thus far.
queue 1

Submit this job with the command condor_submit gpu-job.submit. Once the job is complete, check the .out file for a line stating that the code was run with GPU. You should see something similar to:

2021-02-02 23:25:19.022467: I tensorflow/core/common_runtime/eager/execute.cc:611] Executing op MatMul in device /job:localhost/replica:0/task:0/device:GPU:0

The GPU:0 part of the log statement above shows that a GPU was found and used for the computation.

GPUs on the Open Science Pool¶

Curious about what GPUs make up the Open Science Pool? Run condor_status -const 'gpus >= 1' -af CUDADeviceName | sort | uniq -c to find out. Which GPU models are most common?