GPU Exercise 1.3: Running a GPU job¶
To ensure that our job runs on a resource with an available GPU, all we need to
do is update two lines in the submit file. First, set
request_gpus = 1. This tells
HTCondor that a GPU is needed to run this job. Second, we need to specify a GPU
enabled container image. This can be done by adding
+SingularityImage = "/cvmfs/singularity.opensciencegrid.org/opensciencegrid/tensorflow-gpu:2.3"
to the submit file. Note that in the previous section, the container image specified was not
GPU enabled. The updated submit file with the changes mentioned above is named
and contains the following contents:
universe = vanilla # Job requirements - ensure we are running on a Singularity enabled # node and have enough resources to execute our code # Tensorflow also requires AVX instruction set and a newer host kernel Requirements = HAS_SINGULARITY == True && HAS_AVX2 == True && OSG_HOST_KERNEL_VERSION >= 31000 request_cpus = 1 request_gpus = 1 request_memory = 1 GB request_disk = 1 GB # Container image to run the job in +SingularityImage = "/cvmfs/singularity.opensciencegrid.org/opensciencegrid/tensorflow-gpu:2.3" # Executable is the program your job will run It's often useful # to create a shell script to "wrap" your actual work. Executable = job-wrapper.sh Arguments = # Inputs/outputs - in this case we just need our python code. # If you leave out transfer_output_files, all generated files comes back transfer_input_files = test.py #transfer_output_files = # Error and output are the error and output channels from your job # that HTCondor returns from the remote host. Error = $(Cluster).$(Process).error Output = $(Cluster).$(Process).output # The LOG file is where HTCondor places information about your # job's status, success, and resource consumption. Log = $(Cluster).log # Send the job to Held state on failure. #on_exit_hold = (ExitBySignal == True) || (ExitCode != 0) # Periodically retry the jobs every 1 hour, up to a maximum of 5 retries. #periodic_release = (NumJobStarts < 5) && ((CurrentTime - EnteredCurrentStatus) > 60*60) # queue is the "start button" - it launches any jobs that have been # specified thus far. queue 1
Submit this job with the command
condor_submit gpu-job.submit. Once the job is complete, check
.out file for a line stating that the code was run with GPU. You should see something similar
2021-02-02 23:25:19.022467: I tensorflow/core/common_runtime/eager/execute.cc:611] Executing op MatMul in device /job:localhost/replica:0/task:0/device:GPU:0
GPU:0 part of the log statement above shows that a GPU was found and used for the computation.
GPUs on the Open Science Pool¶
Curious about what GPUs make up the Open Science Pool?
condor_status -const 'gpus >= 1' -af CUDADeviceName | sort | uniq -c to find out.
Which GPU models are most common?