Exercise 1.1: GPUs¶
For this exercise, we will use the
ap40.uw.osg-htc.org access point. Log in:
$ ssh <USERNAME>@ap40.uw.osg-htc.org
Let's first explore what GPUs are available in the OSPool. Remember that the pool is dynamic - resources are beeing added and removed all the time - but we can at least find out what the current set of GPUs are there. Run:
user@ap40 $ condor_status -const 'GPUs > 0'
Once you have that list, pick one of the resources and look at the
classad using the
-l flag. For example:
user@ap40 $ condor_status -l [MACHINE]
-autoformat flag, explore the different attributes
of the GPUs. Some interesting attributes might be
Mips number of a GPU slot with a regular slot. Does
Mips number indicate that GPUs can be much faster than CPUs?
A sample GPU job¶
Create a file named
mytf.py and chmod it to be executable. The
content is a sample TensorFlow code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Then, create a submit file to run the code on a GPU, using a TensorFlow container image. The new bits of the submit file is provided below, but you will have to fill in the rest from what you have learnt earlier in the User School.
universe = container container_image = /cvmfs/singularity.opensciencegrid.org/opensciencegrid/tensorflow-gpu:2.2-cuda-10.1 executable = mytf.py request_gpus = 1
Note that TensorFlow also require the AVX2 CPU extensions. Remember
that AVX2 is available in the
micro architectures. Add a
requirements line stating that
Microarch has to be one of those two (the operand for
or in the classad experssions is
Submit the job and watch the queue. Did the job start running as quickly as when we ran CPU jobs? Why/why not?
Examine the out/err files. Does it indicate somewhere that
the job was mapped to a GPU? (Hint: search for
Created TensorFlow device)
Keep a copy of the out/err. Modify the submit file to not run on a GPU, and the try the job again. Did the job work? Does the err from the CPU job look anything like the GPU err?