Organizing HTC Workloads¶
Imagine you have a collection of books, and you want to analyze how word usage varies from book to book or author to author.
This exercise is similar to HTCondor exercise 2.4, in that it is about counting word frequencies in multiple files. But the focus here is on organizing the files more effectively on the Access Point, with an eye to scaling up to a larger HTC workload in the future.
Log into an OSPool Access Point¶
Make sure you are logged into ap40.uw.osg-htc.org
.
Get Files¶
To get the files for this exercise:
- Type
wget https://github.com/osg-htc/user-school-2023/raw/main/docs/materials/scaling/files/osgus23-day4-ex11-organizing-files.tar.gz
to download the tarball. - As you learned earlier, expand this tarball file; it will create a
organizing-files
directory. - Change to that directory, or create a separate one for this exercise and copy the files in.
Our Workload¶
We can analyze one book by running the wordcount.py
script, with the
name of the book we want to analyze:
$ ./wordcount.py Alice_in_Wonderland.txt
Try running the command to see what the output is for the script.
Once you have done that delete the output file created (rm counts.Alice_in_Wonderland.txt
).
We want to run this script on all the books we have copies of.
- What is the input set for this HTC workload?
- What is the output set?
Make an Organization Plan¶
Based on what you know about the script, inputs, and outputs, how would you organize this HTC workload in directories (folders) on the Access Point?
There will also be system and HTCondor files produced when we submit a job — how would you organize the log, standard output, and standard error files?
Try making those changes before moving on to the next section of the tutorial.
Organize Files¶
There are many different ways to organize files; a simple method that works for most workloads is having a directory for your input files and a directory for your output files.
-
Set up this structure on the command line by running:
$ mkdir input $ mv *.txt input/ $ mkdir output
-
View the current directory and its subdirectories by using the
ls
command with the recursive (-R
) flag:$ ls -R README.md books.submit input output wordcount.py ./input: Alice_in_Wonderland.txt Huckleberry_Finn.txt Dracula.txt Pride_and_Prejudice.txt ./output:
-
Next, create directories for the HTCondor log, standard output, and standard output files (in one directory):
$ mkdir logs $ mkdir errout
Submit One Job¶
Now we want to submit a test job that uses this organizing scheme,
using just one item in our input set —
in this example, we will use the Alice_in_Wonderland.txt
file from our input
directory.
-
Fill in the incomplete lines of the submit file, as shown below:
executable = wordcount.py arguments = Alice_in_Wonderland.txt transfer_input_files = input/Alice_in_Wonderland.txt transfer_output_files = counts.Alice_in_Wonderland.txt transfer_output_remaps = "counts.Alice_in_Wonderland.txt=output/counts.Alice_in_Wonderland.txt"
To tell HTCondor the location of the input file, we need to include the input directory. Also, this submit file uses the
transfer_output_remaps
feature that you learned about; it will move the output file to theoutput
directory by renaming or remapping it. -
Next, edit the submit file lines that tell the log, output, and error files where to go:
output = logs/job.$(ClusterID).$(ProcID).out error = errout/job.$(ClusterID).$(ProcID).err log = errout/job.$(ClusterID).$(ProcID).log
-
Submit your job and monitor its progress.
Submit Multiple Jobs¶
Now, you are ready to submit the whole workload.
-
Create a file with the list of input files (the input set); here, this is the list of the book files to analyze. Do this by using the shell
ls
command and redirecting its output to a file:$ ls input > booklist.txt $ cat booklist.txt
-
Modify the submit file to reference the file of inputs and replace the fixed value (
Alice_in_Wonderland.txt
) with a variable ($(book)
):executable = wordcount.py arguments = $(book) transfer_input_files = input/$(book) transfer_output_files = counts.$(book) transfer_output_remaps = "counts.$(book)=output/counts.$(book)" queue book from booklist.txt
-
Submit the jobs
-
When complete, look at the complete set of input and (now) output files to see how they are organized.