HTC Exercise 2.3: Submit with “queue from”¶
Exercise Goals¶
While using queue *N*
, $(Cluster)
, and $(Process)
can be useful for some jobs, it's not great for all jobs. What if you want to submit the same task for differently named files? What if you want to iterate over non-numerical parameters?
In the next two exercises, you will explore more ways to use a single
submit file to submit many jobs. The goal of this exercise is to submit many
jobs from a single submit file by using the queue ... from
syntax to read
variable values from a file.
Background¶
When submitting many jobs from a single submit file, always consider:
- What makes each job unique? In other words, there is one job per _____?
- How should you tell HTCondor to distinguish each job?
For queue *N*
, jobs are distinguished by the built-in $(Process)
variable. But what if you need something more customized or something that's not a number? HTCondor can do that! HTCondor can distinguish and submit multiple jobs using custom variables.
Example: Counting Words in Files¶
Imagine you have a collection of books, and you want to analyze how word usage varies from book to book or author to author. You could create separate submit files for each book and submit all of the files manually, but you'd have to edit the file each time.
Below is an example submit file for this task. If we created a submit file for each text, we'd have to change the last five lines above the queue
statement for each text we want to analyze.
executable = freq.py
request_memory = 1GB
request_disk = 20MB
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_input_files = Alice_in_Wonderland.txt
arguments = Alice_in_Wonderland.txt
output = Alice_in_Wonderland.txt
error = Alice_in_Wonderland.err
log = Alice_in_Wonderland.log
queue
This would be overly verbose and tedious! Let's see a better way to automate this.
Queue Jobs From a List of Values¶
Suppose we want to modify our word-frequency analysis from a previous exercise so that it outputs only the top N most common words of a document. However, we want to experiment with different values of N.
For this analysis, we will have a new version of the word-frequency counting script. First, we need a new version of the word counting program so that it accepts an extra number as a command line argument and prints only that many of the most common words. Here is the new code (it's not necessary to understand this code):
#!/usr/bin/env python3
import os
import sys
import operator
if len(sys.argv) != 3:
print(f'Usage: {os.path.basename(sys.argv[0])} DATA NUM_WORDS')
sys.exit(1)
input_filename = sys.argv[1]
num_words = int(sys.argv[2])
words = {}
with open(input_filename, 'r') as my_file:
for line in my_file:
line_words = line.split()
for word in line_words:
if word in words:
words[word] += 1
else:
words[word] = 1
sorted_words = sorted(words.items(), key=operator.itemgetter(1))
for word in sorted_words[-num_words:]:
print(f'{word[0]} {word[1]:8d}')
To submit this program with a collection of two variable values for each run, one for the number of top words and one for the filename:
- Save the script as
wordcount-top-n.py
. -
Download and unpack some books from Project Gutenberg:
[username@ap40]$ pelican object get osdf://ospool/uc-shared/public/school/2025/books.tar.gz . [username@ap40]$ tar -xzvf books.tar.gz
-
You'll notice that the books are in a directory called
books/
. For simplicity's sake, let's move the files to the current directory and remove the now-emptybooks/
directory.[username@ap40]$ mv books/*.txt . [username@ap40]$ rmdir books/
-
Create a new submit file (or base it off a previous one!) named
wordcount-top.sub
, including memory and disk requests of 20 MB. - All of the jobs will use the same
executable
andlog
statements. -
Update other statements to work with two variables,
book
andn
:output = $(book)_top_$(n).out error = $(book)_top_$(n).err transfer_input_files = $(book) arguments = $(book) $(n) queue book, n from books_n.txt
Note especially the changes to the
queue
statement; it now tells HTCondor to read a separate text file of pairs of values, which will be assigned tobook
andn
respectively. -
Create the separate text file of job variable values and save it as
books_n.txt
:Alice_in_Wonderland.txt, 10 Alice_in_Wonderland.txt, 25 Alice_in_Wonderland.txt, 50 Pride_and_Prejudice.txt, 10 Pride_and_Prejudice.txt, 25 Pride_and_Prejudice.txt, 50 Dracula.txt, 10 Dracula.txt, 25 Dracula.txt, 50 Huckleberry_Finn.txt, 10 Huckleberry_Finn.txt, 25 Huckleberry_Finn.txt, 50
Note that we used 3 different values for n for each book.
-
Submit the file.
- Do a quick sanity check: How many jobs were submitted? How many log, output, and error files were created?
Extra Challenge¶
You may have noticed that the output of these jobs has a messy naming convention. Because our macros resolve to the filenames, including their extension (e.g., Alice_in_Wonderland.txt.txt
), the output filenames contain with multiple extensions (e.g., Alice_in_Wonderland.txt.txt.err
). Although the extra extension is acceptable, it makes the filenames harder to read and possibly organize. Change your submit file and variable file for this exercise so that the output filenames do not include the .txt
extension.