Organizing HTC Workloads¶
Exercise Goal¶
When working with large datasets or running a task many times over (e.g. across subsets, parameters, or files), organizing your workload becomes essential. High-throughput computing (HTC) allows you to run thousands of independent jobs efficiently, but success depends on how well you prepare your files and manage your project structure.
This exercise introduces core principles of HTC workflow organization through a practical example. While the specific task—read mapping—comes from bioinformatics, the underlying strategies apply to any domain that handles data at scale.
[INSERT MINIMAP WORKFLOW - YOU ARE HERE]
You'll learn how to:
- Plan and organize your workload for high-throughput execution
- Structure input, output, and support files for efficient job management
- Track outputs, logs, and errors across many jobs in a clean, reproducible way
In this exercise, you’ll use a typical bioinformatics read mapping workflow to explore how to sustainably scale your workloads on the OSPool. You’ll focus on job-level organization, building a multi-job submit file, tracking job progress with logs, and troubleshooting job failures.
Log into an OSPool Access Point¶
Make sure you are logged into ap40.uw.osg-htc.org
.
Get Files¶
To get the files for this exercise:
- Type
wget https://github.com/osg-htc/school-2025/raw/main/docs/materials/scaling/files/osgus25-day4-ex11-organizing-files.tar.gz
to download the tarball. - As you learned earlier, expand this tarball file; it will create a
organizing-files
directory. - Change to that directory, or create a separate one for this exercise and copy the files in.
Our Workload¶
Imagine you're working with a large set of DNA sequencing data from an organism. The sequencing file contains millions of snippets of DNA, which we call reads. Your goal is to figure out where these reads came from by matching them to a known reference genome using a read mapping tool.
One commonly used tool for this task is minimap2
, which quickly finds where each DNA read best matches the reference. For this exercise, we'll use minimap2
to map our reads to a reference genome.
To do that, we might run a command like:
$ minimap2 -ax map-ont [reference_genome.fasta] [sequencing_reads.fastq] > output.sam
Here’s what these files are:
- reference_genome.fasta
: a file containing the complete reference genome we’re mapping to.
- sequencing_reads.fastq
: a file containing millions of DNA fragments, called reads, from a sequencing machine.
- output.sam
: a SAM file, which is a standard text format, used to store the results of mapping—where each read aligns in the genome.
We want to run this command to map all our reads against the reference genome. FASTQ files contain 4 lines per read, you can run the following command to calculate the number of reads in your FASTQ file:
$ expr $(wc -l < reads.fastq) / 4
$ 49382043
Read mapping using algorithms, like minimap2
, do not scale up well by simply adding additional CPUs to the problem. These mappers typically plateau their speed around 2-4 CPUs. This problem, however, can be solved by employing a "divide and conquer" approach. With this approach, we can subdivide our input reads.fastq file into smaller subsets which can be submitting to HTCondor as a set of independent parallel-running jobs.
We expect to be working with the following files in each of our jobs:
- Inputs
- A subset of reads.fastq -
reads_subset_a.fastq
- The reference genome -
reference_genome.fasta
- A minimap2 container image -
minimap2.sif
- The executable -
run_minimap2.sh
- A subset of reads.fastq -
- Outputs
- A SAM-formatted output file -
reads_subset_a.sam
[what is SAM-formatted?] - [Does the above section clarify or need further workup?]
- A SAM-formatted output file -
- System Generated Files
- A set of log, standard error, and standard out files
Make an Organization Plan¶
Based on what you know about the script, inputs, and outputs, how would you organize this HTC workload in directories (folders) on the Access Point?
There will also be system and HTCondor files produced when we submit a job — how would you organize the log, standard output, and standard error files?
When to Use the Open Science Data Federation (OSDF)
Make sure to consider which files will be re-used often (common files across all jobs) versus which files will be used only once. Files often re-used, can be placed in your /ospool/ap40/data/<user.name>/
directory to take advantage of the caching benefits when using the osdf://
transfer plugin.
[LARGE FILES ONLY]
Organize Files¶
There are many different ways to organize files; a simple method that works for most workloads is having a directory for your input files and a directory for your output files.
For our exercise, we will use the following data organizational structure:
├── /home/<user.name>/
│ ├── scaling-up
│ │ ├── inputs
│ │ ├── outputs
│ │ ├── logs
│ │ │ ├── log
│ │ │ ├── error
│ │ │ ├── output
├── /ospool/ap40/data/<user.name>/
│ ├── scaling-up
│ │ ├── inputs
│ │ ├── software
-
Set up this structure on the command line by running:
$ mkdir -p inputs $ mkdir -p outputs $ mkdir -p logs $ mkdir -p logs/log $ mkdir -p logs/error $ mkdir -p logs/output $ cd /ospool/ap40/data/$USER/ $ mkdir -p scaling-up/inputs $ mkdir -p scaling-up/software
FOR STAFF REVIEWERS: DO WE PREFER?
Do we prefer below:
$ mkdir -p inputs outputs logs/{log,error,output} $ cd /ospool/ap40/data/$USER/ $ mkdir -p scaling-up/{inputs,software}
-
Move the
reads.fastq
file to yourinputs
directory (the one under/home
) using themv
command. -
Move the
reference_genome.fasta
file to your/ospool/ap40/data/<user.name>/scaling-up/inputs
directory using themv
command. -
Move the
minimap2.sif
container image file to your/ospool/ap40/data/<user.name>/scaling-up/software
directory using themv
command.
Stop and Consider: Why are we moving our files to these directories?
When preparing your jobs for high-throughput computing, think about how each file will be used. Will the file be reused across many jobs? Or is it specific to a single job? Will it need to be transferred repeatedly—or just once?
With that in mind:
- Why might it make sense to place reads.fastq
in your /home
directory?
- Why are we storing reference_genome.fasta
and minimap2.sif
in /ospool/ap40/data/
instead?
Solution
Files in /home
are transferred to each job, directly from the AP to the EP. These files do not get cached in the OSDF, which makes sense for files that change across jobs—such as subsets of reads.fastq
.
But reference_genome.fasta
and minimap2.sif
are the same in every job. By storing them in /ospool/ap40/data/
and using osdf://
to access them, we avoid repeatedly transferring them. Instead, they're cached near where jobs run, which speeds things up and reduces network load.
Wrangling the Data¶
To get ready for our mapping step, we need to prepare our read files. This includes two crucial steps, splitting our reads.
-
Navigate to your
~/scaling-up/inputs/
directory.cd ~/scaling-up/inputs/
-
Split the FASTQ file into subsets. We can divide our
reads.fastq
into subsets of500,000
reads per subset. Since each FASTQ read consist of four lines in the FASTQ file, we can splitreads.fastq
every 2,000,000 lines.split -l 2000000 reads.fastq reads_fastq_chunk_ rm reads.fastq
This will generate 99 read subset files with the prefix
eads_fastq_chunk_
.Subsetting Data - Your Milage May Vary
When subsetting your data, you should exercise your own judgement. The ideal job profile on the OSPool typically looks something like:
- Runtime: Between 10mins and <10hrs
- Memory (RAM): 1-5 GB
- Inputs: <1 GB
- Outputs: <1 GB
Jobs with larger profiles may still run, but will likely face significant increases in
idle
state while waiting for a machine to match. -
Delete (or move) the
reads.fastq
file from the input directory.
In the next exercise, we will generate a list of jobs for HTCondor using the the subset files in this directory. To avoid accidentally including the reads.fastq
file in our list of jobs, please delete it or move it out of this directory.
Organization in HTC is Critical!
One of the most important steps in scaling up our workflows to run on HTC systems, such as the OSPool, is maintaining a clear organizational structure. This includes deleting files we will not be using anymore. For the rest of the exercise, we will not be using the reads.fastq
file after splitting it. Not deleting this file can cause downstream issues. Do not skip this step.
Review Your Progress¶
Now that we've set up our directory structure and pre-processed our data, we can focus on preparing our submission scripts and getting our jobs running!
Checking your progress
Before you move on, take this time to comb through your directory structure and compare it to the structure below.
View the current directory and its subdirectories by using the ls
command with the recursive (-R
) flag. Your overall structure should look something like this:
├── /home/<user.name>/
│ ├── scaling-up
│ │ ├── inputs
│ │ │ ├── reads_fastq_chunk_a
│ │ │ ├── reads_fastq_chunk_b
│ │ │ ├── reads_fastq_chunk_c
│ │ ├── outputs
│ │ ├── logs
│ │ │ ├── log
│ │ │ ├── error
│ │ │ ├── output
├── /ospool/ap40/data/<user.name>/
│ ├── scaling-up
│ │ ├── inputs
│ │ │ ├── reference_genome.fasta
│ │ ├── software
│ │ │ ├── minimap2.sif