Data Exercise 1.1: Understanding Data Requirements¶
Exercise Goal¶
This exercise's goal is to learn to think critically about an application's data needs, especially before submitting a large batch of jobs or using tools for delivering large data to jobs. In this exercise we will attempt to understand the input and output of the bioinformatics application BLAST.
Setup¶
For this exercise, we will use the ap40.uw.osg-htc.org
access point. Log in:
$ ssh <USERNAME>@ap40.uw.osg-htc.org
Create a directory for this exercise named blast-data
and change into it
Copy the Input Files¶
To run BLAST, we need the executable, input file, and reference database. For this example, we'll use the "pdbaa" database, which contains sequences for the protein structure from the Protein Data Bank. For our input file, we'll use an abbreviated fasta file with mouse genome information.
-
Copy the BLAST executables:
user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2024/ncbi-blast-2.12.0+-x64-linux.tar.gz user@ap40 $ tar -xzvf ncbi-blast-2.12.0+-x64-linux.tar.gz
-
Download these files to your current directory:
user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2024/pdbaa.tar.gz user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2024/mouse.fa
-
Untar the
pdbaa
database:user@ap40 $ tar -xzvf pdbaa.tar.gz
Understanding BLAST¶
Remember that blastx
is executed in a command like the following:
user@ap40 $ ./ncbi-blast-2.12.0+/bin/blastx -db <DATABASE ROOTNAME> -query <INPUT FILE> -out <RESULTS FILE>
In the above, the <INPUT FILE>
is the name of a file containing a number of genetic sequences (e.g. mouse.fa
), and
the database that these are compared against is made up of several files that begin with the same <DATABASE ROOTNAME>
,
(e.g. pdbaa/pdbaa
).
The output from this analysis will be printed to <RESULTS FILE>
that is also indicated in the command.
Calculating Data Needs¶
Using the files that you prepared in blast-data
, we will calculate how much disk space is needed if we were to
run a hypothetical BLAST job with a wrapper script, where the job:
- Transfers all of its input files (including the executable) as tarballs
- Untars the input files tarballs on the execute host
- Runs
blastx
using the untarred input files
Here are some commands that will be useful for calculating your job's storage needs:
-
List the size of a specific file:
user@ap40 $ ls -lh <FILE NAME>
-
List the sizes of all files in the current directory:
user@ap40 $ ls -lh
-
Sum the size of all files in a specific directory:
user@ap40 $ du -sh <DIRECTORY>
Input requirements¶
Total up the amount of data in all of the files necessary to run the blastx
wrapper job, including the executable itself.
Write down this number.
Also take note of how much total data is in the pdbaa
directory.
Compressed Files
Remember, blastx
reads the un-compressed pdbaa
files.
Output requirements¶
The output that we care about from blastx
is saved in the file whose name is indicated after the -out
argument to
blastx
.
Also, remember that HTCondor also creates the error, output, and log files, which you'll need to add up, too.
Are there any other files?
Total all of these together, as well.
Up next!¶
Next you will create a HTCondor submit script to transfer the Blast input files in order to run Blast on a worker nodes. Next Exercise