Data Exercise 1.1: Understanding Data Requirements¶
This exercise's goal is to learn to think critically about an application's data needs, especially before submitting a large batch of jobs or using tools for delivering large data to jobs. In this exercise we will attempt to understand the input and output of the bioinformatics application BLAST.
Log in to
Create a directory for this exercise named
blast-dataand change into it
Copy the Input Files¶
To run BLAST, we need the executable, input file, and reference database. For this example, we'll use the "pdbaa" database, which contains sequences for the protein structure from the Protein Data Bank. For our input file, we'll use an abbreviated fasta file with mouse genome information.
Copy the BLAST executables:
Download these files to your current directory:
[email protected] $ tar -xzvf pdbaa.tar.gz
blastx is executed in a command like the following:
[email protected] $ ./blastx -db <DATABASE ROOTNAME> -query <INPUT FILE> -out <RESULTS FILE>
In the above, the
<INPUT FILE> is the name of a file containing a number of genetic sequences (e.g.
the database that these are compared against is made up of several files that begin with the same
The output from this analysis will be printed to
<RESULTS FILE> that is also indicated in the command.
Calculating Data Needs¶
Using the files that you prepared in
blast-data, we will calculate how much disk space is needed if we were to
run a hypothetical BLAST job with a wrapper script, where the job:
- Transfers all of its input files (including the executable) as tarballs
- Untars the input files tarballs on the execute host
blastxusing the untarred input files
Here are some commands that will be useful for calculating your job's storage needs:
List the size of a specific file:
[email protected] $ ls -lh <FILE NAME>
List the sizes of all files in the current directory:
[email protected] $ ls -lh
Sum the size of all files in a specific directory:
[email protected] $ du -sh <DIRECTORY>
Total up the amount of data in all of the files necessary to run the
blastx wrapper job, including the executable itself.
Write down this number.
Also take note of how much total data is in the
blastx reads the un-compressed
The output that we care about from
blastx is saved in the file whose name is indicated after the
-out argument to
Also, remember that HTCondor also creates the error, output, and log files, which you'll need to add up, too.
Are there any other files?
Total all of these together, as well.
Next you will create a HTCondor submit script to transfer the Blast input files in order to run Blast on a worker nodes. Next Exercise