Skip to content

Data Exercise 1.1: Understanding Data Requirements

Exercise Goal

This exercise's goal is to learn to think critically about an application's data needs, especially before submitting a large batch of jobs or using tools for delivering large data to jobs. In this exercise we will attempt to understand the input and output of the bioinformatics application BLAST.

Setup

For this exercise, we will use the ap40.uw.osg-htc.org access point. Log in:

$ ssh <USERNAME>@ap40.uw.osg-htc.org

Create a directory for this exercise named blast-data and change into it

Copy the Input Files

To run BLAST, we need the executable, input file, and reference database. For this example, we'll use the "pdbaa" database, which contains sequences for the protein structure from the Protein Data Bank. For our input file, we'll use an abbreviated fasta file with mouse genome information.

  1. Copy the BLAST executables:

    user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2023/ncbi-blast-2.12.0+-x64-linux.tar.gz
    user@ap40 $ tar -xzvf ncbi-blast-2.12.0+-x64-linux.tar.gz
    
  2. Download these files to your current directory:

    user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2023/pdbaa.tar.gz
    user@ap40 $ wget http://proxy.chtc.wisc.edu/SQUID/osg-school-2023/mouse.fa
    
  3. Untar the pdbaa database:

    user@ap40 $ tar -xzvf pdbaa.tar.gz
    

Understanding BLAST

Remember that blastx is executed in a command like the following:

user@ap40 $ ./ncbi-blast-2.12.0+/bin/blastx -db <DATABASE ROOTNAME> -query <INPUT FILE> -out <RESULTS FILE>

In the above, the <INPUT FILE> is the name of a file containing a number of genetic sequences (e.g. mouse.fa), and the database that these are compared against is made up of several files that begin with the same <DATABASE ROOTNAME>, (e.g. pdbaa/pdbaa). The output from this analysis will be printed to <RESULTS FILE> that is also indicated in the command.

Calculating Data Needs

Using the files that you prepared in blast-data, we will calculate how much disk space is needed if we were to run a hypothetical BLAST job with a wrapper script, where the job:

  • Transfers all of its input files (including the executable) as tarballs
  • Untars the input files tarballs on the execute host
  • Runs blastx using the untarred input files

Here are some commands that will be useful for calculating your job's storage needs:

  • List the size of a specific file:

    user@ap40 $ ls -lh <FILE NAME>
    
  • List the sizes of all files in the current directory:

    user@ap40 $ ls -lh
    
  • Sum the size of all files in a specific directory:

    user@ap40 $ du -sh <DIRECTORY>
    

Input requirements

Total up the amount of data in all of the files necessary to run the blastx wrapper job, including the executable itself. Write down this number. Also take note of how much total data is in the pdbaa directory.

Compressed Files

Remember, blastx reads the un-compressed pdbaa files.

Output requirements

The output that we care about from blastx is saved in the file whose name is indicated after the -out argument to blastx. Also, remember that HTCondor also creates the error, output, and log files, which you'll need to add up, too. Are there any other files? Total all of these together, as well.

Up next!

Next you will create a HTCondor submit script to transfer the Blast input files in order to run Blast on a worker nodes. Next Exercise