Data Exercise 2.1: Using a Web Proxy for Large Shared Input¶
Continuing the series of exercises blasting mouse genetic sequences, the objective of this exercise is to use a web proxy to stage the large database, which will be downloaded into each of many jobs that use the split input files from the last exercise (Exercise 1.3).
Setup¶
- Make sure you are logged into
login05.osgconnect.net
- Make sure you are in the same directory as the previous exercise,
Exercise 1.3 directory named
blast-split
.
Place the Large File on the Proxy¶
First, you'll need to put the pdbaa_files.tar.gz
file onto the Stash web directory. Use the following command:
user@login05 $ cp pdbaa_files.tar.gz /public/<USERNAME>
Replacing <USERNAME>
with your username
Test a download of the file¶
Once the file is placed in your /public
directory, it can be downloaded from a corresponding URL such as
http://stash.osgconnect.net/public/<USERNAME>/pdbaa_files.tar.gz
, where <USERNAME>
is your username on
login05.osgconnect.net
.
Using the above convention (and from a different directory on login05.osgconnect.net
, any directory), you can test
the download of your pdbaa_files.tar.gz
file with a command like the following:
user@login05 $ wget http://stash.osgconnect.net/public/<USERNAME>/pdbaa_files.tar.gz
Again, replacing <USERNAME>
with your own username.
You may realize that you've been using wget
to download files from a web proxy for many of the previous exercises at
the school!
Run a New Test Job¶
Now, you'll repeat the last exercise (with a single input query file) but have HTCondor download the
pdbaa_files.tar.gz
file from the web proxy, instead of having the file transferred from the submit server.
Modify the submit file and wrapper script¶
In the wrapper script, we have to add some special lines so that we can pull from the HTTP proxy and see where they are coming from.
Normally, we would let HTCondor do the HTTP transfer, but we want to see the download and to see where it came from.
In blast_wrapper.sh
, we will have to add commands to pull the data file:
#!/bin/bash
# Set the http_proxy environment which wget uses
export http_proxy=$OSG_SQUID_LOCATION
# Copy the pdbaa_files.tar.gz to the worker node
# Add the -S argument, so we can see if it was a cache HIT or MISS
wget -S http://stash.osgconnect.net/public/<USERNAME>/pdbaa_files.tar.gz
tar xvzf pdbaa_files.tar.gz
./blastx -db pdbaa -query $1 -out $1.result
rm pdbaa*
Be sure to replace <USERNAME>
with your own username.
The new line will download the pdbaa_files.tar.gz
from the HTTP proxy, using the closest cache (because wget
will
look at the environment variable http_proxy
for the newest cache).
Also notice the final line of the wrapper script has been modified to delete the pdbaa data files as well as the pdbaa_files.tar.gz file so that it will not be transferred back to the submit server when the job finishes.
In your submit file, you will need to remove the pdbaa_files.tar.gz
file from the transfer_input_files
, because we
are now transferring the tarball via the wget
command in our wrapper script.
Submit the test job¶
You may wish to first remove the log, result, output, and error files from the previous tests, which will be overwritten when the new test job completes.
user@login05 $ rm *.err *.out *.result *.log
Submit a single test job! (If your submit file uses a queue .. matching
statement, a simple way to submit a single job is to temporarily change it to queue inputfile matching mouse_rna.fa.1
.
When the job starts, the wrapper will download the pdbaa_files.tar.gz
file from the web proxy.
If the jobs takes longer than two minutes, you can assume that it will complete successfully, and then continue with the
rest of the exercise.
After the job completes examine the error file generated by the submission. At the top of the file, you will find something like:
--2021-07-23 10:35:51-- http://stash.osgconnect.net/public/dweitzel/pdbaa_files.tar.gz
Resolving iitgrid.iit.edu (iitgrid.iit.edu)... 216.47.155.220
Connecting to iitgrid.iit.edu (iitgrid.iit.edu)|216.47.155.220|:3128... connected.
Proxy request sent, awaiting response...
HTTP/1.1 200 OK
Server: nginx/1.16.1
Content-Type: application/octet-stream
Content-Length: 22105180
Last-Modified: Fri, 23 Jul 2021 14:27:49 GMT
ETag: "60fad1e5-1514c5c"
Accept-Ranges: bytes
Age: 0
Date: Fri, 23 Jul 2021 15:35:51 GMT
X-Cache: HIT from iitgrid.iit.edu
Via: 1.1 iitgrid.iit.edu (squid/frontier-squid-4.10-2.1)
Connection: close
Length: 22105180 (21M) [application/octet-stream]
Saving to: 'pdbaa_files.tar.gz.1'
...
Notice the X-Cache
line. It says it was a cache HIT from the proxy iitgrid.iit.edu
.
Yay! You successfully used a proxy to cache data near your worker node! Notice, the name of the cache may be different, and
it may be a MISS the first time you request the data from the cache.
Get HTCondor to do the wget for you!¶
The transfer_input_files
command in the submit file can also be an HTTP address. Instead of using wget
in your blast_wrapper.sh
file, remove it and add the HTTP address to the transfer_input_files
line in your blast_split.sub
executable = blast_wrapper.sh
transfer_input_files = blastx, $(inputfile), http://stash.osgconnect.net/public/<USERNAME>/pdbaa_files.tar.gz
output = $(inputfile).out
...
Run all 100 Jobs!¶
If all of the previous tests have gone okay, you can prepare to run all 100 jobs that will use the split input files.
To make sure you're not going to generate too much data, use the size of files from the previous test to calculate how
much total data you're going to add to the blast-split
directory for 100 jobs.
Make sure you remove pdbaa_files.tar.gz
from the transfer_input_files
in the split submit file. Moreover, don't forget to remove the log, error and output file of the previous job.
Submit all 100 jobs!
They may take a while to all complete, but it will still be faster than the many hours it would have taken to blast the
single, large mouse_rna.fa
file without splitting it up.
In the meantime, as long as the first several jobs are running for longer than two minutes, you can move on to the next
exercise