Thursday Exercise 3.3: Using Stash for unique large input¶
In this exercise, we will run a multimedia program that converts and manipulates video files. In particular, we want to convert large .mov
files to smaller (10-100s of MB) mp4
files. Just like the Blast database in the previous exercise, these video files are too large to send to jobs using HTCondor's default file transfer mechanism, so we'll be using the Stash tool to send our data to jobs. This exercise should take 25-30 minutes.
Data¶
We'll start by moving our source movie files into Stash, so that they'll be available to our jobs when they run out on OSG.
- Log into
training.osgconnect.net
and move into the~/stash/public
directory. -
The video files are currently stored on the squid proxy from the first exercise this afternoon. To place them in Stash, download them using
wget
:user@training $ wget http://proxy.chtc.wisc.edu/osgschool18/videos.tar.gz
-
Once downloaded, untar the
tar.gz
file. It should contain three.mov
files. (this may take a while since everyone else is likely doing the same thing) - How big are the three files? Which is the smallest? (Find out with
ls -lh
.) -
We're going to need a list of these files later. For now, let's save that list to a file in this directory by running
ls
and redirecting the output to a file:user@training $ ls *.MOV *.mov > movie_list.txt
-
Once you've examined the three
mov
files and created the list of files, remove the originaltar.gz
file.
Software¶
We'll be using a multi-purpose media tool called ffmpeg
to convert video formats. The basic command to convert a file looks like this:
user@training $ ./ffmpeg -i input.mov output.mp4
In order to resize our files, we're going to manually set the video bitrate and resize the frames, so that the resulting file is smaller.
user@training $ ./ffmpeg -i input.mp4 -b:v 400k -s 640x360 output.mp4
To get the ffmpeg
program do the following:
- On training.osgconnect.net, create a directory for this exercise
~/thur-data-ffmpeg
and move into it. -
We'll be downloading the
ffmpeg
pre-built static binary originally from this page: http://johnvansickle.com/ffmpeg/.user@training $ wget http://proxy.chtc.wisc.edu/osgschool18/ffmpeg-release-64bit-static.tar.xz
-
Once the binary is downloaded, un-tar it, and then copy the main
ffmpeg
program into your current directory:user@training $ tar -xf ffmpeg-release-64bit-static.tar.xz user@training $ cp ffmpeg-4.0.1-64bit-static/ffmpeg ./
Script¶
We want to write a script that uses ffmpeg
to convert a .mov
file to a smaller format. Our script will need to copy that movie file from Stash to the job's current working directory (as in the previous exercise, run the appropriate ffmpeg
command, and then remove the original movie file so that it doesn't get transferred back to the submit server. This last step is particularly important, as otherwise you will have large files transferring into the submit server and filling up your home directory space.
Create a file called run_ffmpeg.sh
, that does the steps described above. Use the name of the smallest .mov
file in the ffmpeg
command. Once you've written your script, check it against the example below:
#!/bin/bash module load xrootd module load stashcp stashcp /user/username/public/test_open_terminal.mov ./ ./ffmpeg -i test_open_terminal.mov -b:v 400k -s 640x360 test_open_terminal.mp4 rm test_open_terminal.mov
In your script, the username should be replaced by your training.osgconnect.net
username.
Ultimately we'll want to submit several jobs (one for each .mov
file), but to start with, we'll run one job to make sure that everything works.
Submit File¶
Create a submit file for this job, based on other submit files from the school (This file, for example.) Things to consider:
-
We'll be copying the video file into the job's working directory, so make sure to request enough disk space for the input
mov
file and the outputmp4
file. If you're aren't sure how much to request, ask a helper in the room. -
Important Don't list the name of the
.mov
intransfer_input_files
. Our job will be interacting with the input.mov
files solely from within the script we wrote above. -
Note that we do need to transfer the
ffmpeg
program that we downloaded above.transfer_input_files = ffmpeg
-
Add the same requirements as the previous exercise:
+WantsStashCache = true requirements = (OSGVO_OS_STRING == "RHEL 6") && (OpSys == "LINUX") && (HAS_MODULES =?= true)
Initial Job¶
With everything in place, submit the job. Once it finishes, we should check to make sure everything ran as expected:
- Check the directory where you submitted the job. Did the output
.mp4
file return? - Also in the directory where you submitted the job - did the original
.mov
file return here accidentally? - Check file sizes. How big is the returned
.mp4
file? How does that compare to the original.mov
input?
If your job successfully returned the converted .mp4
file and not the .mov
file to the submit server, and the .mp4
file was appropriately scaled down, then we can go ahead and convert all of the files we uploaded to Stash.
Multiple jobs¶
We wrote the name of the .mov
file into our run_ffmpeg.sh
executable script. To submit a set of jobs for all of our .mov
files, what will we need to change in:
- the script?
- the submit file?
Once you've thought about it, check your reasoning against the instructions below.
Add an argument to your script¶
- Look at your
run_ffmpeg.sh
script. What values will change for every job? - The input file will change with every job - and don't forget that the output file will too! Let's make them both into arguments.
To add arguments to a bash script, we use the notation $1
for the first argument (our input file) and $2
for the second argument (our output file name). The final script should look like this:
#!/bin/bash module load xrootd module load stashcp stashcp /user/username/public/$1 ./ ./ffmpeg -i $1 -b:v 400k -s 640x360 $2 rm $1
Note that we use the input file name multiple times in our script, so we'll have to use $1
multiple times as well.
Modify your submit file¶
-
We now need to tell each job what arguments to use. We will do this by adding an arguments line to our submit file. Because we'll only have the input file name, the "output" file name will be the input file name with the
mp4
extension. That should look like this:arguments = $(mov) $(mov).mp4
-
To set these arguments, we will use the
queue .. matching
syntax that we learned on Monday. To do so, we need to create a list of our input files. -
In our submit file, we can then change our queue statement to:
queue mov from movie_list.txt
Once you've made these changes, try submitting all the jobs!
Bonus¶
If you wanted to set a different output file name, bitrate and/or size for each original movie, how could you modify:
movie_list.txt
- Your submit file
run_ffmpeg.sh
to do so?
Show hint
Here's the changes you can make to the various files:
-
movie_list.txt
ducks.MOV ducks.mp4 500k 1280x720 teaching.MOV teaching.mp4 400k 320x180 test_open_terminal.mov terminal.mp4 600k 640x360
-
Submit file
arguments = $(mov) $(mp4) $(bitrate) $(size) queue mov,mp4,bitrate,size from movie_list.txt
-
run_ffmpeg.sh
1 2 3 4 5 6 7
#!/bin/bash module load stashcp module load xrootd stashcp /user/username/public/$1 ./ ./ffmpeg -i $1 -b:v $3 -s $4 $2 rm $1