Transfer Larger Job Files and Containers Using OSDF¶
For input files >1GB and output files >1GB in size, the default HTCondor file transfer mechanisms run the risk of over-taxing the Access Point and their network capacity. And this is exactly why the OSDF (Open Science Data Federation) exists for researchers with larger per-job data! The OSDF is a network of data origins and caches for data distribution.
If you have an account on an OSG Access Point, you have access to an OSDF data origin, specifically a directory that can be used to stage input and output data for jobs, accessible via the OSDF. This guide describes general tips for using the OSDF, where to stage your files, and how to access files from jobs.
Important Considerations and Best Practices¶
-
Use OSDF locations for larger files and containers: We recommend using the OSDF for files larger than 1GB (input or output) and all container files.
-
OSDF files are cached across the Open Science Pool, any changes or modifications that you make might not be propagated. This means that if you add a new version of a file the OSDF directory, it must first be given a unique name (or directory path) to distinguish it from previous versions of that file. Adding a date or version number to directories or file names is strongly encouraged to manage your files uniqness. This is especially important when using the OSDF for software and containers.
-
Never submit jobs from the OSDF locations; always submit jobs from within the
/home
directory. Alllog
,error
,output
files and any other files smaller than the above values should ONLY ever exist within the user's/home
directory. -
Files placed within a public OSDF directory are publicly accessible, discoverable and readable by anyone, via the web. At the moment, most default OSDF locations are not public.
Where to Put Your Files¶
Data origins and local mount points varies between the different access points. See the list below for the "Local Path" to use, based on your access point.
Access Point | OSDF Origin | |
---|---|---|
ap40.uw.osg-htc.org | Accessible to user only:
| |
ap20.uc.osg-htc.org | Accessible to user only:
| |
ap21.uc.osg-htc.org | Accessible to user only:
|
Transfer Files To/From Jobs Using the OSDF¶
Use an 'osdf://' URL to Transfer Large Input Files and Containers¶
Jobs will transfer data from the OSDF directory when files are indicated
with an appropriate osdf://
URL (or the older stash://
) in the
transfer_input_files
line of the submit file. Make sure to customize the
base URL based on your Access Point, as described in the table above.
Some examples:
-
Transferring one file from
/ospool/apXX/data/
transfer_input_files = osdf:///ospool/apXX/data/<username>/InFile.txt
-
When using multiple files from
/ospool/apXX/data/
, it can be useful to use HTCondor submit file variables to make your list of files more readable:# Define a variable (example: OSDF_LOCATION) equal to the # path you would like files transferred to, and call this # variable using $(variable) OSDF_LOCATION = osdf:///ospool/apXX/data/<username> transfer_input_files = $(OSDF_LOCATION)/InputFile.txt, $(OSDF_LOCATION)/database.sql
-
Transferring a folder from
/ospool/apXX/data/
transfer_input_files = osdf:///ospool/apXX/data/<username>/<folder>?recursive
Please note that for transferring a folder using OSDF ?recursive
needs to added after the folder name.
Use transfer_output_remaps
and 'osdf://' URL for Large Output Files¶
To move output files into an OSDF directory, users should
use the transfer_output_remaps
option
within their job's submit file, which will transfer the user's
specified file to the specific location in the data origin.
By using transfer_output_remaps
, it is possible to specify what path
to save a file to and what name to save it under. Using this approach,
it is possible to save files back to specific locations in your OSDF
directory (as well as your /home
directory, if desired).
The general syntax for transfer_output_remaps
is:
transfer_output_remaps = "Output1.txt = path/to/save/file/under/output.txt; Output2.txt = path/to/save/file/under/RenamedOutput.txt"
When saving large output files back to /ospool/apXX/data/
, the path provided will look like:
transfer_output_remaps = "Output.txt = osdf:///ospool/apXX/data/<username>/Output.txt"
Some examples:
-
Transferring one output file (
OutFile.txt
) back to/ospool/apXX/data/
:transfer_output_remaps = "OutFile.txt=osdf:///ospool/apXX/data/<username>/OutFile.txt"
-
When using multiple files from
/ospool/apXX/data/
, it can be useful to use HTCondor submit file variables to make your list of files more readable. Also note the semi-colon separator in the list of output files.# Define a variable (example: OSDF_LOCATION) equal to the # path you would like files transferred to, and call this # variable using $(variable) OSDF_LOCATION = osdf:///ospool/apXX/data/<username> transfer_output_remaps = "file1.txt = $(OSDF_LOCATION)/file1.txt; file2.txt = $(OSDF_LOCATION)/file2.txt; file3.txt = $(OSDF_LOCATION)/file3.txt"
Phase out of stash:/// and stashcp command¶
Historically, output files could be transferred from a job to an'
OSDF location using the stashcp
command within the job's
executable. However, this mechanism is no longer encouraged for OSPool
users. Instead, jobs should use transfer_output_remaps
(an HTCondor
feature) to transfer output files to your assigned OSDF origin. By using
transfer_output_remaps
, HTCondor will manage the output data transfer
for your jobs. Data transferred via HTCondor is more likely to be
transferred successfully and errors with transfer are more likely to be
reported to the user.
osdf://
is the new format for these kind of transfers, and is
equivalent of the old stash://
format (which will keep on being
supported for the short term).