US Map featuring the locations of current OSDF architectural components.

Open Science Data Federation

The Open Science Data Federation (OSDF) is an OSG service designed to support the sharing of files staged in autonomous “origins”, for efficient access to those files from anywhere in the world via a global namespace and network of caches. The OSDF may be used either standalone - allowing data to be downloaded via HTTPS - or with HTCondor managing data transfer for compute jobs running on one of the many resource pools supported by OSG.

For the sake of concreteness, the OSG documentation focuses on implementation and usage of the OSDF from within OSG resource pools, and provides examples of appropriate HTTP addresses for accessing data in OSG-supported data origins from other locations.

To learn how to utilize OSDF as a standalone infrastructure, please reach out to the OSG Team through [email protected].

How is the Open Science Data Federation integrated with other OSG Services?

The Open Science Data Federation (OSDF) enables users and institutions to make datasets available to compute jobs running in distributed high-throughput computing (dHTC) environments such as the Open Science Pool (OSPool). Compute jobs submitted from an HTCondor access point (e.g. an OSG-Operated Access Point) can access data stored in data origins, with HTCondor managing data transfer via the OSDF’s global namespace and data caches.

By providing the distributed data access layer via these data caches, jobs running in the OSPool (or any other resource pool) can reduce wide-area network consumption, load on the data origins, and latency of data access.

Example OSDF Use Cases

  • A researcher wants to share a dataset with their community such that others may process it.
  • A researcher produces data on the OSPool that they need to store for future processing or sharing with the community.
  • A researcher has a GB to TB-scale dataset that they want to analyze. Their workflow processes the same data many times, thus benefiting greatly from the caching within OSDF.
  • A researcher curated a dataset they produced, and are hosting on their data origin within OSDF. They now want to free up their disk space at their origin by transferring responsibility for this dataset to somebody else’s data origin without any changes in how the community accesses the data.

To learn more details about these or other use cases, please reach out to our team of Research Computing Facilitators through [email protected].

Who can use the OSDF?

Any US-based academic, government, or non-profit institution may operate a data origin to export their users’ data. Researchers using the OSPool from an OSG-Operated Access Point may also use an OSG operated origin.

Who can access data in the OSDF?

Each origin can be configured to make data public or private, and can control the rules for sharing. For example, OSG-Operated Access Points allow users to make their data accessible to all (public)

Cached and transferred portions of the data may additionally be visible to the administrators of the cache services and of the execution points where a user’s jobs run. Non-public data is encrypted when sent over the network, but not on disk.

Who manages the OSDF?

Origins in the OSDF are generally managed by the projects or institutions that own the underlying storage. Additionally the PATh project (which funds many core OSG technologies and services) offers options for PATh-hosted origins otherwise owned by and configured specifically for relevant organizations, as described here.

The caches are largely managed by OSG staff, who remotely operate the services. Some caches are dedicated to a research community or access point; a number of caches are specific to the LIGO experiment. For example, the PATh project operates the OSDF data origin associated with the OSG-Operated Access Points for US-associated research projects accessing the OSPool. The cache hardware is distributed throughout the US, including points of presence in the Internet2 and ESNet networks and university facilities such as UW-Madison, Chicago, Syracuse, UCSD, and Nebraska.

What hardware is necessary for an OSDF Data Origin?

There are many ways to architect filesystems, including those relevant to an OSDF data origin. Possible solutions include commercial and open source options.

To provide the community some guidance, we are offering to host your suggested solutions here.