Open Science Data Federation
The Open Science Data Federation (OSDF) connects disparate dataset repositories into a single, nation-wide data distribution network. Leveraging the OSDF, providers can make their datasets available to a wide variety of compute users, from browsers to Jupyter notebooks to high throughput computing environments.
The OSDF is part of the OSG Fabric of Services, running software developed by the Pelican Platform.
There are many ways to participate in the OSDF. Read on for three different ways to engage.
Share
The OSDF may be for you if…
- You are part of a collaborative project that works with shared data sets
- You have generated data as part of a project and want to share it
Want to make your dataset available via the OSDF?
CICI PIs, see additional details for your projects here: Dear CICI PIs
Contribute
The OSDF can be a platform for sharing data from your institution or contributing infrastructure to a national project. These are some ways that institutions and communities can contribute to the OSDF:
- Provide unused storage space for other groups or projects to use via the OSDF
- Host infrastructure to make the OSDF more robust, like a local cache
Want to contribute to the OSDF infrastructure?
Use
The OSDF may be for you if…
- You are using the OSPool to analyze or produce data.
- You want to analyze data that has been shared on the OSDF.
Want to use or process data hosted on the OSDF?
FAQ
Any US-based academic, government, or non-profit institution may connect their object store to the OSDF.
Researchers using the OSPool from an OSG-operated Access Point automatically get an allocation on a local filesystem connected to the OSDF.
What about a researcher or community that would like to connect to the OSDF but doesn’t have their own storage infrastructure?
- CC* Storage projects have committed to having their storage managed by OSG; projects can request space from the OSG for their use via the support desk.
- Researchers can request an OSN allocation from ACCESS and request OSG connect their bucket to the OSDF.
The “origin” service connects the backend object store (a POSIX filesystem, S3-compatible endpoint, or HTTP endpoint) with the national infrastructure. The origin service needs access to the storage and incoming connectivity from the external infrastructure.
Most origin backends are currently a mounted shared filesystem, and S3 endpoints like those found on AWS or OSN are increasingly common. To ease operations, the OSG Consortium offers a “hosted origin service” where central experts will install and operate the origin as a container. The container is most often deployed via on-prem hardware as part of the National Research Platform or an institutional Kubernetes cluster and inside a ScienceDMZ.
If the repository runs their own origin, this can be done on “bare metal” with native packages or as a container operated by the institution.
The hardware needed for the origin varies widely based on expected usage; it is typically deployed on server-class hardware. Planning the network connectivity with the object store and out to the national infrastructure (including firewalls along the path) is key. The OSG team is experienced in consulting and providing help to universities in designing the integration. To provide the community some guidance, we host your suggested solutions.
The Open Science Data Federation (OSDF) enables users and institutions to make datasets available to compute jobs running in distributed high-throughput computing (dHTC) environments such as the Open Science Pool (OSPool). Compute jobs submitted from an HTCondor access point (e.g. an OSG-Operated Access Point) can access data stored in data origins, with HTCondor managing data transfer via the OSDF’s global namespace and data caches.
By providing the distributed data access layer via these data caches, jobs running in the OSPool (or any other resource pool) can reduce wide-area network consumption, load on the data origins, and latency of data access.
The OSDF is not limited to dHTC environments: it can be accessed via a browser (like S3, OSDF’s underlying protocol is HTTPS) or directly via a Python client.
The OSDF can be used in a variety of scenarios, including:
- A repository wants to stream its datasets, at scale, without scaling egress.
- A researcher wants to share a dataset with their community so others can use it in computational workflows.
- A researcher produces data on the OSPool that they need to store for future processing or sharing with the community.
- A team wants to make their datasets available to their community without opening their storage directly to the Internet.
To learn more details about these or other use cases, please reach out to our team of Research Computing Facilitators through [email protected].
Each origin is configured to enforce the object store’s access policies. Objects can be made public or private, and the repository controls the rules for sharing. For example, origins at Access Points can provide users with a public directory and a directory that is only accessible to a user’s jobs.
The content distribution network enforces the origin’s access policies by requiring a signed access token for non-public objects.
Objects cached in the content distribution network are visible to the administrators of the cache services and of the execution points where a user’s jobs run. Non-public data is encrypted when sent over the network, but not on disk. The OSDF is appropriate for non-public data from “open science” communities but not highly regulated or sensitive data (such as PII or HIPAA data).
OSDF Contributors
The OSDF is part of the OSG Fabric of Services run by the OSG Consortium.
The effort to operate the OSDF central services and hosted origins is provided by the PATh project. Institutions may operate their own origins on behalf of local repositories.
The caches in the distribution network are primarily managed by PATh staff but consist of hardware contributed by external projects or institutions such as:
 
            
            Powered By Pelican Platform
 
            
             
             
             
             
             
             
             
            