AlphaFold3 Alignment Library
A growing OSDF-hosted library of reusable AlphaFold3 multiple sequence alignments for structure prediction workflows.
As of
1 alignments cached
Representing
— unique protein sequences
Annotated across
— source organisms
Including
— community-contributed records
Running AlphaFold3 at scale can be challenging because jobs often spend a long time preparing inputs before structure prediction begins, including searching large reference databases to build alignments. This can lead to long runtimes, heavy disk I/O use, and scheduling challenges on shared computing systems, especially when the same protein chains are analyzed repeatedly across many projects. The AlphaFold3 Alignment Library helps reduce this repeated work by allowing compatible OSPool workflows to reuse precomputed alignments when they are already available.
What is the AlphaFold3 Alignment Library?
The AlphaFold3 Alignment Library is a shared index and OSDF namespace for precomputed multiple sequence alignments (MSAs) that can be reused by cache-aware AlphaFold3 workflows running on the OSPool. The initial focus is protein-chain unpaired MSAs in A3M format, which are the alignment artifacts most useful for skipping repeated CPU-heavy sequence search work before GPU inference.
Each record links a normalized protein sequence hash to an alignment file, its OSDF/Pelican location, checksum, source, size, sequence count, and provenance metadata about how the alignment was generated. This lets workflows decide whether a cached alignment is acceptable for a given run.
The library is not a replacement for AlphaFold3 itself. It is a reuse layer for the data-pipeline stage: when a pre-computed alignment already exists for a protein chain, a workflow can fetch that alignment from OSDF instead of recomputing it de novo from the full reference databases.
Rollout note: This library is currently being grown and tested as cache-aware AlphaFold3 workflows are introduced on the OSPool. Records, coverage, and contribution pathways will expand over time. The OSPool is looking for early adopters to test the library and provide feedback on its design and utility. If you are interested in using or contributing to the library, please contact [email protected].
Why cache alignments?
AlphaFold3 workflows often spend substantial CPU time in the data pipeline searching large sequence databases to build MSAs. Across a community of users, protein chains, protein families, organisms, and benchmark inputs may be searched repeatedly. The alignment library reduces duplicated work by allowing those precomputed alignments to be reused when the sequence, source, and provenance are appropriate for the user's workflow.
- Faster cache-aware runs — jobs that find an acceptable cached MSA can skip part or all of the repeated sequence-search stage for that chain.
- Lower CPU and storage pressure — repeated database searches are replaced with OSDF/Pelican reads of smaller alignment artifacts.
- Reproducibility and provenance — cached records include source and checksum metadata so users can distinguish generated, imported, and community-contributed alignments.
- Community reuse — every validated contribution can make later structure-prediction workflows faster for researchers.
How jobs use the library
A caching-aware AlphaFold3 wrapper can check the library once per protein chain before running the data pipeline. The workflow normalizes the protein sequence, computes its sequence hash, queries the registry or static manifest, and then fetches an acceptable A3M file from OSDF when a matching record is available. If no acceptable record is found, the job can continue with de novo MSA generation using the AlphaFold3 databases.
Users and workflows should opt in to the sources they trust. For example, a workflow may choose to accept only alignments generated by a specific AlphaFold3 wrapper, a particular database snapshot, a specific collaboration, or a curated imported source. This prevents the cache from becoming a blind substitution mechanism and keeps provenance visible.
Need an account to run AlphaFold3 workloads? Sign up on the OSG Portal
Current scope
| Item | Current scope |
|---|---|
| Primary artifact | Protein-chain unpaired MSA files in A3M format. |
| Cache key | Normalized protein sequence hash, with one or more alignment records allowed per sequence when sources or provenance differ. |
| Distribution | Alignment files are stored in an OSDF namespace and retrieved through Pelican-aware clients or normal OSDF access patterns. |
| Registry metadata | Sequence identifier, source, OSDF URI, checksum, size, number of aligned sequences, timestamps, and generation/provenance fields. |
| Sources | OSPool-generated AlphaFold3 data-pipeline outputs, curated imported alignment sets, and validated institutional or community contributions. |
| Organism information | Derived from source metadata when available. Organism counts describe annotated source organisms, not a guarantee that each sequence is organism-unique. |
How can I contribute alignments?
Researchers, institutions, and collaborations can contribute precomputed alignments when they are willing to share the alignment artifacts and enough metadata for reuse. Contibuting to the library is as easy as running AlphaFold3 using the OSG-Provided workflow on your own sequences. Caching of alignments happens automatically when you run the workflow on the OSPool, and your generated alignments will be added to the library's registry and made available for reuse by yourself (and other users).
Already have a set of pre-computed alignments you'd like to share?
Connect with a Facilitator at [email protected]
Contributed records are deduplicated by sequence and validated before being advertised for broad reuse. Multiple records may exist for the same sequence when they come from different trusted sources or database snapshots.
To contribute alignments, discuss institutional cache hosting, or share a collaboration-scale alignment set, contact [email protected].
Powered by Pelican Platform