ProteinTensor Introduction

ProteinTensor is an AI-native biomolecular storage format designed to eliminate the preprocessing bottleneck in modern structural biology machine learning pipelines.

The Problem

Every time a researcher trains AlphaFold, Boltz, RoseTTAFold, OpenFold, or any structure-prediction model, the same work happens before a single GPU operation runs:

for each protein in dataset:
    parse mmCIF / PDB file          (30 – 350 ms per structure)
    extract sequence tokens
    build atom coordinate arrays
    construct backbone geometry
    compute covalent bond graph
    load or regenerate MSA           (2 – 480 min with JackHMMER)
    run ESM2 / ESM3 inference        (seconds per protein on GPU)
    compute distance matrices
    ...

    -> finally: model.forward(features)

For a 100,000-structure training run this preprocessing costs thousands of CPU-hours per epoch - most of it producing identical results every time. The mmCIF file has not changed. The sequence has not changed. The physics has not changed. Yet every run recomputes everything from scratch.

ProteinTensor solves this by converting the PDB entry once into a .ptt file - a Zarr-backed, LZ4-compressed, memory-mappable store that holds every tensor a model needs - and then loading those tensors directly at training time with zero parsing.

once:   mmCIF  ->  ProteinTensor (.ptt)
always: .ptt   ->  model.forward()

Who This Is For

Structural biology researchers running AlphaFold 3, Boltz, or Chai-1 who spend hours waiting for MSA generation before every new experiment.

ML engineers at pharma and biotech companies iterating over large structure databases (PDB, AlphaFold Database, ESMAtlas) where I/O throughput is a training bottleneck measured in wall-clock days.

Academic labs with limited GPU budgets who cannot afford to waste compute cycles on re-parsing text files when those GPU-hours should go toward model training.

Software engineers building structural biology pipelines who want a single, well-defined intermediate format that works with PyTorch, JAX, and NumPy without writing custom loaders for every model.

ProteinTensor is to structural biology what Parquet is to analytics, what safetensors is to model weights, and what ONNX is to model exchange - a common, open, high- performance format that turns a recurring computational tax into a one-time cost.

Benchmark: Traditional Pipeline vs ProteinTensor

All timings are median over 30 rounds on a Windows workstation (RTX 5080, Python 3.11.9); mmCIF parsing and .ptt reads are CPU-bound, so these reflect CPU performance. Proteins span the full range from a 76-residue domain to a 3,525-residue CRISPR enzyme. Run python boltz_benchmark.py to reproduce.

Per-structure load times

Structure	Method	Res	MSA seqs	mmCIF parse	ptt: full	ptt: backbone	ptt: bonds	ptt: MSA	ptt: dist mx
1UBQ - Ubiquitin	X-ray	76	512	7.4 ms	3.2 ms	1.3 ms	0.8 ms	1.8 ms	0.8 ms
6LU7 - SARS-CoV-2 Mpro	X-ray	312	1,024	28.7 ms	3.3 ms	1.3 ms	0.8 ms	5.2 ms	1.9 ms
4HHB - Hemoglobin	X-ray	574	2,048	54.1 ms	3.3 ms	1.3 ms	0.8 ms	11.5 ms	3.6 ms
6M0J - ACE2 + RBD	Cryo-EM	791	2,048	73.2 ms	3.3 ms	1.4 ms	0.8 ms	15.3 ms	6.9 ms
6VXX - Spike trimer	Cryo-EM	2,916	8,192	283.9 ms	3.7 ms	1.4 ms	1.0 ms	213.7 ms	74.7 ms
6OHW - Cas12a	Cryo-EM	3,525	8,192	346.5 ms	3.7 ms	1.3 ms	1.0 ms	243.9 ms	107.3 ms

Column definitions

ptt: full - read() - all atoms, backbone, bonds, metadata
ptt: backbone - read_backbone() - N/CA/C/O coordinates + sequence only
ptt: bonds - read_bonds() - covalent graph only
ptt: MSA - read_msa() - MSA tokens + profile (loaded from .ptt cache)
ptt: dist mx - read_pair_feature("distance_matrix") - Ca-Ca distance matrix

Speedup vs mmCIF baseline

Structure	Res	full	backbone	bonds	MSA	dist mx
1UBQ - Ubiquitin	76	2x	6x	10x	4x	9x
6LU7 - SARS-CoV-2 Mpro	312	9x	21x	38x	5x	15x
4HHB - Hemoglobin	574	17x	40x	70x	5x	15x
6M0J - ACE2 + RBD	791	22x	54x	92x	5x	11x
6VXX - Spike trimer	2,916	76x	201x	285x	1x*	4x
6OHW - Cas12a	3,525	95x	257x	343x	1x*	3x

*MSA speedup shown as 1x vs mmCIF parse because both are in the same time range for large proteins - the real MSA comparison is vs JackHMMER generation (see below).

Feature assembly: time to prepare all tensors for model.forward()

Traditional = mmCIF parse + A3M MSA parse + distance-matrix compute. ProteinTensor = read the structure, MSA, distance matrix, and ESM2 embedding from a single pre-cached .ptt. Reproduce with python benchmarks/assembly_benchmark.py (MSA depth and embedding shape are realistic; numeric content is synthetic, so timing reflects tensor dimensions, not values).

Structure	Res	MSA depth	Traditional	ProteinTensor	Speedup
1UBQ - Ubiquitin	76	512	14.1 ms	7.1 ms	2.0x
6LU7 - SARS-CoV-2 Mpro	312	1,024	48.7 ms	13.6 ms	3.6x
4HHB - Hemoglobin	574	2,048	118.0 ms	22.7 ms	5.2x
6M0J - ACE2 + RBD	791	2,048	196.4 ms	38.3 ms	5.1x
6VXX - Spike trimer	2,916	8,192	1,395 ms	309 ms	4.5x
6OHW - Cas12a	3,525	8,192	1,462 ms	381 ms	3.8x

Average speedup across all six structures: 4x for full feature assembly (measured on a Windows CPU box - see benchmarks/ASSEMBLY_RESULTS.md).

On an earlier 34x figure: prior versions reported ~34x here. That number was measured against ProteinTensor's original scalar A3M parser, which dominated the traditional side (~11 s to parse an 8,192-deep MSA). Vectorizing that parser in v0.2.0 cut the traditional baseline ~8x, so the fair feature-assembly speedup is now ~4x. The .ptt read side was unchanged - only the baseline got faster.

Drug target benchmark

Same methodology across six high-value drug targets spanning KRAS oncology, HIV antivirals, PD-L1 immunotherapy, p53, cardiovascular (PCSK9), and a full IgG1 antibody. Numbers are consistent with the structural biology benchmark above.

Target	Res	mmCIF parse	ptt: full	ptt: backbone	ptt: bonds	ptt: MSA	ptt: dist mx
6OIM - KRAS G12C + Sotorasib	167	17.1 ms	3.4 ms	1.3 ms	0.8 ms	3.0 ms	1.3 ms
3HTB - HIV-1 protease	163	16.5 ms	3.3 ms	1.4 ms	0.8 ms	2.8 ms	1.3 ms
5WT9 - PD-L1 checkpoint	533	54.8 ms	3.8 ms	1.4 ms	0.8 ms	11.9 ms	3.8 ms
1TUP - p53 tumor suppressor	585	57.4 ms	3.4 ms	1.4 ms	0.8 ms	13.0 ms	4.0 ms
2P4E - PCSK9	586	55.4 ms	3.4 ms	1.4 ms	0.8 ms	12.8 ms	4.1 ms
1IGT - IgG1 antibody	1,316	127.3 ms	3.5 ms	1.4 ms	0.8 ms	47.1 ms	17.9 ms

Target	Res	full	backbone	bonds	MSA	dist mx
6OIM - KRAS G12C + Sotorasib	167	5x	13x	22x	6x	13x
3HTB - HIV-1 protease	163	5x	12x	21x	6x	13x
5WT9 - PD-L1 checkpoint	533	15x	40x	69x	5x	14x
1TUP - p53 tumor suppressor	585	17x	42x	71x	4x	14x
2P4E - PCSK9	586	16x	41x	70x	4x	14x
1IGT - IgG1 antibody	1,316	37x	92x	156x	3x	7x

DataLoader batch throughput

Measured using ProteinDataset + ProteinDataset.collate(), loading structures into padded batches ready for model.forward(). Single process, no prefetch workers.

Batch size	ms / batch	Structures / sec
1	0.01 ms	97,088
4	0.03 ms	116,279
8	0.42 ms	19,242
16	0.97 ms	16,412
32	2.1 ms	15,033

Scale projection: 100,000 structures, one training epoch

These are projections, extrapolated from the measured per-structure timings above - not end-to-end measurements at 100k scale.

Operation	Traditional pipeline	ProteinTensor	Speedup
Structure load (parse mmCIF each epoch)	3.8 hours	6 min	37x
Backbone-only load (template search)	3.8 hours	2 min	95x
Full feature assembly (seq + MSA + pairs + emb)	16 hours	3.9 hours	4x
MSA generation (JackHMMER, 32-core CPU, once)	4,000 hours	2.7 hours	1,477x

MSA generation assumes 2.4 min/protein on a 32-core server (PDB90 database, standard AlphaFold settings). ProteinTensor generates MSAs once and loads from the .ptt cache on every subsequent run. The 4,000-hour figure is the real cost AlphaFold2 and Boltz users pay to build training datasets from scratch.

Measured vs projected - read this. The 1,477x above is MSA generation (building the alignment once with JackHMMER) and is a literature-based projection, not something benchmarked here. What is measured on hardware is the recurring per-epoch MSA load - reading a cached MSA from .ptt vs re-parsing A3M text each epoch (against a vectorized A3M parser baseline): 3.4x-5.9x, growing with MSA depth. See benchmarks/MSA_RESULTS.md. These are different quantities; do not read the 1,477x as a measured load speedup.

Disk tradeoff

A full-featured .ptt (8,192-sequence MSA + distance matrix + ESM2-650M embedding at float16) averages 23x larger than the source mmCIF across the six benchmark structures. The tradeoff is deliberate: pay disk space once to avoid paying GPU-hours and CPU-hours on every training run. A structure-only .ptt with no cached features is smaller than the source mmCIF.

Install

pip install -e ".[dev]"           # core + dev tools
pip install -e ".[cloud]"         # adds fsspec, s3fs, gcsfs for remote reads
pip install -e ".[dev,cloud]"     # everything

Requires Python >= 3.9, gemmi, zarr, numpy, click, rich.

Quick Start

Convert a structure

proteintensor convert 1abc.cif 1abc.ptt
proteintensor info 1abc.ptt

Convert a sequence (no structure required)

For sequence-driven predictors like AlphaFold and Boltz, the primary input is a sequence, not a structure. ProteinTensor can build a sequence-only .ptt (no coordinates) directly from a raw string or a FASTA file:

proteintensor convert-seq MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDG ubq.ptt
proteintensor convert-seq complex.fasta complex.ptt   # multi-record FASTA -> multi-chain

import proteintensor as pt

data = pt.from_sequence("MQIFVKTLTGK...", pdb_id="UBQ", chain_id="A")
data.has_structure        # False - sequence-only entry
data.sequence_tokens      # (N_res,)  int32

pt.write(data, "ubq.ptt")

# FASTA: a single record -> one chain; multiple records -> multi-chain complex
data = pt.from_fasta("complex.fasta")

Batch-convert a directory

Convert an entire directory of structures in parallel, with progress reporting. Files that fail to parse are skipped and listed in the summary; already-converted outputs are skipped by default.

proteintensor convert-dir ./pdb_files/ ./ptt_files/            # auto worker count
proteintensor convert-dir ./pdb_files/ ./ptt_files/ --workers 16 --recursive
proteintensor convert-dir ./pdb_files/ ./ptt_files/ --overwrite  # rebuild existing

Benchmark against mmCIF

proteintensor benchmark 1abc.cif --rounds 20

Cache an MSA (after running JackHMMER / ColabFold)

import proteintensor as pt

msa = pt.from_a3m("1abc_uniref90.a3m",
                  tool="jackhammer", tool_version="3.3.2",
                  database="uniref90", database_date="2024-01")
pt.add_msa("1abc.ptt", msa, source="uniref90")

Cache ESM2 embeddings (after GPU inference)

pt.add_embedding("1abc.ptt", esm_representations,
                 model="esm2_t33_650M_UR50D", layer=-1, dtype="float16",
                 sequence_hash=pt.embedding_sequence_hash(data.sequence_tokens))

Run Boltz2 directly from a .ptt file

from proteintensor import BoltzAdapter

adapter = BoltzAdapter("1abc.ptt")
predictions = adapter.predict(
    "boltz_output/",
    model="boltz2",
    diffusion_samples=5,
    recycling_steps=3,
    accelerator="gpu",
)
# -> boltz_output/predictions/1abc/1abc_model_0.cif  (predicted structure)
# -> boltz_output/predictions/1abc/pae_*.npz          (PAE matrix)
# -> boltz_output/predictions/1abc/plddt_*.npz        (per-residue confidence)

Python API

import proteintensor as pt

# ------ Structure ------
data = pt.read("1abc.ptt")
data.atom_positions.shape      # (N_atoms, 3)   float32
data.sequence_tokens.shape     # (N_res,)        int32
data.backbone_positions.shape  # (N_res, 4, 3)  float32  N/CA/C/O
data.bond_edge_index.shape     # (2, N_edges)   int32   bidirectional

# Backbone only (fastest structural load)
bb = pt.read_backbone("1abc.ptt")
bb.positions.shape  # (N_res, 4, 3)

# Bond graph only
bonds = pt.read_bonds("1abc.ptt")

# ------ MSA ------
msa = pt.read_msa("1abc.ptt", source="uniref90")
msa.tokens.shape    # (N_seq, N_res)  int32
msa.profile.shape   # (N_res, 23)    float32

# ------ Pair features ------
pt.compute_and_store_distances("1abc.ptt")       # Ca-Ca distance matrix
pt.compute_and_store_contacts("1abc.ptt", threshold=8.0)

dist = pt.read_pair_feature("1abc.ptt", "distance_matrix")
dist.data.shape     # (N_res, N_res, 1)  float32

# Store arbitrary pair tensors (template features, MSA covariance, …)
pt.add_pair_feature("1abc.ptt", my_array, name="template_pair",
                    symmetric=False, dtype="float16")

# ------ PLM embeddings ------
emb = pt.read_embedding("1abc.ptt", "esm2_t33_650M_UR50D")
emb.data.shape      # (N_res, 1280)  float32  (upcast from float16 on load)

# ------ Ligands / small molecules ------
# Capture drugs, cofactors, and ions from a structure (opt-in)
data = pt.from_mmcif("6oim.cif", include_ligands=True)
[l.name for l in data.ligands]        # ['MG', 'GDP', 'MOV']  (MOV = sotorasib)

ligs = pt.read_ligands("6oim.ptt")
ligs[0].elements                      # (N_atoms,)  S2  element symbols
ligs[0].positions                     # (N_atoms, 3)  float32
pt.list_ligands("6oim.ptt")           # ['MG', 'GDP', 'MOV']

# Build a ligand from SMILES (needs `pip install "proteintensor[ligands]"`)
aspirin = pt.from_smiles("CC(=O)Oc1ccccc1C(=O)O", name="AIN")
pt.add_ligand("target.ptt", aspirin)  # attach to an existing .ptt

# ------ Lazy / zero-copy access ------
positions = pt.mmap_positions("1abc.ptt")       # zarr.Array - no full load
backbone  = pt.mmap_backbone("1abc.ptt")        # [N_res, 4, 3]
msa_lazy  = pt.mmap_msa_tokens("1abc.ptt", "uniref90")  # [N_seq, N_res]
emb_lazy  = pt.mmap_embedding("1abc.ptt", "esm2_t33_650M_UR50D")

# Slice without loading the full tensor
ca_window = backbone[100:164, 1, :]             # 64 Ca positions
top_100   = msa_lazy[:100, :]                   # first 100 MSA sequences

# ------ PyTorch ------
import torch
data   = pt.read("1abc.ptt")
coords = torch.from_numpy(data.atom_positions)   # (N_atoms, 3)
tokens = torch.from_numpy(data.sequence_tokens)  # (N_res,)

# ------ JAX ------
import jax.numpy as jnp
data   = pt.read("1abc.ptt")
coords = jnp.array(data.atom_positions)

# ------ Cloud streaming ------
# Read a single structure directly from S3 (no local download)
data = pt.read("s3://my-bucket/proteins/1abc.ptt")
bb   = pt.read_backbone("s3://my-bucket/proteins/1abc.ptt")
arr  = pt.mmap_positions("s3://my-bucket/proteins/1abc.ptt")  # lazy remote array

# Open a dataset stored in cloud
ds = pt.ProteinDataset("s3://my-bucket/training.ptt")

# Prepare a local .ptt for fast remote reads before uploading (one-time)
pt.consolidate("1abc.ptt")                  # writes .zmetadata
# aws s3 cp -r 1abc.ptt s3://my-bucket/proteins/1abc.ptt

# Pass storage_options for credentials or custom endpoints
data = pt.read(
    "s3://my-bucket/proteins/1abc.ptt",
    storage_options={"key": "ACCESS_KEY", "secret": "SECRET_KEY"},
)

# ------ Multi-structure dataset ------
# Structure .ptt files and sequence-only .ptt files can be mixed in one dataset.
pt.create_dataset("training.ptt")
for ptt_file in Path("ptt_files").glob("*.ptt"):
    pt.add_to_dataset("training.ptt", ptt_file)

ds = pt.ProteinDataset("training.ptt")
len(ds)               # number of structures
ds[0]                 # ProteinTensorData by index
ds["1ABC"]            # ProteinTensorData by PDB ID (case-insensitive)

# PyTorch DataLoader - collate pads variable-length sequences
from torch.utils.data import DataLoader
loader = DataLoader(ds, batch_size=8, collate_fn=pt.ProteinDataset.collate)
for batch in loader:
    coords  = torch.from_numpy(batch["atom_positions"])   # (B, max_atoms, 3)
    pad     = torch.from_numpy(batch["padding_mask"])     # (B, max_res)  True=real
    has_str = torch.from_numpy(batch["has_structure"])    # (B,)  False = sequence-only

Sequence-only entries contribute zero atoms to the batch (n_atoms == 0, has_structure == False), so sequence-driven and structure-based samples can be loaded together in one DataLoader.

.ptt file layout

structure.ptt/                      Zarr directory store (v0.7)
├── .zattrs                         format, version, pdb_id, resolution, ...
├── sequence/
│   ├── tokens             [N_res]           int32    AA vocab indices (0-20)
│   ├── residue_index      [N_res]           int32    PDB sequence numbers
│   └── chain_id           [N_res]           S1       chain labels
├── atoms/
│   ├── positions          [N_atoms, 3]      float32  Angstrom coordinates
│   ├── mask               [N_atoms]         bool
│   └── b_factors          [N_atoms]         float32  B-factor / pLDDT
├── structure/
│   ├── residue_atom_start [N_res]           int32    first atom index per residue
│   └── residue_atom_count [N_res]           int32    atom count per residue
├── backbone/
│   ├── positions          [N_res, 4, 3]     float32  N/CA/C/O coords
│   └── mask               [N_res, 4]        bool     False = missing atom
├── bonds/
│   ├── edge_index         [2, N_edges]      int32    bidirectional (src, dst)
│   └── edge_type          [N_edges]         uint8    1=SINGLE 2=DOUBLE 4=AROMATIC
│                                                      5=PEPTIDE 6=DISULFIDE
├── msa/
│   └── <source>/                            one sub-group per database source
│       ├── .zattrs                          tool, version, database, date, seq SHA-256
│       ├── tokens         [N_seq, N_res]    int32    0-20=AA 21=GAP 22=MASK
│       ├── deletion_matrix [N_seq, N_res]   float32  insertions before each column
│       ├── profile        [N_res, 23]       float32  per-position residue frequencies
│       └── deletion_mean  [N_res]           float32
├── pairs/
│   └── <name>/                              one sub-group per named feature
│       ├── .zattrs                          channels, symmetric, dtype, description
│       └── data           [N_res, N_res, C] any dtype, chunked 128x128xC
├── embeddings/
│   └── <model>/                             one sub-group per PLM model
│       ├── .zattrs                          model, layer, dim, dtype, seq SHA-256
│       └── data           [N_res, D]        float32 or float16, chunked 256xD
└── ligands/
    └── <index>/                            one sub-group per non-polymer ligand
        ├── .zattrs                          name (CCD), chain_id, res_num, smiles
        ├── elements       [N_atoms]         S2       element symbols
        ├── positions      [N_atoms, 3]      float32  Angstrom coordinates
        └── b_factors      [N_atoms]         float32

Multi-structure dataset layout

dataset.ptt/                        Zarr directory store
├── .zattrs                         format="proteintensor-dataset", version, num_structures
└── structures/
    ├── 000000/                     zero-padded integer key
    │   └── (same layout as single .ptt above)
    ├── 000001/
    │   └── ...
    └── ...

Each sub-group under structures/ is identical to a standalone .ptt root, so all single-structure reader helpers work on sliced groups.

Supported models

Model	Adapter	Status
Boltz 2	`BoltzAdapter`	Verified - end-to-end prediction on RTX 5080
Boltz 1	`BoltzAdapter(model="boltz1")`	Supported
OpenFold	-	Planned
RoseTTAFold-All-Atom	-	Planned
Chai-1	-	Planned

Run tests

pytest tests/ -v

150 tests across structure roundtrip, backbone/bonds/MSA/pairs/embeddings/ligands, sequence conversion, A3M parsing, Boltz adapter, multi-structure dataset, and cloud streaming (memory:// fsspec - no real cloud account required).

Roadmap

Backbone-only dense layout [N_res, 4, 3] for faster backbone access
Bond graph storage (edge_index) - SINGLE / DOUBLE / AROMATIC / PEPTIDE / DISULFIDE
MSA feature caching - A3M parser, provenance tracking, multi-source per file
Pair representation block [N, N, C] - distance matrix, contact map, generic named tensors
Pre-embedded ESM2 / ESM3 features - float16 storage, provenance hash, lazy mmap access
Model adapters: Boltz2 - end-to-end prediction from .ptt verified on RTX 5080
Multi-structure dataset container - one Zarr store, N structures, PyTorch DataLoader compatible
Cloud streaming - S3 / GCS via fsspec, training directly from object storage

Model coverage

OpenFold adapter
RoseTTAFold-All-Atom adapter
Chai-1 adapter

Data pipeline

Batch convert CLI - convert entire PDB directories in parallel with progress reporting
Sequence-identity dataset splitting - MMseqs2-based cluster splits to prevent data leakage between train / val / test

Format extensions

Ligand / small-molecule support - CCD-based extraction from structures, SMILES input via RDKit, element/coordinate storage (bond graphs and binding-site annotations still to come)
MD trajectory storage - time axis [N_frames, N_atoms, 3] for conformational ensembles and AlphaFold 3 diffusion trajectories

Performance

Parallel DataLoader workers - thread-safe multi-worker prefetching verified under PyTorch DDP
Format version migration CLI - upgrade .ptt files in-place across version bumps

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.github/workflows		.github/workflows
assets		assets
benchmarks		benchmarks
proteintensor		proteintensor
tests		tests
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
boltz_benchmark.py		boltz_benchmark.py
paper.bib		paper.bib
paper.md		paper.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ProteinTensor Introduction

The Problem

Who This Is For

Benchmark: Traditional Pipeline vs ProteinTensor

Per-structure load times

Speedup vs mmCIF baseline

Feature assembly: time to prepare all tensors for model.forward()

Drug target benchmark

DataLoader batch throughput

Scale projection: 100,000 structures, one training epoch

Disk tradeoff

Install

Quick Start

Convert a structure

Convert a sequence (no structure required)

Batch-convert a directory

Benchmark against mmCIF

Cache an MSA (after running JackHMMER / ColabFold)

Cache ESM2 embeddings (after GPU inference)

Run Boltz2 directly from a .ptt file

Python API

.ptt file layout

Multi-structure dataset layout

Supported models

Run tests

Roadmap

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ProteinTensor Introduction

The Problem

Who This Is For

Benchmark: Traditional Pipeline vs ProteinTensor

Per-structure load times

Speedup vs mmCIF baseline

Feature assembly: time to prepare all tensors for model.forward()

Drug target benchmark

DataLoader batch throughput

Scale projection: 100,000 structures, one training epoch

Disk tradeoff

Install

Quick Start

Convert a structure

Convert a sequence (no structure required)

Batch-convert a directory

Benchmark against mmCIF

Cache an MSA (after running JackHMMER / ColabFold)

Cache ESM2 embeddings (after GPU inference)

Run Boltz2 directly from a .ptt file

Python API

.ptt file layout

Multi-structure dataset layout

Supported models

Run tests

Roadmap

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages