Structure Prediction
Predict protein structure given a sequence
The structure prediction task is simple: given a representation of a biomolecule such as an amino acid or nucleic acid sequence, predict its 3D structure. The input is a protein sequence, the output is a PDB or CIF file. As structures unlock valuable information about a molecule’s function or behavior, it is interwoven in molecular discovery workflows, from target preparation to binder design to assessing developability.
An overview of our top recommendations for structure prediction tools is below.
Tool | Inputs | Outputs | Capabilities |
---|---|---|---|
AlphaFold2 | Protein | Average pLDDT, PAE matrix, ipTM/pTM | 5 samples |
Boltz | Protein, nucleic acid, small molecule (ccd, smiles) | Confidence, PDE/PAE matrix, ipTM/pTM | Multiple samples |
Chai | Protein, nucleic acid, small molecule (smiles) | Aggregate score, Average pLDDT, PAE/PDE matrix, pTM/ipTM | Multiple samples |
Protenix | Protein, nucleic acid, small molecule (ccd) | pLDDT, pTM/ipTM, PAE/PDE matrix | Multiple samples |
Recommend for specific cases
- ImmuneBuilder - for fast antibody/nanobody/TCR structure prediction (very fast)
- HighFold - for cyclic peptides, uses AlphaFold2 modified with a cyclic offset
- ESMFold - uses ESM2 language model, very fast
More experimental tools
- AFSample2 - uses AlphaFold jackhammer MSAs instead of MMseqs2 MSAs, so takes several hours. Generates arbitrary number of conformations
- UniFold - retrainable AlphaFold, also uses jackhammer MSAs
- CombFold - combine large complexes with many repeating domains from smaller structure predictions of subunits
- AlphaFlow - generates more diverse conformations
- RosettaFoldNA - Protein/nucleic acid complexes
- RosettaFold-All-Atom - Protein/small molecule complexes - allows for uploading a sdf for ligand
Multiple Sequence Alignment
All of our structure prediction tools require a Multiple Sequence Alignment (MSA) as input. This queries a large database of sequences to find similar sequences to your input to inform the structure prediction. The alignment produces a .a3m file, which stores the result sequences aligned with the input sequence. We then use this file as input to your structure prediction.
We use the ColabFold MMseqs2 method to query uniref30, bfd, and the colabfold environmental db to generate the MSA. We host our own servers for MSA, with the databases loaded into memory, for a great improvement in calculation speed.
Input limits
All Alphafold3 models have a hard input limit of 2048 residues due to limitations of the models. Alphafold2 has a limit of ~5000 residues due to compute restraints.
Alphafold2
AlphaFold2 was a major leap in structure prediction when it came out, shifting the field from physics-based modeling to deep learning. It uses attention-based neural networks and evolutionary data, reaching near-experimental accuracy and outperforming everything that came before it.
The training cutoff for AlphaFold2 is April 30, 2018. This can be helpful to keep in mind when evaluating or benchmarking results.
Overview of outputs
- pTM (predicted Template Modeling score): Evaluates the confidence in the overall arrangement of the predicted structure, ranging from 0 to 1. Scores above 0.5 suggest a good prediction; scores below 0.5 indicate poor confidence.
- ipTM (interface predicted TM-score): Assesses the confidence in the predicted relative positions of interacting subunits within a protein complex. Values above 0.8 indicate high-quality interface predictions, while scores below 0.6 suggest potential inaccuracies.
- pAE (predicted Aligned Error) matrix: A 2D plot estimating the expected positional error between residue pairs in Ångströms. Lower pAE values denote higher confidence in the relative positioning of residues, whereas higher values indicate uncertainty. This matrix is particularly useful for evaluating the confidence in spatial relationships between different domains or subunits.
Helpful guide on interpreting outputs: https://www.ebi.ac.uk/training/online/courses/alphafold/
AlphaFold2 has 5 sets of weights, and the best model is listed as Rank 1. Because of this, AlphaFold2 can only produce up to 5 models. By default, AlphaFold2 will run for 3 recycles. You can increase this up to 20, but by default the model will still stop earlier if there’s a small difference between subsequent recycles. You can override this by turning on the “No Stop” parameter.
We offer ColabFold’s default amber setting for relaxation. For further energy minimization, we recommend our OpenMM molecular dynamics tool.
We recommend Alphafold2 for pulldown screens for protein-protein interaction. Tamarind can run this in high throughput for up to hundreds of thousands of structures. Alphafold2 metrics have been found to be a better indicator of binding than Alphafold3, as its confidence metrics are better calibrated.
Alphafold3
While Alphafold3 is not commercially available, various reproductions of it have come out, which match its accuracy. We offer Chai-1, Boltz, and Protenix. These models are all-atom models, meaning they can incorporate small molecules and nucleic acids into the structure as well.
Alphafold3 models are all generative models, so they can sample an unlimited number of output structures (up to thousands). If you are looking for a particular pose, we recommend generating many samples and taking the highest confidence / clustering afterwards.
The confidence metric produced from these models can be a good indicator of the model’s confidence in the predicted structure. The other metrics are less clearly useful.
We are members of the OpenFold consortium, which is working on OpenFold, a more complete AlphaFold3 reproduction. We’ll have it available when it’s released in the coming weeks!
Prediction speed
You can expect most AlphaFold3 structure prediction jobs to complete within ~10 minutes on Tamarind. AlphaFold2 is generally slower, taking ~30 minutes to several hours for larger jobs. Prediction time increases linearly with the number of recycles (AlphaFold2) or samples (AlphaFold3) selected.
Antibody structure prediction
Chai and Boltz both provide the option to specify restraints, including contact (between 2 residues) and pocket (between a residue and a chain) restraints. Depending on what epitope and paratope information you have, we recommend constraining your structure prediction. We’ve found using either with a large number of samples to produce good results.
Restraints with Boltz/Chai
- Contact restraints - constrain a specific residue on Chain 1 with a specific residue on Chain 2 - only between 2 protein chains
- Pocket restraints - constrain a specific residue on Chain 1 with a different entire chain or a ligand - between 2 protein chains or a protein and a ligand
- Covalent restraints - create a covalent bond between a specific atom on Chain 1 with a specific atom on Chain 2 - between 2 proteins chains or a protein and a ligand
If you need a lot of structure predictions of your antibody/nanobody without the antigen, ImmuneBuilder or ABodyBuilder3 may be useful. They are very fast, and produce reasonable structures, though aren’t able to include the target in the prediction.
We offer TCRModel2 for Class1, Class2, and unbound TCR prediction.
Peptide structure prediction
We’ve found that Boltz-1x and Chai-1 perform the best for predicting poses of peptide-protein complexes, our recommendation would be try both approaches and see what looks most accurate to your data.
For cyclic peptides, we recommend Highfold, which is a modified version of ColabFold with a cyclic offset implemented, meaning it accounts for the circular rather than linear nature of the peptide.
Linkers
Linkers are usually modeled as part of the monomer. For example, you can include a linker between two complexes like this: AAAAAAGGGGSGGGGSGGGGSGGGGSAAAAAAAA and predict it as one structure using any of our structure prediction tools.