Skip to main content
We are actively developing our finetuning pipelines. Email info@tamarind.bio if any of the use cases below sound interesting and we can work with you to determine the best approach for your data. Custom models you train appear under Finetuning → My Models and can be selected as the input model when running inference.

Language Model Finetuning for Property Prediction

The Finetune Protein Language Model tool adapts a pretrained protein language model to predict a numerical (regression) or categorical (classification) property from a single-chain protein sequence. Upload a CSV with a sequence column and a property column, pick a base model, and we save the resulting adapter to your account. Once training finishes, find the model under Finetuning → My Models and select it to run inference on new sequences. Common use cases: binding affinity (Kd, EC50, IC50), enzyme activity, fluorescence, expression yield, thermostability (Tm), solubility, and other sequence-driven measurements.

Choosing a base model

We support five sizes of ESM2, plus ProtT5 and ProstT5. Larger models capture more sequence context but are slower and need more memory. The default is ESM2 650M, which works well for most datasets.
Base modelWhen to pick it
facebook/esm2_t6_8M_UR50DQuick smoke tests, very small datasets (fewer than 50 rows)
facebook/esm2_t12_35M_UR50DTight compute budget, fast iteration
facebook/esm2_t30_150M_UR50DBalanced choice for medium datasets
facebook/esm2_t33_650M_UR50D (default)Most regression and classification tasks
facebook/esm2_t36_3B_UR50DLargest datasets, hardest tasks (slow)
Rostlab/prot_t5_xl_uniref50T5-style backbone alternative to ESM2
Rostlab/ProstT5Sequence + structure-aware backbone
ElnaggarLab/ankh-base / ankh-largeAnkh family alternatives
If you don’t have a strong preference, start with the default. Try a larger backbone only if validation performance plateaus.

Preparing your data

A few rules of thumb that have made a measurable difference in customer runs:
  • Normalize regression targets to roughly the [-1, 1] range. Raw values like Kd in nM or EC50 in M can cause training loss to blow up. Centering and scaling (e.g. z-score, then clip outliers) gives stable training.
  • Aim for at least 100 rows of training data. Below ~50, results are unreliable; below 10 we print a warning. Single-protein mutational scans (one wild-type with many close variants) are the hardest case because sequences are nearly identical and signal is subtle.
  • Deduplicate exact sequence repeats before upload. Duplicates inflate validation metrics without improving the model.
  • For classification, encode your label column as discrete strings or integers. We map unique values to class indices automatically.

Settings

The defaults work for most users. Adjust the following only if needed:
SettingDefaultWhen to change
epochs20Decrease for very large datasets where each epoch is expensive; the early-stopping callback (below) will halt training automatically once validation plateaus
learningRate3e-4Tuned for LoRA. If fullModelTraining is on, drop to 1e-5 to 5e-5; the LoRA default will diverge
batchSize1Larger batches speed training if memory allows
gradientAccumulation8Effective batch size = batchSize * gradientAccumulation
dropout0.2Increase for small/noisy datasets
fullModelTrainingfalseTrain all weights instead of a LoRA adapter. More expressive but slower and needs lower learning rate
loraRank4LoRA adapter rank. Bump to 8 or 16 for datasets with >1000 rows where 4 underfits
loraAlpha1LoRA scaling. Set roughly equal to or 2x loraRank (e.g. rank=8 → alpha=8 or 16)
earlyStoppingPatience5Stop training if validation metric doesn’t improve for N epochs. Higher (10+) for noisy small datasets, or set equal to epochs to disable
We always save the best-performing checkpoint by validation Spearman (regression) or accuracy (classification), not the final epoch. Training logs show which epoch’s weights were retained.

Interpreting the output

Each finetune job produces a model.pth and a training_history.png plot of train/val loss and the validation metric over epochs. A flat or noisy validation curve usually means more data, lower learning rate, or longer training is needed. A train loss that drops while validation flatlines means the model is overfitting; try increasing dropout, decreasing loraRank, or shortening training.

Antibody Property Prediction

For antibody datasets with paired heavy and light chains, we offer a finetuning workflow that supports several antibody-pretrained backbones for regression and classification. This works well for specificity, developability proxies, and other antibody-specific properties where antibody-pretrained embeddings give a head start over a generic protein language model. Email info@tamarind.bio to get this enabled for your account.

Point Mutations

State-of-the-art models, including ours, struggle to capture small effects from single point mutations directly. If you have a small set of measured point mutations and want to identify the most informative next batch to test, we recommend ALDE (Active Learning-assisted Directed Evolution). For a fixed budget of experimental rounds, ALDE selects which mutations to measure next given the data you’ve collected so far.

Structure Prediction Finetuning

We offer Boltz finetuning on custom CIF structures, which can then be used for inference on new sequences. This is currently available to select partners. Email info@tamarind.bio to discuss your use case.

Large Binding Affinity Datasets (30,000+ data points)

For very large binding affinity datasets (tens of thousands of measurements against a common target), we recommend AlphaBind, which is purpose-built for this regime. Email info@tamarind.bio to learn more.

Tips for getting good results

  • Start with the default settings and the default base model. Only tune one thing at a time when iterating.
  • Watch the validation curve. If validation metric never improves above ~0.1, the model isn’t learning the signal. Check for label noise, target scale, or whether the task is fundamentally too hard for sequence alone.
  • Mutational-scan datasets are the hardest case. When all sequences differ by 1–2 residues and labels are subtle, even a well-trained model can produce nearly constant predictions. Larger backbones, higher loraRank, and fullModelTraining help, but there’s a fundamental capacity ceiling for very narrow datasets.
  • Validate with held-out sequences, not random row splits. Especially for antibody and mutational data, random splits leak information across train and validation.