Language Model Finetuning for Property Prediction
The Finetune Protein Language Model tool adapts a pretrained protein language model to predict a numerical (regression) or categorical (classification) property from a single-chain protein sequence. Upload a CSV with a sequence column and a property column, pick a base model, and we save the resulting adapter to your account. Once training finishes, find the model under Finetuning → My Models and select it to run inference on new sequences. Common use cases: binding affinity (Kd, EC50, IC50), enzyme activity, fluorescence, expression yield, thermostability (Tm), solubility, and other sequence-driven measurements.Choosing a base model
We support five sizes of ESM2, plus ProtT5 and ProstT5. Larger models capture more sequence context but are slower and need more memory. The default is ESM2 650M, which works well for most datasets.| Base model | When to pick it |
|---|---|
facebook/esm2_t6_8M_UR50D | Quick smoke tests, very small datasets (fewer than 50 rows) |
facebook/esm2_t12_35M_UR50D | Tight compute budget, fast iteration |
facebook/esm2_t30_150M_UR50D | Balanced choice for medium datasets |
facebook/esm2_t33_650M_UR50D (default) | Most regression and classification tasks |
facebook/esm2_t36_3B_UR50D | Largest datasets, hardest tasks (slow) |
Rostlab/prot_t5_xl_uniref50 | T5-style backbone alternative to ESM2 |
Rostlab/ProstT5 | Sequence + structure-aware backbone |
ElnaggarLab/ankh-base / ankh-large | Ankh family alternatives |
Preparing your data
A few rules of thumb that have made a measurable difference in customer runs:- Normalize regression targets to roughly the [-1, 1] range. Raw values like Kd in nM or EC50 in M can cause training loss to blow up. Centering and scaling (e.g. z-score, then clip outliers) gives stable training.
- Aim for at least 100 rows of training data. Below ~50, results are unreliable; below 10 we print a warning. Single-protein mutational scans (one wild-type with many close variants) are the hardest case because sequences are nearly identical and signal is subtle.
- Deduplicate exact sequence repeats before upload. Duplicates inflate validation metrics without improving the model.
- For classification, encode your label column as discrete strings or integers. We map unique values to class indices automatically.
Settings
The defaults work for most users. Adjust the following only if needed:| Setting | Default | When to change |
|---|---|---|
epochs | 20 | Decrease for very large datasets where each epoch is expensive; the early-stopping callback (below) will halt training automatically once validation plateaus |
learningRate | 3e-4 | Tuned for LoRA. If fullModelTraining is on, drop to 1e-5 to 5e-5; the LoRA default will diverge |
batchSize | 1 | Larger batches speed training if memory allows |
gradientAccumulation | 8 | Effective batch size = batchSize * gradientAccumulation |
dropout | 0.2 | Increase for small/noisy datasets |
fullModelTraining | false | Train all weights instead of a LoRA adapter. More expressive but slower and needs lower learning rate |
loraRank | 4 | LoRA adapter rank. Bump to 8 or 16 for datasets with >1000 rows where 4 underfits |
loraAlpha | 1 | LoRA scaling. Set roughly equal to or 2x loraRank (e.g. rank=8 → alpha=8 or 16) |
earlyStoppingPatience | 5 | Stop training if validation metric doesn’t improve for N epochs. Higher (10+) for noisy small datasets, or set equal to epochs to disable |
Interpreting the output
Each finetune job produces amodel.pth and a training_history.png plot of train/val loss and the validation metric over epochs. A flat or noisy validation curve usually means more data, lower learning rate, or longer training is needed. A train loss that drops while validation flatlines means the model is overfitting; try increasing dropout, decreasing loraRank, or shortening training.
Antibody Property Prediction
For antibody datasets with paired heavy and light chains, we offer a finetuning workflow that supports several antibody-pretrained backbones for regression and classification. This works well for specificity, developability proxies, and other antibody-specific properties where antibody-pretrained embeddings give a head start over a generic protein language model. Email info@tamarind.bio to get this enabled for your account.Point Mutations
State-of-the-art models, including ours, struggle to capture small effects from single point mutations directly. If you have a small set of measured point mutations and want to identify the most informative next batch to test, we recommend ALDE (Active Learning-assisted Directed Evolution). For a fixed budget of experimental rounds, ALDE selects which mutations to measure next given the data you’ve collected so far.Structure Prediction Finetuning
We offer Boltz finetuning on custom CIF structures, which can then be used for inference on new sequences. This is currently available to select partners. Email info@tamarind.bio to discuss your use case.Large Binding Affinity Datasets (30,000+ data points)
For very large binding affinity datasets (tens of thousands of measurements against a common target), we recommend AlphaBind, which is purpose-built for this regime. Email info@tamarind.bio to learn more.Tips for getting good results
- Start with the default settings and the default base model. Only tune one thing at a time when iterating.
- Watch the validation curve. If validation metric never improves above ~0.1, the model isn’t learning the signal. Check for label noise, target scale, or whether the task is fundamentally too hard for sequence alone.
- Mutational-scan datasets are the hardest case. When all sequences differ by 1–2 residues and labels are subtle, even a well-trained model can produce nearly constant predictions. Larger backbones, higher
loraRank, andfullModelTraininghelp, but there’s a fundamental capacity ceiling for very narrow datasets. - Validate with held-out sequences, not random row splits. Especially for antibody and mutational data, random splits leak information across train and validation.