SentenceTransformer wrapper for Rostlab/prot_t5_xl_uniref50

This repository repackages Rostlab/prot_t5_xl_uniref50 into a Sentence-Transformers model: Transformer + mean pooling for producing fixed-size protein sequence embeddings.

Preprocessing (IMPORTANT)

ProtT5 expects:

  • Replace rare/ambiguous amino acids U, Z, O, B with X
  • Insert whitespace between all amino acids

Example: PRTEINO -> "P R T E I N O"

Usage

from sentence_transformers import SentenceTransformer

def prott5_prepare_sequences(seqs):
    import re
    out = []
    for s in seqs:
        s = re.sub(r"[UZOB]", "X", s.upper())
        out.append(" ".join(list(s)))
    return out

model = SentenceTransformer("wrice/prot_t5_xl_uniref50-st")
seqs = ["PRTEINO", "SEQWENCE"]
seqs = prott5_prepare_sequences(seqs)

emb = model.encode(seqs, normalize_embeddings=False, batch_size=4, show_progress_bar=True)
print(emb.shape)
Downloads last month
17
Safetensors
Model size
1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support