BiomedBERT Hash Nano Embeddings

This is a BiomedBERT Hash Nano model fined-tuned using sentence-transformers. It maps sentences & paragraphs to a 128 dimensional dense vector space and can be used for tasks like clustering or semantic search.

The training dataset was generated using a random sample of PubMed title-abstract pairs along with similar title pairs. The training workflow was a two step distillation process as follows.

Distill embeddings from the larger pubmedbert-base-embeddings model using this model distillation script from Sentence Transformers.
Build a distilled dataset of teacher scores using the biomedbert-base-reranker cross-encoder for a separate random sample of title-abstract pairs.
Further fine-tune the model on the distilled dataset using KLDivLoss.

Usage (txtai)

This model can be used to build embeddings databases with txtai for semantic search and/or as a knowledge source for retrieval augmented generation (RAG).

import txtai

embeddings = txtai.Embeddings(
  path="neuml/biomedbert-hash-nano-embeddings",
  content=True,
  vectors={"trust_remote_code": True}
)
embeddings.index(documents())

# Run a query
embeddings.search("query to run")

Usage (Sentence-Transformers)

Alternatively, the model can be loaded with sentence-transformers.

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer("neuml/biomedbert-hash-nano-embeddings", trust_remote_code=True)
embeddings = model.encode(sentences)
print(embeddings)

Usage (Hugging Face Transformers)

The model can also be used directly with Transformers.

from transformers import AutoTokenizer, AutoModel
import torch

# Mean Pooling - Take attention mask into account for correct averaging
def meanpooling(output, mask):
    embeddings = output[0] # First element of model_output contains all token embeddings
    mask = mask.unsqueeze(-1).expand(embeddings.size()).float()
    return torch.sum(embeddings * mask, 1) / torch.clamp(mask.sum(1), min=1e-9)

# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("neuml/biomedbert-hash-nano-embeddings", trust_remote_code=True)
model = AutoModel.from_pretrained("neuml/biomedbert-hash-nano-embeddings", trust_remote_code=True)

# Tokenize sentences
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    output = model(**inputs)

# Perform pooling. In this case, mean pooling.
embeddings = meanpooling(output, inputs['attention_mask'])

print("Sentence embeddings:")
print(embeddings)

Evaluation Results

Performance of this model is compared to previously released models trained on medical literature. The most commonly used small embeddings model is also included for comparison.

The following datasets were used to evaluate model performance.

PubMed QA
- Subset: pqa_labeled, Split: train, Pair: (question, long_answer)
PubMed Subset
- Split: test, Pair: (title, text)
PubMed Summary
- Subset: pubmed, Split: validation, Pair: (article, abstract)

Evaluation results are shown below. The Pearson correlation coefficient is used as the evaluation metric.

Model	PubMed QA	PubMed Subset	PubMed Summary	Average
all-MiniLM-L6-v2	90.40	95.92	94.07	93.46
bioclinical-modernbert-base-embeddings	92.49	97.10	97.04	95.54
biomedbert-base-colbert	94.59	97.18	96.21	95.99
biomedbert-base-reranker	97.66	99.76	98.81	98.74
biomedbert-hash-nano-colbert	90.45	96.81	92.00	93.09
biomedbert-hash-nano-embeddings	90.39	96.29	95.32	94.00
pubmedbert-base-embeddings	93.27	97.00	96.58	95.62
pubmedbert-base-embeddings-8M	90.05	94.29	94.15	92.83

At only 970K parameters this model packs quite a punch. It's competitive with larger models trained on medical literature retaining 98% of the performance of pubmedbert-base-embeddings at 0.88% the size. The performance is also better than all-MiniLM-L6-v2, a commonly used small model and it's 23x smaller. It also performs much better than the 8M static embeddings model although it is slower given that model is static.

This is a great model to use for smaller datasets and on limited compute / edge devices. Given that it only produces vectors of 128 dimensions, stored vectors also don't need as much space.

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'BertHashModel'})
  (1): Pooling({'word_embedding_dimension': 128, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Downloads last month: 9

Safetensors

Model size

970k params

Tensor type

F32

Model tree for NeuML/biomedbert-hash-nano-embeddings

Base model

NeuML/biomedbert-hash-nano

Finetuned

(2)

this model

Collection including NeuML/biomedbert-hash-nano-embeddings

Medical and Scientific Literature Models

Collection

Models for working with medical and scientific literature. • 15 items • Updated 1 day ago • 9