This is a model trained on cnmoro/LexicalTriplets to produce lexical embeddings (not semantic!)

This can be used to compute lexical similarity between words or phrases.

Concept:

"Some text" will be similar to "Sm txt"

"King" will not be similar to "Queen" or "Royalty"

"Dog" will not be similar to "Animal"

"Doge" will be similar to "Dog"

import torch, re, unicodedata
from transformers import AutoModel, AutoTokenizer

model_name = "cnmoro/LexicalEmbed-Base"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
model.eval()

def preprocess(text):
    text = unicodedata.normalize('NFD', text)
    text = ''.join(c for c in text if unicodedata.category(c) != 'Mn')
    text = re.sub(r'[^\w\s]+', ' ', text.lower())
    return re.sub(r'\s+', ' ', text).strip()

texts = ["hello world", "hel wor"]
texts = [ preprocess(s) for s in texts ]
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

with torch.no_grad():
    embeddings = model(**inputs)

cosine_sim = torch.nn.functional.cosine_similarity(embeddings[0], embeddings[1], dim=0)
print(f"Cosine Similarity: {cosine_sim.item()}") # 0.8966174125671387
Downloads last month
37
Safetensors
Model size
16.6M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train cnmoro/LexicalEmbed-Base