HuPER Recognizer (ARPAbet phone recognition)

A CTC phone recognizer fine-tuned from WavLM-Large that maps 16 kHz speech audio to an ARPAbet phone sequence. See the HuPER paper for details: arXiv:2602.01634.

Quickstart

pip install -U transformers torchaudio
import torch
import torchaudio
from transformers import Wav2Vec2Processor, WavLMForCTC

repo_id = "huper29/huper_recognizer"
processor = Wav2Vec2Processor.from_pretrained(repo_id)
model = WavLMForCTC.from_pretrained(repo_id)
model.eval()

waveform, sr = torchaudio.load("sample.wav")
if waveform.shape[0] > 1:
    waveform = waveform.mean(dim=0, keepdim=True)
if sr != 16000:
    waveform = torchaudio.transforms.Resample(sr, 16000)(waveform)

inputs = processor(waveform.squeeze().numpy(), sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
    logits = model(**inputs).logits

pred_ids = torch.argmax(logits, dim=-1)[0].tolist()
blank_id = processor.tokenizer.pad_token_id

phone_tokens = []
prev = None
for token_id in pred_ids:
    if token_id != blank_id and token_id != prev:
        token = model.config.id2label.get(token_id, processor.tokenizer.convert_ids_to_tokens(token_id))
        if token not in {"<PAD>", "<UNK>", "<BOS>", "<EOS>", "|"}:
            phone_tokens.append(token)
    prev = token_id

print(" ".join(phone_tokens))

Citation

@article{guo2026huper,
  title   = {HuPER: A Human-Inspired Framework for Phonetic Perception},
  author  = {Guo, Chenxu and Lian, Jiachen and Liu, Yisi and Huang, Baihe and Narayanan, Shriyaa and Cho, Cheol Jun and Anumanchipalli, Gopala},
  journal = {arXiv preprint arXiv:2602.01634},
  year    = {2026}
}
Downloads last month
42
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for huper29/huper_recognizer

Finetuned
(20)
this model

Paper for huper29/huper_recognizer