Automatic Speech Recognition
Transformers
Safetensors
Japanese
hubert
k2ssl

japanese-hubert-base-k2-rs35kh-bpe

This model is a Hubert Base fine-tuned on the large-scale Japanese ASR corpus ReazonSpeech v2.0 using the k2 framework.

Usage

You can use this model through transformers library:

import librosa
import numpy as np
from transformers import AutoProcessor, HubertForCTC

model = HubertForCTC.from_pretrained(
    "reazon-research/japanese-hubert-base-k2-rs35kh-bpe",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
).to("cuda")
processor = AutoProcessor.from_pretrained("reazon-research/japanese-hubert-base-k2-rs35kh-bpe")

audio, _ = librosa.load(audio_filepath, sr=16_000)
audio = np.pad(audio, pad_width=int(0.5 * 16_000))  # Recommend to pad audio before inference
input_values = processor(
    audio,
    return_tensors="pt",
    sampling_rate=16_000
).input_values.to("cuda").to(torch.bfloat16)

with torch.inference_mode():
    logits = model(input_values).logits.cpu()
predicted_ids = torch.argmax(logits, dim=-1)[0]
transcription = processor.decode(predicted_ids, skip_special_tokens=True).removeprefix("▁")

Test Results

We report the Character Error Rate (CER) of our model and the other wav2vec2 families.

Model #Prameters⬇ AVERAGE⬇ JSUT-BASIC5000⬇ Common Voice⬇ TEDxJP-10K⬇
reazon-research/japanese-wav2vec2-large-rs35kh 319M 16.25% 11.00% 18.23% 19.53%
reazon-research/japanese-wav2vec2-base-rs35kh 96.7M 20.40% 13.22% 23.76% 24.23%
reazon-research/japanese-hubert-base-k2-rs35kh-bpe 98.4M 11.07% 9.76% 11.36% 12.10%
reazon-research/japanese-hubert-base-k2-rs35kh 98.4M 11.23% 9.94% 11.59% 12.18%

We also report the CER for long-form speech.

Model #Prameters⬇ JSUT-BOOK⬇
reazon-research/japanese-wav2vec2-large-rs35kh 319M 30.98%
reazon-research/japanese-wav2vec2-base-rs35kh 96.7M 82.84%
reazon-research/japanese-hubert-base-k2-rs35kh-bpe 98.4M 84.55%
+ Silero VAD 19.34%
reazon-research/japanese-hubert-base-k2-rs35kh 98.4M 27.05%
+ Silero VAD 19.59%

Citation

@misc{japanese-hubert-base-k2-rs35kh-bpe,
  title={japanese-hubert-base-k2-rs35kh-bpe},
  author={Sasaki, Yuta},
  url = {https://huggingface.co/reazon-research/japanese-hubert-base-k2-rs35kh-bpe},
  year = {2025}
}

@article{yang2024k2ssl,
  title={k2SSL: A faster and better framework for self-supervised speech representation learning},
  author={Yang, Yifan and Zhuo, Jianheng and Jin, Zengrui and Ma, Ziyang and Yang, Xiaoyu and Yao, Zengwei and Guo, Liyong and Kang, Wei and Kuang, Fangjun and Lin, Long and others},
  journal={arXiv preprint arXiv:2411.17100},
  year={2024}
}

License

Apache Licence 2.0

Downloads last month
24
Safetensors
Model size
96.7M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for reazon-research/japanese-hubert-base-k2-rs35kh-bpe

Finetuned
(2)
this model

Dataset used to train reazon-research/japanese-hubert-base-k2-rs35kh-bpe

Collections including reazon-research/japanese-hubert-base-k2-rs35kh-bpe