CosyVoice3-2512 ONNX Models (flow & hift)
This repository provides ONNX-format models for selected modules of CosyVoice3, including:
flow_fp32.onnx(full precision, flow module)flow_fp16.onnx(half precision, flow module)hift.onnx(full precision, hift module)flow_hift_fp32.onnx(combined flow_fp32 and hift model)flow_hift_fp16.onnx(combined flow_fp16 and hift model)
For usage instructions, please refer to the GitHub repository.
Other modules of CosyVoice3 can be obtained from the official CosyVoice3.
I have open-sourced the ONNX version of CosyVoice2 and CosyVoice3, including the modified modules and conversion scripts needed for ONNX. If you want to learn how to perform the conversion, please visit CosyVoiceForOnnx.
Model Inputs and Outputs
flow_fp32.onnx / flow_fp16.onnx
- Inputs:
token(int64)prompt_token(int32)prompt_feat(float32 / float16)embedding(float32 / float16)- For
flow_fp32.onnx, must use float32 - For
flow_fp16.onnx, must use float16
- For
- Outputs:
tts_mel(float32)
hift.onnx
- Input:
speech_feat(float32)
- Output:
generated_speech(float32)
flow_hift_fp32.onnx / flow_hift_fp16.onnx
- Inputs
token(int32)prompt_token(int32)prompt_feat(float32 / float16)embedding(float16)speed(float32, scalar, controls speech rate)- For
flow_hift_fp32.onnx, must use float32 - For
flow_hift_fp16.onnx, must use float16
- For
Output -
generated_speech(float32)
Notes
- All outputs are float32.
- Input precision must strictly match the model requirements.
- Note: in the combined model,
tokeninput is int32 (not int64). Thespeedinput is a float32 scalar controlling speech speed.
Acknowledgments
The original models are from the official CosyVoice3. This repository only provides ONNX format conversion and adaptation.
Model tree for Lourdle/Fun-CosyVoice3-0.5B-2512_ONNX
Base model
FunAudioLLM/Fun-CosyVoice3-0.5B-2512