kNN-TTS: kNN Retrieval for Simple and Effective Zero-Shot Multi-speaker Text-to-Speech
Input
Target Voice
0 2
1 50
Weight neighbors by similarity distance
Generated Audio
Examples
Text | Target Voice | Voice Morphing (λ) | Top-k Retrieval | Use Weighted Averaging |
---|
kNN-TTS Technical Details
kNN-TTS uses self-supervised learning (SSL) features and kNN retrieval to achieve robust zero-shot multi-speaker TTS.
Key Components
- Feature Extraction: We extract discrete representations from target speaker speech using a pre-trained SSL encoder. We use the 6th layer of WavLM Large.
- Text-to-SSL: We train a lightweight TTS model to predict the same representations from Text. For simplicity, we train on a single speaker dataset.
- Retrieval Mechanism: We use kNN to find for each unit in the generated features its closest matches in the target voice unit database
- Voice Morphing: By linearly interpolating the source and selected target speaker features, we can morph the two voices. The interpolation parameter λ controls the balance between source and target characteristics
- Vocoder: We use a pre-trained vocoder to convert the converted features to waveform.
Performance
Our simple and efficient model achieves comparable results to sota models while being trained on 100 to 1000× less transcribed data. This framework is therefore particularly well-suited for low-resource domains.
For more details, please refer to our paper (https://arxiv.org/abs/2408.10771).
About the Project
This demo showcases kNN-TTS, a lightweight zero-shot text-to-speech synthesis model.
Authors
- Karl El Hajal
- Ajinkya Kulkarni
- Enno Hermann
- Mathew Magimai.-Doss
Citation
If you use kNN-TTS in your research, please cite our paper:
@misc{hajal2025knntts,
title={kNN Retrieval for Simple and Effective Zero-Shot Multi-speaker Text-to-Speech},
author={Karl El Hajal and Ajinkya Kulkarni and Enno Hermann and Mathew Magimai.-Doss},
year={2025},
eprint={2408.10771},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2408.10771},
}
Acknowledgments
The target voices featured in this demo were sourced from the following datasets:
License
This project is licensed under the MIT License.