kNN-TTS: kNN Retrieval for Simple and Effective Zero-Shot Multi-speaker Text-to-Speech
Input
Target Voice
0 2
1 50
Weight neighbors by similarity distance
Generated Audio
Examples
Text | Target Voice | Voice Morphing (λ) | Top-k Retrieval | Use Weighted Averaging |
---|
kNN-TTS Technical Details
kNN-TTS uses self-supervised learning (SSL) features and kNN retrieval to achieve robust zero-shot multi-speaker TTS.
Key Components
- Feature Extraction: We extract discrete representations from target speaker speech using a pre-trained SSL encoder. We use the 6th layer of WavLM Large.
- Text-to-SSL: We train a lightweight TTS model to predict the same representations from Text. For simplicity, we train on a single speaker dataset.
- Retrieval Mechanism: We use kNN to find for each unit in the generated features its closest matches in the target voice unit database
- Voice Morphing: By linearly interpolating the source and selected target speaker features, we can morph the two voices. The interpolation parameter λ controls the balance between source and target characteristics
- Vocoder: We use a pre-trained vocoder to convert the converted features to waveform.
Performance
Our simple and efficient model achieves comparable results to sota models while being trained on 100 to 1000× less transcribed data. This framework is therefore particularly well-suited for low-resource domains.
For more details, please refer to our paper (https://arxiv.org/abs/2408.10771).
About the Project
This demo showcases kNN-TTS, a lightweight zero-shot text-to-speech synthesis model.
Authors
- Karl El Hajal
- Ajinkya Kulkarni
- Enno Hermann
- Mathew Magimai.-Doss
Citation
If you use kNN-TTS in your research, please cite our paper:
@inproceedings{hajal-etal-2025-knn,
title = "k{NN} Retrieval for Simple and Effective Zero-Shot Multi-speaker Text-to-Speech",
author = "Hajal, Karl El and
Kulkarni, Ajinkya and
Hermann, Enno and
Magimai Doss, Mathew",
editor = "Chiruzzo, Luis and
Ritter, Alan and
Wang, Lu",
booktitle = "Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)",
month = apr,
year = "2025",
address = "Albuquerque, New Mexico",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.naacl-short.65/",
pages = "778--786",
ISBN = "979-8-89176-190-2"
}
Acknowledgments
The target voices featured in this demo were sourced from the following datasets:
License
This project is licensed under the MIT License.