kNN-TTS: kNN Retrieval for Simple and Effective Zero-Shot Multi-speaker Text-to-Speech
Input
Target Voice       
  0  2
1  50
Weight neighbors by similarity distance
Generated Audio
 Examples
 | Text | Target Voice | Voice Morphing (λ) | Top-k Retrieval | Use Weighted Averaging | 
|---|
kNN-TTS Technical Details
kNN-TTS uses self-supervised learning (SSL) features and kNN retrieval to achieve robust zero-shot multi-speaker TTS.
Key Components
- Feature Extraction: We extract discrete representations from target speaker speech using a pre-trained SSL encoder. We use the 6th layer of WavLM Large.
 - Text-to-SSL: We train a lightweight TTS model to predict the same representations from Text. For simplicity, we train on a single speaker dataset.
 - Retrieval Mechanism: We use kNN to find for each unit in the generated features its closest matches in the target voice unit database
 - Voice Morphing: By linearly interpolating the source and selected target speaker features, we can morph the two voices. The interpolation parameter λ controls the balance between source and target characteristics
 - Vocoder: We use a pre-trained vocoder to convert the converted features to waveform.
 
Performance
Our simple and efficient model achieves comparable results to sota models while being trained on 100 to 1000× less transcribed data. This framework is therefore particularly well-suited for low-resource domains.
For more details, please refer to our paper (https://arxiv.org/abs/2408.10771).
About the Project
This demo showcases kNN-TTS, a lightweight zero-shot text-to-speech synthesis model.
Authors
- Karl El Hajal
 - Ajinkya Kulkarni
 - Enno Hermann
 - Mathew Magimai.-Doss
 
Citation
If you use kNN-TTS in your research, please cite our paper:
@inproceedings{hajal-etal-2025-knn,
    title = "k{NN} Retrieval for Simple and Effective Zero-Shot Multi-speaker Text-to-Speech",
    author = "Hajal, Karl El  and
      Kulkarni, Ajinkya  and
      Hermann, Enno  and
      Magimai Doss, Mathew",
    editor = "Chiruzzo, Luis  and
      Ritter, Alan  and
      Wang, Lu",
    booktitle = "Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)",
    month = apr,
    year = "2025",
    address = "Albuquerque, New Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.naacl-short.65/",
    pages = "778--786",
    ISBN = "979-8-89176-190-2"
}
Acknowledgments
The target voices featured in this demo were sourced from the following datasets:
License
This project is licensed under the MIT License.