kNN-TTS: kNN Retrieval for Simple and Effective Zero-Shot Multi-speaker Text-to-Speech

Input

Text

Target Voice

Voice Morphing (λ)

Higher values give more weight to target voice characteristics

0 2

Top-k Retrieval

k closest neighbors to retrieve

1 50

Weight neighbors by similarity distance

Use Weighted Averaging

Generated Audio

Output Speech

Examples

Text	Target Voice	Voice Morphing (λ)	Top-k Retrieval	Use Weighted Averaging

kNN-TTS Technical Details

kNN-TTS uses self-supervised learning (SSL) features and kNN retrieval to achieve robust zero-shot multi-speaker TTS.

Key Components

Feature Extraction: We extract discrete representations from target speaker speech using a pre-trained SSL encoder. We use the 6th layer of WavLM Large.
Text-to-SSL: We train a lightweight TTS model to predict the same representations from Text. For simplicity, we train on a single speaker dataset.
Retrieval Mechanism: We use kNN to find for each unit in the generated features its closest matches in the target voice unit database
Voice Morphing: By linearly interpolating the source and selected target speaker features, we can morph the two voices. The interpolation parameter λ controls the balance between source and target characteristics
Vocoder: We use a pre-trained vocoder to convert the converted features to waveform.

Performance

Our simple and efficient model achieves comparable results to sota models while being trained on 100 to 1000× less transcribed data. This framework is therefore particularly well-suited for low-resource domains.

For more details, please refer to our paper (https://arxiv.org/abs/2408.10771).

About the Project

This demo showcases kNN-TTS, a lightweight zero-shot text-to-speech synthesis model.

Authors

Karl El Hajal
Ajinkya Kulkarni
Enno Hermann
Mathew Magimai.-Doss

Citation

If you use kNN-TTS in your research, please cite our paper:

@inproceedings{hajal-etal-2025-knn,
    title = "k{NN} Retrieval for Simple and Effective Zero-Shot Multi-speaker Text-to-Speech",
    author = "Hajal, Karl El  and
      Kulkarni, Ajinkya  and
      Hermann, Enno  and
      Magimai Doss, Mathew",
    editor = "Chiruzzo, Luis  and
      Ritter, Alan  and
      Wang, Lu",
    booktitle = "Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)",
    month = apr,
    year = "2025",
    address = "Albuquerque, New Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.naacl-short.65/",
    pages = "778--786",
    ISBN = "979-8-89176-190-2"
}

Acknowledgments

The target voices featured in this demo were sourced from the following datasets:

Thorsten Dataset
LibriSpeech Dataset
Emotional Speech Dataset (ESD)

License

This project is licensed under the MIT License.

Built with Gradio logo