Evaluation

Overview

Overview

Participants will apply their system’s inference to open and withheld test sets and submit the processed stimuli for evaluation. Both objective metrics and crowdsourced subjective tests will be used to evaluate the performance of the neural speech codec submissions. Objective metrics will be provided during both the development and final phases as a general guide only.

Final submissions on the blind test set will be ranked solely based on crowdsourced listener scores, with a focus on transparency, intelligibility, and noise/reverb robustness.

A crowdsourced listening test battery will be used to assess:

Quality of clean speech (MUSHRA-style evaluation, one system at a time)
Track 1: degradation in real-world conditions (DMOS)
Track 2: quality of enhanced speech (MOS)
Speech intelligibility (DRT)

Test Materials

Figure 1: Test Materials

Test materials will include curated test sets focused on quality assessment under clean, noisy and reverberant conditions. Following the conclusion of the challenge, these test sets will be released publicly for the benefit of the research community. For crowdsourced assessment of intelligibility, previously reported approach along with publicly released stimuli link will be utilized.

Test Battery

Figure 1: Neural Speech Codec Test Battery

The subjective tests for both Challenge tracks are listed above, along with the corresponding test methods and weighting factors used to compute the overall submission rankings.

In Track 1, equal importance is given to the 1 kbps and 6 kbps conditions, as the input utterances are primarily clean or contain only mild reverberation and noise. An exception to this is the Diagnostic Rhyme Test (DRT), which assesses speech intelligibility. For Track 1, the DRT is conducted only at 1 kbps, based on the expectation that most systems will be fully intelligible at 6 kbps under clean conditions.

In Track 2, higher weights are assigned to the 6 kbps scores. This reflects the greater challenge of achieving good quality and intelligibility under high reverberation and low signal-to-noise ratio (SNR) conditions—particularly at very low bitrates. For the DRT in Track 2, testing is again limited to the clean condition at 1 kbps, as intelligibility is expected to plateau at 6 kbps. Given the anticipated poor intelligibility of 1 kbps systems in noisy conditions, the DRT will be assessed only at 6 kbps for noisy conditions.

Finally, the evaluation prioritizes performance in single-speaker scenarios over multi-speaker (overlapping) scenarios, as single-speaker speech is more prevalent in practical use cases. Moreover, reliably assessing quality in overlapping speech remains a complex and less mature area of evaluation. All evaluations will be performed exclusively on speech in the English language for this year’s challenge edition.

References

[SIT.2024]
- L. Lechler and K. Wojcicki, “Crowdsourced Multilingual Speech Intelligibility Testing,” In Proc. ICASSP 2024, Seoul, South Korea, 2024, pp. 1441-1445.
- Manuscript: https://arxiv.org/pdf/2403.14817
- Public release of data and test software: https://github.com/cisco/multilingual-speech-testing/tree/main/speech-intelligibility-DRT
[MUSHRA.2025]
- L. Lechler, C. Moradi and I. Balic, “Crowdsourcing MUSHRA Tests in the Age of Generative Speech Technologies: A Comparative Analysis of Subjective and Objective Testing Methods,” Interspeech 2025, Rotterdam, Netherlands (accepted, forthcoming).
- Manuscript: https://arxiv.org/pdf/2506.00950
- Public release of test data and softwware: in-preparation.

Evaluation

Table of Contents

Overview

Test Materials

Test Battery

References