Datasets

Table of Contents

  1. Training Datasets
    1. Speech Datasets
    2. Noise Datasets
    3. Room Impulse Response Datasets

Training Datasets

Training datasets will include publicly available speech, noise and room impulse response datasets outlined in the tables below. Proposals for additional training dataset inclusions are encouraged from prospective participants until one week prior to challenge start.

Data preparation scripts are available in the data generation repository: https://lrac.short.gy/data-gen-repo.

Speech Datasets

Training Speech Datasets Distribution

Figure 1: List of speech datasets permitted for model training in the 2025 LRAC Challenge.

The data preparation script by default applies filtering based on our analyses of estimations of SNR, reverberation metrics, and bandwidth. Additionally, we estimated speaker gender, speaker ID (where not available already), and per-speaker durations to achieve an approximately balanced and maximally diverse training set. Additionally, the filtering excludes files reserved for testing (open test set).

Training Speech Number of files per language Distribution

Figure 2: Number of files per Language​ for Non-Curated and Curated Training Data.

Training Speech Gender Distribution

Figure 3: Gender Distribution​ for Non-Curated and Curated Training Data.

Training Speech Number of hours per Dataset Distribution

Figure 4: Number of hours per dataset for Non-Curated and Curated Training Data.

Back to top

Noise Datasets

Training Noise Datasets Distribution

Figure 5: List of noise datasets permitted for model training in the 2025 LRAC Challenge.

From these datasets, we classify all files using an ontology that is rooted in Audioset, but trimmed down and emphasising broad noise categories and human vocal sounds. For classification, we use CLAP. In our own research using and modifying CLAP, we find that, on the dataset ESC-50 which has more classes but minimal class overlap, CLAP achieves an accuracy of 94.8%. On an internal dataset with a similar amount of classes but likely signifincalty more class overlap, CLAP achieves an accuracy of 62.3%. We expect the accuracy on these datasets to be somewhere in between these two numbers, likely towards the former as the classes we defiend for this task are more distinct than in the latter dataset.

We adjusted our class names to be ‘CLAP- friendly’. Figure 6 shows this distribution.

Training Noise Full Class Distribution

Figure 6: Full distribution of noises in the training noise dataset.

From that distribution, we derive a subset. We first remove the following classes ['music with instruments and vocals singing', 'speech', 'a cappella singing', 'laughter', 'silence'] which account for those classes that could be considered to have human vocal sounds, and silence. In addition, we take only the most probable 7500 files according to CLAP from 'musical instruments' and 'machines', in an effort to smooth out the distribution. Finally, we set aside a portion of the data for validation. The resulting final distribution of noises for the training set is shown in Figure 7.

Training Noise Final Class Distribution

Figure 7: Final distribution of noises in the training noise dataset.

Back to top

Room Impulse Response Datasets

Training Nosie Datasets Distribution

Figure 6: List of room impulse response (RIR) datasets permitted for model training in the 2025 LRAC Challenge.

Back to top