Baseline System
Table of Contents
Status
The baseline system is under development. Further details will be shared shortly.
Motivation
- Reduced barrier for entry and participation
- Simple starting point for research for potential participants
Emphasis
- Use of well-adopted open-source components
- Simplicity of the baseline system itself
Approach
- Leverage End-to-End Speech Processing Toolkit (ESPnet)
- Widely adopted in the community
- Support for data augmentations: noise & reverb
- Trainer and multi-GPU support
- Learning rate schedulers
- Support for neural codecs
- Reduces the amount of effort and open-source code
- Separate baselines for Track 1 and Track 2
Baseline System Tasks
- Training data generation (either offline or online - TBD): download datasets, apply the challenge provided curation lists, run augmentations
- Run model training
- Run VERSA toolkit-based objective metric-based evaluation
- Provide script for challenge rule validation: compute, latency
Model Components
- Encoder
- Strided Convolutions + Residual Units (causal)
- Quantizer
- Hard residual VQ with STE (straight-through) gradient estimation, or an FSQ based method
- Support for Multi-bitrate training (1 and 6kbps)
- Decoder
- Transposed Convolutions + Residual Units (causal)
- Discriminator
- Borrow design from EnCodec and trim down channels