Keynote Talks

Jean-Marc Valin (Senior Staff Research Scientist, Google)

Keynote
From Paper to Production: the Engineering and Standardization Challenges of Neural Speech Coding

Abstract

It has been eight years since the emergence of the first neural speech codecs. Since then, proposed codecs have shown steadily improving quality, but there still does not appear to be an upcoming standard based on neural speech coding. The most obvious roadblock to mainstream neural codecs is computational resources. In that respect, we have seen a complexity reduction of three orders of magnitude, from hundreds of GFLOPS down to hundreds of MFLOPS.

While the complexity problem is slowly being addressed, other significant obstacles remain. I argue that a standard neural codec would also require some level of interpretability so that its behaviour and robustness can be reasonably understood. Furthermore, we need standards that can evolve while preserving long-term interoperability, which is difficult given the rapid pace of codec evolution. In this keynote, I will propose a set of criteria for an upcoming neural codec standard and suggest practical paths toward achieving them.

Bio

Jean-Marc Valin is a Senior Staff Research Scientist at Google and a long-time contributor to the Xiph.Org Foundation. He received his B.A.Sc., M.S., and Ph.D. in Electrical Engineering from the University of Sherbrooke, Canada. He is a lead architect of the Opus and Speex audio codecs and also contributed to the AV1 video codec. His research focuses on speech and audio coding, neural vocoders (LPCNet, FARGAN), and deep-learning-based speech enhancement. He was previously at Amazon Web Services and Mozilla.

Nicola Pia (Senior Scientist, Fraunhofer IIS)

Keynote
Design Choices and Effective Evaluation of Modern Speech and Audio Codecs Based on Neural Networks

Abstract

Modern advances in machine learning have fundamentally revolutionized the field of speech and audio coding at low bit rates. While the rapid pace of innovation produces promising new end-to-end codecs nearly every month, these technologies need to be integrated into a complete audio processing pipeline, to make use of their potential for real-world communications applications.

In this keynote, I will provide a comprehensive overview of the architectural choices and inherent challenges involved in developing modern neural speech and audio codecs. We will explore how the structure of the codec and the interplay between its modules dictate performance, and why robust quality control on a variety of different types of signals is of paramount importance. Finally, I will present recent advancements in subjective listening methodologies specifically designed for efficient evaluation of neural speech codecs.

Bio

Nicola Pia is a Senior Scientist at Fraunhofer IIS in Erlangen, Germany. He studied Mathematics at the Università di Cagliari and spent the final year of his master's degree at the Université de Strasbourg. He later pursued a PhD in Mathematics, conducting his research between the Università di Cagliari and the Ludwig-Maximilians-Universität (LMU) in Munich. After receiving his doctorate in 2019, he won a DAAD grant and completed a brief post-doctoral fellowship at LMU.

Since joining Fraunhofer (2019), his work has focused on speech and audio coding, with a particular emphasis on communication with mobile devices. He authored many papers and is the inventor of several patents on this subject. Moreover, he teaches a course on deep generative models for signal processing at the Friedrich-Alexander-Universität Erlangen-Nürnberg.

Cullen Jennings (Fellow and CTO for Collaboration AI, Cisco Systems)

Keynote
In Search of Better Audio Codecs: Requirements and Constraints for Modern Audio Codecs on the Internet

Abstract

We may have seemingly infinite bandwidth, but work around audio codecs is increasingly complex. From the early days of G.711 to today’s ML algorithms, this talk examines both the history of audio codecs and presents thoughts about the challenges ahead.

We're faced with new problems, such as integrating with wearable and handheld satellite devices, and working with spatial audio, multispeaker audio, and low latency encoding. Machine Learning audio codecs raise dilemmas around training data and bias. And how do we look at tradeoffs between robustness, generalization, complexity, fidelity, latency, and interoperability? When it comes to evaluating different codecs, what critical factors should we consider? Meanwhile, standards organizations are influencing the state of the art, driving rapid development in both interoperable and proprietary codecs.

Bio

Cullen Jennings is a Fellow and CTO for Audio/Video Collaboration and AI at Cisco, where he drives the vision and strategy for collaboration technology, including AI research and development. Since the '90s, he has worked on VoIP systems that now host billions of minutes of audioconferencing daily, driven internet voice communications standards, and spearheaded the development of WebRTC. His industry leadership has shaped numerous open-source and standards organizations, and he holds 100+ patents.

Cullen joined Cisco in 2000 through the acquisition of Vovida Networks, where he was VP of Engineering. He is also co-founder of Jasomi Networks and Point Grey Research, and holds a Ph.D. in computer science from the University of British Columbia.