ECE Seminar Lecture Series

Toward Explainable Speech Foundation Models

Shinji Watanabe, PhD. Associate Professor at Carnegie Mellon University

Friday, April 26, 2024
Noon–1 p.m.

601 Computer Studies Building

 

 

 

Speech foundation models are an active research area with the potential to consolidate various speech-processing tasks within a single model. A notable trend in this domain involves scaling up data volume, model size, and the range of tasks. This scaling trajectory has brought about significant changes in our research landscape, particularly regarding resource allocation. Notably, it has led to a division of research roles, where large tech companies primarily focus on building foundational models, while smaller entities, including academic institutions and smaller companies, concentrate on refining and analyzing these models. While this division has streamlined research efforts, there is a growing concern about the potential loss of explainability in these foundational models. This is primarily due to the limited transparency in the model-building process, often dictated by company policies. To address this concern, our group has started the development of large-scale speech foundation models. Our talk introduces Open Whisper-style Speech Models (OWSM), a series of speech foundation models developed at Carnegie Mellon University, reproducing OpenAI Whisper-style training using publicly available data and our open-source toolkit ESPnet. Crucially, our models exhibit several explainable behaviors thanks to the transparency inherent in our model-building process. In addition to showcasing the OWSM models, we discuss the related research efforts encompassing software development, data collection, cleaning, and model evaluation. Throughout this presentation, we would like to discuss how to address the research challenges posed by this shifting landscape within our speech and audio community.

Shinji looking at cameraShinji Watanabe is an Associate Professor at Carnegie Mellon University, Pittsburgh, PA. He received his B.S., M.S., and Ph.D. (Dr. Eng.) degrees from Waseda University, Tokyo, Japan. He was a research scientist at NTT Communication Science Laboratories, Kyoto, Japan, from 2001 to 2011, a visiting scholar at Georgia Institute of Technology, Atlanta, GA, in 2009, and a senior principal research scientist at Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA USA from 2012 to 2017. Before Carnegie Mellon University, he was an associate research professor at Johns Hopkins University, Baltimore, MD, USA, from 2017 to 2020. His research interests include automatic speech recognition, speech enhancement, spoken language understanding, and machine learning for speech and language processing. He has published over 400 papers in peer-reviewed journals and conferences and received several awards, including the best paper award from the IEEE ASRU in 2019. He is a Senior Area Editor of the IEEE Transactions on Audio Speech and Language Processing. He was/has been a member of several technical committees, including the APSIPA Speech, Language, and Audio Technical Committee (SLA), IEEE Signal Processing Society Speech and Language Technical Committee (SLTC), and Machine Learning for Signal Processing Technical Committee (MLSP). He is an IEEE and ISCA Fellow.