Department of Electrical and Computer Engineering Ph.D. Public Defense
Robust Techniques for Generating Talking Faces from Speech
Emre Eskimez
Supervised by Professor Wendi Heinzelman and Professor Zhiyao Duan
Friday, July 26, 2019
1 p.m.
Computer Studies Building, Room 601
Speech is a fundamental modality in human-to-human communication. It carries complex messages that written languages cannot convey effectively, such as emotion and intonation, which can change the meaning of the message. Due to its importance in human communication, speech processing has attracted much attention of researchers to establish human-to-machine communication. Personal assistants, such as Alexa, Cortana, and Siri that can be interfaced using speech, are now mature enough to be part of our daily lives. With the deep learning revolution, speech processing has advanced significantly in the fields of automatic speech recognition, speech synthesis, speech style transfer, speaker identification/verification and speech emotion recognition.
Although speech contains rich information about the message that is being transmitted and the state of the speaker, it does not contain all the information for speech communication. Facial cues play an important role in establishing a connection between a speaker and a listener. It has been shown that estimating emotions from speech is a hard task for untrained humans; therefore most people rely on a speaker’s facial expressions to discern the speaker’s affective state, which is important for comprehending the message that the speaker is trying to convey. Another benefit of the availability of facial cues during speech communication is that seeing the lips of the speaker improves speech comprehension, especially in environments where background noise is present. This can be observed mostly in cocktail-party scenarios, where people tend to communicate better when they are facing each other but may have trouble communicating when talking over the phone.
This thesis describes my work in the fields of speech enhancement (SE), speech animation (SA), and automatic speech emotion recognition (ASER). For SE, I have proposed long short-term memory (LSTM) based and convolutional neural network (CNN) based architectures to compensate for the non-stationary noise in utterances. My proposed models have been evaluated in terms of speech quality and speech intelligibility. These models have been used as pre-processing modules to a commercial automatic speaker verification system, and it has been shown that they provide a performance boost in terms of equal-error rate (EER).
I have proposed a speech super-resolution (SSR) system that employs a generative adversarial network (GAN). The generator network is fully convolutional with 1D kernels, enabling real-time inference on edge devices. The objective and subjective studies showed the proposed network outperforms the DNN baselines.
For speech animation (SA), I have proposed an LSTM network to predict face landmarks from first- and second-order temporal differences of the log-mel spectrogram. I have conducted objective and subjective evaluations and verified that the generated landmarks are on-par with the ground-truth ones. Generated landmarks can be used by the existing systems to fit texture or 2D and 3D models to obtain realistic talking faces to increase speech comprehension. I extended this work to include noise-resilient training. The new architecture accepts the raw waveforms and processes them through 1D convolutional layers that output the PCA coefficients of the 3D face landmarks. The objective and subjective results showed that the proposed network achieves better performance compared to my previous work and a DNN-based baseline. In another work, I have proposed an end-to-end image-based talking face generation system that works with arbitrarily long speech inputs and utilizes attention mechanisms.
For automatic speech emotion recognition (ASER), I have compared human and machine performance in large-scale experiments and concluded that machines could discern emotions from speech better than untrained humans. I have also proposed a web-based automatic speech emotion classification framework, where the user can upload their files and can analyze the affective content of the utterances. The framework adapts to the user’s choices over time since the user corrects the wrong labels. This allows for large-scale emotional analysis in a semi-automatic framework. I have proposed a transfer learning framework where I train autoencoders using 100 hours of neutral speech to boost the ASER performance. I have systematically analyzed four different autoencoders, namely denoising autoencoder, variational autoencoder, adversarial autoencoder and adversarial variational Bayes. This method is beneficial in scenarios where there are not enough annotated data to train deep neural networks (DNNs).
Pulling all of this work together provides a framework for generating a realistic talking face from noisy and emotional speech that has the capability of expressing emotions. This framework would be beneficial for applications in telecommunications, human-machine interaction/interface, augmented/virtual reality, telepresence, video games, dubbing, and animated movies.