Video Synthesis of a Talking Head

Thumbnail Image

Date

2023-07

Journal Title

Journal ISSN

Volume Title

Publisher

University of Leeds

Abstract

The ability to synthesise a video talking head from speech audio has many potential applications, such as video conferencing, video animation production and virtual assistants. Although there has been considerable prior work on this task, the quality of generated videos is typically limited in terms of overall realism and resolution. In this thesis, we propose a novel approach for synthesis of a high-resolution talking head video (1024x1024 in our experiments) from speech audio and a single identity image. The approach is built on top of a pre-trained StyleGAN image generator. We model trajectories in the latent space of the generator conditioned on speech utterances. To train this model, we use a dataset of talking-head videos, which are mapped into the latent space of the image generator using an image encoder that is also pre-trained. We train a recurrent neural network to map from speech utterances to sequences of displacements in the latent space of the image generator. These displacements are applied to the back-projection into the latent space of a single identity frame chosen from a target video in the training dataset. The thesis begins by reporting on an experimental evaluation of existing GAN inversion methods that map video frames into the latent space of a pre-trained StyleGAN image generator. We apply one such inversion method to train an unconditional video generator that requires an identity image and a random seed for the dynamical process that generates a trajectory through the latent space of the image generator. We evaluate our method for talking head synthesis from speech audio with standard measures and show that it significantly outperforms recent state-of-the-art methods on commonly used audio-visual talking-head datasets (GRID and TCD-TIMIT). We perform the evaluation with two versions of StyleGAN; one trained on video frames depicting talking heads and the other on faces with static expressions (i.e., not talking). The quality of the results is shown to be better when using StyleGAN pre-trained on talking heads. However, the range of possible identities is narrower due to the much smaller set of identities in the talking head dataset. The videos from experiments can be found at https://mohammedalghamdi.github.io/phd-thesis-website/

Description

Keywords

talking-head, video synthesis, Artificial Intelligence, generative models

Citation

Endorsement

Review

Supplemented By

Referenced By

Copyright owned by the Saudi Digital Library (SDL) © 2025