Audio-to-Video Synthesis using Generative Adversarial Network

Thumbnail Image

Date

2024-01-23

Journal Title

Journal ISSN

Volume Title

Publisher

University of New South Wales

Abstract

Video generation is often perceived as stringing several image generators. However, in addition to visual quality, video generators must also consider motion smoothness and synchronicity with audio and text. Audio plays a crucial role in guiding visual content, as even slight discrepancies between audio and motion can be noticeable to human eyes. Thus, audio can be a self-supervised signal for learning the motion and building correlations between the audio and motion. While there are attempts to build promising audio-to-video generation models, these models typically rely on supervised signals such as keypoints. However, annotating keypoints as supervised signals takes time and effort. Thus, this thesis focuses on audio-based pixel-level video generation without keypoints. The primary goal of this thesis is to build models that generate a temporally and spatially coherent video from audio inputs. The thesis proposes multiple audio-to-video generator frameworks. The first proposed model, PhonicsGAN, uses GRU units for audio to generate pixel-based videos. The subsequent frameworks, each, address particular challenges while still pursuing the main objective. To improve the spatial quality of the generated videos, a model that adapts the image fusion concept to video generation is proposed. This model incorporates a multiscale fusion model that combines images with video frames. While the spatial quality of the video is important, the temporal aspect of the video frames should also be considered. To address this, a shuffling technique is proposed presenting each dataset sample with varied permutations to improve the video's temporal learning. We propose a new model that learns motion trajectory from sparse motion frames. AdaIN is utilised to adjust the motion in the content frame to the target frame to enhance the learning of video motion. All the proposed models are compared with state-of-the-art models to demonstrate their ability to generate high-quality videos from audio inputs. This thesis contributes to the field of video generation in several ways: Firstly, by providing an extensive survey on GAN-based video generation techniques. Secondly, by proposing and evaluating four pixel-based frameworks for enhanced audio-to-video generation output that each addresses important challenges in the field. Lastly, by collecting and publishing a new audio/visual dataset that can be used by the research community for further investigations in this area.

Description

Keywords

artificial intelligence, Deep Learning, Computer Vision, content creation, generative models, GAN, Video generation, GAN-based video generation techniques, Pixel-level video generation, Temporal learning, Audio-to-video generation

Citation

Collections

Endorsement

Review

Supplemented By

Referenced By

Copyright owned by the Saudi Digital Library (SDL) © 2024