SACM - Australia

Permanent URI for this collectionhttps://drepo.sdl.edu.sa/handle/20.500.14154/9648

Browse

Search Results

Now showing 1 - 2 of 2

Restricted
Audio-to-Video Synthesis using Generative Adversarial Network
(University of New South Wales, 2024-01-23) Aldausari, Nuha; Mohammadi, Gelareh; Sowmya, Arcot; Marcus, Nadine
Video generation is often perceived as stringing several image generators. However, in addition to visual quality, video generators must also consider motion smoothness and synchronicity with audio and text. Audio plays a crucial role in guiding visual content, as even slight discrepancies between audio and motion can be noticeable to human eyes. Thus, audio can be a self-supervised signal for learning the motion and building correlations between the audio and motion. While there are attempts to build promising audio-to-video generation models, these models typically rely on supervised signals such as keypoints. However, annotating keypoints as supervised signals takes time and effort. Thus, this thesis focuses on audio-based pixel-level video generation without keypoints. The primary goal of this thesis is to build models that generate a temporally and spatially coherent video from audio inputs. The thesis proposes multiple audio-to-video generator frameworks. The first proposed model, PhonicsGAN, uses GRU units for audio to generate pixel-based videos. The subsequent frameworks, each, address particular challenges while still pursuing the main objective. To improve the spatial quality of the generated videos, a model that adapts the image fusion concept to video generation is proposed. This model incorporates a multiscale fusion model that combines images with video frames. While the spatial quality of the video is important, the temporal aspect of the video frames should also be considered. To address this, a shuffling technique is proposed presenting each dataset sample with varied permutations to improve the video's temporal learning. We propose a new model that learns motion trajectory from sparse motion frames. AdaIN is utilised to adjust the motion in the content frame to the target frame to enhance the learning of video motion. All the proposed models are compared with state-of-the-art models to demonstrate their ability to generate high-quality videos from audio inputs. This thesis contributes to the field of video generation in several ways: Firstly, by providing an extensive survey on GAN-based video generation techniques. Secondly, by proposing and evaluating four pixel-based frameworks for enhanced audio-to-video generation output that each addresses important challenges in the field. Lastly, by collecting and publishing a new audio/visual dataset that can be used by the research community for further investigations in this area.
24 0
Restricted
Automatic Generation of a Coherent Story from a Set of Images
(Saudi Digital Library, 2023-12) Aljawy, Zainy; Mian, Ajaml; Hassan, Ghulam Mubashar
This dissertation explores vision and language (V&L) algorithms. While (V&L) succeeds in image and video captioning tasks, the dynamic Visual Storytelling Task (VST) remains challenging. VST demands coherent stories from a set of images, requiring grammatical accuracy, flow, and style. The dissertation addresses these challenges. Chapter 2 presents a framework utilizing an advanced language model. Chapters 3 and 4 introduce novel techniques that integrate rich visual representation to enhance generated stories. Chapter 5 introduces a new storytelling dataset with a comprehensive analysis. Chapter 6 proposes a state-of-the-art Transformer-based model for generating coherent and informative story descriptions from image sets.
55 0

SACM - Australia

Browse

Filters

Settings

Sort By

Results per page

Search Results