3D Audio-Visual Indoor Scene Reconstruction and Completion for Virtual Reality from a Single Image
No Thumbnail Available
Date
2025
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Saudi Digital Library
Abstract
In this research, we propose a novel method for generating an audio-visual scene in 3D
virtual space using a single panoramic RGB-D input. Our investigation begins with the
reconstruction of a 3D model from RGB panoramic data alone, developing a semantic
geometry model by combining estimated monocular depth with material information
for spatial sound rendering. Building upon the preliminary results, we extend our
approach to construct a comprehensive virtual reality (VR) environment using 360◦
RGB-D input. The proposed method enables the creation of an immersive VR space
by generating a complete 3D voxelized model that incorporates scene semantics from
a single panoramic input.
Our methodology employs a deep 3D convolutional neural network integrated with
transfer learning for RGB semantic features, coupled with a re-weighting strategy in the
3D weighted cross-entropy loss function. The proposed re-weighting method uniquely
combines two class re-balancing techniques (re-sampling and class-sensitive learning)
while smoothing the weights through an unsupervised clustering algorithm. This ap-
proach addresses critical challenges in semantic scene completion (SSC), including in-
herent class imbalances in indoor 3D spatial representations. Furthermore, we quantify
the performance uncertainty in our results to ensure an unbiased assessment across tri-
als, contributing to more reliable benchmarking in the SSC field. We design a hybrid
architecture featuring a dual-head model that simultaneously processes RGB and depth
data. Depth information is encoded using a Flipped Truncated Signed Distance Func-
tion (F-TSDF), capturing essential geometric shape characteristics. RGB features are
projected from 2D to 3D space using depth maps. We explored various RGB semantics
fusion strategies, including early, middle, and late fusion methods. Based on perfor-
mance evaluations using K-fold cross-validation, we selected the late fusion approach.
This method involves downsampling features using planar convolutions to align with
3D resolution, followed by fusing RGB semantic features with geometric information
through element-wise addition. The hybrid encoder-decoder architecture incorporates
an Identity Transformation within a full pre-activation Residual Module (ITRM), en-
abling effective management of diverse signals within the F-TSDF representation.
The inference methodology of the proposed SSC model is extended to accommodate
360◦ RGB-D input through cubic projection and 3D rotation, enabling VR space de-
sign with comprehensive spatial coverage. We propose a streamlined computer vision-
based approach capable of reconstructing a 3D SSC model from a single panoramic in-
put, facilitating plausible sound environment simulation. Additionally, our proposed
method contributes to reducing the complexity of estimating room impulse responses
(RIRs), which typically require extensive equipment and multiple recordings in real
space. We implement the audio-visual VR reconstructions in the Unity 3D gaming
platform combined with the Steam audio plug-in for spatial sound rendering. Acoustic
properties are evaluated by measuring parameters such as early decay time (EDT) and
reverberation time (RT60). Comparative analysis indicates that our approach achieves
better VR space reconstruction, producing more realistic scene representations and im-
mersive acoustic characteristics compared to existing methods reported in the litera-
ture.
The proposed method contributes to the design of enhanced VR environments by in-
tegrating both audio and visual signals into a unified framework. Our results support
the development of datasets that combine audio and 3D SSC models, encouraging the
application of AI in VR spaces. This advancement has the potential to drive progress
in VR applications across various domains, such as gaming, education, and tourism.
Description
Keywords
3D Reconstruction, Semantic Scene Completion, Virtual Reality, Room acoustic modelling