3D Audio-Visual Indoor Scene Reconstruction and Completion for Virtual Reality from a Single Image

No Thumbnail Available

Date

2025

Journal Title

Journal ISSN

Volume Title

Publisher

Saudi Digital Library

Abstract

In this research, we propose a novel method for generating an audio-visual scene in 3D virtual space using a single panoramic RGB-D input. Our investigation begins with the reconstruction of a 3D model from RGB panoramic data alone, developing a semantic geometry model by combining estimated monocular depth with material information for spatial sound rendering. Building upon the preliminary results, we extend our approach to construct a comprehensive virtual reality (VR) environment using 360◦ RGB-D input. The proposed method enables the creation of an immersive VR space by generating a complete 3D voxelized model that incorporates scene semantics from a single panoramic input. Our methodology employs a deep 3D convolutional neural network integrated with transfer learning for RGB semantic features, coupled with a re-weighting strategy in the 3D weighted cross-entropy loss function. The proposed re-weighting method uniquely combines two class re-balancing techniques (re-sampling and class-sensitive learning) while smoothing the weights through an unsupervised clustering algorithm. This ap- proach addresses critical challenges in semantic scene completion (SSC), including in- herent class imbalances in indoor 3D spatial representations. Furthermore, we quantify the performance uncertainty in our results to ensure an unbiased assessment across tri- als, contributing to more reliable benchmarking in the SSC field. We design a hybrid architecture featuring a dual-head model that simultaneously processes RGB and depth data. Depth information is encoded using a Flipped Truncated Signed Distance Func- tion (F-TSDF), capturing essential geometric shape characteristics. RGB features are projected from 2D to 3D space using depth maps. We explored various RGB semantics fusion strategies, including early, middle, and late fusion methods. Based on perfor- mance evaluations using K-fold cross-validation, we selected the late fusion approach. This method involves downsampling features using planar convolutions to align with 3D resolution, followed by fusing RGB semantic features with geometric information through element-wise addition. The hybrid encoder-decoder architecture incorporates an Identity Transformation within a full pre-activation Residual Module (ITRM), en- abling effective management of diverse signals within the F-TSDF representation. The inference methodology of the proposed SSC model is extended to accommodate 360◦ RGB-D input through cubic projection and 3D rotation, enabling VR space de- sign with comprehensive spatial coverage. We propose a streamlined computer vision- based approach capable of reconstructing a 3D SSC model from a single panoramic in- put, facilitating plausible sound environment simulation. Additionally, our proposed method contributes to reducing the complexity of estimating room impulse responses (RIRs), which typically require extensive equipment and multiple recordings in real space. We implement the audio-visual VR reconstructions in the Unity 3D gaming platform combined with the Steam audio plug-in for spatial sound rendering. Acoustic properties are evaluated by measuring parameters such as early decay time (EDT) and reverberation time (RT60). Comparative analysis indicates that our approach achieves better VR space reconstruction, producing more realistic scene representations and im- mersive acoustic characteristics compared to existing methods reported in the litera- ture. The proposed method contributes to the design of enhanced VR environments by in- tegrating both audio and visual signals into a unified framework. Our results support the development of datasets that combine audio and 3D SSC models, encouraging the application of AI in VR spaces. This advancement has the potential to drive progress in VR applications across various domains, such as gaming, education, and tourism.

Description

Keywords

3D Reconstruction, Semantic Scene Completion, Virtual Reality, Room acoustic modelling

Citation

Endorsement

Review

Supplemented By

Referenced By

Copyright owned by the Saudi Digital Library (SDL) © 2025