Resource Efficient Distributed Inference of Deep Neural Networks
No Thumbnail Available
Date
2025
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Saudi Digital Library
Abstract
Deep Neural Networks (DNNs) play a central role in contemporary artificial intelligence, enabling a wide range of applications including computer vision, natural language processing, and multimodal intelligence. Despite their success, deploying these models efficiently on edge devices remains challenging due to substantial computational requirements, energy consumption, and communication overhead. This dissertation proposes an integrated framework for enabling resource-efficient distributed inference of DNNs in Edge AI environments. The framework is structured around three interrelated components: asynchronous split inference, resource-aware batching, and efficient tensor compression between clients and servers.
The first contribution introduces an asynchronous split inference approach in which model execution is divided across edge clients and backend servers while overlapping communication with computation. This approach, referred to as ASAP (Asynchronous Split Inference for Accelerated DNN Execution), reduces idle periods during inference and improves system throughput. The second contribution investigates adaptive input and slice batching techniques designed to enhance hardware utilization and reduce inference latency across heterogeneous client devices. These batching strategies dynamically respond to variations in device capabilities and network conditions, enabling efficient resource sharing in distributed settings. The third contribution presents a tensor compression strategy that minimizes communication overhead by compactly encoding intermediate activations exchanged between clients and servers, achieving bandwidth reduction without compromising inference accuracy.
Extensive experimental evaluations are conducted using state-of-the-art vision models, including Vision Transformer (ViT), Swin Transformer, DenseNet, and ResNet, under a variety of deployment scenarios. The results show up to a 67 percent reduction in end-to-end inference latency, along with notable improvements in GPU utilization and energy efficiency. By combining asynchronous execution, adaptive batching, and compression techniques, the proposed framework supports scalable, low-latency, and adaptive inference suitable for real-world edge deployments.
Overall, this dissertation contributes to the advancement of distributed deep learning by addressing the interplay between computation and communication in heterogeneous edge–server systems. The proposed solutions establish a foundation for future research on adaptive model partitioning, compression-aware inference, and real-time multimodal processing in large-scale Edge AI platforms.
Description
PhD dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science.
Keywords
Distributed Inference, Deep Neural Networks, Edge AI, Resource Efficiency, Split Computing
Citation
Mubark, W. (2025). Resource Efficient Distributed Inference of Deep Neural Networks. Doctoral dissertation, University of Missouri–Kansas City.
