Resource Efficient Distributed Inference of Deep Neural Networks

Mubark, Waleed

Resource Efficient Distributed Inference of Deep Neural Networks

Files

SACM-Dissertation.pdf (5.44 MB)

Date

2025

Authors

Mubark, Waleed

Publisher

Saudi Digital Library

Abstract

Deep Neural Networks (DNNs) play a central role in contemporary artificial intelligence, enabling a wide range of applications including computer vision, natural language processing, and multimodal intelligence. Despite their success, deploying these models efficiently on edge devices remains challenging due to substantial computational requirements, energy consumption, and communication overhead. This dissertation proposes an integrated framework for enabling resource-efficient distributed inference of DNNs in Edge AI environments. The framework is structured around three interrelated components: asynchronous split inference, resource-aware batching, and efficient tensor compression between clients and servers. The first contribution introduces an asynchronous split inference approach in which model execution is divided across edge clients and backend servers while overlapping communication with computation. This approach, referred to as ASAP (Asynchronous Split Inference for Accelerated DNN Execution), reduces idle periods during inference and improves system throughput. The second contribution investigates adaptive input and slice batching techniques designed to enhance hardware utilization and reduce inference latency across heterogeneous client devices. These batching strategies dynamically respond to variations in device capabilities and network conditions, enabling efficient resource sharing in distributed settings. The third contribution presents a tensor compression strategy that minimizes communication overhead by compactly encoding intermediate activations exchanged between clients and servers, achieving bandwidth reduction without compromising inference accuracy. Extensive experimental evaluations are conducted using state-of-the-art vision models, including Vision Transformer (ViT), Swin Transformer, DenseNet, and ResNet, under a variety of deployment scenarios. The results show up to a 67 percent reduction in end-to-end inference latency, along with notable improvements in GPU utilization and energy efficiency. By combining asynchronous execution, adaptive batching, and compression techniques, the proposed framework supports scalable, low-latency, and adaptive inference suitable for real-world edge deployments. Overall, this dissertation contributes to the advancement of distributed deep learning by addressing the interplay between computation and communication in heterogeneous edge–server systems. The proposed solutions establish a foundation for future research on adaptive model partitioning, compression-aware inference, and real-time multimodal processing in large-scale Edge AI platforms.

Description

PhD dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science.

Keywords

Distributed Inference, Deep Neural Networks, Edge AI, Resource Efficiency, Split Computing

Citation

Mubark, W. (2025). Resource Efficient Distributed Inference of Deep Neural Networks. Doctoral dissertation, University of Missouri–Kansas City.

URI

https://hdl.handle.net/20.500.14154/77881

Collections

SACM - United States of America

Full item page

Resource Efficient Distributed Inference of Deep Neural Networks

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By