CONTEXTUAL INFORMATION FOR OBJECT DETECTION

Thumbnail Image

Date

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Object detection has improved very rapidly in the last decades, but because they are very essential and considerably needed in various applications, further enhancement is needed. This thesis proposes the use of contextual information captured from digital scenes as a tool to contribute to developing detection performance. Contextual information, such as the co-occurrence of objects and the spatial and relative size among objects, provides deep and complex knowledge and interpretation about scenes. Determining such relationships among objects is seen to provide machine learning models with vital cues that aid detection methods to reach a better performance. In this thesis, sixteen contextual object-object relationships captured from MSCOCO 2017 training dataset are proposed. Upon the unique and intelligent enlightenment that those sixteen relationships provide, two contextual models, named Rescoring Model, and Relabelling Model, are proposed. These models explicitly encode contextual information from scenes, resulting to an improvement in the performance of two of the state-of-the-art detectors (i.e., Faster RCNN and YOLO). These models even provide greater improvement when being repeatedly processed, achieving higher AUC, mAP and F1 scores, with an increase of up to 19 percentage points compared with the baseline detectors. Due to the enhancement those contextual models achieve, another contextual model, named Transformer-Encoder Detector Module, is proposed. In contrast to the previous models, this model implicitly encodes contextual statistics and uses attention mechanism to provide a deeper understanding of images contents. It also achieves higher mAP, F1 scores and AUC average score of 13 percentage points compared to Faster RCNN detector. Perturbed images, where two different approaches of perturbations are applied, are used to examine the impact of the proposed contextual models. Results show that contextual models also gain better performances compared to the baseline detector. This is due to the use of both visual and contextual features, unlike the detector, which depends only on visual features.

Description

Keywords

Citation

Endorsement

Review

Supplemented By

Referenced By

Copyright owned by the Saudi Digital Library (SDL) © 2025