CONTEXTUAL INFORMATION FOR OBJECT DETECTION
Abstract
Object detection has improved very rapidly in the last decades, but because they are very
essential and considerably needed in various applications, further enhancement is needed.
This thesis proposes the use of contextual information captured from digital scenes as a
tool to contribute to developing detection performance. Contextual information, such as
the co-occurrence of objects and the spatial and relative size among objects, provides deep
and complex knowledge and interpretation about scenes. Determining such relationships
among objects is seen to provide machine learning models with vital cues that aid detection
methods to reach a better performance.
In this thesis, sixteen contextual object-object relationships captured from MSCOCO 2017
training dataset are proposed. Upon the unique and intelligent enlightenment that those
sixteen relationships provide, two contextual models, named Rescoring Model, and Relabelling
Model, are proposed. These models explicitly encode contextual information from
scenes, resulting to an improvement in the performance of two of the state-of-the-art detectors
(i.e., Faster RCNN and YOLO). These models even provide greater improvement
when being repeatedly processed, achieving higher AUC, mAP and F1 scores, with an
increase of up to 19 percentage points compared with the baseline detectors.
Due to the enhancement those contextual models achieve, another contextual model,
named Transformer-Encoder Detector Module, is proposed. In contrast to the previous
models, this model implicitly encodes contextual statistics and uses attention mechanism
to provide a deeper understanding of images contents. It also achieves higher mAP, F1
scores and AUC average score of 13 percentage points compared to Faster RCNN detector.
Perturbed images, where two different approaches of perturbations are applied, are used
to examine the impact of the proposed contextual models. Results show that contextual
models also gain better performances compared to the baseline detector. This is due to
the use of both visual and contextual features, unlike the detector, which depends only on
visual features.