Recognizing Human-Object Interactions in Videos
Date
2023-05-11
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Saudi Digital Library
Abstract
Understanding human actions that involve interacting with objects is very important
due to the wide range of real-world applications, such as security surveillance and
healthcare. In this thesis, three different approaches are presented for addressing
the problem of human-object interactions (HOIs) recognition in videos.
Firstly, we propose a hierarchical framework for analyzing human-object interactions
in a video sequence. The framework comprises Long Short-Term Memory
(LSTM) networks that capture human motion and temporal object information independently.
These pieces of information are then combined through a bilinear layer
and fed into a global deep LSTM to learn high-level information about HOIs. To
concentrate on the key components of human and object temporal information, the
proposed approach incorporates an attention mechanism into LSTMs.
Secondly, we aim to achieve a holistic understanding of human-object interactions
(HOIs) by exploiting both their local and global contexts through knowledge
distillation. The local context graphs are used to learn the relationship between
humans and objects at the frame level by capturing their co-occurrence at a specific
time step. On the other hand, the global relation graph is constructed based on the
video-level of human and object interactions, identifying their long-term relations
throughout a video sequence. We investigate how knowledge from these context
graphs can be distilled to their counterparts to improve HOI recognition.
Lastly, we propose the Spatio-Temporal Interaction Transformer-based (STIT)
network to reason about spatio-temporal changes of humans and objects. Specifically,
the spatial transformers learn the local context of humans and objects at
specific frame times. The temporal transformer then learns the relations at a higher
level between spatial context representations at different time steps, capturing long-term
dependencies across frames. We further investigate multiple hierarchy designs
for learning human interactions.
The effectiveness of each of the proposed methods mentioned above is evaluated
using various video action datasets that include human-object interactions, such as
Charades, CAD-120, and Something-Something V1.
Description
Keywords
Human-Object Interactions (HOIs), Knowledge Distillation (KD), Global Context, Local Context, Attention, Spatio-Temporal, Long-Term Dependencies