Recognizing Human-Object Interactions in Videos

Almushyti, Muna Ibrahim M

Recognizing Human-Object Interactions in Videos

dc.contributor.advisor	Li, Frederick
dc.contributor.author	Almushyti, Muna Ibrahim M
dc.date.accessioned	2023-09-13T08:21:56Z
dc.date.available	2023-09-13T08:21:56Z
dc.date.issued	2023-05-11
dc.description.abstract	Understanding human actions that involve interacting with objects is very important due to the wide range of real-world applications, such as security surveillance and healthcare. In this thesis, three different approaches are presented for addressing the problem of human-object interactions (HOIs) recognition in videos. Firstly, we propose a hierarchical framework for analyzing human-object interactions in a video sequence. The framework comprises Long Short-Term Memory (LSTM) networks that capture human motion and temporal object information independently. These pieces of information are then combined through a bilinear layer and fed into a global deep LSTM to learn high-level information about HOIs. To concentrate on the key components of human and object temporal information, the proposed approach incorporates an attention mechanism into LSTMs. Secondly, we aim to achieve a holistic understanding of human-object interactions (HOIs) by exploiting both their local and global contexts through knowledge distillation. The local context graphs are used to learn the relationship between humans and objects at the frame level by capturing their co-occurrence at a specific time step. On the other hand, the global relation graph is constructed based on the video-level of human and object interactions, identifying their long-term relations throughout a video sequence. We investigate how knowledge from these context graphs can be distilled to their counterparts to improve HOI recognition. Lastly, we propose the Spatio-Temporal Interaction Transformer-based (STIT) network to reason about spatio-temporal changes of humans and objects. Specifically, the spatial transformers learn the local context of humans and objects at specific frame times. The temporal transformer then learns the relations at a higher level between spatial context representations at different time steps, capturing long-term dependencies across frames. We further investigate multiple hierarchy designs for learning human interactions. The effectiveness of each of the proposed methods mentioned above is evaluated using various video action datasets that include human-object interactions, such as Charades, CAD-120, and Something-Something V1.
dc.format.extent	129
dc.identifier.uri	https://hdl.handle.net/20.500.14154/69154
dc.language.iso	en
dc.publisher	Saudi Digital Library
dc.subject	Human-Object Interactions (HOIs)
dc.subject	Knowledge Distillation (KD)
dc.subject	Global Context
dc.subject	Local Context
dc.subject	Attention
dc.subject	Spatio-Temporal
dc.subject	Long-Term Dependencies
dc.title	Recognizing Human-Object Interactions in Videos
dc.type	Thesis
sdl.degree.department	Computer Science
sdl.degree.discipline	Computer Vision, Human-Object Interactions
sdl.degree.grantor	Durham University
sdl.degree.name	Doctor of Philosophy

Collections

SACM - United Kingdom

Recognizing Human-Object Interactions in Videos

Files

Collections