Gaizauskas, RobAlsunaidi, Abdulsalam2024-06-232024-06-232023-08-05https://hdl.handle.net/20.500.14154/72320Artificial intelligence has long sought to develop agents capable of perceiving the complex visual environment around us and communicating about it using natural language. In recent years, significant strides have been made towards this objective, particularly in the field of image content description. For instance, current artificial systems are able to classify images of a single object with a high level of accuracy that is sometimes comparable to that of humans. Although there has been remarkable progress in recognising objects, there has been less headway in action recognition due to a significant limitation in the current approach. Most of the advances in visual recognition rely on classifying images into distinct and non-overlapping categories. While this approach may work well in many contexts, it is inadequate for under- standing actions. It constrains the categorisation of an action to a single interpretation, thereby preventing an agent from proposing multiple possible interpretations. To tackle this fundamental limitation, this thesis proposes a framework that seeks to de- scribe action-depicting images using multiple verbs, and expands the vocabulary used to de- scribe such images beyond the limitations of the training dataset. In particular, the framework leverages lexical embeddings as a supplementary tool to go beyond the verbs that are supplied as explicit labels for images in datasets used for supervised training of action classifiers. More specifically, these embeddings are used for representing the target labels (i.e., verbs). By exploiting a richer representations of human actions, this framework has the potential to improve the capability of artificial agents to accurately recognise and describe human actions in images. In this thesis, we focus on the representation of input images and target labels. We examine various components for both elements, ranging from commonly used off-the-shelf options to custom-designed ones tailored to the task at hand. By carefully selecting and evaluating these components, we aim not only to improve the accuracy and effectiveness of the proposed frame- work but also to gain deeper insight into the potential of distributed lexical representations for action prediction in images.144enNatural Language ProcessingPredicting Actions in Images using Distributed Lexical RepresentationsThesis