Representing Automatically Generated Topics

Thumbnail Image

Date

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Topic models are widely used in natural language processing (NLP). Ensuring that their output is interpretable is an essential area of research with a wide range of applications in several areas, such as the enhancement of exploratory search interfaces. Conventionally, topics are represented by their most probable words. However, these representations are often difficult for humans to interpret. Evaluating representations also presents further challenges. Ideally, humans can gauge the quality of the topics, but it is not always feasible in practical terms. This thesis addresses the limitations related to the output of the topic model in three ways. First, it proposes and explores a range of alternative representations of topics by re-ranking topic words. Re-ranking adjusts the weights of the words and aims to identify informative words in the topics. This approach is a straightforward remedy, as topics tend to contain ``noisy'' words. Additionally, two approaches to evaluating the topics are proposed: (1) an automatic approach based on a document retrieval task; and (2) a crowdsourcing task. Both approaches demonstrate that re-ranking words improves topic interpretability. In addition, two alternative visual forms of the topic are explored, and a simple list of words representation shows to be more useful than a word cloud. Second, the thesis introduces a new approach to assigning topics with short descriptive labels. Labelling topics is an important task that aims to improve access to large document collections. Previous work on the automatic assignment of labels to topics has relied on a two-stage approach: (1) retrieve candidate labels from a large pool; and then (2) re-rank candidate labels. However, these approaches can only assign candidate labels from a restricted set that may not include any suitable ones. The new approach uses a sequence-to-sequence neural-based approach to generate labels that do not have this limitation. In addition, two new synthetic datasets of pairs of topics and labels are created to train the models. Third, this thesis conducts an empirical study on the proposed labelling approaches and performs quantitative and qualitative analyses of the generated labels. The labels are evaluated with gold labels that were rated by humans, and the labels are also evaluated with the topics themselves. The proposed approaches generate appropriate labels that are coherent and relevant to the topics.

Description

Keywords

Citation

Endorsement

Review

Supplemented By

Referenced By

Copyright owned by the Saudi Digital Library (SDL) © 2025