Evaluating Text Summarization with Goal-Oriented Metrics: A Case Study using Large Language Models (LLMs) and Empowered GQM
No Thumbnail Available
Date
2024-09
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
University of Birmingham
Abstract
This study evaluates the performance of Large Language Models (LLMs) in dialogue
summarization tasks, focusing on Gemma and Flan-T5. Employing a mixed-methods
approach, we utilized the SAMSum dataset and developed an enhanced Goal-Question-Metric
(GQM) framework for comprehensive assessment. Our evaluation combined traditional
quantitative metrics (ROUGE, BLEU) with qualitative assessments performed by GPT-4,
addressing multiple dimensions of summary quality. Results revealed that Flan-T5 consistently
outperformed Gemma across both quantitative and qualitative metrics. Flan-T5 excelled in
lexical overlap measures (ROUGE-1: 53.03, BLEU: 13.91) and demonstrated superior
performance in qualitative assessments, particularly in conciseness (81.84/100) and coherence
(77.89/100). Gemma, while showing competence, lagged behind Flan-T5 in most metrics. This
study highlights the effectiveness of Flan-T5 in dialogue summarization tasks and underscores
the importance of a multi-faceted evaluation approach in assessing LLM performance. Our
findings suggest that future developments in this field should focus on enhancing lexical
fidelity and higher-level qualities such as coherence and conciseness. This study contributes to
the growing body of research on LLM evaluation and offers insights for improving dialogue
summarization techniques.
Description
Keywords
Artificial Intelligent g, Large Language Models, Goal-Question-Metric, Natural language processing, Software Engineering