Evaluating Text Summarization with Goal-Oriented Metrics: A Case Study using Large Language Models (LLMs) and Empowered GQM

No Thumbnail Available

Date

2024-09

Journal Title

Journal ISSN

Volume Title

Publisher

University of Birmingham

Abstract

This study evaluates the performance of Large Language Models (LLMs) in dialogue summarization tasks, focusing on Gemma and Flan-T5. Employing a mixed-methods approach, we utilized the SAMSum dataset and developed an enhanced Goal-Question-Metric (GQM) framework for comprehensive assessment. Our evaluation combined traditional quantitative metrics (ROUGE, BLEU) with qualitative assessments performed by GPT-4, addressing multiple dimensions of summary quality. Results revealed that Flan-T5 consistently outperformed Gemma across both quantitative and qualitative metrics. Flan-T5 excelled in lexical overlap measures (ROUGE-1: 53.03, BLEU: 13.91) and demonstrated superior performance in qualitative assessments, particularly in conciseness (81.84/100) and coherence (77.89/100). Gemma, while showing competence, lagged behind Flan-T5 in most metrics. This study highlights the effectiveness of Flan-T5 in dialogue summarization tasks and underscores the importance of a multi-faceted evaluation approach in assessing LLM performance. Our findings suggest that future developments in this field should focus on enhancing lexical fidelity and higher-level qualities such as coherence and conciseness. This study contributes to the growing body of research on LLM evaluation and offers insights for improving dialogue summarization techniques.

Description

Keywords

Artificial Intelligent g, Large Language Models, Goal-Question-Metric, Natural language processing, Software Engineering

Citation

Endorsement

Review

Supplemented By

Referenced By

Copyright owned by the Saudi Digital Library (SDL) © 2025