Multimodal Speech Emotion Recognition Based on Audio and Text Information

No Thumbnail Available

Date

2024

Journal Title

Journal ISSN

Volume Title

Publisher

Saudi Digital Library

Abstract

Emotion recognition is inherently complex due to the reliance on multiple modalities. While humans possess the ability to infer emotions despite misalignments between these modalities, automatic Speech Emotion Recognition (SER) systems are devoid of such intuitive capabilities, which impacts their ability to accurately and reliably interpret emotions. This thesis addresses this challenge by aiming to develop a multimodal SER framework that integrates speech and text, resolves cross-modal discrepancies, and enhances the interpretability of emotion predictions. To achieve this aim, the research sets out to improve fusion strategies, reduce ambiguity arising from modality mismatches, and enable more context-aware emotion modelling through advanced learning techniques. In line with these goals, the thesis presents three contributions. Firstly, it proposes a hierarchical classification framework for SER that independently processes audio and text, employing a novel late fusion method to improve recognition accuracy. This approach evaluates emotional cues across multiple levels, providing insights into the relative significance of each modality. Secondly, the thesis develops an adaptation of this framework to specifically address scenarios where emotionally charged text is paired with neutral speech, resulting in modality discrepancies that frequently contribute to misclassification. By integrating text as a supportive modality, the adapted framework improves the system's ability to recognise emotional patterns that are often obscured by neutral tones. Finally, a novel augmentation strategy using Artificial Intelligence (AI) voice cloning is introduced to address modality mismatches. This approach generates augmented samples of neutral speech paired with emotional text, enabling the model to learn from such conflicts. Supervised Contrastive Learning (SCL), incorporating the augmentation strategy, is then applied to improve its capacity to manage variability and inconsistencies in real-world emotional data. This research emphasises the importance of integrating speech and text modalities in SER, addressing modality discrepancies, improving the interpretability of emotion predictions, and enriching emotional representations. The experimental results demonstrate the effectiveness of the proposed models in classifying ambiguous emotional samples, managing modality mismatches, and improving contextual understanding. These findings underscore the effectiveness of hierarchical modelling, text integration, and AI driven augmentation, advancing the performance and reliability of SER systems over existing approaches.

Description

Keywords

Speech Emotion Recognition, Multimodal Speech Emotion Recognition Based on Audio and Text Information, AI voice cloning, human-computer interaction, Supervised Contrastive Learning, Multimodal fusion

Citation

Endorsement

Review

Supplemented By

Referenced By

Copyright owned by the Saudi Digital Library (SDL) © 2025