Toward Robust Mental Health Classification Systems Across Genres and Languages
No Thumbnail Available
Date
2026
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Saudi Digital Library
Abstract
Mental health conditions are a major global public health challenge, yet many individuals do not receive appropriate care because of stigma, limited access to services, and the difficulty of accurate assessment. Natural Language Processing (NLP) has shown growing promise for identifying mental health conditions through language, but existing systems often struggle to generalize across modalities, domains, conditions, and languages.
Existing approaches leave critical gaps in condition specificity, cross-genre robustness, and multilingual coverage. Prior work often studies isolated features or a single modality, leaving open how language markers behave across both writing and speech for the same condition. Condition-specific continual pretraining remains underexplored relative to generic mental health adaptation. Cross-condition transfer from clinically comorbid disorders has been proposed but rarely validated. And the field remains overwhelmingly English-centric, with Arabic among the most underserved languages despite its more than 400 million speakers.
This dissertation addresses these gaps through a progression from interpretable linguistic analysis to multilingual evaluation, using schizophrenia as a core case study. We first present an integrated analysis of cohesion features, pragmatic cues, and language model-based measures across clinical speech and writing, showing that patients exhibit heightened fear, higher neuroticism, reduced specificity, and lower cohesion, with effects generally stronger in writing. We then evaluate these signals through supervised classification, finding that cohesion is the strongest standalone structured feature view in writing, while a TF-IDF lexical baseline dominates in speech. Moving to neural modeling, we show that progressive multi-stage continual training of BERT on patient-generated social media achieves an 11.7% relative F1 improvement over base BERT and outperforms MentalBERT and ClinicalBERT for schizophrenia detection. We then demonstrate that focused cross-condition transfer outperforms broad mental health pretraining, with StressRoBERTa achieving 82% F1 on the SMM4H 2022 stress detection benchmark. To extend mental health NLP beyond English, we introduce ArMHC, a large-scale Arabic mental health corpus from X (formerly Twitter) constructed through a dialect-aware extraction pipeline with LLM-based validation, covering 18 conditions across 1,911 users. Using the ArMHC schizophrenia subset, we evaluate cross-lingual and cross-genre transfer from English clinical data to Arabic social media, finding that both language and genre mismatch contribute substantially to transfer degradation, with genre mismatch being qualitatively more destructive: cross-lingual same-genre transfer still permits partial detection, while cross-genre transfer falls below chance.
Overall, this dissertation demonstrates that robust mental health NLP benefits from combining interpretable linguistic analysis with domain-adaptive and transfer-based modeling, while expanding into low-resource multilingual settings. The findings contribute new linguistic evidence, modeling strategies, and dataset resources for building more inclusive and clinically relevant computational approaches to mental health assessment.
Description
Keywords
Computer science, Artificial intelligence, Natural Language Processing, Mental health
