Saudi Cultural Missions Theses & Dissertations

Permanent URI for this communityhttps://drepo.sdl.edu.sa/handle/20.500.14154/10

Browse

Search Results

Now showing 1 - 1 of 1
  • ItemRestricted
    A Novel Emoji-Aware Computational Framework for Body-Shaming Detection in Gulf Arabic Social Media Discourse
    (Saudi Digital Library, 2026) Albluwi, Abeer; Rizk, Dominick
    Online body-shaming has become a prevalent form of appearance-based harm in Gulf Arabic social media, particularly on TikTok and Instagram. In these environments, harmful content is rarely explicit. It is carried through indirect and culturally embedded forms of expression, including sarcastic religious expressions and emojis whose meaning is often context-dependent and inverted. Despite the scale of the problem, no prior Arabic NLP work had addressed body-shaming as a standalone classification task, and no dataset existed to support it. This dissertation introduces GABSD-E (Gulf Arabic Body-Shaming Dataset, Emoji-Enriched), the first annotated corpus built specifically for this task. The dataset contains 24,988 comments from TikTok and Instagram, labeled under a three-class taxonomy: Body-Shaming (BS), General Bullying (B), and Not Bullying (NB), with Fleiss' kappa = 0.87. Exploratory analysis confirmed that 42.9% of comments contain at least one emoji, and that the BS class has the highest emoji density relative to class size, establishing that emojis are structurally embedded in how body-shaming is communicated, not incidental to it. Building on this, the dissertation proposes a novel emoji-aware representation framework that treats emojis as culturally grounded semiotic units rather than preprocessing noise. The framework consists of three components: a semantic emoji tagging layer based on a manually constructed Gulf Arabic emoji dictionary; a contextual lexicon injection layer for culturally specific expressions; and a preservation-over-deletion preprocessing pipeline that reverses the standard Arabic NLP practice of emoji removal. Five Arabic transformer models were fine-tuned and evaluated under this framework. SaudiBERT achieved the best performance, with a mean Macro-F1 of 0.9519 ± 0.0030 and accuracy of 95.27% ± 0.30 across five random seeds, substantially above all classical baselines (best classical Macro-F1: 0.6906).Ablation results confirmed that emoji removal caused the largest per-condition performance drop (−0.0589 Macro-F1), with consistent degradation observed across all classes, indicating that emoji signals provide complementary contextual cues that improve discrimination across categories. Semantic enrichment was the most critical component for the BS class (ΔF1-BS = +0.0775). An exploratory benchmark of four large language models showed that the best LLM under few-shot prompting (GPT-4o, Macro-F1 = 0.9278) still fell 0.0279 points below the fine-tuned SaudiBERT, confirming that model scale alone does not substitute for culturally grounded domain-specific modeling. The dissertation contributes GABSD-E as a reusable dataset and benchmark framework, an annotation methodology applicable to related harm categories, and a representation framework that generalizes to any Arabic NLP task where emoji pragmatics carry discriminative weight.
    10 0

Copyright owned by the Saudi Digital Library (SDL) © 2026