Auditing Machine Learning Models Beyond Predefined Groups: A Multi-Level Framework for Systematic Analysis of Performance Disparities

No Thumbnail Available

Date

2026

Journal Title

Journal ISSN

Volume Title

Publisher

Saudi Digital Library

Abstract

This dissertation develops and validates a multilevel framework for detecting disparities in machine learning models used to predict substance use disorder (SUD) treatment outcomes. Aim One develops the first component of the framework using permutation-based feature importance to create subgroups from top influential predictors. Results show that all three predictive models—logistic regression, random forest, and gradient boosting—exhibited substantial marginal performance differences across feature defined subgroups, particularly in recall and fairness related metrics, revealing that individual influential features can induce unequal model performance. Aim Two extends disparity detection by applying hierarchical clustering to the same influential features, uncovering latent subgroups defined by multivariate structure. Cluster-level analyses reveal heterogeneous performance patterns, including clusters with systematically lower recall, calibration inconsistencies, and elevated fairness metric deviations. Interaction-level evaluation shows that many disparities are context dependent and appear only within specific cluster–feature combinations. Aim Three statistically evaluates the stability and sources of these disparities across repeated model runs. Feature-level disparities are consistently significant but largely unaffected by subgroup size imbalance, while cluster-level disparities show selective detectability and, in some cases, dependence on representation-imbalance. Interaction-level tests isolate the most persistent, context stable disparities that remain significant after accounting for latent subgroup structure, demonstrating the necessity of multilayer evaluation. The dissertation concludes that a unified, multilevel disparity detection framework—spanning marginal, latent, and conditional subgroup definitions—is essential for identifying reliable and actionable performance gaps in healthcare predictive models. This approach provides a scalable and reproducible path toward more equitable machine learning based SUD treatment evaluation.

Description

Keywords

Disparity detection, Multilevel evaluation framework, Machine learning fairness, Permutation‑based feature importance, Feature‑defined subgroups, Latent subgroup analysis, Context‑dependent performance gaps

Citation

Endorsement

Review

Supplemented By

Referenced By

Copyright owned by the Saudi Digital Library (SDL) © 2026