Using Aqueous Solubility to Test the Robustness and Limitations of Machine Learning Predictive Models

Thumbnail Image

Date

2022

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

The application of artificial intelligence to chemistry has been increasing in recent years, and machine learning (ML) models have been proven effective in drug and material discovery. Researchers from various domains have used ML to solve complex problems that humans are unable to solve. Such models are having a huge impact in the area of drug discovery. In this study, various ML models are developed for solubility prediction using molecular fingerprints and physical descriptors to investigate the number of data points required to build robust ML models for drug discovery tasks. A random forest regressor trained on 46 principal component analysis components was found to be the best-performing model, exhibiting an R2 score of 0.81 and a root mean square error of 0.91. To assess the ML models’ performance on subsets of the complete data, this study also compared the performances of ML models built on three sets of clusters on the basis of the similar structural properties of all solutes present in the dataset. Important features contributed in the RF regressor training were extracted according to their importance in the model training. To understand the data required to train the ML model for solubility prediction tasks, an RF regressor and support vector machine were trained simultaneously by providing increasing numbers of training examples to the model.

Description

Keywords

pharmacy, machine learning

Citation

Harverd

Endorsement

Review

Supplemented By

Referenced By

Copyright owned by the Saudi Digital Library (SDL) © 2025