Predicting One-Dimensional Protein Structures by Leveraging Pre-Trained Language Models (PLMs) and Deep Learning
No Thumbnail Available
Date
2025
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Saudi Digital Library
Abstract
Proteins are essential biomolecules, and their function is intrinsically linked to their structure.
Understanding this relationship is crucial for advancements in molecular biology, medicine,
and biotechnology. Despite the rapid growth in sequencing data, structural data remains
sparse due to the challenges and costs associated with experimental methods. As a result,
computational protein structure prediction has become essential in bridging the gap between
sequence data and structural understanding.
This thesis focuses on advancing one-dimensional (1D) structural annotations—specifically
secondary structure (SS) and relative solvent accessibility (RSA)—by leveraging state-of-the
art deep learning methodologies. Two novel prediction tools are introduced: Porter6 for SS
prediction and PaleAle6.0 for RSA prediction. Both models utilize pre-trained protein language
models (PLMs) and a convolutional bidirectional recurrent neural network (CBRNN)
architecture, enabling high-accuracy predictions without relying on multiple sequence
alignments. PaleAle6.0 further supports real-valued, binary, and multi-class RSA outputs,
offering enhanced flexibility and performance.
To promote accessibility and usability, these tools are made available to the research
community through DeepPredict, a web-based platform designed for efficient and scalable
structural predictions. DeepPredict enables users to perform accurate SS and RSA predictions
with minimal computational requirements. This thesis presents a comprehensive evaluation
of PLM-based embeddings, highlights the importance of careful dataset design to avoid bias
and overfitting, and promotes realistic evaluation metrics that consider evolutionary
relationships between proteins.
With its powerful prediction capabilities and user-friendly design, DeepPredict supports a
wide range of applications in drug discovery, synthetic biology, and the understanding of
disease mechanisms, laying a strong foundation for future advancements in computational
biology.
Description
Keywords
Deep learning 1D protein prediction Protein databases Secondary structure Intrinsic disorder Solvent accessibility AlphaFold Protein language models
