TRANSMEMBRANE PROTEIN RESIDUE CONTACT PREDICTION USING NOVEL MACHINE LEARNING APPROACHES AND PRETRAINED PROTEIN LANGUAGE MODELS

Li, LiaoAlmalki, Bander2025-09-072025Almalki, B. (2025). Transmembrane Protein Residue Contact Prediction Using Novel Machine Learning Approaches and Pretrained Protein Language Models (Doctoral dissertation, University of Delaware).https://hdl.handle.net/20.500.14154/76328Transmembrane (TM) proteins make up ~20–30% of human protein-coding genes and are essential for transport and signaling, but their structures are hard to determine experimentally. This dissertation develops novel machine/deep learning frameworks to improve contact map prediction and 3D structure modeling of α-helical TM proteins. Contributions include: (1) a transductive SVM with early stopping (TSVM-ES) improving helix–helix contact prediction; (2) AT-TSVM, transferring structural features to boost accuracy; (3) TMHC-MSA, leveraging protein language models for superior contact maps; (4) TMH-ID, a model for predicting dimer interface residues, outperforming multimer baselines; and (5) ICM-MD, combining feed-forward and GCN architectures trained on molecular dynamics–based dimers, achieving state-of-the-art performance in inter-chain contact prediction.Transmembrane (TM) proteins represent a significant portion of all known proteins and play a crucial role in many biological processes, such as facilitating thetransport of molecules, ions, and information between a cell and its external environment. It’s estimated that they account for approximately 20-30% of all protein-coding genes in humans. Despite their abundance and importance, only a very small portion has been determined experimentally because of the difficulty of obtaining well-ordered crystals and the high cost of conducting in vitro experiments. Such a difficulty might hinder the understanding of their function. The need for developing computational tools to determine their structure, therefore, becomes essential. However, compared to globular proteins, computationally determining the structure of TM proteins can be harder due to the limited availability of high-resolution structures to be used as templates and training examples for structural modeling and prediction. Residue contact prediction is one of the most successful computational approaches to reduce the huge search space for the TM protein fold and generate a high-quality 3D structure. Determining the structure of the protein can reveal invaluable information about its function. In this work, we explore the current advances in the field, investigate the effectiveness of different learning approaches, propose and develop novel machine/deep learning techniques for generating a high-quality contact map for both inter-chain and intra-chain residues in alpha-helical TM proteins. In addition, we assess the accuracy of the different proposed models and show that the proposed work can enhance the contact map prediction and ultimately produce a better 3D structure. In the first chapter, we investigate the contact map prediction task of alpha-helical TM proteins from a different angle using a transductive learning approach. Identifying the interaction between the helices within the membrane greatly affects their tilt angle and relative position, thus impacting the overall protein structure. We utilize transductive learning by incorporating the unlabeled test data during training to address the scarcity of labeled data, which is common in the prediction of amino acid residue contacts, and to improve the model accuracy. Using features derived from protein structures, we compare the predictive performance of transductive support vector machine (SVM) and inductive SVM in identifying helix-helix residue contacts, with the aim of determining the specific conditions and limitations under which TSVMs excel. Then, we explore potential solutions to mitigate the performance degradation of the transductive model. We introduce an early stop technique TSVMES that produces a more accurate model, outperforming the state-of-the-art TSVM by 5% when tested on a set of benchmarks of transmembrane proteins. In the second chapter, we investigate the feasibility of incorporating structural features into the classifier. Most current TM protein residue contact predictors rely solely on features extracted from protein sequences to predict residue contacts. However, using these features alone leads to a low-accuracy contact map and, subsequently, to a poor 3D structure. Other models attempt to exploit features extracted from 3D protein structures to produce a better representative contact model. Nevertheless, this approach is not applicable in real-life scenarios where the structure is not available during the model testing phase, making it a chicken-and-egg dilemma. Therefore, we propose a novel approach that utilizes atomic features extracted from known TM protein 3D structures to enable the model to train on these features and transfer this knowledge to the test data, which lack atomic features, to improve the prediction of the contact of the TM protein residue. Our proposed method, AT-TSVM, employs Transductive Support Vector Machines with transfer and active learning to improve contact prediction accuracy. The results indicate that our proposed model can boost the accuracy of contact prediction by an average of 5 to 6% on the inductive classifier and 3 to 4% over the transductive classifier. In the third chapter, we utilize large protein language models to generate an accurate contact map for alpha-helical TM proteins. The majority of previous studies employ techniques that rely on statistical analysis of the sequence to infer connections between residues. A few recent techniques, which are based on natural language processing models, have been successful in achieving this goal. Nevertheless, the majority of these techniques and models are designed for globular proteins and are not tailored for specific protein types like Transmembrane Proteins. Therefore, we propose a Transmembrane Protein Helices Contacts predictor (TMHC-MSA) that utilizes features extracted by a protein language model (MSA Transformer) and incorporates neighborhood information using a feature window to enhance the quality of the produced contact map. Our proposed model demonstrates superior performance by successfully outperforming the state-of-the-art method by an average of 7% in terms of L precision and even surpassing the MSA Transformer by an average of 2.5% on the same metric. Furthermore, we demonstrate that the more accurate contact map produced by our model can be used to generate a more accurate 3D structure. In the fourth chapter, we dive deeper to explore the dimerization of bitopic TM proteins. Most bitopic transmembrane proteins associate with each other to stabilize their structure by forming dimers. This association leads to the activation of downstream specific cellular functions. Therefore, being able to accurately identify interface residues in a given dimer is important to understand its function, and has been a challenging pursuit of many computational methods. In this chapter, we break down the dimerization residue contact prediction into two tasks. In the first task, we propose a model that leverages structural features extracted from the field of molecular dynamics alongside other features from various domains to predict interface residues in α-helical TM dimers. The accurate prediction of interface residues has potential applications in pharmaceutical drug design. The results reveal key limitations in the ability of state-of-the-art multimer models, including AlphaFold2-Multimer and RoseTTAFold2, to precisely identify these interface residues. Therefore, we introduced TMH-ID, a novel machine learning model which integrates various sequence-based features, including large protein language model coupling scores and TM-specific motifs, in addition to structure-based features extracted from the predicted structure of PREDDIMER. In particular, our proposed model achieved the highest mean F1 score, outperforming several advanced baselines such as THOIPA, MSA Transformer, and ProteinBERT. Furthermore, TMH-ID outperforms other multimer structure predictors RoseTTAFold2, AlphaFold2Multimer, and PREDDIMER in interface residue prediction across the Crystal subset. In the fifth chapter, we explore the prediction of inter-chain residue contacts in TM homodimers and present ICM-MD, a novel machine learning framework for predicting inter-chain residue contacts in α-helical TM homodimers. In this chapter, we propose two models to address the scarcity of available structures necessary to develop an accurate classifier to identify these contacts. This is achieved by training the models on a large database of molecular dynamics simulated dimers (Membranome) and integrating transmembrane-specific sequence and structural features. Our models adopt a residue-pair-centric learning paradigm to address the limited availability of training data and to enhance generalizability to unseen examples. The first model employs a lightweight and interpretable feed-forward neural network architecture that is com- putationally efficient and scalable. The second model implements a more advanced architecture based on graph convolutional networks (GCNs), enabling the effective integration of information from neighboring residues to capture richer structural context. This is achieved through the message-passing mechanism, which facilitates the exchange of information between contacting residues, thereby enabling each residue to incorporate context from its interaction partners. To the best of our knowledge, this is the first study to leverage molecular dynamics–based structural models as a surrogate ground truth for training an interchain contact predictor in TM proteins. The results show that the proposed simple model consistently outperforms state-of-the-art models, including DeepHomo1, DeepHomo2, Glinter, and DeepTMP in multiple evaluation metrics. Moreover, the advanced GCN-based model surpasses all the other models, delivering consistently stable performance across all evaluation metrics.128en-USResidue Contact PredictionLarge Protein Language ModelsDimerization PredictionInter-Helical Residue Contact PredictionInterface Residue PredictionTransmembrane ProteinsTRANSMEMBRANE PROTEIN RESIDUE CONTACT PREDICTION USING NOVEL MACHINE LEARNING APPROACHES AND PRETRAINED PROTEIN LANGUAGE MODELSThesis