Generic placeholder image

Current Bioinformatics

Editor-in-Chief

ISSN (Print): 1574-8936
ISSN (Online): 2212-392X

Research Article

A Comparison of Mutual Information, Linear Models and Deep Learning Networks for Protein Secondary Structure Prediction

Author(s): Saida Saad Mohamed Mahmoud, Beatrice Portelli, Giovanni D'Agostino, Gianluca Pollastri, Giuseppe Serra* and Federico Fogolari*

Volume 18, Issue 8, 2023

Published on: 03 July, 2023

Page: [631 - 646] Pages: 16

DOI: 10.2174/1574893618666230417103346

Price: $65

conference banner
Abstract

Background: Over the last several decades, predicting protein structures from amino acid sequences has been a core task in bioinformatics. Nowadays, the most successful methods employ multiple sequence alignments and can predict the structure with excellent performance. These predictions take advantage of all the amino acids at a given position and their frequencies. However, the effect of single amino acid substitutions in a specific protein tends to be hidden by the alignment profile. For this reason, single-sequence-based predictions attract interest even after accurate multiple-alignment methods have become available: the use of single sequences ensures that the effects of substitution are not confounded by homologous sequences.

Objective: This work aims at understanding how the single-sequence secondary structure prediction of a residue is influenced by the surrounding ones. We aim at understanding how different prediction methods use single-sequence information to predict the structure.

Methods: We compare mutual information, the coefficients of two linear models, and three deep learning networks. For the deep learning algorithms, we use the DeepLIFT analysis to assess the effect of each residue at each position in the prediction.

Results: Mutual information and linear models quantify direct effects, whereas DeepLIFT applied on deep learning networks quantifies both direct and indirect effects.

Conclusion: Our analysis shows how different network architectures use the information of single protein sequences and highlights their differences with respect to linear models. In particular, the deep learning implementations take into account context and single position information differently, with the best results obtained using the BERT architecture.

Keywords: Secondary structure prediction, single sequence, mutual information, linear model, deep learning, neuralnetwork, LSTM, BERT.

[1]
Anfinsen CB. Principles that govern the folding of protein chains. Science 1973; 181(4096): 223-30.
[http://dx.doi.org/10.1126/science.181.4096.223] [PMID: 4124164]
[2]
Rost B, Sander C, Schneider R. Redefining the goals of protein secondary structure prediction. J Mol Biol 1994; 235(1): 13-26.
[http://dx.doi.org/10.1016/S0022-2836(05)80007-5] [PMID: 8289237]
[3]
Jumper J, Evans R, Pritzel A, et al. Highly accurate protein structure prediction with AlphaFold. Natur 2021; 596(7873): 583-9.
[http://dx.doi.org/10.1038/s41586-021-03819-2] [PMID: 34265844]
[4]
Zhou Y, Karplus M. Interpreting the folding kinetics of helical proteins. Natur 1999; 401(6751): 400-3.
[http://dx.doi.org/10.1038/43937] [PMID: 10517642]
[5]
Ozkan SB, Wu GA, Chodera JD, Dill KA. Protein folding by zipping and assembly. Proc Natl Acad Sci USA 2007; 104(29): 11987-92.
[http://dx.doi.org/10.1073/pnas.0703700104] [PMID: 17620603]
[6]
Plaxco KW, Simons KT, Baker D. Contact order, transition state placement and the refolding rates of single domain proteins. J Mol Biol 1998; 277(4): 985-94.
[http://dx.doi.org/10.1006/jmbi.1998.1645] [PMID: 9545386]
[7]
Yang Y, Gao J, Wang J, et al. Sixty-five years of the long march in protein secondary structure prediction: The final stretch? Brief Bioinform 2016; 19(3): bbw129.
[http://dx.doi.org/10.1093/bib/bbw129] [PMID: 28040746]
[8]
Rost B, Sander C. Third generation prediction of secondary structures. In: Protein Structure Prediction: Methods and Protocols. Totowa, NJ: Humana Press 2000; pp. 71-95.
[http://dx.doi.org/10.1385/1-59259-368-2:71]
[9]
Pauling L, Corey RB. Configurations of polypeptide chains with favored orientations around single bonds: Two new pleated sheets. Proc Natl Acad Sci USA 1951; 37(11): 729-40.
[http://dx.doi.org/10.1073/pnas.37.11.729] [PMID: 16578412]
[10]
Pauling L, Corey RB, Branson HR. The structure of proteins: Two hydrogen-bonded helical configurations of the polypeptide chain. Proc Natl Acad Sci 1951; 37(4): 205-11.
[http://dx.doi.org/10.1073/pnas.37.4.205] [PMID: 14816373]
[11]
Chou PY, Fasman GD. Prediction of protein conformation. Biochemistry 1974; 13(2): 222-45.
[http://dx.doi.org/10.1021/bi00699a002] [PMID: 4358940]
[12]
Garnier J, Osguthorpe DJ, Robson B. Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins. J Mol Biol 1978; 120(1): 97-120.
[http://dx.doi.org/10.1016/0022-2836(78)90297-8] [PMID: 642007]
[13]
Gibrat JF, Garnier J, Robson B. Further developments of protein secondary structure prediction using information theory. J Mol Biol 1987; 198(3): 425-43.
[http://dx.doi.org/10.1016/0022-2836(87)90292-0] [PMID: 3430614]
[14]
Garnier J, Gibrat JF, Robson B. GOR method for predicting protein secondary structure from amino acid sequence. Methods Enzymol 1996; 266: 540-53.
[http://dx.doi.org/10.1016/S0076-6879(96)66034-0] [PMID: 8743705]
[15]
Rost B. Review: Protein secondary structure prediction continues to rise. J Struct Biol 2001; 134(2-3): 204-18.
[http://dx.doi.org/10.1006/jsbi.2001.4336] [PMID: 11551180]
[16]
Pollastri G, Przybylski D, Rost B, Baldi P. Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles. Proteins 2002; 47(2): 228-35.
[http://dx.doi.org/10.1002/prot.10082] [PMID: 11933069]
[17]
Torrisi M, Pollastri G, Le Q. Deep learning methods in protein structure prediction. Comput Struct Biotechnol J 2020; 18: 1301-10.
[http://dx.doi.org/10.1016/j.csbj.2019.12.011] [PMID: 32612753]
[18]
Heffernan R, Paliwal K, Lyons J, Singh J, Yang Y, Zhou Y. Single‐sequence‐based prediction of protein secondary structures and solvent accessibility by deep whole‐sequence learning. J Comput Chem 2018; 39(26): 2210-6.
[http://dx.doi.org/10.1002/jcc.25534] [PMID: 30368831]
[19]
Kotowski K, Smolarczyk T, Roterman-Konieczna I, Stapor K. ProteinUnet-An efficient alternative to SPIDER3‐single for sequence‐based prediction of protein secondary structures. J Comput Chem 2021; 42(1): 50-9.
[http://dx.doi.org/10.1002/jcc.26432] [PMID: 33058261]
[20]
Shrikumar A, Greenside P, Kundaje A. Learning important features through propagating activation differences. In 34th ICML. Sydney, Australia 2017; pp. 3145-53. Available from: http://arxiv.org/abs/1704.02685
[21]
Chowdhury R, Bouatta N, Biswas S, et al. Single-sequence protein structure prediction using a language model and deep learning. Nat Biotechnol 2022; 40(11): 1617-23.
[http://dx.doi.org/10.1038/s41587-022-01432-w] [PMID: 36192636]
[22]
Lei Z, Gao S, Zhang Z, Zhou MC, Cheng J. MO4: A many-objective evolutionary algorithm for protein structure prediction. IEEE Trans Evol Comput 2022; 26(3): 417-30.
[http://dx.doi.org/10.1109/TEVC.2021.3095481]
[23]
Rashid S, Sundaram S, Kwoh CK. Empirical study of protein feature representation on deep belief networks trained with small data for secondary structure prediction. IEEE/ACM Trans Comput Biol Bioinformatics 2022; 1.
[http://dx.doi.org/10.1109/TCBB.2022.3168676]
[24]
Hu L, Yang S, Luo X, Yuan H, Sedraoui K, Zhou MC. A distributed framework for large scale protein-protein interaction data analysis and prediction using MapReduce. IEEE/CAA J. IEEE/CAA J of Automat Sinic 2022; 9(1): 160-72.
[http://dx.doi.org/10.1109/JAS.2021.1004198]
[25]
Beltagy I, Peters ME, Cohan A. Longformer: The Long-Document Transformer arXiv 200405150 2020.
[26]
Wu H. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. Adv Neural Inf Process Syst 2021; 34: 22419-30.
[27]
Zhang J. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. Proceedings of the 37th International Conference on Machine Learning. Vienna, Austria. 2020; pp. 11328-39. Available from: https://arxiv.org/abs/1912.08777
[28]
Wang G, Dunbrack RL Jr. PISCES: A protein sequence culling server. Bioinformatics 2003; 19(12): 1589-91.
[http://dx.doi.org/10.1093/bioinformatics/btg224] [PMID: 12912846]
[29]
Rost B. PHD: Predicting one-dimensional protein structure by profile-based neural networks. Methods Enzymol 1996; 266: 525-39.
[http://dx.doi.org/10.1016/S0076-6879(96)66033-9] [PMID: 8743704]
[30]
Touw WG, Baakman C, Black J, et al. A series of PDB-related databanks for everyday needs. Nucleic Acids Res 2015; 43(D1): D364-8.
[http://dx.doi.org/10.1093/nar/gku1028] [PMID: 25352545]
[31]
Kabsch W, Sander C. Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 1983; 22(12): 2577-637.
[http://dx.doi.org/10.1002/bip.360221211] [PMID: 6667333]
[32]
Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA 1992; 89(22): 10915-9.
[http://dx.doi.org/10.1073/pnas.89.22.10915] [PMID: 1438297]
[33]
Heffernan R, Paliwal K, Lyons J, et al. Improving prediction of secondary structure, local backbone angles and solvent accessible surface area of proteins by iterative deep learning. Sci Rep 2015; 5(1): 11476.
[http://dx.doi.org/10.1038/srep11476] [PMID: 26098304]
[34]
Heffernan R, Yang Y, Paliwal K, Zhou Y. Capturing non-local interactions by long short-term memory bidirectional recurrent neural networks for improving prediction of protein secondary structure, backbone angles, contact numbers and solvent accessibility. Bioinformatics 2017; 33(18): 2842-9.
[http://dx.doi.org/10.1093/bioinformatics/btx218] [PMID: 28430949]
[35]
Matsuda H. Physical nature of higher-order mutual information: Intrinsic correlations and frustration. Phys Rev E Stat Phys Plasmas Fluids Relat Interdiscip Topics 2000; 62(3): 3096-102.
[http://dx.doi.org/10.1103/PhysRevE.62.3096] [PMID: 11088803]
[36]
Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput 1997; 9(8): 1735-80.
[http://dx.doi.org/10.1162/neco.1997.9.8.1735] [PMID: 9377276]
[37]
Sibi P, Jones SA, Siddarth P. Analysis of different activation functions using back propagation neural networks. J Theor Appl Inf Technol 2013; 47: 1264-8. Available from: https://www.jatit.org/volumes/Vol47No3/61Vol47No3.pdf
[38]
Devlin J. BERT: Pre-training of deep bidirectional transformers for language understanding ACL Anthology 2019; 1: 4171-86.
[http://dx.doi.org/10.18653/v1/N19-1423]
[39]
Gu Y, Tinn R, Cheng H, et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare 2022; 3(1): 1-23.
[http://dx.doi.org/10.1145/3458754]
[40]
Chalkidis I. LEGAL-BERT: The muppets straight out of law school. arXiv 2020; 2898-904.
[41]
Feng Z. CodeBERT: A pre-trained model for programming and natural languages. arXiv:200208155 2020; 1536-47.
[http://dx.doi.org/10.18653/v1/2020.findings-emnlp.139]
[42]
Raffel C. Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res 2020; 21: 1-67.
[http://dx.doi.org/10.48550/arXiv.1910.10683]
[43]
Paszke A. Automatic differentiation in pytorch. 2017. Available from: https://openreview.net/forum?id=BJJsrmfCZ
[44]
Benesty J. Pearson correlation coefficient. In: Noise reduction in speech processing. Berlin: Springer 2009; pp. 1-4.

Rights & Permissions Print Cite
© 2024 Bentham Science Publishers | Privacy Policy