Generic placeholder image

Current Genomics

Editor-in-Chief

ISSN (Print): 1389-2029
ISSN (Online): 1875-5488

Research Article

Prediction of Deleterious Single Amino Acid Polymorphisms with a Consensus Holdout Sampler

Author(s): Óscar Álvarez-Machancoses, Eshel Faraggi, Enrique J. deAndrés-Galiana, Juan L. Fernández-Martínez and Andrzej Kloczkowski*

Volume 25, Issue 3, 2024

Published on: 14 March, 2024

Page: [171 - 184] Pages: 14

DOI: 10.2174/0113892029236347240308054538

Price: $65

conference banner
Abstract

Background: Single Amino Acid Polymorphisms (SAPs) or nonsynonymous Single Nucleotide Variants (nsSNVs) are the most common genetic variations. They result from missense mutations where a single base pair substitution changes the genetic code in such a way that the triplet of bases (codon) at a given position is coding a different amino acid. Since genetic mutations sometimes cause genetic diseases, it is important to comprehend and foresee which variations are harmful and which ones are neutral (not causing changes in the phenotype). This can be posed as a classification problem.

Methods: Computational methods using machine intelligence are gradually replacing repetitive and exceedingly overpriced mutagenic tests. By and large, uneven quality, deficiencies, and irregularities of nsSNVs datasets debase the convenience of artificial intelligence-based methods. Subsequently, strong and more exact approaches are needed to address these problems. In the present work paper, we show a consensus classifier built on the holdout sampler, which appears strong and precise and outflanks all other popular methods.

Results: We produced 100 holdouts to test the structures and diverse classification variables of diverse classifiers during the training phase. The finest performing holdouts were chosen to develop a consensus classifier and tested using a k-fold (1 ≤ k ≤5) cross-validation method. We also examined which protein properties have the biggest impact on the precise prediction of the effects of nsSNVs.

Conclusion: Our Consensus Holdout Sampler outflanks other popular algorithms, and gives excellent results, highly accurate with low standard deviation. The advantage of our method emerges from using a tree of holdouts, where diverse LM/AI-based programs are sampled in diverse ways.

Keywords: Polymorphisms, holdout sampler, protein mutation, deep sampling, machine learning, single amino acid variants.

Graphical Abstract
[1]
Sunyaev, S.; Ramensky, V.; Bork, P. Towards a structural basis of human non-synonymous single nucleotide polymorphisms. Trends Genet., 2000, 16(5), 198-200.
[http://dx.doi.org/10.1016/S0168-9525(00)01988-0] [PMID: 10782110]
[2]
Cargill, M.; Altshuler, D.; Ireland, J.; Sklar, P.; Ardlie, K.; Patil, N.; Lane, C.R.; Lim, E.P.; Kalyanaraman, N.; Nemesh, J.; Ziaugra, L.; Friedland, L.; Rolfe, A.; Warrington, J.; Lipshutz, R.; Daley, G.Q.; Lander, E.S.; Lander, E.S. Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nat. Genet., 1999, 22(3), 231-238.
[http://dx.doi.org/10.1038/10290] [PMID: 10391209]
[3]
Collins, F.S.; Brooks, L.D.; Chakravarti, A. A DNA polymorphism discovery resource for research on human genetic variation. Genome Res., 1998, 8(12), 1229-1231.
[http://dx.doi.org/10.1101/gr.8.12.1229] [PMID: 9872978]
[4]
Abecasis, G.R.; Altshuler, D.; Auton, A.; Brooks, L.D.; Durbin, R.M.; Gibbs, R.A.; Hurles, M.E.; McVean, G.A. A map of human genome variation from population-scale sequencing. Nature, 2010, 467(7319), 1061-1073.
[http://dx.doi.org/10.1038/nature09534] [PMID: 20981092]
[5]
Collins, F.S.; Guyer, M.S.; Chakravarti, A. Variations on a theme: Cataloging human DNA sequence variation. Science, 1997, 278(5343), 1580-1581.
[http://dx.doi.org/10.1126/science.278.5343.1580] [PMID: 9411782]
[6]
Risch, N.; Merikangas, K. The future of genetic studies of complex human diseases. Science, 1996, 273(5281), 1516-1517.
[http://dx.doi.org/10.1126/science.273.5281.1516] [PMID: 8801636]
[7]
Studer, R.A.; Dessailly, B.H.; Orengo, C.A. Residue mutations and their impact on protein structure and function: Detecting beneficial and pathogenic changes. Biochem. J., 2013, 449(3), 581-594.
[http://dx.doi.org/10.1042/BJ20121221] [PMID: 23301657]
[8]
Halushka, M.K.; Fan, J.B.; Bentley, K.; Hsie, L.; Shen, N.; Weder, A.; Cooper, R.; Lipshutz, R.; Chakravarti, A. Patterns of single-nucleotide polymorphisms in candidate genes for blood-pressure homeostasis. Nat. Genet., 1999, 22(3), 239-247.
[http://dx.doi.org/10.1038/10297] [PMID: 10391210]
[9]
Capriotti, E.; Nehrt, N.L.; Kann, M.G.; Bromberg, Y. Bioinformatics for personal genome interpretation. Brief. Bioinform., 2012, 13(4), 495-512.
[http://dx.doi.org/10.1093/bib/bbr070] [PMID: 22247263]
[10]
Niu, B.; Scott, A.D.; Sengupta, S.; Bailey, M.H.; Batra, P.; Ning, J.; Wyczalkowski, M.A.; Liang, W.W.; Zhang, Q.; McLellan, M.D.; Sun, S.Q.; Tripathi, P.; Lou, C.; Ye, K.; Mashl, R.J.; Wallis, J.; Wendl, M.C.; Chen, F.; Ding, L. Protein-structure-guided discovery of functional mutations across 19 cancer types. Nat. Genet., 2016, 48(8), 827-837.
[http://dx.doi.org/10.1038/ng.3586] [PMID: 27294619]
[11]
Goode, D.L.; Hunter, S.M.; Doyle, M.A.; Ma, T.; Rowley, S.M.; Choong, D.; Ryland, G.L.; Campbell, I.G. A simple consensus approach improves somatic mutation prediction accuracy. Genome Med., 2013, 5(9), 90.
[http://dx.doi.org/10.1186/gm494] [PMID: 24073752]
[12]
Choi, Y.; Sims, G.E.; Murphy, S.; Miller, J.R.; Chan, A.P. Predicting the functional effect of amino acid substitutions and indels. PLoS One, 2012, 7(10), e46688.
[http://dx.doi.org/10.1371/journal.pone.0046688] [PMID: 23056405]
[13]
Choi, Y.; Chan, A.P. PROVEAN web server: A tool to predict the functional effect of amino acid substitutions and indels. Bioinformatics, 2015, 31(16), 2745-2747.
[http://dx.doi.org/10.1093/bioinformatics/btv195] [PMID: 25851949]
[14]
Kumar, P.; Henikoff, S.; Ng, P.C. Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nat. Protoc., 2009, 4(7), 1073-1081.
[http://dx.doi.org/10.1038/nprot.2009.86] [PMID: 19561590]
[15]
Tang, H.; Thomas, P.D. PANTHER-PSEP: Predicting disease-causing genetic variants using position-specific evolutionary preservation. Bioinformatics, 2016, 32(14), 2230-2232.
[http://dx.doi.org/10.1093/bioinformatics/btw222] [PMID: 27193693]
[16]
Katsonis, P.; Lichtarge, O. A formal perturbation equation between genotype and phenotype determines the Evolutionary Action of protein-coding variations on fitness. Genome Res., 2014, 24(12), 2050-2058.
[http://dx.doi.org/10.1101/gr.176214.114] [PMID: 25217195]
[17]
Gallion, J.; Koire, A.; Katsonis, P.; Schoenegge, A.M.; Bouvier, M.; Lichtarge, O. Predicting phenotype from genotype: Improving accuracy through more robust experimental and computational modeling. Hum. Mutat., 2017, 38(5), 569-580.
[http://dx.doi.org/10.1002/humu.23193] [PMID: 28230923]
[18]
Schwarz, J.M.; Rödelsperger, C.; Schuelke, M.; Seelow, D. MutationTaster evaluates disease-causing potential of sequence alterations. Nat. Methods, 2010, 7(8), 575-576.
[http://dx.doi.org/10.1038/nmeth0810-575] [PMID: 20676075]
[19]
Reva, B.; Antipin, Y.; Sander, C. Predicting the functional impact of protein mutations: Application to cancer genomics. Nucleic Acids Res., 2011, 39(17), e118.
[http://dx.doi.org/10.1093/nar/gkr407] [PMID: 21727090]
[20]
Adzhubei, I.A.; Schmidt, S.; Peshkin, L.; Ramensky, V.E.; Gerasimova, A.; Bork, P.; Kondrashov, A.S.; Sunyaev, S.R. A method and server for predicting damaging missense mutations. Nat. Methods, 2010, 7(4), 248-249.
[http://dx.doi.org/10.1038/nmeth0410-248] [PMID: 20354512]
[21]
Capriotti, E.; Calabrese, R.; Fariselli, P.; Martelli, P.; Altman, R.B.; Casadio, R. WS-SNPs&GO: A web server for predicting the deleterious effect of human protein variants using functional annotation. BMC Genomics, 2013, 14(Suppl 3), S6.
[http://dx.doi.org/10.1186/1471-2164-14-S3-S6] [PMID: 23819482]
[22]
Capriotti, E.; Calabrese, R.; Casadio, R. Predicting the insurgence of human genetic diseases associated to single point protein mutations with support vector machines and evolutionary information. Bioinformatics, 2006, 22(22), 2729-2734.
[http://dx.doi.org/10.1093/bioinformatics/btl423] [PMID: 16895930]
[23]
Bendl, J.; Stourac, J.; Salanda, O.; Pavelka, A.; Wieben, E.D.; Zendulka, J.; Brezovsky, J.; Damborsky, J. PredictSNP: Robust and accurate consensus classifier for prediction of disease-related mutations. PLOS Comput. Biol., 2014, 10(1), e1003440.
[http://dx.doi.org/10.1371/journal.pcbi.1003440] [PMID: 24453961]
[24]
Stone, E.A.; Sidow, A. Physicochemical constraint violation by missense substitutions mediates impairment of protein function and disease severity. Genome Res., 2005, 15(7), 978-986.
[http://dx.doi.org/10.1101/gr.3804205] [PMID: 15965030]
[25]
Miosge, L.A.; Field, M.A.; Sontani, Y.; Cho, V.; Johnson, S.; Palkova, A.; Balakishnan, B.; Liang, R.; Zhang, Y.; Lyon, S.; Beutler, B.; Whittle, B.; Bertram, E.M.; Enders, A.; Goodnow, C.C.; Andrews, T.D. Comparison of predicted and actual consequences of missense mutations. Proc. Natl. Acad. Sci., 2015, 112(37), E5189-E5198.
[http://dx.doi.org/10.1073/pnas.1511585112] [PMID: 26269570]
[26]
Saunders, C.T.; Baker, D. Evaluation of structural and evolutionary contributions to deleterious mutation prediction. J. Mol. Biol., 2002, 322(4), 891-901.
[http://dx.doi.org/10.1016/S0022-2836(02)00813-6] [PMID: 12270722]
[27]
Stefl, S.; Nishi, H.; Petukh, M.; Panchenko, A.R.; Alexov, E. Molecular mechanisms of disease-causing missense mutations. J. Mol. Biol., 2013, 425(21), 3919-3936.
[http://dx.doi.org/10.1016/j.jmb.2013.07.014] [PMID: 23871686]
[28]
Pires, D.E.V.; Chen, J.; Blundell, T.L.; Ascher, D.B. In silico functional dissection of saturation mutagenesis: Interpreting the relationship between phenotypes and changes in protein stability, interactions and activity. Sci. Rep., 2016, 6(1), 19848.
[http://dx.doi.org/10.1038/srep19848] [PMID: 26797105]
[29]
Castaldi, P.J.; Dahabreh, I.J.; Ioannidis, J.P.A. An empirical assessment of validation practices for molecular classifiers. Brief. Bioinform., 2011, 12(3), 189-202.
[http://dx.doi.org/10.1093/bib/bbq073] [PMID: 21300697]
[30]
Baldi, P.; Brunak, S. Bioinformatics: The machine learning approach; MIT Press: Cambridge, MA, 2001.
[31]
Thusberg, J.; Olatubosun, A.; Vihinen, M. Performance of mutation pathogenicity prediction methods on missense variants. Hum. Mutat., 2011, 32(4), 358-368.
[http://dx.doi.org/10.1002/humu.21445] [PMID: 21412949]
[32]
Ng, P.C.; Henikoff, S. Predicting the effects of amino acid substitutions on protein function. Annu. Rev. Genomics Hum. Genet., 2006, 7(1), 61-80.
[http://dx.doi.org/10.1146/annurev.genom.7.080505.115630] [PMID: 16824020]
[33]
Polikar, R. Ensemble based systems in decision making. IEEE Circuits Syst. Mag., 2006, 6(3), 21-45.
[http://dx.doi.org/10.1109/MCAS.2006.1688199]
[34]
Capriotti, E.; Altman, R.B.; Bromberg, Y. Collective judgment predicts disease-associated single nucleotide variants. BMC Genomics, 2013, 14(Suppl 3), S2.
[http://dx.doi.org/10.1186/1471-2164-14-S3-S2] [PMID: 23819846]
[35]
González-Pérez, A.; López-Bigas, N. Improving the assessment of the outcome of nonsynonymous SNVs with a consensus deleteriousness score, Condel. Am. J. Hum. Genet., 2011, 88(4), 440-449.
[http://dx.doi.org/10.1016/j.ajhg.2011.03.004] [PMID: 21457909]
[36]
UniProt Consortium. The universal protein resource (UniProt). Nucleic Acids Res., 2008, 36(Database issue), D190-D195.
[PMID: 18045787]
[37]
Fernández Martínez, J.L.; Fernández Muñiz, M.Z.; Tompkins, M.J. On the topography of the cost functional in linear and nonlinear inverse problems. Geophysics, 2012, 77(1), W1-W15.
[http://dx.doi.org/10.1190/geo2011-0341.1]
[38]
Fernández-Martínez, J.L.; Fernández-Muñiz, Z.; Pallero, J.L.G.; Pedruelo-González, L.M. From Bayes to Tarantola: New insights to understand uncertainty in inverse problems. J. Appl. Geophys., 2013, 98, 62-72.
[http://dx.doi.org/10.1016/j.jappgeo.2013.07.005]
[39]
Fernández-Martínez, J.L.; Fernández-Muñiz, Z. The curse of dimensionality in inverse problems. J. Comput. Appl. Math., 2020, 369, 112571.
[http://dx.doi.org/10.1016/j.cam.2019.112571]
[40]
Álvarez-Machancoses, Ó.; De Andrés-Galiana, E.J.; Fernández-Martínez, J.L.; Kloczkowski, A. Robust prediction of single and multiple point protein mutations stability changes. Biomolecules, 2019, 10(1), 67.
[http://dx.doi.org/10.3390/biom10010067] [PMID: 31906171]
[41]
Fernández-Martínez, J.L.; Álvarez-Machancoses, Ó.; deAndrés-Galiana, E.J.; Bea, G.; Kloczkowski, A. Robust sampling of defective pathways in alzheimer’s disease. Implications in drug repositioning. Int. J. Mol. Sci., 2020, 21(10), 3594.
[http://dx.doi.org/10.3390/ijms21103594] [PMID: 32438758]
[42]
Fernández-Martínez, J.L.; de Andrés-Galiana, E.J.; Fernández-Ovies, F.J.; Cernea, A.; Kloczkowski, A. Robust sampling of defective pathways in Multiple Myeloma. Int. J. Mol. Sci., 2019, 20(19), 4681.
[http://dx.doi.org/10.3390/ijms20194681] [PMID: 31546608]
[43]
deAndrés-Galiana, E.J.; Fernández-Ovies, F.J.; Cernea, A.; Fernández-Martínez, J.L.; Kloczkowski, A. Deep neural networks for phenotype prediction in rare disease inclusion body myositis: A case study. In: Artificial Intelligence in Precision Health. From Concept to Applications; Barth, D., Ed.; Elsevier: Amsterdam, Netherlands, 2020; pp. 189-202.
[http://dx.doi.org/10.1016/B978-0-12-817133-2.00008-2]
[44]
Álvarez-Machancoses, Ó.; deAndrés-Galiana, E.J.; Fernández-Martínez, J.L.; Kloczkowski, A. The utilization of different classifiers to perform drug repositioning in Inclusion Body Myositis supports the concept of Biological Invariance Proceedings of the International Conference on Artificial Intelligence and Soft Computing, Zakopane, Poland, 12-14 Oct, 2020, pp. 589-598.
[http://dx.doi.org/10.1007/978-3-030-61401-0_55]
[45]
Efron, B.; Tibshirani, R. An Introduction to Bootstrap; Chapman & Hall: Boca Raton, FL, 1993.
[http://dx.doi.org/10.1007/978-1-4899-4541-9]
[46]
Breiman, L. Random Forests. L, Breiman. Mach. Learn., 2001, 45(1), 5-32.
[http://dx.doi.org/10.1023/A:1010933404324]
[47]
Altschul, S.F.; Gish, W.; Miller, W.; Myers, E.W.; Lipman, D.J. Basic local alignment search tool. J. Mol. Biol., 1990, 215(3), 403-410.
[http://dx.doi.org/10.1016/S0022-2836(05)80360-2] [PMID: 2231712]
[48]
Thomas, P.D.; Campbell, M.J.; Kejariwal, A.; Mi, H.; Karlak, B.; Daverman, R.; Diemer, K.; Muruganujan, A.; Narechania, A. PANTHER: A library of protein families and subfamilies indexed by function. Genome Res., 2003, 13(9), 2129-2141.
[http://dx.doi.org/10.1101/gr.772403] [PMID: 12952881]
[49]
Thomas, P.D.; Kejariwal, A.; Guo, N.; Mi, H.; Campbell, M.J.; Muruganujan, A.; Lazareva-Ulitsky, B. Applications for protein sequence-function evolution data: mRNA/protein expression analysis and coding SNP scoring tools. Nucleic Acids Res., 2006, 34(Web Server), W645-W650.
[http://dx.doi.org/10.1093/nar/gkl229] [PMID: 16912992]
[50]
Faraggi, E.; Zhou, Y.; Kloczkowski, A. Accurate single-sequence prediction of solvent accessible surface area using local and global features. Proteins, 2014, 82(11), 3170-3176.
[http://dx.doi.org/10.1002/prot.24682] [PMID: 25204636]
[51]
Kohavi, R. A study of cross-validation and bootstrap for accuracy estimation and model selection. Proceedings of the 14th International Joint Conference on Artificial Intelligence 2, Montreal, 20-25 Aug, 1995, pp. 1137-1145.
[52]
Fernández-Martínez, J.L. Sampling defective pathways in phenotype prediction problems via the holdout sampler. Bioinform. Biomed. Eng., 2018, 108(14), 24-32.
[http://dx.doi.org/10.1007/978-3-319-78759-6_3]
[53]
Fernández-Muñiz, Z.; Khaniani, H.; Fernández-Martínez, J.L. Data kit inversion and uncertainty analysis. J. Appl. Geophys., 2019, 161, 228-238.
[http://dx.doi.org/10.1016/j.jappgeo.2018.12.022]
[54]
Fernández-Martínez, J.L.; Fernández-Muñiz, Z.; Breysse, D. The uncertainty analysis in linear and nonlinear regression revisited: Application to concrete strength estimation. Inverse Probl. Sci. Eng., 2018, 27, 1740-1764.
[55]
Huang, G.B.; Zhu, Q.Y.; Siew, C.K. Extreme learning machine: Theory and applications. Neurocomputing, 2006, 70(1-3), 489-501.
[http://dx.doi.org/10.1016/j.neucom.2005.12.126]
[56]
Huang, G.B. An insight into extreme learning machines: Random neurons, random features and kernels. Cognit. Comput., 2014, 6(3), 376-390.
[http://dx.doi.org/10.1007/s12559-014-9255-2]
[57]
Huang, G.B.; Chen, L.; Siew, C.K. Universal approximation using incremental constructive feedforward networks with random hidden nodes. IEEE Trans. Neural Netw., 2006, 17(4), 879-892.
[http://dx.doi.org/10.1109/TNN.2006.875977] [PMID: 16856652]
[58]
Huang, G.B. What are extreme learning machines? Filling the gap between frank rosenblatt’s dream and john von neumann’s puzzle. Cognit. Comput., 2015, 7(3), 263-278.
[http://dx.doi.org/10.1007/s12559-015-9333-0]
[59]
Guang-Bin Huang; Hongming Zhou; Xiaojian Ding; Rui Zhang, Extreme learning machine for regression and multiclass classification. IEEE Trans. Syst. Man Cybern. B Cybern., 2012, 42(2), 513-529.
[http://dx.doi.org/10.1109/TSMCB.2011.2168604] [PMID: 21984515]
[60]
Ertugrul, O.F.; Tagluk, M.E.; Kaya, Y.; Tekin, R. EMG signal classification by extreme learning machine. In: Signal Processing and Communications Applications Conference; , 2013.
[http://dx.doi.org/10.1109/SIU.2013.6531269]
[61]
Huang, G.B.; Zhu, Q.Y.; Siew, C.K. Extreme learning machine: A new learning scheme of feedforward neural networks. Proceedings of the 2004 IEEE International Joint Conference on 2004 Jul 25, , pp. 985-990.2004
[62]
Ho, T.K. Random decision forest. Proceedings of the 3rd International Conference on Document Analysis and Recognition,, Montreal, 1995
[63]
Wang, Y.C.; Wu, Y.; Choi, J.; Allington, G.; Zhao, S.; Khanfar, M.; Yang, K.; Fu, P.Y.; Wrubel, M.; Yu, X.; Mekbib, K.Y.; Ocken, J.; Smith, H.; Shohfi, J.; Kahle, K.T.; Lu, Q.; Jin, S.C. Computational genomics in the era of precision medicine: Applications to variant analysis and gene therapy. J. Pers. Med., 2022, 12(2), 175.
[http://dx.doi.org/10.3390/jpm12020175] [PMID: 35207663]
[64]
Koumakis, L. Deep learning models in genomics; are we there yet? Comput. Struct. Biotechnol. J., 2020, 18, 1466-1473.
[http://dx.doi.org/10.1016/j.csbj.2020.06.017] [PMID: 32637044]
[65]
Alharbi, W.S.; Rashid, M. A review of deep learning applications in human genomics using next-generation sequencing data. Hum. Genomics, 2022, 16(1), 26.
[http://dx.doi.org/10.1186/s40246-022-00396-x] [PMID: 35879805]
[66]
Sapoval, N.; Aghazadeh, A.; Nute, M.G.; Antunes, D.A.; Balaji, A.; Baraniuk, R.; Barberan, C.J.; Dannenfelser, R.; Dun, C.; Edrisi, M.; Elworth, R.A.L.; Kille, B.; Kyrillidis, A.; Nakhleh, L.; Wolfe, C.R.; Yan, Z.; Yao, V.; Treangen, T.J. Current progress and open challenges for applying deep learning across the biosciences. Nat. Commun., 2022, 13(1), 1728.
[http://dx.doi.org/10.1038/s41467-022-29268-7] [PMID: 35365602]
[67]
Davydov, E.V.; Goode, D.L.; Sirota, M.; Cooper, G.M.; Sidow, A.; Batzoglou, S. Identifying a high fraction of the human genome to be under selective constraint using GERP++. PLOS Comput. Biol., 2010, 6(12), e1001025.
[http://dx.doi.org/10.1371/journal.pcbi.1001025] [PMID: 21152010]
[68]
Kopanos, C.; Tsiolkas, V.; Kouris, A.; Chapple, C.E.; Albarca Aguilera, M.; Meyer, R.; Massouras, A. VarSome: The human genomic variant search engine. Bioinformatics, 2019, 35(11), 1978-1980.
[http://dx.doi.org/10.1093/bioinformatics/bty897] [PMID: 30376034]

Rights & Permissions Print Cite
© 2024 Bentham Science Publishers | Privacy Policy