Generic placeholder image

Current Chinese Science

Editor-in-Chief

ISSN (Print): 2210-2981
ISSN (Online): 2210-2914

General Research Article Section: Bioinformatics

Distant Supervision-based Relation Extraction for Literature-Related Biomedical Knowledge Graph Construction

Author(s): Rui Hua, Xuezhong Zhou*, Zixin Shu, Dengying Yan, Kuo Yang, Xinyan Wang, Chuang Cheng and Qiang Zhu

Volume 3, Issue 6, 2023

Published on: 06 October, 2023

Page: [477 - 487] Pages: 11

DOI: 10.2174/0122102981269053230921074451

Price: $65

Abstract

Background: The task of relation extraction is a crucial component in the construction of a knowledge graph. However, it often necessitates a significant amount of manual annotation, which can be time-consuming and expensive. Distant supervision, as a technique, seeks to mitigate this challenge by generating a large volume of pseudo-training data at a minimal cost, achieved by mapping triple facts onto the raw text.

Objective: The aim of this study is to explore the novelty and potential of the distant supervisionbased relation extraction approach. By leveraging this innovative method, we aim to enhance knowledge reliability and facilitate new knowledge discovery, establishing associations between knowledge from specific biomedical data or existing knowledge graphs and literature.

Methods: This study presents a methodology to construct a biomedical knowledge graph employing distant supervision techniques. Through establishing links between knowledge entities and relevant literature sources, we methodically extract and integrate information, thereby expanding and enriching the knowledge graph. This study identified five types of biomedical entities (e.g., diseases, symptoms and genes) and four kinds of relationships. These were linked to PubMed literature and divided into training and testing datasets. To mitigate data noise, the training set underwent preprocessing, while the testing set was manually curated.

Results: In our research, we successfully associated 230,698 triples from the existing knowledge graph with relevant literature. Furthermore, we identified additional 205,148 new triples directly sourced from these studies.

Conclusion: Our study markedly advances the field of biomedical knowledge graph enrichment, particularly in the context of Traditional Chinese Medicine (TCM). By validating a substantial number of triples through literature associations and uncovering over 200,000 new triples, we have made a significant stride in promoting the development of evidence-based medicine in TCM. The results underscore the potential of using a distant supervision-based relation extraction approach to both validate and expand knowledge bases, contributing to the broader progression of evidence-based practices in the realm of TCM.

Keywords: Relation extraction, knowledge graph, distant supervision, named entity recognition, literature, biomedical knowledge graph.

Graphical Abstract
[1]
Gu, J.; Sun, F.; Qian, L.; Zhou, G. Chemical-induced disease relation extraction via attention-based distant supervision. BMC Bioinformatics, 2019, 20(1), 403.
[http://dx.doi.org/10.1186/s12859-019-2884-4] [PMID: 31331263]
[2]
Névéol, A.; Islamaj, D.,R.; Lu, Z. Semi-automatic semantic annotation of PubMed queries: A study on quality, efficiency, satisfaction. J. Biomed. Inform., 2011, 44(2), 310-318.
[http://dx.doi.org/10.1016/j.jbi.2010.11.001] [PMID: 21094696]
[3]
Lu, Z. PubMed and beyond: A survey of web tools for searching biomedical literature. Database, 2011, 2011(0), baq036-baq036.
[http://dx.doi.org/10.1093/database/baq036] [PMID: 21245076]
[4]
Davis, A.P.; Wiegers, T.C.; Johnson, R.J.; Sciaky, D.; Wiegers, J.; Mattingly, C.J. Comparative toxicogenomics database (CTD): Update 2023. Nucleic Acids Res., 2022, gkac833.
[http://dx.doi.org/10.1093/nar/gkac833] [PMID: 36169237]
[5]
Wishart, D.S.; Feunang, Y.D.; Guo, A.C.; Lo, E.J.; Marcu, A.; Grant, J.R.; Sajed, T.; Johnson, D.; Li, C.; Sayeeda, Z.; Assempour, N.; Iynkkaran, I.; Liu, Y.; Maciejewski, A.; Gale, N.; Wilson, A.; Chin, L.; Cummings, R.; Le, D.; Pon, A.; Knox, C.; Wilson, M. DrugBank 5.0: A major update to the DrugBank database for 2018. Nucleic Acids Res., 2018, 46(D1), D1074-D1082.
[http://dx.doi.org/10.1093/nar/gkx1037] [PMID: 29126136]
[6]
Ernst, P.; Siu, A.; Weikum, G. KnowLife: A versatile approach for constructing a large knowledge graph for biomedical sciences. BMC Bioinform., 2015, 16(1), 157.
[http://dx.doi.org/10.1186/s12859-015-0549-5] [PMID: 25971816]
[7]
Himmelstein, D.S.; Lizee, A.; Hessler, C.; Brueggeman, L.; Chen, S.L.; Hadley, D.; Green, A.; Khankhanian, P.; Baranzini, S.E. Systematic integration of biomedical knowledge prioritizes drugs for repurposing. eLife, 2017, 6, e26726.
[http://dx.doi.org/10.7554/eLife.26726] [PMID: 28936969]
[8]
Crichton, G.; Baker, S.; Guo, Y.; Korhonen, A. Neural networks for open and closed Literature-based Discovery. PLoS One, 2020, 15(5), e0232891.
[http://dx.doi.org/10.1371/journal.pone.0232891] [PMID: 32413059]
[9]
Mintz, M.; Bills, S.; Snow, R.; Jurafsky, D. Distant supervision for relation extraction without labeled data. Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, 2-7 August 2009, Singapore, Association for Computational Linguistics,2009, , p. 1003.
[http://dx.doi.org/10.3115/1690219.1690287]
[10]
Riedel, S.; Yao, L.; McCallum, A. Modeling relations and their mentions without labeled text.Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2010. Lecture Notes in Computer Science; Balcázar, J.L.; Bonchi, F.; Gionis, A.; Sebag, M., Eds.; Springer: Berlin, Heidelberg, 2010, 6323, pp. 148-163.
[http://dx.doi.org/10.1007/978-3-642-15939-8_10]
[11]
Zeng, D.; Liu, K.; Chen, Y.; Zhao, J. Distant supervision for relation extraction via piecewise convolutional neural networks. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing,, Lisbon, Portugal, Association for Computational Linguistics,, 2015, , pp. 1753-1762.
[http://dx.doi.org/10.18653/v1/D15-1203]
[12]
Jat, S.; Khandelwal, S.; Talukdar, P. Improving distantly supervised relation extraction using word and entity based attention. arXiv:1804.06987, 2018. Available from: http://arxiv.org/abs/1804.06987 (Accessed: Nov. 15, 2022).
[13]
Alt, C.; Hübner, M.; Hennig, L. Fine-tuning pre-trained transformer language models to distantly supervised relation extraction. arXiv:1906.08646, 2019. Available from: http://arxiv.org/abs/1906.08646 (Accessed: Nov. 15, 2022).
[14]
Chen, T.; Shi, H.; Tang, S.; Chen, Z.; Wu, F.; Zhuang, Y. CIL: Contrastive instance learning framework for distantly supervised relation extraction. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 12021, pp. 6191-6200.
[http://dx.doi.org/10.18653/v1/2021.acl-long.483]
[15]
Li, D.; Zhang, T.; Hu, N.; Wang, C.; He, X. HiCLRE: A hierarchical contrastive learning framework for distantly supervised relation extraction. arXiv:2202.13352, 2022. Available from: http://arxiv.org/abs/2202.13352 (Accessed: Nov. 15, 2022).
[16]
Ravikumar, K.E.; Liu, H.; Cohn, J.D.; Wall, M.E.; Verspoor, K. Literature mining of protein-residue associations with graph rules learned through distant supervision. J. Biomed. Semantics, 2012. 3(S3)(3), S2.
[http://dx.doi.org/10.1186/2041-1480-3-S3-S2] [PMID: 23046792]
[17]
Berman, H.M.; Westbrook, J.; Feng, Z.; Gilliland, G.; Bhat, T.N.; Weissig, H.; Shindyalov, I.N.; Bourne, P.E. The protein data bank. Nucleic Acids Res., 2000, 28(1), 235-242.
[http://dx.doi.org/10.1093/nar/28.1.235] [PMID: 10592235]
[18]
Bobic, T.; Klinger, R.; Thomas, P.; Hofmann-Apitius, M. Improving distantly supervised extraction of drug-drug and protein-protein interactions. ROBUS-UNSUP ’12: Proceedings of the Joint Workshop on Unsupervised and Semi-Supervised Learning in NLP, 2012, pp. 35-43.
[19]
Chen, X.; Jeong, J.C.; Dermyer, P. KUPS: Constructing datasets of interacting and non-interacting protein pairs with associated attributions. Nucleic Acids Res., 2011, 39(Database), D750-D754.
[http://dx.doi.org/10.1093/nar/gkq943] [PMID: 20952400]
[20]
Pyysalo, S.; Airola, A.; Heimonen, J.; Björne, J.; Ginter, F.; Salakoski, T. Comparative analysis of five protein-protein interaction corpora., BMC Bioinform., 2008. 9(S3)(3), S6.
[http://dx.doi.org/10.1186/1471-2105-9-S3-S6] [PMID: 18426551]
[21]
Liu, M.; Ling, Y.; An, Y.; Hu, X. Relation extraction from biomedical literature with minimal supervision and grouping strategy. 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM),, Belfast, United Kingdom,2014, pp. 444-449.
[http://dx.doi.org/10.1109/BIBM.2014.6999198]
[22]
Zheng, W.; Blake, C. Using distant supervised learning to identify protein subcellular localizations from full-text scientific articles. J. Biomed. Inform., 2015, 57, 134-144.
[http://dx.doi.org/10.1016/j.jbi.2015.07.013] [PMID: 26220461]
[23]
UniProt Consortium.. Reorganizing the protein space at the universal protein resource (UniProt). Nucleic Acids Res., 2012, 40(D1), D71-D75.
[http://dx.doi.org/10.1093/nar/gkr981] [PMID: 22102590]
[24]
Davis, A.P.; Murphy, C.G.; Saraceni-Richards, C.A.; Rosenstein, M.C.; Wiegers, T.C.; Mattingly, C.J. Comparative toxicogenomics database: A knowledgebase and discovery tool for chemical-gene-disease networks., Nucleic Acids Res., 2009. 37(Database), D786-D792.
[http://dx.doi.org/10.1093/nar/gkn580] [PMID: 18782832]
[25]
Junge, A.; Jensen, L.J. CoCoScore: Context-aware co-occurrence scoring for text mining applications using distant supervision. Bioinformatics, 2020, 36(1), 264-271.
[http://dx.doi.org/10.1093/bioinformatics/btz490] [PMID: 31199464]
[26]
Fomous, C.; Mitchell, J.A.; McCray, A. ‘Genetics home reference’: Helping patients understand the role of genetics in health and disease. Public Health Genomics, 2006, 9(4), 274-278.
[http://dx.doi.org/10.1159/000094477] [PMID: 17003538]
[27]
SIB Swiss Institute of Bioinformatics Members.. The SIB Swiss Institute of Bioinformatics’ resources: Focus on curated databases. Nucleic Acids Res., 2016, 44(D1), D27-D37.
[http://dx.doi.org/10.1093/nar/gkv1310] [PMID: 26615188]
[28]
Kanehisa, M.; Furumichi, M.; Tanabe, M.; Sato, Y.; Morishima, K. KEGG: New perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res., 2017, 45(D1), D353-D361.
[http://dx.doi.org/10.1093/nar/gkw1092] [PMID: 27899662]
[29]
Szklarczyk, D.; Morris, J.H.; Cook, H.; Kuhn, M.; Wyder, S.; Simonovic, M.; Santos, A.; Doncheva, N.T.; Roth, A.; Bork, P.; Jensen, L.J.; von Mering, C. The STRING database in 2017: Quality-controlled protein–protein association networks, made broadly accessible. Nucleic Acids Res., 2017, 45(D1), D362-D368.
[http://dx.doi.org/10.1093/nar/gkw937] [PMID: 27924014]
[30]
Allot, A.; Peng, Y.; Wei, C.H.; Lee, K.; Phan, L.; Lu, Z. LitVar: A semantic search engine for linking genomic variant data in PubMed and PMC. Nucleic Acids Res., 2018, 46(W1), W530-W536.
[http://dx.doi.org/10.1093/nar/gky355] [PMID: 29762787]
[31]
Zhang, D.; Mohan, S.; Torkar, M.; McCallum, A. A distant supervision corpus for extracting biomedical relationships between chemicals, diseases and genes. arXiv:2204.06584, 2022. Available from: http://arxiv.org/abs/2204.06584 (Accessed: Nov. 15, 2022).
[32]
Amin, S.; Dunfield, K.A.; Vechkaeva, A.; Neumann, G. A data-driven approach for noise reduction in distantly supervised biomedical relation extraction. Proceedings of the 19th SIGBioMed Workshop on Biomedical Language Processing, 2020, pp. 187-194.
[http://dx.doi.org/10.18653/v1/2020.bionlp-1.20]
[33]
Hogan, W. Abstractified multi-instance learning (AMIL) for biomedical relation extraction. arXiv:2110.12501, 2021. Available from: http://arxiv.org/abs/2110.12501 (Accessed: Nov. 15, 2022).
[34]
Amin, S.; Minervini, P.; Chang, D.; Stenetorp, P.; Neumann, G. MedDistant19. arXiv:2204.04779, 2022. Available from: http://arxiv.org/abs/2204.04779 (Accessed: Nov. 15, 2022).
[35]
Tran, T.; Kavuluru, R. Distant supervision for treatment relation extraction by leveraging MeSH subheadings. Artif. Intell. Med., 2019, 98, 18-26.
[http://dx.doi.org/10.1016/j.artmed.2019.06.002] [PMID: 31521249]
[36]
Yang, K.; Zheng, Y.; Lu, K.; Chang, K.; Wang, N.; Shu, Z.; Yu, J.; Liu, B.; Gao, Z.; Zhou, X. PDGNet: Predicting disease genes using a deep neural network with multi-view features. IEEE/ACM Trans. Comput. Biol. Bioinformatics, 2022, 19(1), 575-584.
[http://dx.doi.org/10.1109/TCBB.2020.3002771] [PMID: 32750864]
[37]
Yang, K.; Wang, N.; Liu, G.; Wang, R.; Yu, J.; Zhang, R.; Chen, J.; Zhou, X. Heterogeneous network embedding for identifying symptom candidate genes. J. Am. Med. Inform. Assoc., 2018, 25(11), 1452-1459.
[http://dx.doi.org/10.1093/jamia/ocy117] [PMID: 30357378]
[38]
Ma, S.; Yang, K.; Wang, N.; Zhu, Q.; Gao, Z.; Zhang, R.; Liu, B.; Zhou, X. Disease phenotype synonymous prediction through network representation learning from PubMed database. Artif. Intell. Med., 2020, 102, 101745.
[http://dx.doi.org/10.1016/j.artmed.2019.101745] [PMID: 31980087]
[39]
Wu, Y.; Zhang, F.; Yang, K.; Fang, S.; Bu, D.; Li, H.; Sun, L.; Hu, H.; Gao, K.; Wang, W.; Zhou, X.; Zhao, Y.; Chen, J. SymMap: An integrative database of traditional Chinese medicine enhanced by symptom mapping. Nucleic Acids Res., 2019, 47(D1), D1110-D1117.
[http://dx.doi.org/10.1093/nar/gky1021] [PMID: 30380087]
[40]
Yu, K.Y.; Gao, W.; Li, S.Z.; Wu, W.; Li, P.; Dou, L.L.; Wang, Y.Z.; Liu, E.H. Qualitative and quantitative analysis of chemical constituents in Ardisiae Japonicae Herba. J. Sep. Sci., 2017, 40(22), 4347-4356.
[http://dx.doi.org/10.1002/jssc.201700667] [PMID: 28926203]
[41]
Li, J.; Sun, Y.; Johnson, R.J.; Sciaky, D.; Wei, C.H.; Leaman, R.; Davis, A.P.; Mattingly, C.J.; Wiegers, T.C.; Lu, Z. BioCreative V CDR task corpus: A resource for chemical disease relation extraction. Database, 2016, 2016, baw068.
[http://dx.doi.org/10.1093/database/baw068] [PMID: 27161011]
[42]
Doğan, R.I.; Leaman, R.; Lu, Z. NCBI disease corpus: A resource for disease name recognition and concept normalization. J. Biomed. Inform., 2014, 47, 1-10.
[http://dx.doi.org/10.1016/j.jbi.2013.12.006] [PMID: 24393765]
[43]
Collier, N.; Kim, J-D. Introduction to the bio-entity recognition task at JNLPBA. Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA/BioNLP), Geneva, Switzerland,2004, pp. 73-78.
[44]
Smith, L.; Tanabe, L.K.; Ando, R.J.; Kuo, C.J.; Chung, I.F.; Hsu, C.N.; Lin, Y.S.; Klinger, R.; Friedrich, C.M.; Ganchev, K.; Torii, M.; Liu, H.; Haddow, B.; Struble, C.A.; Povinelli, R.J.; Vlachos, A.; Baumgartner, W.A., Jr; Hunter, L.; Carpenter, B.; Tsai, R.T.H.; Dai, H.J.; Liu, F.; Chen, Y.; Sun, C.; Katrenko, S.; Adriaans, P.; Blaschke, C.; Torres, R.; Neves, M.; Nakov, P.; Divoli, A.; Maña-López, M.; Mata, J.; Wilbur, W.J. Overview of BioCreative II gene mention recognition. Genome Biol., 2008. 9(S2)(2), S2.
[http://dx.doi.org/10.1186/gb-2008-9-s2-s2] [PMID: 18834493]
[45]
Lin, Y.; Shen, S.; Liu, Z.; Luan, H.; Sun, M. Neural relation extraction with selective attention over instances. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics,, Berlin, Germany, Association for Computational Linguistics, 2016, , pp. 2124-2133.
[http://dx.doi.org/10.18653/v1/P16-1200]
[46]
Moreira, J.; Oliveira, C.; Macêdo, D.; Zanchettin, C.; Barbosa, L. Distantly-supervised neural relation extraction with side information using BERT. 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, UK,2020, pp. 1-7.
[http://dx.doi.org/10.1109/IJCNN48605.2020.9206648]
[47]
Devlin, J.; Chang, M-W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805, 2018.
[http://dx.doi.org/10.48550/ARXIV.1810.04805]
[48]
Cui, Y.; Yang, Z.; Yao, X. Efficient and effective text encoding for Chinese LLaMA and Alpaca. arXiv:2304.08177, 2023. Available from: http://arxiv.org/abs/2304.08177 (Accessed: Aug. 13, 2023).
[49]
Touvron, H. LLaMA: Open and efficient foundation language models. arXiv:2302.13971, 2023. Available from: http://arxiv.org/abs/2302.13971 (Accessed: Aug. 13, 2023).
[50]
Ouyang, L. Training language models to follow instructions with human feedback. arXiv:2203.02155, 2022.

Rights & Permissions Print Cite
© 2024 Bentham Science Publishers | Privacy Policy