Title:Distant Supervision-based Relation Extraction for Literature-Related
Biomedical Knowledge Graph Construction
Volume: 3
Issue: 6
Author(s): Rui Hua, Xuezhong Zhou*, Zixin Shu, Dengying Yan, Kuo Yang, Xinyan Wang, Chuang Cheng and Qiang Zhu
Affiliation:
- School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China
Keywords:
Relation extraction, knowledge graph, distant supervision, named entity recognition, literature, biomedical knowledge graph.
Abstract:
Background: The task of relation extraction is a crucial component in the construction
of a knowledge graph. However, it often necessitates a significant amount of manual annotation,
which can be time-consuming and expensive. Distant supervision, as a technique, seeks to mitigate
this challenge by generating a large volume of pseudo-training data at a minimal cost,
achieved by mapping triple facts onto the raw text.
Objective: The aim of this study is to explore the novelty and potential of the distant supervisionbased
relation extraction approach. By leveraging this innovative method, we aim to enhance
knowledge reliability and facilitate new knowledge discovery, establishing associations between
knowledge from specific biomedical data or existing knowledge graphs and literature.
Methods: This study presents a methodology to construct a biomedical knowledge graph employing
distant supervision techniques. Through establishing links between knowledge entities and
relevant literature sources, we methodically extract and integrate information, thereby expanding
and enriching the knowledge graph. This study identified five types of biomedical entities (e.g.,
diseases, symptoms and genes) and four kinds of relationships. These were linked to PubMed literature
and divided into training and testing datasets. To mitigate data noise, the training set underwent
preprocessing, while the testing set was manually curated.
Results: In our research, we successfully associated 230,698 triples from the existing knowledge
graph with relevant literature. Furthermore, we identified additional 205,148 new triples directly
sourced from these studies.
Conclusion: Our study markedly advances the field of biomedical knowledge graph enrichment,
particularly in the context of Traditional Chinese Medicine (TCM). By validating a substantial
number of triples through literature associations and uncovering over 200,000 new triples, we
have made a significant stride in promoting the development of evidence-based medicine in TCM.
The results underscore the potential of using a distant supervision-based relation extraction approach
to both validate and expand knowledge bases, contributing to the broader progression of
evidence-based practices in the realm of TCM.