Title:Representation Learning of Biological Concepts: A Systematic Review
Volume: 19
Issue: 1
Author(s): Yuntao Yang, Xu Zuo, Avisha Das, Hua Xu and Wenjin Zheng*
Affiliation:
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA
Keywords:
Machine learning, biological concepts, representation learning, embedding, natural language processing, graph neural networks.
Abstract:
Objective: Representation learning in the context of biological concepts involves acquiring
their numerical representations through various sources of biological information, such as sequences,
interactions, and literature. This study has conducted a comprehensive systematic review by analyzing
both quantitative and qualitative data to provide an overview of this field.
Methods: Our systematic review involved searching for articles on the representation learning of biological
concepts in PubMed and EMBASE databases. Among the 507 articles published between 2015
and 2022, we carefully screened and selected 65 papers for inclusion. We then developed a structured
workflow that involved identifying relevant biological concepts and data types, reviewing various representation
learning techniques, and evaluating downstream applications for assessing the quality of the
learned representations.
Results: The primary focus of this review was on the development of numerical representations for
gene/DNA/RNA entities. We have found Word2Vec to be the most commonly used method for biological
representation learning. Moreover, several studies are increasingly utilizing state-of-the-art large
language models to learn numerical representations of biological concepts. We also observed that representations
learned from specific sources were typically used for single downstream applications that
were relevant to the source.
Conclusion: Existing methods for biological representation learning are primarily focused on learning
representations from a single data type, with the output being fed into predictive models for downstream
applications. Although there have been some studies that have explored the use of multiple data types to
improve the performance of learned representations, such research is still relatively scarce. In this systematic
review, we have provided a summary of the data types, models, and downstream applications
used in this task.