Machine Learning for Integrating Data in Biology and Medicine: Principles, Practice, and Opportunities

Marinka Zitnik; Francis Nguyen; Bo Wang; Jure Leskovec; Anna Goldenberg; Michael M Hoffman

doi:10.1016/j.inffus.2018.09.012

Machine Learning for Integrating Data in Biology and Medicine: Principles, Practice, and Opportunities

Inf Fusion. 2019 Oct:50:71-91. doi: 10.1016/j.inffus.2018.09.012. Epub 2018 Sep 21.

Authors

Marinka Zitnik¹, Francis Nguyen^{2

3}, Bo Wang⁴, Jure Leskovec^{1

5}, Anna Goldenberg^{6

7

8}, Michael M Hoffman^{2

3

7

8}

Affiliations

¹ Department of Computer Science, Stanford University, Stanford, CA, USA.
² Department of Medical Biophysics, University of Toronto, Toronto, ON, Canada.
³ Princess Margaret Cancer Centre, Toronto, ON, Canada.
⁴ Hikvision Research Institute, Santa Clara, CA, USA.
⁵ Chan Zuckerberg Biohub, San Francisco, CA, USA.
⁶ Genetics & Genome Biology, SickKids Research Institute, Toronto, ON, Canada.
⁷ Department of Computer Science, University of Toronto, Toronto, ON, Canada.
⁸ Vector Institute, Toronto, ON, Canada.

Abstract

New technologies have enabled the investigation of biology and human health at an unprecedented scale and in multiple dimensions. These dimensions include myriad properties describing genome, epigenome, transcriptome, microbiome, phenotype, and lifestyle. No single data type, however, can capture the complexity of all the factors relevant to understanding a phenomenon such as a disease. Integrative methods that combine data from multiple technologies have thus emerged as critical statistical and computational approaches. The key challenge in developing such approaches is the identification of effective models to provide a comprehensive and relevant systems view. An ideal method can answer a biological or medical question, identifying important features and predicting outcomes, by harnessing heterogeneous data across several dimensions of biological variation. In this Review, we describe the principles of data integration and discuss current methods and available implementations. We provide examples of successful data integration in biology and medicine. Finally, we discuss current challenges in biomedical integrative methods and our perspective on the future development of the field.

Keywords: computational biology; heterogeneous data; machine learning; personalized medicine; systems biology.

Grants and funding

U54 EB020405/EB/NIBIB NIH HHS/United States