Toward more realistic drug-target interaction predictions

Brief Bioinform. 2015 Mar;16(2):325-37. doi: 10.1093/bib/bbu010. Epub 2014 Apr 9.

Abstract

A number of supervised machine learning models have recently been introduced for the prediction of drug-target interactions based on chemical structure and genomic sequence information. Although these models could offer improved means for many network pharmacology applications, such as repositioning of drugs for new therapeutic uses, the prediction models are often being constructed and evaluated under overly simplified settings that do not reflect the real-life problem in practical applications. Using quantitative drug-target bioactivity assays for kinase inhibitors, as well as a popular benchmarking data set of binary drug-target interactions for enzyme, ion channel, nuclear receptor and G protein-coupled receptor targets, we illustrate here the effects of four factors that may lead to dramatic differences in the prediction results: (i) problem formulation (standard binary classification or more realistic regression formulation), (ii) evaluation data set (drug and target families in the application use case), (iii) evaluation procedure (simple or nested cross-validation) and (iv) experimental setting (whether training and test sets share common drugs and targets, only drugs or targets or neither). Each of these factors should be taken into consideration to avoid reporting overoptimistic drug-target interaction prediction results. We also suggest guidelines on how to make the supervised drug-target interaction prediction studies more realistic in terms of such model formulations and evaluation setups that better address the inherent complexity of the prediction task in the practical applications, as well as novel benchmarking data sets that capture the continuous nature of the drug-target interactions for kinase inhibitors.

Keywords: drug–target interaction; kinase bioactivity assays; nested cross-validation; predictive modeling; supervised machine learning.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Computational Biology
  • Databases, Pharmaceutical / statistics & numerical data
  • Drug Discovery / statistics & numerical data*
  • Humans
  • Models, Biological
  • Models, Statistical
  • Quantitative Structure-Activity Relationship
  • Supervised Machine Learning / statistics & numerical data