Identification of compound-protein interactions through the analysis of gene ontology, KEGG enrichment for proteins and molecular fragments of compounds

Mol Genet Genomics. 2016 Dec;291(6):2065-2079. doi: 10.1007/s00438-016-1240-x. Epub 2016 Aug 16.

Abstract

Compound-protein interactions play important roles in every cell via the recognition and regulation of specific functional proteins. The correct identification of compound-protein interactions can lead to a good comprehension of this complicated system and provide useful input for the investigation of various attributes of compounds and proteins. In this study, we attempted to understand this system by extracting properties from both proteins and compounds, in which proteins were represented by gene ontology and KEGG pathway enrichment scores and compounds were represented by molecular fragments. Advanced feature selection methods, including minimum redundancy maximum relevance, incremental feature selection, and the basic machine learning algorithm random forest, were used to analyze these properties and extract core factors for the determination of actual compound-protein interactions. Compound-protein interactions reported in The Binding Databases were used as positive samples. To improve the reliability of the results, the analytic procedure was executed five times using different negative samples. Simultaneously, five optimal prediction methods based on a random forest and yielding maximum MCCs of approximately 77.55 % were constructed and may be useful tools for the prediction of compound-protein interactions. This work provides new clues to understanding the system of compound-protein interactions by analyzing extracted core features. Our results indicate that compound-protein interactions are related to biological processes involving immune, developmental and hormone-associated pathways.

Keywords: Compound–protein interaction; Gene ontology enrichment; Incremental feature selection; KEGG enrichment; Minimum redundancy maximum relevance.

MeSH terms

  • Algorithms
  • Computational Biology / methods*
  • Databases, Genetic
  • Gene Ontology
  • Proteins / chemistry
  • Proteins / metabolism*
  • Small Molecule Libraries / pharmacology*

Substances

  • Proteins
  • Small Molecule Libraries