Prediction Enhancement of Residue Real-Value Relative Accessible Surface Area in Transmembrane Helical Proteins by Solving the Output Preference Problem of Machine Learning-Based Predictors

J Chem Inf Model. 2015 Nov 23;55(11):2464-74. doi: 10.1021/acs.jcim.5b00246. Epub 2015 Oct 20.

Abstract

The α-helical transmembrane proteins constitute 25% of the entire human proteome space and are difficult targets in high-resolution wet-lab structural studies, calling for accurate computational predictors. We present a novel sequence-based method called MemBrain-Rasa to predict relative solvent accessibility surface area (rASA) from primary sequences. MemBrain-Rasa features by an ensemble prediction protocol composed of a statistical machine-learning engine, which is trained in the sequential feature space, and a segment template similarity-based engine, which is constructed with solved structures and sequence alignment. We locally constructed a comprehensive database of residue relative solvent accessibility surface area from the solved protein 3D structures in the PDB database. It is searched against for segment templates that are expected to be structurally similar to the query sequence's segments. The segment template-based prediction is then fused with the support vector regression outputs using knowledge rules. Our experiments show that pure machine learning output cannot cover the entire rASA solution space and will have a serious prediction preference problem due to the relatively small size of membrane protein structures that can be used as the training samples. The template-based engine solves this problem very well, resulting in significant improvement of the prediction performance. MemBrain-Rasa achieves a Pearson correlation coefficient of 0.733 and mean absolute error of 13.593 on the benchmark dataset, which are 26.4% and 26.1% better than existing predictors. MemBrain-Rasa represents a new progress in structure modeling of α-helical transmembrane proteins. MemBrain-Rasa is available at www.csbio.sjtu.edu.cn/bioinf/MemBrain/.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Amino Acid Sequence
  • Animals
  • Bacterial Proteins / chemistry
  • Databases, Protein
  • Humans
  • Machine Learning*
  • Membrane Proteins / chemistry*
  • Models, Chemical*
  • Molecular Sequence Data
  • Protein Structure, Quaternary
  • Protein Structure, Secondary
  • Sequence Alignment
  • Solubility
  • Solvents / chemistry
  • Succinate Dehydrogenase / chemistry
  • Wolinella / chemistry

Substances

  • Bacterial Proteins
  • Membrane Proteins
  • Solvents
  • Succinate Dehydrogenase