Fast algorithm for population-based protein structural model analysis

Proteomics. 2013 Jan;13(2):221-9. doi: 10.1002/pmic.201200334. Epub 2013 Jan 3.

Abstract

De novo protein structure prediction often generates a large population of candidates (models), and then selects near-native models through clustering. Existing structural model clustering methods are time consuming due to pairwise distance calculation between models. In this paper, we present a novel method for fast model clustering without losing the clustering accuracy. Instead of the commonly used pairwise root mean square deviation and TM-score values, we propose two new distance measures, Dscore1 and Dscore2, based on the comparison of the protein distance matrices for describing the difference and the similarity among models, respectively. The analysis indicates that both the correlation between Dscore1 and root mean square deviation and the correlation between Dscore2 and TM-score are high. Compared to the existing methods with calculation time quadratic to the number of models, our Dscore1-based clustering achieves a linearly time complexity while obtaining almost the same accuracy for near-native model selection. By using Dscore2 to select representatives of clusters, we can further improve the quality of the representatives with little increase in computing time. In addition, for large size (~500 k) models, we can give a fast data visualization based on the Dscore distribution in seconds to minutes. Our method has been implemented in a package named MUFOLD-CL, available at http://mufold.org/clustering.php.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Algorithms*
  • Cluster Analysis
  • Computational Biology / methods*
  • Databases, Protein
  • Models, Chemical*
  • Models, Molecular
  • Protein Conformation
  • Proteins / chemistry*

Substances

  • Proteins