PANNZER: high-throughput functional annotation of uncharacterized proteins in an error-prone environment

Bioinformatics. 2015 May 15;31(10):1544-52. doi: 10.1093/bioinformatics/btu851. Epub 2015 Jan 8.

Abstract

Motivation: The last decade has seen a remarkable growth in protein databases. This growth comes at a price: a growing number of submitted protein sequences lack functional annotation. Approximately 32% of sequences submitted to the most comprehensive protein database UniProtKB are labelled as 'Unknown protein' or alike. Also the functionally annotated parts are reported to contain 30-40% of errors. Here, we introduce a high-throughput tool for more reliable functional annotation called Protein ANNotation with Z-score (PANNZER). PANNZER predicts Gene Ontology (GO) classes and free text descriptions about protein functionality. PANNZER uses weighted k-nearest neighbour methods with statistical testing to maximize the reliability of a functional annotation.

Results: Our results in free text description line prediction show that we outperformed all competing methods with a clear margin. In GO prediction we show clear improvement to our older method that performed well in CAFA 2011 challenge.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Cluster Analysis
  • Computational Biology / methods
  • Data Interpretation, Statistical
  • Data Mining*
  • Databases, Genetic
  • Databases, Protein*
  • Gene Ontology
  • Humans
  • Molecular Sequence Annotation*
  • Proteins / genetics
  • Proteins / metabolism*
  • Vocabulary, Controlled*

Substances

  • Proteins