PANNZER: high-throughput functional annotation of uncharacterized proteins in an error-prone environment

Patrik Koskinen; Petri Törönen; Jussi Nokso-Koivisto; Liisa Holm

doi:10.1093/bioinformatics/btu851

PANNZER: high-throughput functional annotation of uncharacterized proteins in an error-prone environment

Bioinformatics. 2015 May 15;31(10):1544-52. doi: 10.1093/bioinformatics/btu851. Epub 2015 Jan 8.

Authors

Patrik Koskinen¹, Petri Törönen¹, Jussi Nokso-Koivisto¹, Liisa Holm²

Affiliations

¹ Department of Biosciences, University of Helsinki, 00014 Helsinki, Finland and Institute of Biotechnology, University of Helsinki, 00014 Helsinki, Finland.
² Department of Biosciences, University of Helsinki, 00014 Helsinki, Finland and Institute of Biotechnology, University of Helsinki, 00014 Helsinki, Finland Department of Biosciences, University of Helsinki, 00014 Helsinki, Finland and Institute of Biotechnology, University of Helsinki, 00014 Helsinki, Finland.

PMID: 25653249
DOI: 10.1093/bioinformatics/btu851

Abstract

Motivation: The last decade has seen a remarkable growth in protein databases. This growth comes at a price: a growing number of submitted protein sequences lack functional annotation. Approximately 32% of sequences submitted to the most comprehensive protein database UniProtKB are labelled as 'Unknown protein' or alike. Also the functionally annotated parts are reported to contain 30-40% of errors. Here, we introduce a high-throughput tool for more reliable functional annotation called Protein ANNotation with Z-score (PANNZER). PANNZER predicts Gene Ontology (GO) classes and free text descriptions about protein functionality. PANNZER uses weighted k-nearest neighbour methods with statistical testing to maximize the reliability of a functional annotation.

Results: Our results in free text description line prediction show that we outperformed all competing methods with a clear margin. In GO prediction we show clear improvement to our older method that performed well in CAFA 2011 challenge.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Cluster Analysis
Computational Biology / methods
Data Interpretation, Statistical
Data Mining*
Databases, Genetic
Databases, Protein*
Gene Ontology
Humans
Molecular Sequence Annotation*
Proteins / genetics
Proteins / metabolism*
Vocabulary, Controlled*

Substances

Proteins