Introduction

Protein Clusters is a database of proteins grouped together by sequence similarity - clusters.  Clustering is a well-known method in statistics and computer science. For a given set of entities clusters are defined as subsets that are homogeneous and well separated. Protein clusters are defined as groups of homologous proteins. The similarity between two protein sequences is measured by maximum alignment between the sequences calculated by BLAST.

Scope

The Protein Clusters dataset consists of proteins encoded by complete and draft genomes from the RefSeq collection of microorganisms: prokaryotes, viruses, fungi, protozoans; it also includes curated protein clusters from Refseq complete genomes of plants, chloroplasts and mitochondria. Clusters for each group are created and curated separately and given a different accession prefix. The Protein Clusters data set contains automatically generated clusters that do not distinguish orthologs and paralogs.

Data Model

Protein cluster is represented by a list of protein identifiers (gi numbers) and the genomes that code for the proteins. Each cluster has a stable unique identifier (letter prefix followed by digits) and a functional cluster name(title). Cluster name is calculated automatically and followed by manual review.

Example

PCLA_5029913 glycosyl hydrolase family protein

Proteins:

17

Conserved in:

Bacillales

Total genera:

2

Total organisms:

13

Putative Paralogs:

4

COG functional categories:

Carbohydrate transport and metabolism
General function prediction only

CDDs:

cd08996(superfamily:cl14647), smart00640, pfam08244(superfamily:cl07030)

Methods

NCBI Protein Clusters use two methods of clustering: partitioning (clique) and hierarchical.

Clique approach

Proteins are compared by sequence similarity using BLAST all against all (E-value cutoff 10-5); effective length of the search space set to 5 × 108). Each BLAST score is then modified by protein length × alignment length of the BLAST hit and the modified scores are sorted. Clusters (also known as cliques) consist of protein sets such that every member of the cluster hits every other protein member (reciprocal best hits by modified score). Cluster membership is such that for any given protein in the cluster (protein A), all the other members of the cluster will have a greater modified score to protein A than any protein outside of the cluster will have to protein A. There are no cutoffs used during the clustering procedure, or strict requirements for clusters of orthologous groups, or any check on phylogenetic distance. The initial set of uncurated clusters created in 2005 has been used as a starting point for curation and has been updated quarterly since that time.

Hierarchical clustering

A new approach implemented for prokaryotic genomes is based on hierarchical clustering.First, all the proteins are organized in global clusters, then links between clusters are calculated reflecting the similarity between the clusters based on several criteria.

 Clustering procedure. The similarity of proteins is determined from the aggregated BLAST hits obtained by blastp with e-value 10-3. Two proteins are considered connected if there is an aggregated BLAST hit between them satisfying criteria on hit length and score. Clusters are aggregated in a hierarchical manner using the complete linkage distance, with an additional requirement that the minimum distance between clusters should not exceed threshold.  Because of the sparse nature of connections and applied thresholds, we build a family of trees that we consider clusters.

Related clusters After proteins are clustered, links between clusters are calculated by several criteria based on similarity of proteins in these clusters, and link indexes are created. These indexes are used to show the neighborhood of cluster in Entrez search. First, representative proteins from groups of redundant and nearly-redundant proteins are selected by the program USEARCH. These representative proteins are partitioned in disjoint sets and clustered. In order to perform clustering in parallel, the data set is partitioned in disjoint sets  using a parallel implementation based on a disjoint-set forest with union-by-rank heuristics, and then clustering is performed concurrently in partitions. Linking indexes are also calculated in parallel from the aggregated BLAST hit and protein assignment to clusters.

Manual Curation

One of the most important aspects of the Protein clusters curation is the assignation of function which is obtained from the literature. Curated functional annotation can be propagated to all proteins within the cluster. That process allows improving functional annotation in Refseq genomes and unify and standardize the naming rules across various organisms and different annotation pipelines.

Frequently, several alternative names are used for proteins, this variation can lead to confusion for researchers and slow scientific progress. To standardize protein names NCBI staff (Refseq genome curators) work closely with the experts from UniProt. The recommended name from UniProtKB/SwissPort is as also used as NCBI protein cluster preferred name.

Access

Protein Clusters home page: http://0-www.ncbi.nlm.nih.gov.brum.beds.ac.uk/proteinclusters

The Entrez system provides a mechanism for the search and browse options, retrieval and linkage between protein clusters and other NCBI databases as well as external resources.

Text search Clusters can be searched by general text terms, and also by specific terms such as cluster name ([title] or individual protein or gene name (list of all terms can be found in Advance search page).

Example

Search by function:

Query:transcriptional regulator

Query:transcriptional regulator[title]

Query:transcriptional regulator[protein name]

Search by attributes

Gammaproteobacteria[Conserved in]

Limits and Advanced search

Advanced searches can be used for complex Boolean queries. The Builder allows you to look for the available search terms and combine them with AND, OR, NOT operations. Limits give the users a way to do some simple filtering without building complex queries.The searches can be limited by Curation Status, Nucleotide Source or Organism Group.

Example

“rna helicase” search return more than 500 clusters from different organisms. Limiting the search to ‘Curated’ and ‘Viruses’  results in a single cluster  of RNA helicase NPH-II conserved in Poxviridae http://0-www.ncbi.nlm.nih.gov.brum.beds.ac.uk/proteinclusters/PHA2653

The poxvirus RNA helicase  NPH-II belongs to a family of ubiquitous ATP-dependent helicases that are required for RNA metabolism in bacteria, eukaryotes and many viruses. The NPH-II family of helicases found in hepatitis C and various poxviruses have similar sequence, structure and mechanisms of action that are essential for viral replication.

Browse

Entrez system also provides a browsing option. Clusters can be browsed by function, filtered by size and organism group. Browse table allow the users to sort by the content of each column by clicking on the column header. Follow Browse link from the home page or go directly to

http://0-www.ncbi.nlm.nih.gov.brum.beds.ac.uk/proteinclusters/browse

Download

Data snapshots (with a date stamp) are available for download from the FTP directory ftp://ftp.ncbi.nih.gov/genomes/Bacteria/CLUSTERS/

by major taxonomic groups.

Support Center

Last updated: 2013-07-09T13:52:52-04:00