Prokaryotic RefSeq Genomes

Related documentation:

Which genomes are included in RefSeq?

The RefSeq archaeal and bacterial genome assemblies are annotated and maintained copies of complete and whole-genome shotgun assemblies submitted to INSDC (Genbank, ENA and DDBJ) that meet sequence and annotation quality criteria. A genome assembly may be excluded from RefSeq for reasons related to sequence or annotation quality. Most notably, assemblies generated from environmental samples are excluded due to concerns with the accuracy of the organism assignment and possible cross-contamination.

The RefSeq archaeal and bacterial genome assemblies can be searched and downloaded from Entrez Assembly. They are also available as a Blast database for sequence homology searches.

With the exception of selected reference genomes, RefSeq genomes are annotated using NCBI’s prokaryotic genome annotation pipeline to provide consistency across the dataset. Each genome is annotated with gene and protein features that are unique to that genome. Gene features are provided with unique locus_tags on all genomes but NCBI GeneID cross-references are only annotated for reference genomes and a sub-set of representative genomes. Therefore, links between RefSeq genomes, or annotated proteins, and the Gene resource are only available for a subset of RefSeq prokaryotic genomes (and corresponding annotated proteins). Protein coding regions (CDS features) include cross-references to RefSeq non-redundant protein accessions (with WP_ prefix). A given non-redundant protein accession may be annotated on more than one genome.  

For each defined species with assemblies included in RefSeq, one assembly is designated as 'reference’ or ‘representative'. The collection of 'representative' and 'reference' genomes is a compact, normalized, and taxonomically diverse view of the RefSeq collection that can be used for the taxonomic identification and characterization of novel sequences.

Reference genomes

Reference genomes are genome assemblies that are annotated and updated by the assembly submitters and chosen by the RefSeq curatorial staff based on their quality and importance to the community as anchors for the analysis of other genomes in their taxonomic group. Some reference genomes are selected based on a long history of collaboration and wide recognition as a community standard, such as the reference genome of Escherichia coli str. K-12 substr. MG1655. Other reference genomes are selected based on medical importance, sequence and annotation quality, and the availability of experimental support. Gene annotation on these genomes is reviewed and may be modified by RefSeq, but largely reflects the work of the submitters.

Reference genomes are annotated with YP_ or NP_ protein accessions which in turn cross-reference the non-redundant protein records. Reference genomes are also annotated with the GeneID cross-reference to the NCBI Gene resource. You can browse the list of reference genomes in the NCBI Genome resource. Sequences and annotation for reference assemblies can be downloaded from Entrez Assembly.

Representative genomes

For species without a reference genome, one assembly is selected as representative among the live RefSeq assemblies (i.e. not superseded by newer assemblies or suppressed due to quality or taxonomic misassignment concerns). No representative is selected for undefined species (such as ‘Vibrio sp.’). Representative genomes are chosen based on the criteria below, in order of importance. Criteria that are lower on the list are only used if assemblies are judged equal based on higher-ranked criteria:

  1. Manual selection – a few representatives are selected based on community input, biological features or other a priori knowledge about the assembly.
  2. Magnitude of deviation from average assembly length for all candidates for the species – assemblies with the lowest integral number of standard deviations from the species average assembly length are preferred. This ensures that assemblies that are significantly longer or shorter than others for the species are not chosen.
  3. Magnitude of count of pseudo CDSs – assemblies with the lowest rounded natural log of pseudo CDSs are preferred.
  4. Magnitude of count of scaffolds – assemblies with the lowest rounded log base 10 scaffold count are preferred.
  5. Availability of Gene IDs - assemblies annotated with GeneID cross-references to NCBI's Gene resource are preferred.
  6. Absolute count of pseudo CDSs
  7. Type strain status
  8. Release date (tie-breaker)

Representative assemblies are updated several times a year to take into account newly added assemblies to RefSeq, changes in the NCBI Taxonomy, modified taxonomic assignments, and recently discovered contamination.

Representative genomes are annotated with non-redundant RefSeq protein accessions (WP_ accession prefix) and display the protein product name that appears on the WP-accessioned record (see Protein data model below). Representatives for species with at least ten assemblies in RefSeq are annotated with a GeneID cross-reference to the NCBI Gene resource. You can browse the list of reference and representative genomes in the NCBI Genome resource. Sequences and annotation for the latest set of prokaryotic reference and representative genomes can be downloaded from Entrez Assembly.

Non-representative genomes

The non-reference and non-representative prokaryotic genomes make the bulk of the RefSeq prokaryotic collection and are taxonomically diverse. They are included in RefSeq to represent the sequence variation observed in isolate- and strain-specific genomes, including genomes of medical importance. Like representative genomes, they are annotated with non-redundant RefSeq protein accessions (WP_ accession prefix) and display the protein product name that appears on the WP-accessioned record (see Protein data model below). However, non-representative genomes are not annotated with GeneID cross-references to the NCBI Gene resource. Sequences and annotation for non-representative genomes can be downloaded from Entrez Assembly.

Protein data model

With the exception of reference genomes, only non-redundant protein accessions (WP_ accession prefix) are annotated on new or re-annotated RefSeq prokaryotic WGS and Complete genomes. A single non-redundant protein may be annotated on many RefSeq genomes, when the CDS annotated on those genomes encodes exactly the same protein that is identical in both sequence and length. For example, the coding sequence for the 50S ribosomal protein L11 that is annotated on NC_017743.1 provides a cross-link, shown below, to the non-redundant RefSeq protein WP_003156430.1. Over 1000 prokaryotic genomes are annotated with a CDS feature that encodes the identical sequence of the same length as shown in the Identical Protein report, which can be accessed by clicking on the "Identical Proteins" link near the top of the protein record.

Image of CDS feature for 50S ribosomal protein L11 as annotated on NC_017743.1. The CDS cross-references nonredundant protein WP_003156430.1.

Last updated: 2020-08-06T23:28:41Z