Prokaryotic RefSeq Genomes Frequently Asked Questions (FAQ)

Why has NCBI discontinued some prokaryotic Gene records?

NCBI is re-annotating all RefSeq archaeal and bacterial genomes to improve consistency across these datasets. As part of this process, NCBI is providing Gene records only for reference and certain representative genomes for a given prokaryotic species. Pre-existing Gene records for archaeal and bacterial genomes that are not in the above sets have been, or will be, discontinued. Notes are being placed on affected Gene records to provide more details.

How do I find a Gene record for the same species as a discontinued record, or for a given non-redundant protein?

In some cases, a previous Gene record has been tracked as a replacement by an orthologous record from a related strain. In these cases the original record includes an information message that links to the replacement Gene entry. For entries that have not been tracked in this manner, it is still possible to find related Gene entries using NCBI links from proteins to Gene as follows:

  • Navigate to the protein record of interest in NCBI's Protein database
  • Notice the "Related information" section in the right column of the page, follow the link to "Identical Proteins" or to "Related Sequences"
  • In the "Find Related data" section of the right column, select the Gene database, then click the "Find Items" button.

If this approach does not return a Gene record then you can also try to find a related Gene record by doing a blastp query against the "Reference Proteins (refseq_protein)" database. Follow the link in the BLAST results 'Related information' panel to the Gene record.

Will discontinued Gene records continue to be accessible?

Yes, discontinued records are still available in the Gene resource. Discontinued records are not updated with the exception of the graphical display in the "Genomic regions, transcripts, and products" section of the page which will show the current annotation for the RefSeq accession.version and coordinates. If that RefSeq genome has also been suppressed then this display will not change. If the RefSeq genome continues to be public and undergoes future annotation updates, then the annotation of that sequence range may change and be automatically presented on the discontinued Gene entry. Additional updates are made at times to the informational messages that appear at the top of discontinued Gene records. We are working to add an improved message to the large set of bacterial Gene records that were suppressed in the first quarter of 2015. Additional information is available on the RefSeq bacterial re-annotation project page.

Do discontinued Gene records include information about suppressed NP_ and YP_ accessions that have been replaced with a non-redundant WP_ accession?

An information message will be added to the top of the Gene full report for the set of bacterial Gene entries that were removed in the first quarter of 2015, which corresponds to the RefSeq bacterial complete genome re-annotation project and the revised definition of scope for Gene. The provided message includes information on the new locus_tag and replacement non-redundant protein accession when available. This informational message will become available shortly after the FTP installation of comprehensive RefSeq release 70 (May 2015).

Why has NCBI removed so many bacterial protein records with the accession prefix NP_ or YP_?

NCBI has implemented a new data model for managing prokaryotic genomes to address concerns about data redundancy. This new management plan provides non-redundant RefSeq protein records, with an accession prefix 'WP_'. All RefSeq prokaryotic genomes that are annotated with a CDS which translates to the identical protein sequence are now being annotated with a non-redundant protein accession. An exception is made for RefSeq prokaryotic 'reference genomes' which continue to be annotated with 'NP_' or 'YP_' accessions which in turn cross-reference a non-redundant protein accession. At the end of 2014 and into the first quarter of 2015 we re-annotated the RefSeq bacterial complete genomes; this resulted in the removal of nearly 7 million NP_ and YP_ accessions as these genomes were updated to directly cross-reference the new non-redundant protein records (WP_ accessions). All new RefSeq prokaryotic genomes, with the exception of reference genomes, will be annotated with non-redundant WP_ accessions.

Why have the locus_tags changed on RefSeq bacterial genomes, compared to the submitted GenBank entry?

Locus_tags have changed on most prokaryotic RefSeq genomes as a result of re-annotation by NCBI using the Prokaryotic Genome Annotation Pipeline (PGAP). Before the re-annotation, the annotation available on most RefSeq bacterial genomes was identical to that available in the submitted GenBank genome and thus it was appropriate to retained locus_tags from the submitted genome annotation. However, with the re-annotation project in some cases CDS coordinates have changed, unsupported CDSs have been removed, or new supported CDSs have been added; therefore, it was necessary to provide new locus_tags in order to comprehensively report this data type for RefSeq bacterial genomes.

How do I find the best replacement for a discontinued locus_tag?

In many cases, the original locus_tag is still annotated on the current RefSeq bacterial genome record as an /old_locus_tag qualifier along with the new locus tag for the gene feature.

One example is the annotation of the 30S ribosomal protein S9 gene on the Pseudomonas fluorescens A506 genome (NC_017911.1). The FEATURES table shows both the discontinued locus_tag, PflA506_0831, and the current locus tag, PFLA506_RS04145.

A supplemental report (release70.bacterial-reannotation-report) is provided for FTP with RefSeq release 70 which includes information to support mapping protein accessions as well as locus_tags. This report is initially provided in the FTP RefSeq release-catalog directory area and will be moved to the release-catalog/archive/ directory when the July 2015 RefSeq release 71 is installed.

What is the difference between a RefSeq reference and representative genome?

A detailed explanation, including how to retrieve sequences from these two kinds of organisms, is on the Prokaryotic RefSeq Genomespage. In brief, RefSeq reference genomes are high quality genome datasets that NCBI and the community have identified as being important. NCBI staff thoroughly review, correct, and augment the annotation of these genomes, often in collaboration with community members. RefSeq representative genomes are provided for clades and species that do not have a designated reference genome. They represent one or a small number of the best genomes for a clade cluster.

An important distinction is that reference genomes are annotated with YP_ or NP_ protein accessions that in turn cross-reference non-redundant protein records (WP_ ). All reference genomes are also annotated with GeneIDs. Representative genomes are annotated with non-redundant protein accessions (WP_), and a sub-set of representative genomes are annotated with GeneIDs. Currently, GeneIDs are provided for representative genomes that have at least 10 nearly identical variant genomes in the cluster.

Why are some type strain bacterial genomes not designated as a RefSeq reference (or representative) genome?

Type strain information is considered when the selection of representative genomes is made, but the genomic sequence data must still be of sufficient quality to be a reference or representative RefSeq genome (RefSeq genome annotation criteria). If you have questions about type strain genomes that you feel are of high quality but are not tracked as a RefSeq reference or representative genome, please write to info@ncbi.nlm.nih.gov with quality supportive details so we can review the situation.

How do I find a replacement for a removed protein accession?

Removed protein accessions can still be accessed in NCBI's Protein resource when querying by protein accession or gi. A custom message has been provided with links to the replacement non-redundant protein for a subset of suppressed NP_ and YP_ protein accessions. These suppressed accessions have been replaced by, and are identical to, a non-redundant protein record; the same nucleotide accession.version + coordinates that uses to cross-reference a NP_ or YP_ accession not cross-references a non-redundant WP_ accession. We are working to expand this navigation support for annotation updates that resulted in a small change to the CDS feature coordinates such that the original NP_/YP_ accession are very similar to but not identical to the replacement non-redundant WP_ accession.

Image of informative message added to suppressed bacterial YP/NP accessions.

How do I find the nucleotide coding sequence (CDS) for a non-redundant protein record?

Retrieve the record in the Protein database and then click the link a the top of the page to the Identical Protein table (or use the Display Settings menu to navigate to this view). In the table, find the organism that you want and use the link in the CDS Region in Nucleotide column to link to the region of the genomic sequence that corresponds to the coding region (CDS) for the protein in the organism of interest.

How do I find the list of genomes that include a CDS annotation that cross-references a given non-redundant protein accession?

From a given non-redundant protein accession, change the Display Settings to the Identical Protein Report. A link is provided to this display at the top of the record view, where the link to FASTA format is also found. The Identical Protein Report page shows the nucleotide sequence and CDS annotation location that the non-redundant protein is annotated on. The report also includes the organism information for the nucleotide record, and lists additional protein sequences that are identical in sequence to the RefSeq non-redundant protein.

The tabular report can be downloaded by using the Send to link on the upper right side of the page (Send to -> File).

What should I do if I think the name given to a non-redundant RefSeq protein in the DESCRIPTION line or the protein /product line is wrong?

Please write to the help-desk (info@ncbi.nlm.nih.gov), describe the problem and provide the evidence you have for a new protein name.

Can I add a publication to a non-redundant protein record?

At this time, publications cannot be added to non-redundant protein records as they represent the pure sequence object which is found, in many cases, on genomes from multiple strains or even species.

How can I review what genes are annotated nearby for a given non-redundant RefSeq protein?

From a non-redundant protein record, navigate to the Identical Proteins report page (using the link at the top of the record, or the Display Settings menu). Select the organism of interest from the table, follow the link in the "CDS Region in Nucleotide". You can review the annotation for an expanded region in the GenBank format by changing the Region shown, or view the annotation in a graphical context by following the link to "Graphics". Once you are in the graphics view you can zoom out to view a graphical display of the neighboring gene annotations.

How can I access species- or strain-specific protein datasets?

Species- or strain-specific protein datasets for individual RefSeq genomes can be obtained online, by FTP, and through NCBI s programming utilities. To access data online, navigate to the annotated genome record(s) in NCBI s Nucleotide database, use the right-column option to Find related data in the Protein database, then download the protein records using the upper-right Send to wizard.

To access proteins for specific species or strains by FTP, navigate to NCBI s Assembly resource then follow the right-column link to the RefSeq FTP site where you can download sequence and annotation data in a variety of formats. RefSeq genome sequences include a link to the Assembly resource in the DBLINK section of the record or follow the Assembly link that is located in the right-column 'Related information' section of the Nucleotide record.

To access data using NCBI programming utilities one must provide the genomic accession(s) and use the eLink function to access the linked protein data (see documentation http://0-www-ncbi-nlm-nih-gov.brum.beds.ac.uk/books/NBK25501/).

Last updated: 2019-01-08T18:06:07Z