RefSeq Announcements for 2015

January 7, 2015: Announcing RefSeq Release 69

This full release incorporates genomic, transcript, and protein data available, as of January 2, 2015 and includes 74,127,019 records, 52,276,468 proteins, 9,973,568 RNAs, and sequences from 51,661 organisms. Additional information is available in the Release Notes.

Changes since the previous release:

[1] A list of updated organisms and dbSNP annotation summary is available here: ftp://ftp.ncbi.nih.gov/snp/release-notes/RefSeq/refseq69.snp.rpt

[2] In this release, we saw almost 50% increase in the count of genomic records in the vertebrate_other node, from 2161339 to 3081632. This increase correlates with the effort to annotate a large number of bird genome assemblies in late 2014. The complete list RefSeq genomes annotated by NCBI's Eukaryotic Genome Annotation Pipeline is available here: http://0-www.ncbi.nlm.nih.gov.brum.beds.ac.uk/genome/annotation_euk/all/

Other nodes saw modest increases in record counts.

[3] The International Nucleotide Sequence Database Consortium (INSDC) has introduced a new feature key 'regulatory' and a set of feature classes. http://www.insdc.org/files/feature_table.html#7.2

It is RefSeq's policy to adhere to INSDC formats. NCBI has made the changes necessary to start using the new feature key and consequently poly-adenylation signal features are now displayed using the new format.

For example: http://0-www.ncbi.nlm.nih.gov.brum.beds.ac.uk/nuccore/NM_001303444.1

 regulatory      1191..1196
                 /gene="HEXIM2"
                 /gene_synonym="L3"
                 /regulatory_class="polyA_signal_sequence"

March 2015: Delayed installation of RefSeq Release 70

March 10, 2015: We are delaying the installation of RefSeq release 70 in order to do some additional quality assessment and prepare supplemental reports specific to this release. We anticipate releasing the data in approximately a week and apologize for the inconvenience.

March 12, 2015: We have identified some data concerns in the initial extraction for RefSeq release 70. We are currently working to update these issues and then will re-process data for the FTP release. We hope to install the release near the end of the month but cannot provide a firm date at this time.

March 31, 2015: The next RefSeq release (release number 70) will be provided in early May. In the meantime, new and updated RefSeq records continue to be provided in the RefSeq daily update directory, weekly updates of transcript and protein accessions are provided for a small number of more highly accessed vertebrates, and organism-specific genome plus annotation data (a snapshot in time) is available from the genomes FTP area. Example links to these areas include:

  1. ftp://ftp.ncbi.nlm.nih.gov/refseq/daily/ (RefSeq daily updates)

  2. ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot (weekly transcript and protein updates; also available for rat, mouse, cow, pig, and Xenopus tropicalis)

  3. ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/all_assembly_versions/GCF_000001405.28_GRCh38.p2/ (updated annotation for the human reference genome)

  4. ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Mus_musculus/all_assembly_versions/GCF_000001635.23_GRCm38.p3 (updated annotation for the mouse reference genome)

May 7, 2015: Announcing RefSeq Release 70

This full release incorporates genomic, transcript, and protein data available, as of April 30, 2015 and includes 74,720,563 records, 50,351,119 proteins, 11,310,700 RNAs, and sequences from 54,118 organisms. Additional information is available in the Release Notes.

This comprehensive RefSeq release includes a number of changes to the bacterial RefSeq genomes including completion of the bacterial re-annotation project and transition to the new non-redundant RefSeq protein data model. A similar update for Archaeal genomes will occur later this year. To facilitate your transition to using this data we are providing additional extensive online documentation, additional report files in the RefSeq FTP release-catalog<ftp://ftp.ncbi.nlm.nih.gov/refseq/release/release-catalog/> directory, custom informational messages on suppressed protein records (now replaced with non-redundant protein accessions), and custom messages on discontinued Gene records (the scope definition for Gene changed). The bacterial data model change offers advantages in annotation consistency, reduced protein redundancy, and improved management of protein names. Since the January RefSeq release, we have started a major initiative to improve bacterial RefSeq protein names.

Changes since the previous release:

[1] A list of updated organisms and dbSNP annotation summary is available here: ftp://ftp.ncbi.nih.gov/snp/release-notes/RefSeq/refseq70.snp.rpt

[2] Eukaryotic genome updates This release includes updated annotation for the human reference genome (GRCh38.p2), the mouse reference genome (GRCm38) and the Caenorhabditis elegans reference genome corresponding to WormBase release WS245.

[3] Prokaryotic RefSeq data This release reflects a large update of complete bacterial RefSeq genomes, proteins, and Genes.

NCBI decided to re-annotate all RefSeq prokaryotic genomes using NCBI’s genome annotation pipeline in order to make genome annotation comparable across genomes and species, instead of representing submitted annotation that was provided using different methods reflecting different states of technology development over time. Previously, it was possible that the same gene, in the same species, with an identical sequence for the genes genomic region might be annotated with a different protein simply because it was annotated using different methods. Because of the re-annotation, the same gene in the same species with the same sequence will now be annotated with exactly the same protein in RefSeq. If you’d like to learn more about the re-annotation project and what NCBI is doing to help you transition to using this new data, please see the RefSeq Re-annotation Project page at: http://0-www.ncbi.nlm.nih.gov.brum.beds.ac.uk/refseq/about/prokaryotes/reannotation/.

Previously, each annotated CDS was tracked with a distinct RefSeq protein accession number; however, given the facts that the identical protein sequence has been found on multiple re-annotated RefSeq genomes, coupled with the extensive sequencing of bacterial genomes (often of the same strain but different isolates) the RefSeq prokaryotic protein dataset was rapidly becoming very redundant. Therefore, rather than flood the protein database with thousands of completely identical proteins, NCBI has adopted the use of non-redundant (WP_) proteins for RefSeq prokaryotic genomes that are annotated using the NCBI pipeline. If the identical protein sequence (exactly the same protein sequence and length) appears on more than one RefSeq genome, NCBI simply re-uses the existing WP accession number instead of creating a new accession for each new occurrence and genome. For conserved proteins the same WP accession may appear on thousands of genomes. This is a first step toward dealing with a world when genomes are sequenced just for assays, rather than to discover novel proteins. We appreciate that this is new and a major change for RefSeq prokaryotic genomes, and that there are some issues still to be worked out to use these data smoothly, but we felt we needed to start making this change as the number of disease-outbreak and other isolate sequencing continues to increase rapidly.

Advantages of comprehensive re-annotation and non-redundant proteins: - More consistent annotation across RefSeq bacterial genomes. - Significant reduction in protein redundancy. This is most notable for heavily sequenced species. For more information please see: http://0-www.ncbi.nlm.nih.gov.brum.beds.ac.uk/refseq/about/prokaryotes/reannotation/#reducedredundancy - Significant improvement in protein name management.

This release: The long term plan to re-annotate all RefSeq bacterial genomes using NCBI's prokaryotic genome annotation pipeline has now nearly completed and is included in this release. We anticipate that the remaining very small number of re-annotated bacterial genomes will be released by the end of the summer 2015. We also plan to re-annotate the archaeal genomes. As RefSeq bacterial genomes were re-annotated, the proteins were replaced with non-redundant RefSeq proteins (having the WP_ accession prefix). This data type was first announced in June 2013: http://0-www.ncbi.nlm.nih.gov.brum.beds.ac.uk/news/06-11-2013-wp-refseqs/. Thus >7 million YP/NP protein accessions were removed since January, resulting in a decrease in the total number of protein accessions and a significant reduction in protein redundancy for the prokaryotic dataset. Removed accessions are reported here: ftp://ftp.ncbi.nlm.nih.gov/refseq/release/release-catalog/release70.removed-records.gz

A data mapping report is available in the release-catalog directory (release70.bacterial-reannotation-report.txt.gz).

Protein records

In all bacterial genomes, except reference genomes and a small number which have yet to be re-annotated, protein accessions NP/YP have been replaced with non-redundant protein accession numbers (WP_).

  • > 7 million bacterial YP_ and NP_ RefSeq proteins were suppressed as complete bacterial genomes were re-annotated to conform to the new data model
  • Nearly 1 million non-redundant protein records were updated in March and April 2015 to improve the protein name. These updates affected CDS “/product=â€? annotation details for all (>31,000) of the RefSeq bacterial genomes and included typographical corrections, name format standardization, and improved functional information.
  • We have initiated a long-term project to validate and improve protein names for non-redundant protein records. In March and April we validated names for approximately 2 million records using multiple support lines from Swiss-Prot, HMM analysis, domain architecture analysis, and NCBI scientific staff curation.

Nucleotide records

  • >6,400 new or re-annotated RefSeq bacterial genomes were released since January 2, 2015.
  • All new complete or draft RefSeq prokaryote genomes now use the accession format rule NZ_<original_INDSC_accession>. Complete genomes that were already accessioned using the ‘NC_’ prefix will continue to use that accession number. Thus, the accession prefix is no longer an indicator of a complete bacterial genome. Information about genome completeness is provided in the record DEFINITION line, the Assembly resource, and FTP reports provided by Assembly and Genome resources.

Quality control

  • Over 450 RefSeq bacterial genomes that do not meet updated quality criteria were suppressed; some of these may be reinstated in the future after further improvements are made to NCBI’s prokaryotic genome annotation pipeline.
  • A supplemental file in the refseq-catalog directory (release70.addedQA-suppressedAssemblies.txt) reports details for a subset of bacterial genomes that were suppressed in March 2015 following an expansion of QA metrics and subsequent to curatorial review. This report illustrates some of the reasons for suppression.

locus_tag format

Re-annotated RefSeq genome records have new locus_tags in the format of <original locus tag prefix>_RS<digits>. The original locus tag is provided in the “old_locus_tag� qualifier. A bacterial genomes mapping report available in the release-catalog directory (release70.bacterial-reannotation-report.txt.gz) includes information about old and new locus_tags.

Available Reports and Documentation

a) Supplemental data mapping file: A ftp file in the release-catalog directory (release70.bacterial-reannotation-report.txt.gz) has been prepared for re-annotated genomes that have recently transitioned to using the new non-redundant proteins. This file reports the old protein accession and gi, the annotated CDS coordinates, the old locus_tag and NCBI GeneID values and maps that to the current non-redundant protein accession and gi, the new locus_tag and NCBI GeneID (if available), the current CDS annotation coordinates, and indicates then the original protein identically matches verses is similar to the replacement non-redundant protein or was dropped from the annotation. b) Supplemental report of suppressed assemblies: A ftp file in the release-catalog directory (release70.addedQA-SuppressedAssemblies.txt) reports details for a subset of bacterial genomes that were suppressed in March 2015 following an expansion of QA metrics and subsequent to curatorial review. This report illustrates some of the reasons for suppression. c) NCBI has created online documentation to explain these changes in detail: - Re-annotation project: http://0-www.ncbi.nlm.nih.gov.brum.beds.ac.uk/refseq/about/prokaryotes/reannotation/ - RefSeq Prokaryotic Genome Policy: http://0-www.ncbi.nlm.nih.gov.brum.beds.ac.uk/refseq/about/prokaryotes/ - RefSeq non-redundant proteins: http://0-www.ncbi.nlm.nih.gov.brum.beds.ac.uk/refseq/about/nonredundantproteins/ - Prokaryotic annotation pipeline: http://0-www.ncbi.nlm.nih.gov.brum.beds.ac.uk/genome/annotation_prok/process/ - Prokaryotic RefSeq FAQ: http://0-www.ncbi.nlm.nih.gov.brum.beds.ac.uk/refseq/about/prokaryotes/faq/

Impact to NCBI Gene

Together with this re-annotation effort, the scope of bacterial genomes included in Gene has been changed to include only genomes designated as a 'reference genome,' or 'representative genome' where there is a cluster of related assemblies to indicate that the chosen representative assembly will be stable. Individual gene features on each assembly are identified with a locus_tag that can be used as a unique identifier for the gene in publications, even if the assembly is out of scope for Gene.

Ongoing work

  • Organism classification and QA: work continues to identify miss-classified genomes and those with contamination. Depending on the specific details of identified issues, additional RefSeq bacterial genomes may be suppressed or updated.
  • Re-annotation of complete genomes: A small number of bacterial genomes have not yet been re-annotated at this time and will be in the near future. We also plan to re-annotate the archaeal RefSeq genomes in 2015.
  • Protein names: we are continuing to work on providing improved names for the non-redundant (WP_ accessioned) bacterial protein dataset. We are leveraging multiple sources of information including curated UniProtKB/Swiss-Prot records, HMMs, Domain and domain architecture, publications and manual curation.
  • Partial proteins: we are re-examining the prokaryotic genome annotation pipeline logic with regards to providing a non-redundant protein record for partial coding sequences.

Using this data

Please refer to the RefSeq bacterial genomes FAQ for information that will facilitate access to these data

a)Strain-specific protein datasets for individual RefSeq genomes can be obtained online, by FTP, and through NCBI's programming utilities. To access data online, navigate to the annotated genome record(s) in NCBI's Nucleotide database, use the right-column option to "Find related data" in the Protein database, then download the protein records using the upper-right ‘Send to’ wizard. To access proteins for specific species or strains by FTP, navigate to NCBI's Assembly record then follow the right-column link to the RefSeq FTP site. RefSeq genomes include a link to the Assembly resource in the DBLINK section of the record or in the right-column Related information section of the Nucleotide record. To access data using NCBI programming utilities one must provide the genomic accessions and use the eLink function to access the linked protein data (see documentation http://0-www.ncbi.nlm.nih.gov.brum.beds.ac.uk/books/NBK25501/). b) A graphical display of an annotated gene or protein can be accessed from the Nucleotide resource. From a RefSeq genome record of interest, such as NC_002695.1, follow the link to ‘Graphics’, and search for the locus_tag or protein name of interest. c) Conversely, is starting from an individual non-redundant protein record, information about the annotated genomic location and genome taxonomy is available by following the (page top) link to the Identical Protein report. When a non-redundant protein record has been annotated on multiple RefSeq genomes, this report page lists the set of genomes that contain that identical protein, the genomic coordinates of the annotated CDS, and the specific organism information of the annotated genomic record. Thus this report page can be used to identify the taxonomic range that that identical protein has been found in. The protein report can be downloaded in tabular format using the "Send to" link, and can be accessed using NCBI's programming utilities.

Measurable reduction in protein redundancy

Here are some measures for four species that illustrates the significant reduction in protein record redundancy resulting from the use of non-redundant RefSeq proteins (WP_ accessions).

Counts:
Species                   Genomes  Total Proteins  Total Unique WPs  Total Singleton WPs
------------------------- -------  --------------  ----------------  -------------------
Staphylococcus aureus        4194      11,764,898           222,588              138,284
Escherichia coli             2685      13,637,370         1,033,617              649,100
Mycobacterium tuberculosis   1790       7,245,836           139,800              101,255
Salmonella enterica           918       4,099,013           294,106              194,982

Percents:
Species                   Genomes  Percent Reduction (WPs)  Percent Singleton WPs
------------------------- -------  -----------------------  ---------------------
Staphylococcus aureus        4194          98%                    62%
Escherichia coli             2685          94%                    63%
Mycobacterium tuberculosis   1790          98%                    72%
Salmonella enterica           918          93%                    66%

Singletons Per Genome:
Species                   Average Protein Count  Singleton WPs per Genome  Percent Singleton Per Genome
------------------------- ---------------------  ------------------------  ----------------------------
Staphylococcus aureus             2814                       33                        1.17%
Escherichia coli                  5088                      241                        4.74%
Mycobacterium tuberculosis        4046                       56                        1.38%
Salmonella enterica               4485                      212                        4.72%

Definitions

  • "Total Proteins" counts the number of times non-redundant proteins accessions are annotated on the set of genomes for the species.
  • "Total Unique WPs" counts the distinct number of non-redundant proteins used across all genomes. This is the truly non-redundant set of proteins for the species.
  • "Total Singleton WPs" counts the number of non-redundant proteins used only once in the set of genomes for the species.
  • "Percent Reduction" measures the compression in protein identifier space gained by using non-redundant protein accessions (WP_ prefix)
  • "Percent Singleton WPs" measures the percent of all non-redundant proteins for that species that are used only once in that species.

July 13, 2015: Announcing RefSeq Release 71

This full release incorporates genomic, transcript, and protein data available, as of July 6, 2015 and includes 77,730,891 records, 52,494,032 proteins, 11,803,354 RNAs, and sequences from 55,267 organisms.

Changes since the previous release:

[1] A list of updated organisms and dbSNP annotation summary is available here:

<ftp://ftp.ncbi.nih.gov/snp/release-notes/RefSeq/refseq71.snp.rpt>

[2]Caenorhabditis elegans:

The Caenorhabditis elegans annotation was updated to correct an identified problem with missing gene symbols, and incorrectly labeled non-coding RNAs.

[3]Prokaryotic genomes:

We plan to comprehensively re-annotate bacterial and archaeal genomes for RefSeq release 72 (September 2015). This re-annotation is being carried out to reflect improvements in a) management of partial, very short, and fragmented genes and proteins; and b) protein name management. This re-annotation will also increase consistency of some textual information that is applied to RefSeq records.

Note that re-annotation will not be done for the small set of bacterial reference genomes for which annotation changes are manually maintained.

August 27, 2015: Announcing RefSeq Release 72

This full release incorporates genomic, transcript, and protein data available, as of July 6, 2015 and includes 77,730,891 records, 52,494,032 proteins, 11,803,354 RNAs, and sequences from 55,267 organisms. Additional information is available in the Release Notes.

Changes since the previous release:

[1] A list of updated organisms and dbSNP annotation summary is available here:

[ftp://ftp.ncbi.nih.gov/snp/release-notes/RefSeq/refseq72.snp.rpt](ftp://ftp.ncbi.nih.gov/snp/release-notes/RefSeq/refseq71.snp.rpt)

[2]Caenorhabditis elegans:

The Caenorhabditis elegans annotation was updated to correct an identified problem with missing gene symbols, and incorrectly labeled non-coding RNAs.

[3]Prokaryotic genomes:

We plan to comprehensively re-annotate bacterial and archaeal genomes for RefSeq release 71 (September 2015). This re-annotation is being carried out to reflect improvements in a) management of partial, very short, and fragmented genes and proteins; and b) protein name management. This re-annotation will also increase consistency of some textual information that is applied to RefSeq records.

Note that re-annotation will not be done for the small set of bacterial reference genomes for which annotation changes are manually maintained.

November 2, 2015: Announcing RefSeq Release 73

This full release incorporates genomic, transcript, and protein data available, as of November 2, 2015 and includes 83,881,439 records, 54,766,170 proteins, 12,998,293 RNAs, and sequences from 55,966 organisms. Additional information is available in the Release Notes.

Changes since the previous release:

[1] A list of updated organisms and dbSNP annotation summary is available here:

<ftp://ftp.ncbi.nih.gov/snp/release-notes/RefSeq/refseq73.snp.rpt>

[2] Future change: GI sequence identifiers to be removed from some file formats

As of 06/15/2016, the integer sequence identifiers known as "GIs" will no longer be included in the GenBank, GenPept, and FASTA formats supported by NCBI for the display of sequence records.

Please refer to the FTP release notes for additional details.

Last updated: 2017-12-01T21:32:03Z