About RefSeq

The Reference Sequence (RefSeq) collection provides a comprehensive, integrated, non-redundant, well-annotated set of sequences, including genomic DNA, transcripts, and proteins. RefSeq sequences form a foundation for medical, functional, and diversity studies. They provide a stable reference for genome annotation, gene identification and characterization, mutation and polymorphism analysis (especially RefSeqGene records), expression studies, and comparative analyses. [ more... ]

RefSeq genomes are copies of selected assembled genomes available in GenBank. RefSeq transcript and protein records are generated by several processes including:

Computation
- Eukaryotic Genome Annotation Pipeline
- Prokaryotic Genome Annotation Pipeline
Manual curation
Propagation from annotated genomes that are submitted to members of the International Nucleotide Sequence Database Collaboration (INSDC)

Scope

NCBI provides RefSeqs for taxonomically diverse organisms including archaea, bacteria, eukaryotes, and viruses. References sequences are provided for genomes, transcripts, and proteins. Some targeted loci projects are included in RefSeq including: RefSeqGene, fungal ITS, and rRNA loci. New or updated records are added to the collection as data become publicly available.

RefSeq Growth Statistics

Data Access and Availability

RefSeq is accessible via BLAST , Entrez, and the NCBI FTP site (RefSeq releases, and RefSeq Genomes). Information is also available in NCBI's Assembly, Genomes and Gene resources, and for some organisms additional information is available in NCBI's genome browser Genome Data Viewer. Special properties have been defined to facilitate Entrez-based retrieval. See also: Entrez Query Hints

Distinguishing Features

The main features of the RefSeq collection include:

non-redundancy
explicitly linked nucleotide and protein sequences
updates to reflect current knowledge of sequence data and biology
data validation and format consistency
distinct accession series (all accessions include an underscore '_' character)
ongoing curation by NCBI staff and collaborators, with reviewed records indicated

RefSeq Production Processes and Policy

RefSeq records are derived from publicly available sequence data; varying levels of validation, additional annotation, and manual curation are applied to the RefSeq record. NCBI Reference Sequences are provided through the separate processes described below.

This page provides a brief overview of the RefSeq production processes. Also see: NCBI Handbook, RefSeq chapter NCBI Handbook, Genome Annotation chapter RefSeq Prokaryotic Genomes Eukaryotic genome annotation policy

Collaboration

For some organisms, the annotated RefSeq records are provided by collaborating groups. Depending on the organism, collaborations may be established at the whole-genome level, or smaller collaborations may be established for gene families.

Whole-genome collaborations include records for Saccharomyces cerevisiae , Arabidopsis thaliana , Drosophila melanogaster , and Caenorhabditis elegans . When such a collaboration is established, the primary sequence level review is carried out by the collaborating group. Processing of annotated genome data submitted by collaborations is semi-automated; data is provided by a collaborating group and validated at NCBI to detect obvious errors (e.g., the annotated CDS location is not capable of encoding the provided protein), and to apply the annotation in a more uniform way. NCBI processing may integrate additional information such as nomenclature or other descriptive data. Additional manual curation of these records is not carried out by NCBI staff. NCBI may update the records to correct a general format problem, but otherwise these records are only updated when the collaborating group provides an update. Should errors be reported, then NCBI staff relays that information to the collaborating group.

RefSeq records that are supplied by collaboration do include an indication of the submitting group on the record either as a direct submission Reference citation and/or in the COMMENT block. The RefSeq status (e.g., REVIEWED etc) is either indicated by the collaborating group, or is inferred based on the supplied annotation.

Genome Assembly & Annotation Pipeline

NCBI is providing annotation for some assembled genomic sequence data including human, mouse, rat, honey bee, chicken, chimpanzee (and others). This pipeline is automated and data is refreshed periodically. The model RefSeq records produced from this pipeline have a distinguishing accession prefix (XM, XR, XP), are derived from the genomic sequence, have varying levels of transcript or protein homology support, and are not subject to further manual curation.

Definitions :

Model RefSeq : RNA and protein products that are generated by the eukaryotic genome annotation pipeline. These records use accession prefixes XM_, XR_, and XP_.
Known RefSeq : RNA and protein products that are mainly derived from GenBank cDNA and EST data and are supported by the RefSeq eukaryotic curation group. These records use accession prefixes NM_, NR_, and NP_.

Also see:

NCBI curation of eukaryotic transcript and protein sequences:

RefSeq transcript and protein records for a subset of organisms, primarily mammals, are curated by NCBI staff. Curation is an ongoing process and some records have not been reviewed yet; the curation status is indicated on the RefSeq record in the COMMENT block. Some records representing genomic regions (accession prefix NG_) are provided specifically to support more comprehensive genome-level annotation. The curated RefSeq records are created via a process that includes automated computational methods, collaboration, and manual data review by NCBI staff. This process is further described in the NCBI Handbook, RefSeq chapter .

A combined approach uses both collaborator supplied sequence information and automated BLAST analysis to provide an initial RefSeq record. Records are subject to validation to correct annotation errors and provide annotation in a more consistent format. Descriptive information, including Official Nomenclature and additional citations, are applied to the records. These initial records have a PROVISIONAL, PREDICTED, or INFERRED status.

Additional manual curation is applied to this set of RefSeq records to provide the optimal sequence record, and to fix sequence errors including mis-association with a locus (as might occur for closely related gene families), chimeric sequences, vector or linker contamination, or apparent sequencing errors. Both the nucleotide and protein sequence record may change due to this process. Sequence level review is carried out primarily by NCBI staff but some records are provided via collaboration. These records have a VALIDATED status. Additional annotation, a summary description, and other functional information may be applied, as available, during the sequence review process. These records have a REVIEWED status.

The process flow includes the following steps:

Initial Automatic Processing:
- Automatic processing and FTP downloads from collaborating groups provides an initial definition of the gene and sequence associations
- Validation and QA evaluations check for data conflicts and data completeness
- If pass QA phase, automatically provide RefSeq record. The initial RefSeq record will have a status of INFERRED, PREDICTED, or PROVISIONAL and may include enhanced feature annotation including:
  - Publications
  - Names, Symbols, Aliases
  - GeneID number
  - Cross-references to other databases
  - Map information
Curation Processing (QA failures and other genes):
- Gather available data
- Review Gene-2-sequence associations: data conflicts are resolved through NCBI staff review in collaboration with collaborating databases; this review process is critical for accurately representing closely related genes.
- Curation may provide further enhancements to the RefSeq transcript and/or protein records including:
  - Sequence information
    - remove vector, linker contamination
    - extend UTR
    - represent the optimal sequence by correcting sequencing errors or choosing which polymorphic variant to represent - as identified in published reports, via in-house sequence analysis, or per personal communication.
    - represent splice variant records when there is sufficient unambiguous data available
  - Annotation information:
    - Add publications
    - Add a summary description about the gene and protein function
    - Add a description of transcript variants
    - Add feature annotations such as mature protein products, poly-adenylation signals and sites
    - Ensure correct representation of atypical biology such as selenoproteins, ribosomal slippage, or non-AUG translation initiation sites.
Multiple collaborations support this process.

Since there is a strong manual curation component in this pipeline, input from the research community is especially welcome to further improve the quality of this dataset. The RefSeq records generated by this pipeline are used as a reagent in the genome assembly & annotation pipeline (see above).

RefSeq

Integrated reference sequences