Eukaryotic Annotated Genome Submission Guide

Introduction

This guide gives a brief overview about submitting an annotated eukaryotic genome for GenBank, using the NCBI command line program tbl2asn. However, annotation is not required for genomes, so you can simply submit FASTA files as described in the Genome Submission Guide.

Tbl2asn combines a simple five-column tab-delimited table of feature locations and qualifiers with the DNA sequence (in FASTA format) and the submitter information to generate a file for submission to GenBank. The format of this feature table allows different kinds of features (e.g., gene, mRNA, coding region, tRNA, repeat_region) and qualifiers (e.g., /product, /note) to be indicated. The validator will check for errors such as internal stops in coding regions. See the detailed annotation instructions for those details.

Guidelines for prokaryotic genome submissions.

If you have questions about creating your submission, please contact us by email at genomes@ncbi.nlm.nih.gov prior to creating your submission.

Table of Contents

  1. Obtain a locus_tag prefix
  2. Prepare FASTA-formatted sequence
  3. Annotation

  4. Creating the submission file

  5. Submitting
  6. Updating existing genome submissions
  7. Examples

Obtain a locus_tag prefix

If you are submitting a genome with annotation, then you need to have a locus_tag prefix to use for the annotation, as described in the BioProject and BioSample sections of the genome submission guide.

FASTA-formatted sequence

Nucleotide sequences must be in FASTA format. As described in the genome submission guide, FASTA format consists of a single definition line, beginning with a '>' and followed by optional text, and subsequent lines of sequence. At minimum, all definition lines must contain an identifier for the sequence, called the SeqID. Other information about the biological source of the organism can also be encoded on the definition line of the sequence and is used to build the record.

A sample definition line is

>HTE831 [organism=Drosophila yakuba] [strain=HTE831]

Common source modifiers may be incorporated into the definition line e.g., [strain=HTE831]. Alternatively, these modifiers can be included in the tbl2asn command line with -j.

An example of a FASTA-formatted sequence is shown Figure 1 of Eukaryotic Genome Submission Examples

See the rules for genome sequences.

Annotation

Annotation is optional for GenBank eukaryotic and prokaryotic genomes. However, if you choose to submit with annotation, then the features listed below are the minimum required annotation, although there are many additional features that can be included. It is our hope that the annotation present on any genome will evolve over time as more is known about the biology. In reviewing eukaryotic genome annotation, NCBI strives to ensure that the annotation is consistent throughout the submission and when compared to other genome submissions. We also strive to present information that is an accurate representation of the known biology. To do this we need your help. Please pay careful attention to the annotation instructions presented here and please review all your annotation before submitting your genome. Many genomes are annotated by automatic prediction programs and since these programs do make mistakes, it is up to all of us to try and ensure the information being presented is as accurate as possible. A summary of the required annotation is presented below, however please also refer to our detailed annotation instructions for our annotation expectations.

Required Annotation

  1. Genes

    • locus_tag
  2. Coding regions of known proteins

    • product (protein) names
    • protein_id
  3. mRNA features

    • transcript_id

Gene features

A gene is defined as a region of biological interest for which a name has been assigned. Gene features are always a single interval, and their location should cover the intervals of all the relevant features such as promoters and polyA binding sites. Gene names should follow the standard nomenclature rules of the particular organism. For example, mouse gene names begin with an uppercase letter, and the remaining letters are lowercase. Please refer to detailed annotation instructions for more information on genes.

locus_tag

The locus_tag is a systematic gene identifier that is assigned to each gene. The locus_tag must be unique for every gene of a genome. Each genome project (i.e. all chromosomes) should have the same unique locus_tag prefix to ensure that a locus_tag is specific for a particular genome project, which is why we require that the locus_tag prefix be registered. In addition, genes may also have functional names as assigned in the scientific literature. For example, KCS_0001 is the systematic gene identifier, while Abc5 is the functional gene name. The locus_tag prefix should be 3-12 alphanumeric characters and the first character may not be a digit. Additionally locus_tag prefixes are case-sensitive. The locus_tag prefix is followed by an underscore and then an alphanumeric identification number that is unique within the given genome. Other than the single underscore used to separate the prefix from the identification number, no special characters can be used in the locus_tag. Read more about locus_tags and their intended usage. Please refer to detailed annotation instructions for how to incorporate locus_tags into your annotation table.

CDS (coding region) features

The CDS feature is used to define a protein coding region. All CDS features must have a product qualifier (protein name), protein_id and transcript_id. For the product, use a concise name, not a description or phrase. Alternatively, protein names may be denoted by the same symbol as the corresponding gene with the appropriate capitalization for the organism. In cases where the protein is not known use "hypothetical protein" as the product name. We recommend the use of "hypothetical protein" as this will allow the locus_tag identifier to be appended to the product name in BLAST and Entrez summary lines. Our detailed annotation instructions contain instructions and examples on naming your proteins as well as including additional CDS qualifiers such as EC_numbers, protein functions, descriptive and similarity notes.

protein_id

The submitter must assign an identification number to all proteins. NCBI uses this number to track proteins when sequences are updated. This number is indicated in the table by the CDS qualifier protein_id, and should have the format gnl|dbname|string, where dbname is a version of your lab name that you think will be unique (e.g., SmithUCSD), and string is the unique protein SeqID assigned by the submitter.

The protein_id is used for internal tracking in our database, so it is important that the complete protein_id (dbname + SeqID) not be duplicated by a genome center. Note that when WGS submissions are processed, the dbname in the protein_id is automatically changed to 'WGS:XXXX', where XXXX is the project's accession number prefix. Please see detailed annotation instructions.

mRNA features

The submitter must include an mRNA feature for each translated CDS and extend the gene feature to include the entire mRNA. Additionally, the mRNA must have the same product name, protein_id and transcript_id as its corresponding CDS. Each mRNA feature can be either partial or complete. If there is no UTR information, then the mRNA's location will agree with its CDS's location, but the mRNA will be partial at its 5' and 3' ends. If the mRNA is partial, then make the gene partial.

Our detailed annotation instructions contain examples for including complete and partial mRNA features.

transcript_id

The submitter must also include a transcript_id qualifier. The transcript_id is used for internal tracking in our database, so it is important to include a transcript_id as a qualifier for both the CDS and its corresponding mRNA. Each transcript_id must be unique and different from the protein_id. Please see detailed annotation instructions.

Creating the submission file

The submission file can be generated using tbl2asn or or Genome Workbench. tbl2asn is a simple command line program that automates parts of the submission process, so is very useful, especially for projects that have multiple sequences. The newest version is available by anonymous FTP. The main difference between Genome Workbench and tbl2asn is that Genome Workbench is a menu-driven program with a graphical interface, while tbl2asn is a command line program. See the Genome Submission Guide for specific information on generating a genome submission.

For both programs the sequence must be in a file or files in FASTA format, and the annotation must be in a file or files in the five column tab-delimited feature table format, as described above.

tbl2asn

If you choose to use tbl2asn, then the basic instructions follow, but more detail is in the .sqn instructions of the Genome Submission Guide.

tbl2asn reads a template along with the sequence and table files, and outputs ASN.1 for submission to GenBank. tbl2asn requires that the sequence and annotation file have specific name conventions. The FASTA-formatted sequence file has ".fsa" as an extension, and the five column tab-delimited table file has ".tbl" as an extension. The base name of the .tbl file must be identical to that of the .fsa file for tbl2asn to recognize it and to include the annotation in the output ".sqn" file that it generates.

Start a submission using the genbank submission template.

As described in the general instructions, run:

tbl2asn -p path\_to\_files -t template -M n -Z discrep

If the sequences contain Ns that represent gaps , then run the appropriate tbl2asn commandline with the -l and -a arguments, as described in Gapped Format for Genome Submissions.

Additional command line arguments can be seen on the tbl2asn page.

In the directory specified by -p, the program looks for corresponding pairs of *.fsa and *.tbl files, and builds ASN.1 records named *.sqn for these pairs. The results of the validation (error checking) will be in *.val files. Note that if there are no .tbl files in the directory, then tbl2asn will still generate .sqn files from the .fsa files that are present.

Go to the genome submission guide for specific information about generating a genome submission, and to the detailed annotation instructions to ensure that you have included the annotation correctly.

Be sure to check the output of the validation and discrepancy reports and fix any problems, as described

Once the errors have been fixed, the .sqn files can be submitted to GenBank.

Genome Workbench

If you choose to use Genome Workbench to make your submission file, then follow the directions then follow the directions found here: https://ncbiinsights.ncbi.nlm.nih.gov/2019/07/09/genome-workbench-3-0-now-with-support-for-preparing-genbank-genome-submissions/ There is also a video tutorial is at: https://www.youtube.com/watch?v=BN9e4ma10kA

Check the detailed annotation instructions to ensure that you have included the annotation correctly. Be sure to validate and fix the errors. Run the Discrepancy Report if your submission has annotation by choosing the Submitter Report option in the Reports menu, and fix any problematic annotation detected. Contact genomes@ncbi.nlm.nih.gov with questions about any errors or discrepancy report output.

GFF/GTF files

Some users have annotation in the format of a GFF or GTF file. If you have this type of annotation, you can use the command line program table2asn_GFF (table2asn) to make your submission. Those instructions are here, https://0-www-ncbi-nlm-nih-gov.brum.beds.ac.uk/genbank/genomes_gff/. Note that table2asn is the improved 'version' of tbl2asn and will soon replace it.

Submitting

Genomes are submitted to us via the Submission Portal, as described in the genome submission guide

Updating a genome

See the Updating Information on GenBank Genome Records page for information about updating various aspects of a genome assembly.

For example, when a complete genome or chromosome is updated, the proteins should be tracked to the update. To do this, proteins from the original submission that are present in the update must have the same identifiers that were used in the original submission, plus the accession numbers that were assigned when the submission was loaded into GenBank. These identifiers are included in the protein_id of the update in this format:

gnl|dbname|SeqID|gb|accession_number

where the dbname and SeqID are the values used in the original submission, and the accession number was assigned by GenBank.

When your genome is released, we will supply you with a table that has each protein SeqID and protein accession number, so that you can use those in future updates. If you did not receive this table and need to update your genome, contact us at genomes@ncbi.nlm.nih.gov prior to the preparation of your submission.

Support Center

Last updated: 2021-03-11T15:51:29Z