Annotating Genomes with GFF3 or GTF files

This page describes how to create an annoated genome submission from GFF3 or GTF files, using the beta version of our process. Note that you can always use GenBank's standard 5-column feature table (see Prokaryotic Annotation Guidelines or Eukaryotic Annotation Guidelines) as input.

Table of Contents

Basic format

A 9-column annotation file conforming to the GFF3 or GTF specifications can be used for genome annotation submission. The basic characteristics of the file formats are described at:

The GFF3 format is better described and allows for a richer annotation, but GTF will also work for many submissions. This documentation focuses on GFF3 formatting conventions, but GTF conventions to use for submission are similar. Several basic validators are available to verify that a GFF3 file is syntactically valid:

Note these standalone validators will not detect all formatting and annotation issues, and the GenBank annotation submission software is tolerant of some common GFF3 formatting issues, but they can be useful for initial testing, especially if an input file isn't working as expected.

GenBank-specific requirements

An additional set of rules, specific attributes (equivalent to INSDC qualifiers), and automatic processing are utilized for submission of annotated genomes to GenBank. These additions are:

Formatting requirements

[1] seqid in GFF3/GTF column 1 should match the corresponding FASTA or ASN.1 file that is being annotated. For assemblies already in GenBank, seqids will be matched to their corresponding accessions if they are the same as what was used in the original submission. [The seqid is the text between the '>' and the first space in the fasta definition line; do not include the '>' in the GFF file]

[2] contig, supercontig, chromosome and similar landmark features are not required and will be ignored.

[3] multi-exon mRNA and other RNA features can be represented using either: [a] child exon features [b] child five_prime_UTR, CDS, and three_prime_UTR features [c] multiple RNA feature rows with the same ID

Furthermore, whereas the GFF3 specifications require that all rows of a multi-exon CDS feature use the same ID, some commonly used software deviates from this requirement. To allow for deviations from the specifications, for eukaryotes the GenBank software assumes that multiple CDS rows with the same Parent attribute represent parts of the same CDS feature. Multiple CDS features for the same gene need to be annotated by using a separate mRNA Parent feature for each, so there is always a 1:1 relationship of mRNA to CDS, like in the following schematic:

gene1            ================================    ID=gene1
mRNA1            ================================    ID=mRNA1;Parent=gene1
five_prime_UTR   ==                                  Parent=mRNA1
CDS1               ==....=====...........==          Parent=mRNA1 (3 rows)
three_prime_UTR                            ======    Parent=mRNA1
mRNA2            ================================    ID=mRNA2;Parent=gene1
exon             ====                                Parent=mRNA2
CDS2               ==....................==          Parent=mRNA2 (2 rows)
exon                                     ========    Parent=mRNA2

Changes that occur during processing

[1] CDS features that don't include but are adjacent to a stop codon will be automatically extended 1-3 bp to include the stop codon. start_codon and stop_codon features are not required in either GFF3 or GTF.

[2] gene and mRNA features are useful but NOT required. If they are omitted, and only CDS features are provided, then gene and/or mRNA features will be created on-the-fly based on the corresponding CDS feature. mRNA features are auto-created for eukaryote genome annotation submissions when the appropriate argument is included in the command line, and are normally omitted for prokaryotes.

[3] The partialness markup on gene, mRNA, and CDS features is computed automatically based on the completeness of the CDS feature at either end. There is no need to specify attributes in column 9, and any attributes that are sometimes used to specify partialness, such as start_range or end_range, will be ignored.

Attributes/Annotation features

[1] Many SO feature types are recognized in column 3 and converted to their INSDC equivalents. Commonly used types are:

  • gene
  • CDS
  • mRNA
  • exon
  • five_prime_UTR
  • three_prime_UTR
  • rRNA
  • tRNA
  • ncRNA
  • tmRNA
  • transcript
  • mobile_genetic_element
  • origin_of_replication
  • promoter
  • repeat_region

Some SO types may need to be changed before processing in order to be properly recognized: [a] all gene features should use "gene". More specific SO types like rRNA_gene, miRNA_gene, tRNA_gene, pseudogene, pseudogenic_tRNA, and others should be converted to use "gene" instead

[b] misc_RNA is sometimes used for a generic RNA feature type, but it is not a recognized SO term. Use "transcript" instead. Feature types that aren't recognized will be automatically dropped and reported in the log file. Feature types that are always ignored (so not reported in the log file) are:

  • intron
  • protein

[2] pseudogenes should be flagged with pseudogene=<TYPE> qualifier in column 9 on the gene feature and optionally on any child features. Further details about the TYPE values allowed for the pseudogene qualifier are available at: http://www.insdc.org/documents/pseudogene-qualifier-vocabulary .

[3] annotate with pseudo=true any genes that are 'broken' but are not thought to be pseudogenes. These are genes that do not encode the expected translation, for example because of internal stop codons. These are often caused by problems with the sequence and/or assembly.

[4] gene features require locus_tag qualifiers. GFF3 ID attributes are not used for the locus_tag qualifier, so if the ID is applicable as the locus_tag, it should be copied into that attribute with the appropriate formatting. The locus_tags can be provided either by:

[a] adding a locus_tag= attribute to column 9. This option should be used for annotation updates to keep the existing locus_tags where appropriate.

[b] using the -locus-tag-prefix option in the command line and specifying the prefix to use so that the software assigns locus_tags automatically

[5] mRNA and CDS features require transcript_id and protein_id qualifiers, respectively. They can be provided either by including both or neither of them. Specifically [a] and [b], OR just [c]:

[a] adding transcript_id= attributes to mRNA (and other RNA) features, using the format:

  • transcript_id=gnl|dbname|ID

Where dbname is either the locus_tag prefix, or WGS:XXXX (for assemblies that have already been assigned a WGS accession prefix). Further details are available in the eukaryotic annotation guidelines .

[b] adding protein_id attributes to the CDS features, using one of these formats:

  • protein_id=gnl|dbname|ID
  • protein_id=gnl|dbname|ID|gb|accession

"gb|accession" is only applicable for annotation updates, and is not required to reuse existing protein accessions if the same dbname and ID are provided. Further details are available in the eukaryotic annotation guidelines.

[c] both transcript_id and protein_id can be omitted, and they will be generated automatically based on the IDs of the mRNA/CDS and gene locus_tag prefix. These qualifiers do not appear in the flatfile view, so if the GFF3 IDs are meant to be seen in that view, then they should be copied into a 'note' attribute with the appropriate formatting. However, annotation updates should include the generated protein_ids on CDS features described in point [5b] to allow protein accessions to be preserved appropriately.

[6] GFF3 ID attributes are primarily used just for interpreting parent-child feature relationships.

  • They are not automatically used for the locus_tag qualifier, so if the ID is applicable as the locus_tag, it should be copied into that attribute with the appropriate formatting.
  • However, if no transcript_id, or protein_id qualifiers are present, then the GFF3 ID attribute will be used as the basis of those qualifiers, as described in point [5c]. These qualifiers do not appear in the flatfile view, so if the GFF3 IDs are meant to be seen in that view, then they should be copied into a 'note' attribute with the appropriate formatting.

[7] GFF3 Name attributes are ignored.

[8] product names are specified using a product= attribute on a CDS or RNA feature.

  • Names should conform to GenBank guidelines .
  • Multiple names can be specified by providing the primary name first, and additional names as a comma-separated list.
  • Commas that are intended to be part of a name should be encoded (%2C) according to the GFF3 specifications. However, literal commas should only be included when they are part of enzymatic names. Semi-colons generally should not be included in product names.
  • If a CDS feature does not specify a product name, it will be automatically named 'hypothetical protein'.
  • If an mRNA feature does not specify a product name, it will automatically inherit the name from the CDS.
  • Product names should be provided for tRNAs, rRNAs and ncRNAs in GFF3/GTF submission files.

[9] Most INSDC qualifiers that can be used for submission in a conventional 5-column .tbl file will also work if provided as attributes in column 9 of a GFF3 input file. Multiple values for a qualifier should be provided as a comma-separated list. Commonly used attributes/qualifiers include:

[a] attributes described above in more detail:

  • locus_tag=<tag>_ID (gene)
  • transcript_id=gnl|dbname|ID (RNA)
  • protein_id=gnl|dbname|ID|gb|accession (CDS)
  • product=<name> (RNA, CDS)
  • pseudo=true (gene, RNA, CDS)
  • pseudogene=<TYPE> (GENE, RNA, CDS)

[b] Dbxref=DB:value (all feature types). See https://0-www-ncbi-nlm-nih-gov.brum.beds.ac.uk/genbank/collab/db_xref/ for the current list of allowed databases

[c] ec_number=x.x.x.x (CDS features)

[d] Note= (all feature types). Converted to INSDC /note (also known as a comment)

[e] gene=Abc1 (gene). For the biological gene name (aka symbol)

[f] gene_synonym=xyz (gene). Database names can be included as synonyms, even with no gene name

[g] description= (gene). gene full name, displayed as /note in flatfile.

[h] exception=<CV string> (gene, RNA, CDS)

[i] transl_except=(pos:<base_range>%2Caa:<amino_acid>) (CDS). Used to specify the location of translation exceptions on a CDS feature where a codon at a specific location on the genome should be translated as an alternative amino acid, such as Sec.

[j] function (CDS)

[k] experiment (RNA,CDS)

[l] old_locus_tag (gene)

[m] mobile_element. This has the mandatory qualifier of mobile_element_type, eg mobile_element_type=SINE:Alu

[n] ncRNA_class, regulatory_class, recombination_class. These can also be represented with specific SO feature types in column 3, if they have equivalents in the INSDC class controlled vocabularies .

Annotation crossing gaps

A CDS can only cross a gap of unknown size in introns, not in the actual coding region. If the gap of unknown size is within an exon, then you could split the CDS into two partial CDS features (and mRNAs in eukaryotes) that abut the gap, with a single gene over the whole locus. Alternatively, one of the partial CDS/mRNA features may be deleted if it is very short and there is little or no supporting evidence for it. If you have a single gene and two partial CDS/mRNA features, you should: (1) add a note to each CDS referencing the other half of the gene, (2) add a note to the gene and CDS features stating, "gap found within coding sequence." A CDS exon can cross a gap of estimated size; however, a CDS (or mRNA) should not cross a gap such that over 50% of the translation is X (ie, is in the gap). This situation will generate an error. Again, the CDS/mRNA should either be partial up to the gap or split into two partial CDS/mRNA features on either side of the gap, depending upon your confidence in the translation on each side of the gap. In addition, no feature should begin or end inside a gap. Instead, the feature should abut the gap and be partial. For more information about splitting CDS features, see either the eukaryotic annotation guidelines or the prokaryotic annotation guidelines .

Run table2asn to annotate the sequences

This is the same concept as making .sqn files with .tbl files as the input, except:

[1] Get the new tool, table2asn, by anonymous FTP at ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools/converters/by_program/table2asn_GFF/

[2] Include these arguments in the command line:

Argument When to include
-J -c w always
-euk when the organism is a eukaryote
-locus-tag-prefix <text> if the locus_tags are not in the gff file. The value of 'text' is the registered locus_tag prefix.
-gaps-min <integer> minimum number of Ns in a row that represents a gap
-gaps-unknown <integer> exact number of Ns in a row that represents a gap of completely unknown length
-l the evidence for the linkage of the sequences on either side of the gaps. Most commonly, "paired-ends" or "align-genus"
-help print usage, description and arguments of the program

[a] For the genes to be properly included, the locus_tag's must be present in the .gff file (in column 9 of every gene) OR added by including "-locus-tag-prefix XXXX" in the command line (where XXXX is the registered locus_tag prefix of this genome).

[b] If the organism is a prokaryote, then include the genetic code in the command line [gcode=11].

[c] Prokaryote example:

table2asn -M n -J -c w -t template.sbt -gaps-min 10 -l paired-ends -locus-tag-prefix XXXX -j "[organism=Escherichia coli] [strain=abcd] [gcode=11]" -i fasta_file -f gff_file -o output_file.sqn -Z

because in this example:

  • the source information and genetic code is in the command line, not the fasta file
    • the organism and strain are provided; the rest of the information will be pulled from the corresponding BioSample
    • always include the genetic code, [gcode=11], to ensure that the alternative start codons are recognized
  • all runs of 10 or more Ns represent gaps of estimated size that are connected by paired-ends linkage evidence
  • the locus_tag prefix is in the command line because the locus_tag's are not in the .gff file
  • an output file is not needed or accepted by -Z in versions of table2asn posted after July 1, 2019. Download the current version if the program you have still needs to include "-Z output_file.dr"

[d] Eukaryote example

table2asn -M n -J -c w -euk -t template.sbt -gaps-min 10 -l paired-ends -j "[organism=Loa loa] [isolate=F231]" -i fasta_file -f gff_file -o output_file.sqn -Z

because in this example:

  • the source information is in the command line, not the fasta file
    • the organism and isolate are provided; the rest of the information will be pulled from the corresponding BioSample
  • all runs of 10 or more Ns represent gaps of estimated size that are connected by paired-ends linkage evidence
  • the locus_tags are included in the GFF file (if not, then include "-locus-tag-prefix XXXX" in the command line, where XXXX is the registered locus_tag prefix for this genome, as seen in the prokaryote example)
  • an output file is not needed or accepted by -Z in versions of table2asn posted after July 1, 2019. Download the current version if the program you have still needs to include "-Z output_file.dr"

[e] Use the appropriate arguments for your situation. (FYI, "table2asn -help" will print out all the arguments)

[3] Check the output of the validation and discrepancy report and fix problems

  • Check the .stats file for the number, severity and type of errors that are present in the .val files. All Errors and Rejects need to be fixed. The presence of errors will slow processing. See the genome validation errors for guidance. Contact genomes@ncbi.nlm.nih.gov with any questions about the validation output. During processing there may be some questions about other aspects of the submission.
  • Check the .dr file for the results of the discrepancy report. Categories prefaced with FATAL are always unacceptable and must be fixed. Some of the categories are informational. Reports that are not flagged as fatal should be examined to determine if they represent annotation artifacts that need to be corrected or if they are acceptable due to the biology of the genome. See the discrepancy report examples and explanations for guidance. Write to genomes@ncbi.nlm.nih.gov and send the discrep file with questions about this report.
  • Some common errors include
    • FATAL: MISSING_GENES. This usually occurs because locus_tag's were not included. Be sure that locus_tag's are present either by including them in the GFF file in column 9 of each gene OR by including "locus-tag-prefix XXXX" (where XXXX is the registered locus_tag prefix for the genome) in the command line.
    • FATAL: BACTERIAL_PARTIAL_NONEXTENDABLE_PROBLEMS. If this is a eukaryotic genome, you can ignore this error. If this is a prokaryotic genome, then every CDS must begin and end with valid start and stop codons, respectively, or be partial and either extend to the end of the sequence or abut a gap within the scaffold sequence. However, you should annotate with pseudo=true any genes that are 'broken' but are not thought to be pseudogenes. These are genes that do not encode the expected translation, for example because of internal stop codons or missing start or stop codons, and are often caused by problems with the sequence and/or assembly.
  • Make any necessary fixes to the input .fsa and/or .gff files and run table2asn again.

[4] Submit the error-free .sqn files via the Submission Portal, per the usual instructions.

Support Center

Last updated: 2019-09-06T15:28:32Z