What is a Third Party Annotation (TPA) Sequence?

TPA: A database designed to capture experimental or inferential results that support submitter-provided annotation for, or assembly of, sequence data that the submitter did not directly determine but derived from GenBank primary data.

There are several types of TPA records:

  • TPA:experimental: Annotation of sequence data is supported by peer-reviewed wet-lab experimental evidence.
  • TPA:inferential: Annotation of sequence data by inference (where the source molecule or its product(s) have not been the subject of direct experimentation)
  • TPA:assembly: Assembly or reassembly of sequence data for which the generation, whether it is purely computational or informed by experimentation, has been subject to peer review. Feature annotation is not required to be part of the peer review for this TPA type. (Examples of such assemblies include complete viruses, mitochondria, or named biosynthetic gene clusters)

TPA database records differ from GenBank and RefSeq records:

  • GenBank: An archival database of primary nucleotide sequences that were directly sequenced by the submitter.
  • RefSeq: A curated, non-redundant database that includes genomic DNA, transcript (RNA), and protein products, for major organisms. The sequence data are derived from GenBank primary data, and the annotation is computational, from published literature, or from domain experts.

A TPA sequence is derived or assembled from primary sequence data currently found in the DDBJ/EMBL/GenBank International Nucleotide Sequence Database. It can be genomic or mRNA sequence and can be assembled or derived from primary genomic and/or mRNA sequences. TPA sequences are submitted to DDBJ/EMBL/GenBank as part of the process of publishing biological experiments that include the assembly and/or annotation of existing, primary nucleotide sequences.

Examples of TPA sequences are:

  • mRNA assembled from overlapping EST sequences or derived and annotated from unannotated TSA sequences.
  • mRNA derived from an unannotated section of genomic sequence by comparison with another known mRNA from a different organism.
  • mRNA assembled from overlapping EST sequences, other partial mRNAs, and/or genomic sequences.
  • previously unannotated genomic sequence now described with the exons, introns, and coding region (CDS) intervals of a gene.
  • new assemblies such as complete viruses, complete organelles, or complete biosynthetic gene clusters (BGC)

Note: It is required that all new annotations be experimentally determined to exist, directly or indirectly. Bioinformatic or computational work alone is not sufficient as supporting evidence of new annotation.

What is a primary sequence?

'Primary' sequences used to assemble a TPA sequence are those that have been experimentally determined and are now publicly available in the GenBank/EMBL/DDBJ databases. These include: SRA data (reported with their SRR numbers), Whole Genome Shotgun (WGS) contig sequences (not master records), Transcriptome Shotgun Assembly (TSA) sequences, ESTs, and Trace Archive database sequences. They may not be from a proprietary database. Each primary sequence used to assemble a TPA sequence must be identified by a GenBank accession number in the TPA sequence submission.

Reference sequences may not be cited as data used to build TPA sequences because RefSeqs are not primary data. For example, sequences with Accession Numbers such as NT_112233 or NW_123456 represent contig sequences; the sequences used to assemble these contigs, which can be found at the bottom of contig records, should be cited in a TPA sequence submission. Sequences with Accession Numbers such as XM_345678 or NM_123456 are RefSeqs representing mRNAs that are not experimentally determined and therefore cannot be cited as primary data.

How Do TPA Sequence Records Differ from Other GenBank/EMBL/DDBJ Records?

The display of a TPA sequence is similar to other GenBank/INSDC records, but includes the following:

  • The label 'TPA_exp: or TPA_inf:' or 'TPA_assembly:'at the beginning of each Definition Line.
  • PRIMARY field providing the base pair spans of the primary sequences that contribute to the TPA sequence or DBLINK citation of the SRR sequence records used

Other Features and References are similar to those displayed in other GenBank/EMBL/DDBJ records.

An example of a TPA:experimental is BK000016

An example of a TPA:inferential is BK000554

An example of a TPA:assembly is BK010317

TPA sequence records are shared by all three Collaboration databases and can be found using typical search methods in the GenBank Nucleotide and Protein databases (ie, submitter name, gene/protein name, accession number, etc)

How to Submit TPA Sequence Data

Sequences can be submitted to the TPA database through BankIt:

  • BankIt

    • On the 'Submission Category' input page, choose 'Third Party Annotation'.
    • In the appropriate boxes, provide a brief explanation of the evidence you have to support the new annotation/assembly you are presenting and provide the GenBank accession number(s) of all primary sequence(s) used to assemble or derive your TPA sequence (for SRA data, provide the SRR accession number(s)).
    • Continue with the standard submission process. Be sure to add all new annotation for your TPA sequence on the 'Features' input page.
    • The submission will be labeled as a TPA sequence and processed accordingly after it is successfully submitted.
  • General Information

    • The entire submitted sequence must be covered by cited primary sequence data.
    • There is no limit on the number of overlapping/adjoining primary sequences that can be cited for a TPA submission.
    • If sections of a sequence submitted to TPA have been newly determined by the submitter, those sequences (if they are more than 50 nt) must first be submitted to GenBank, processed, and released to the public before they can be cited as primary sequences
    • TPA sequences must cite the same organism as is reported in their corresponding primary sequence data records. (for example, only Mus musculus primary sequences may be used to build or derive a Mus musculus TPA sequence submission.)

    When are TPA sequences released?

    • TPA sequences are held confidential until their accession numbers or sequence data assembly and/or annotation appear in a peer-reviewed publication in a biological journal.
    • No sequence accepted for the TPA database will be released to the public until the submitter notifies GenBank of its publication or we determine independently that such information was published.
    • When reporting that a TPA sequence's corresponding paper has been published, provide the DOI, PubMedID, or the URL of the journal's online paper

    What should NOT be submitted to TPA

    • Synthetic constructs such as cloning vectors that use well characterized, publicly available genes, promoters, or terminators; these should be submitted as synthetic sequences for GenBank.
    • Microsatellites and related types of repeat regions
    • New sequence that updates or changes existing sequence data from another submitter; these should be submitted as new sequences for GenBank.
    • Annotation that has arisen from an automated tool, such as GeneMark, tRNA scan, or ORF finder, where no further evidence, experimental or otherwise, is presented for the annotation.
    • Annotation from in vivo, in vitro, or in silico experimentation that will not be submitted for publication in a peer-reviewed journal.
Support Center

Last updated: 2020-10-15T17:23:47Z