Common Discrepancy Reports

Introduction

The Discrepancy Report is an evaluation of a single or multiple ASN.1 files, looking for suspicious annotation or annotation discrepancies that NCBI staff has noticed commonly occur in genome submissions, both complete and incomplete (WGS). A few of the problems that this function was written to find include inconsistent locus_tag prefixes, missing gene features, and suspect product names.

This page shows common reports generated by specially configured GenomeWorkbench, table2asn_gff, or the newest version of the command-line program asndisc. See more information about the Discrepancy Report and those tools.

If you have questions about the Discrepancy Report, please contact us by email at genomes@ncbi.nlm.nih.gov prior to sending us your submission.

Common Reports

10_PERCENTN

Explanation : The sequence specified has >10% N’s.

Suggestion : If your sequence has a lot of gaps, it would be expected to give this warning. The warning is more to notify the user that there is a low-quality sequence that should be checked. Gap features may be needed if not already present.

Examples:

  10_PERCENTN: 1 sequence has > 10% Ns

   Contig9.33 (length 4226, 715 other)

ALL_SEQS_CIRCULAR

Explanation : FATAL error for when the technique is WGS, location is genomic, and all sequences are circular.

Suggestion : If these are contigs of a draft genome, please remove the circular indication. If the sequences are the chromosome and plasmids of a genome where there is one sequence per chromosome, please change the locations to match the sequence.

Examples:

ALL_SEQS_CIRCULAR: FATAL! ALL (10) sequences are circular

ALL_SEQS_SHORTER_THAN_20kb

Explanation : The set of sequences in the genome are all <20 kb, which is unexpected for a genome assembly. It may indicate that some sequences are missing, or that this is a transcriptome assembly or some other sort of submission rather than a genome assembly.

Suggestion : Check for missing sequences. If you are updating a genome by adding plasmids, please include the chromosome too as the submission portal will expect a chromosome. If this is a transcriptome assembly, then submit it to the TSA submission portal.

Examples:

ALL_SEQS_SHORTER_THAN_20kb: No sequences longer than 20,000 nt found.

BACTERIAL_JOINED_FEATURES_NO_EXCEPTION

Explanation : Bacteria do not have exons and introns in general so there should not be joined features that do not have exceptions Coding regions that are translated using ribosomal slippage should have the exception ‘ribosomal slippage’. One example where this message can be ignored are features that cross the sequence origin, where a join is necessary.

Suggestion : Ignore if these cross the sequence origin. Add an exception if these are translated by ribosomal slippage. Do not use for any other cases (like annotation across a stop codon generated by sequence error)

Examples:

This is an example of one that is ok.

BACTERIAL_JOINED_FEATURES_NO_EXCEPTIONS: 2 coding regions with joined locations have no exceptions

    2 coding regions over the origin of circular DNA

This is an example that is not correct.

BACTERIAL_JOINED_FEATURES_NO_EXCEPTIONS: FATAL! 1 coding regions with joined locations have no exceptions

FATAL! 1 coding region not over the origin of circular DNA

BACTERIAL_PARTIAL_NONEXTENDABLE_EXCEPTION

Explanation : All bacterial features that are internal to a sequence must be complete unless directly abutting a gap. If the feature is close to the end of a contig, the feature should be extended to the end. The software can detect when the feature can be extended a few bases to the end of the sequence.

Suggestion : If an internally partial coding region lacks a valid start or stop, please make it a nonfunctional gene without a translation. If features are close to the end of a partial sequence or contig but too far away from the end for the software to fix, please extend to the end of the sequence. If internal and there are gaps, change span to go so it abuts the gap, even if not the end of a codon.

Examples:

BACTERIAL_PARTIAL_NONEXTENDABLE_EXCEPTION_PROBLEMS: FATAL! 1 feature has partial ends that do not abut 
the end of the sequence or a gap and cannot be extend by 3 or fewer nucleotides to do so.

BACTERIA_SHOULD_NOT_HAVE_MRNA

Explanation : Since bacteria have many polycistronic mRNAs and we annotate each gene and coding region separately, individual mRNAs for each gene is incorrect.

Suggestion : Remove the incorrect mRNA features. Rarely, the mRNAs are annotated for the complete polycistronic transcript

Examples:

BACTERIA_SHOULD_NOT_HAVE_MRNA: 1 bacterial sequence has mRNA features

BAD_GENE_NAME

Explanation : A gene symbol contains suspect phrases or characters. The gene symbol is longer than expected or has unusual characters.

Suggestion : Check the gene symbols. Do not use protein names for gene symbols. If in doubt, remove the gene symbol.

Example:

BAD_GENE_NAME: 1 gene contains suspect phrase or characters

BAD_BACTERIAL_GENE_NAME

Explanation : There are gene symbols that do not meet the correct format for bacterial genes

Suggestion : Correct the gene symbols if possible (generally three lower case letters followed by capital letters as necessary)

Examples:

BAD_BACTERIAL_GENE_NAME: 1 bacterial gene does not start with a lowercase letter

CHECK_AUTH_NAME

Explanation : One or more authors are missing their first or last name.

Suggestion : Correct the author names so that the family name is last, and the given name is first.

Examples:

CHECK_AUTH_NAME: 2 pubs missing author’s first or last name

DUP_GENES_OPPOSITE_STRANDS

Explanation : There is a pair of genes with the same span but on different strands

Suggestion : Remove one of the genes as this is an annotation error.

Examples:

DUP_GENES_OPPOSITE_STRANDS: 2 genes match other genes in the same location, but on the opposite strand

 gene            93158..93895
                 /gene="abcD"
 gene            complement(93158..93895)
                 /gene="cmk"

EUKARYOTE_SHOULD_HAVE_MRNA

Explanation : All eukaryotic CDS features should be accompanied by mRNA features. This genome is lacking all mRNA features.

Suggestion : Add appropriate mRNA features for all the CDS features. Note that you should have transcript_IDs and protein_IDs on both the mRNA and the accompanying CDS feature. See https://0-www-ncbi-nlm-nih-gov.brum.beds.ac.uk/genbank/eukaryotic_genome_submission_annotation/#protein_id for information about .tbl files, and the mRNA and CDS features require transcript_id and protein_id qualifiers.

Example:

EUKARYOTE_SHOULD_HAVE_MRNA: FATAL! No mRNA present

EXON_INTRON_CONFLICT

Explanation : The spans of the exons and adjacent introns do not directly abut one another.

Suggestion : The exon and intron spans must be edited to be directly adjacent to one another. However, rethink the use of these features. Introns and exons can be implied by the CDS and/or mRNA spans on a record, making them redundant and sources of inconsistencies. These features could be removed unless necessary for clarification.

Example:

EXON_INTRON_CONFLICT.asn:exon   1   lcl|ex1:1-10

EXON_INTRON_CONFLICT.asn:intron [intron]    lcl|ex1:12-20

EXON_INTRON_CONFLICT.asn:exon   2   lcl|ex1:22-40

EXON_INTRON_CONFLICT.asn:exon   1   lcl|ex2:1-15

EXON_INTRON_CONFLICT.asn:intron [intron]    lcl|ex2:10-25

EXON_INTRON_CONFLICT.asn:exon   2   lcl|ex2:20-30

FIND_BADLEN_TRNAS

Explanation : tRNA is longer than expected. This is usually an annotation error. If the genome is archaeal, it is likely there is an unannotated intron.

Suggestion : Check for annotation errors. Annotate with a joined span if archaeal intron can be identified.

Example:

FIND_BADLEN_TRNAS: 1 tRNA is too long – over 150 nucleotides

GAPS

Explanation : One of the sequences in the genome contains gaps.

Suggestion s: Can be ignored if this is expected, otherwise check your sequence and annotation

Example:

GAPS: 1 sequence contains gaps

GENE_PRODUCT_CONFLICT

Explanation : There are coding regions that have the same gene name as other coding regions, but the product name is different.

Suggestion : Check the pairs to see if the gene symbols and/or products are correct. Since there is no unified system for gene symbol naming it is possible for the conflict is expected. The submitter must decide whether to ignore the warning or not.

Example:

GENE_PRODUCT_CONFLICT: 2 coding regions have the same gene name as another coding region but a different product

2 coding regions have the same gene name (lptF) as another coding region but a different product

INCONSISTENT_DBLINK

Explanation : All parts of a genome should have the same BioProject and BioSample pair. This test will tell you when that is not true.

Suggestion : Check to see which pieces have the incorrect BioProject or BioSample

Example: mismatch of the BioSamples

INCONSISTENT_DBLINK: DBLink Report (all present, inconsistent)

BioSample (all present, inconsistent)
    2 DBLink objects have field BioSample value ‘SAMN01’
    2 DBLink objects have field BioSample value ‘SAMN02’
BioProject (all present, all same)

INCONSISTENT_STRUCTURED_COMMENTS

Explanation : All parts of a genome should have the same assembly structure comment information. One exception is plasmids of a complete genome can have a different coverage value.

Suggestion : Most of the time this indicates a problem and the wrong pieces should be identified and corrected.

Example: Mismatch of the Assembly Method of the genome assembly structured comment

INCONSISTENT_STRUCTURED_COMMENTS: Stuctured Comment Report (all present, inconsistent)
Structured comment field Assembly Method
(all present, inconsistent)
Sturctured comment field Expected Final Version
(all present, all same)

LONG_NO_ANNOTATION

Explanation : Test indicates that there is at least one sequence is greater than 5000 nt in length and there is no annotation.

Suggestion : The test is informational only. If you did not intend to provide annotation, it can be ignored. However, this is valuable in cases where annotation was inadvertantly dropped from a sequence.

Example:

1 bioseq is longer than 5000nt and has no features.

LOW_QUALITY_REGION

Explanation : The sequence contains a region where there are a large number of nucleotides that are not A,C,G,T. This can be correct if it is expected and not N. If there are runs of N’s, gap features should be added and there will be other errors in the discrepancy report for the N’s.

Suggestion : Check to see if these are non A,C,G,T,N IUPAC bases. If the bases in the region are N’s, add gaps.

Example:

LOW_QUALITY_REGION: 1 sequence contains low quality region

MISC_FEATURE_WITH_PRODUCT_QUAL

Explanation : The record has misc_feature features that have a product qualifier. /product is only permitted on CDS and RNA features since those are the features where something is made not simply described.

Suggestion : Check these to see if misc_feature is appropriate. If so move the information in /product to /note. If not use the appropriate CDS or RNA feature.

Example:

MISC_FEATURE_WITH_PRODUCT_QUAL: 15 features have a product qualifier

misc_feature    7760..7894
                 /product="Truncated periplasmic divalent cation tolerance protein"

MRNA_SHOULD_HAVE_PROTEIN_TRANSCRIPT_IDS

Explanation : Eukaryotic mRNAs and the corresponding CDS features should have matching transcript_IDs and protein_IDs so that the pairing of each mRNA and CDS is exact. In this case there are no protein_ids or transcript_ids on the mRNA features. The same error message will be given if only one of the two IDs are missing (eg. No transcript_IDs)

Suggestion : Add the protein_ids and transcript_ids to the mRNA features. See https://0-www-ncbi-nlm-nih-gov.brum.beds.ac.uk/genbank/eukaryotic_genome_submission_annotation/#protein_id

Example:

MRNA_SHOULD_HAVE_PROTEIN_TRANSCRIPT_IDS: no protein_id and transcript_id present

MULTIPLE_CDS_ON_MRNA

Explanation : All CDS features on eukaryotic genomes must have their own mRNA, even if the mRNA has an identical span (in the case of alternate start sites)

Suggestion : If the multiple CDS features are correct for the gene, then add mRNA along with transcript_ids to the second mRNA/CDS pair. Remove any incorrect features. If the genome has reciprocal correctly added transcript_ids and protein_ids, this should not be a problem.

Example:

MULTIPLE_CDS_ON_MRNA.asn:ex2 (length 247)

MULTIPLE_CDS_ON_MRNA.asn:ex3 (length 247)

MULTIPLE_CDS_ON_MRNA.asn:ex5 (length 247)

NO_LOCUS_TAGS

Explanation : All CDS and RNA features of a genome must have locus_tag qualifier. This error indicates that none are present.

Suggestion : Add locus_tag qualifiers (and genes, if necessary) to all CDS and RNA features.

Example:

NO_LOCUS_TAGS:  FATAL! None of the 1871 genes has locus tag

PROTEIN_NAMES

Explanation : Message indicates that all proteins have the same name. The name that the proteins have will be given in the message.

Suggestion : For eukaryotic genomes, this could be okay as many of the genomes are not well characterized. Since there is a lot more data for bacteria, we do expect that most bacterial proteins will have a functional name.

Example:

PROTEIN_NAMES: All proteins have the same name ‘hypothetical protein’

REQUIRED_STRAIN

Explanation : Genomes of certain classes of organisms (bacteria, fungi) require strain qualifiers. Endosymbionts and metagenomic assemblies should have isolate instead.

Suggestion : Add a strain name for those organisms that require them. For those that require strain, isolate is not appropriate.

Example:

REQUIRED_STRAIN: 7 biosources are missing required strain value

RRNA_NAME_CONFLICTS

Explanation : rRNA product names should be the standard names for the molecule.

Suggestion : Check for the correct rRNA naming policy for your organism, eg 16S ribosomal RNA. The list of expected names is:

4.5S ribosomal RNA
5S ribosomal RNA
5.8S ribosomal RNA
12S ribosomal RNA
15S ribosomal RNA
16S ribosomal RNA
18S ribosomal RNA
21S ribosomal RNA
23S ribosomal RNA
25S ribosomal RNA
26S ribosomal RNA
28S ribosomal RNA
large subunit ribosomal RNA
small subunit ribosomal RNA

Example:

FATAL! 4 rRNA product names are not standard. Correct the names to the standard format, eg 16S ribosomal RNA.

SEQ_SHORTER_THAN_200bp

Explanation : We expect that contigs of draft genomes be at least 200 bp in length. If you think that you need to keep a short contig, please contact genome staff to explain the situation.

Suggestion : Remove those sequences that are less than 200 nt in length.

Example:

SEQ_SHORTER_THAN_200bp: 2 contigs are shorter than 200 nt
   Contig69641.1 (length 100)
   Contig72501.1 (length 187)

SEQ_SHORTER_THAN_50bp

Explanation : There are sequences that are smaller than 50 nt in length. These should be removed from your draft genome. These sequences will also get the SEQ_SHORTER_THAN_200bp warning.

Suggestion : Remove these sequences from your draft genome.

Example:

SEQ_SHORTER_THAN_50bp: 3 sequences are shorter than 50 nt
   Contig9.22 (length 46)
   Contig9.23 (length 12)
   Contig9.24 (length 6)

SHORT_LNCRNA

Explanation : We expect that these features are >200 bases in length. lncRNAs are long non-coding RNA; such molecules are generally defined as having a length greater than 200bp and do not fit into any other ncRNA class.

Suggestion : Check to see if the ncRNA should be defined by a different class. See the following list of classes: http://www.insdc.org/rna_vocab.html

Example:

SHORT_LNCRNA: 1 lncRNA feature is suspiciously short

SHORT_RRNA

Explanation The rRNA is not partial at either end and is shorter than expected. This could be because the location is not correct or this is just a bit of rRNA-like sequence.

Suggestion : There are several possibilities, depending upon what the problem is. Adjust the rRNA to the right location, or make the end partial if it is at the end of a sequence or abuts a gap, or mark the gene as "pseudo", as appropriate. These are the current conditions that will trigger this test:

18S, 26S, 25S, 16S, small subunit and large subunit rRNAs that are less than 1000 nt and not partial at either end.
23S rRNA that is less than 2000bp and not partial at either end.
28S rRNA that is less than 3300bp and not partial at either end.
5.8S rRNAs that is less than 130 nt and not partial at either end
5S rRNAs that is less than 90 nt and not partial at either end

Example:

FATAL: SHORT_RRNA: 1 rRNA feature is too short

SHOW_TRANSL_EXCEPT

Explanation : There are coding regions in the genome that have a transl_except qualifier to make a valid translation. The common example is a selenocysteine-containing protein.

Suggestion : Correct if it is a valid transl_except and not one that is an attempt to annotate across a stop codon due to sequencing error.

Example:

SHOW_TRANSL_EXCEPTL 3 coding regions have a translation exception
      /transl_except=(pos:complement(333705..333707),aa:Sec)
      /product="formate dehydrogenase-N subunit alpha"

SOURCE_QUALS

Explanation : Tests whether all the source qualifiers match on all sequences in the particular editing session. The source qualifiers should match for all pieces of the same genome Qualifiers such as chromosome and plasmid_name are ignored in this test. There will be a separate test for each qualifier. This is flagged FATAL when not all match.

Suggestion : Fix the piece of the genome that does not match.

Examples:

This is example text where all match ‘SOURCE_QUALS: country (all present, all same)’

This is example text where there is a FATAL because not all match:

SOURCE_QUALS: FATAL! Strain (all present, some duplicates)
3 sources have strain = A
3 sources have strain = B
2 sources have strain = C

TITLE_ENDS_WITH_SEQUENCE

Explanation : There are apparent nucleotide sequence characters in the definition line of the record.

Suggestion : This happens because there is no carriage return after the seqID of the sequence. Go back to the fasta file and make sure the sequence begins on the second line.

Example:

TITLE_ENDS WITH_SEQUENCE:
2 deflines appear to end with sequence characters

DEFINITION  7_quiver GTCTTGTAGTTGATGGCCATATTTACCTGCATAGACTTGATTGACTT
        TTTTAGGCACACCTTTGATATAG.

DEFINITION  TCTTGTAGTTGATGGCCATATTTACCTGCATAGACTTGATTGACTTTTTTAGGCACACC
        TTTGATATAG.

UNPUB_PUB_WITHOUT_TITLE

Explanation : Test is FATAL where there is an unpublished pub and no title.

Suggestion: Add the title of your prospective publication. It does not have to be the final title.

Example: This is the text you will see when there is no title

UNPUB_PUB_WITHOUT_TITLE: FATAL! Unpublished pubs have no title

UNUSUAL_NT

Explanation : A base other than the most common IUPAC bases (A,C,G,T,N) is present in the sequence.

Suggestion : Can be correct if any of the ambiguous bases other than N are used. Can be wrong if text words were included at the beginning of your FASTA sequence. Check your sequence to see if it is correct.

Example:

UNUSUAL_NT: 1 sequence contains nucleotides that are not ATCG or N

GenBank

Public nucleic acid sequence repository

Common Discrepancy Reports

Introduction

Common Reports

10_PERCENTN

ALL_SEQS_CIRCULAR

ALL_SEQS_SHORTER_THAN_20kb

BACTERIAL_JOINED_FEATURES_NO_EXCEPTION

BACTERIAL_PARTIAL_NONEXTENDABLE_EXCEPTION

BACTERIA_SHOULD_NOT_HAVE_MRNA

BAD_GENE_NAME

BAD_BACTERIAL_GENE_NAME

CHECK_AUTH_NAME

DUP_GENES_OPPOSITE_STRANDS

EUKARYOTE_SHOULD_HAVE_MRNA

EXON_INTRON_CONFLICT

FIND_BADLEN_TRNAS

GAPS

GENE_PRODUCT_CONFLICT

INCONSISTENT_DBLINK

INCONSISTENT_DBLINK: DBLink Report (all present, inconsistent)

INCONSISTENT_STRUCTURED_COMMENTS

LONG_NO_ANNOTATION

LOW_QUALITY_REGION

MISC_FEATURE_WITH_PRODUCT_QUAL

MRNA_SHOULD_HAVE_PROTEIN_TRANSCRIPT_IDS

MULTIPLE_CDS_ON_MRNA

NO_LOCUS_TAGS

PROTEIN_NAMES

REQUIRED_STRAIN

RRNA_NAME_CONFLICTS

SEQ_SHORTER_THAN_200bp

SEQ_SHORTER_THAN_50bp

SHORT_LNCRNA

SHORT_RRNA

SHOW_TRANSL_EXCEPT

SOURCE_QUALS

TITLE_ENDS_WITH_SEQUENCE

UNPUB_PUB_WITHOUT_TITLE

UNUSUAL_NT

Genome Resources