NCBI Nicotiana attenuata Annotation Release 100

The RefSeq genome records for Nicotiana attenuata were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Nicotiana attenuata Annotation Release 100

Annotation release ID: 100
Date of Entrez queries for transcripts and proteins: Nov 28 2016
Date of submission of annotation to the public databases: Dec 6 2016
Software version: 7.2

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
NIATTr2	GCF_001879085.1	Max Planck Institute for Chemical Ecology	11-15-2016	Reference	12 assembled chromosomes; unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	NIATTr2
Genes and pseudogenes	39,977
protein-coding	34,094
non-coding	3,886
pseudogenes	1,997
genes with variants	6,946
mRNAs	44,491
fully-supported	34,196
with > 5% ab initio	9,704
partial	1,014
with filled gap(s)	211
known RefSeq (NM_)	0
model RefSeq (XM_)	44,491
Other RNAs	7,156
fully-supported	6,069
with > 5% ab initio	0
partial	2
with filled gap(s)	2
known RefSeq (NR_)	0
model RefSeq (XR_)	6,069
CDSs	44,491
fully-supported	34,196
with > 5% ab initio	9,773
partial	982
with major correction(s)	174
known RefSeq (NP_)	0
model RefSeq (XP_)	44,491

Detailed reports

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	37,980	4,398	2,878	71	121,267
All transcripts	51,647	1,707	1,450	59	21,231
mRNA	44,491	1,743	1,501	141	16,862
misc_RNA	1,610	2,283	1,969	175	9,288
tRNA	1,087	74	73	71	88
lncRNA	4,459	1,535	942	59	21,231
Single-exon transcripts	6,471	1,070	863	144	6,856
coding transcripts (NM_/XM_ )	6,471	1,070	863	144	6,856
CDSs	44,491	1,288	1,059	111	16,296
Exons	195,645	329	177	1	15,662
in coding transcripts (NM_/XM_ )	182,704	321	174	1	8,287
in non-coding transcripts (NR_/XR_ )	17,960	371	183	2	15,662
Introns	153,966	804	280	30	117,496
in coding transcripts (NM_/XM_ )	145,253	778	272	30	117,496
in non-coding transcripts (NR_/XR_ )	13,439	1,078	395	45	57,875

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	1.37	1	1	50
Number of exons per transcript	5.68	4	1	79

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the Arabidopsis thaliana known RefSeq proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 34094 coding genes, 27475 genes had a protein with an alignment covering 50% or more of the query and 12029 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: Arabidopsis thaliana known RefSeq proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker for each assembly. RepeatMasker results are only used for organisms for which a comprehensive repeat library is available.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with RepeatMasker	% Masked with WindowMasker
NIATTr2	GCF_001879085.1	1.33%	47.93%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Depending on the other evidence available, long 454 reads (with average length above 250 nt) may be aligned as traditional evidence and reported in the Transcript alignments section or aligned with RNA-Seq reads and reported in the RNA-Seq alignments section.

Transcript alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species Genbank	265	222 (83.77%)	218 (82.26%)	99.83%	99.26%
Same-species EST	354	305 (86.16%)	273 (77.12%)	99.47%	95.54%
Nicotiana known RefSeq (NM_/NR_)	1,456	1,450 (99.59%)	1,203 (82.62%)	95.88%	98.13%
Nicotiana Genbank	6,400	5,389 (84.20%)	4,000 (62.50%)	95.23%	97.39%
Nicotiana TSA	35,724	33,521 (93.83%)	19,851 (55.57%)	96.96%	95.52%
Nicotiana EST	414,259	294,163 (71.01%)	250,353 (60.43%)	94.41%	97.57%

RNA-Seq alignments

The following RNA-Seq reads from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Track name	Number of reads	Percent aligned reads	Percent spliced reads	Number of introns
All	Aggregate of all aligned samples	3,311,787,556	93%	16%	166,932
SAMN03074605	leaf, mock infection (Nicotiana attenuata, SAMN03074605)	55,483,098	96%	21%	125,331
SAMN03074606	leaf, Alternaria alternata infection (Nicotiana attenuata, SAMN03074606)	53,048,674	96%	22%	127,393
SAMN04260032	leaf, Water (Nicotiana attenuata, SAMN04260032)	48,614,778	92%	10%	108,426
SAMN04260033	leaf, Water (Nicotiana attenuata, SAMN04260033)	93,840,122	94%	10%	125,004
SAMN04260034	leaf, Water (Nicotiana attenuata, SAMN04260034)	65,400,996	93%	10%	118,303
SAMN04260035	leaf, FAC (Nicotiana attenuata, SAMN04260035)	75,122,426	94%	10%	119,508
SAMN04260036	leaf, FAC (Nicotiana attenuata, SAMN04260036)	68,494,928	94%	10%	118,533
SAMN04260037	leaf, FAC (Nicotiana attenuata, SAMN04260037)	69,435,026	94%	10%	117,319
SAMN04260038	leaf, MSOS (Nicotiana attenuata, SAMN04260038)	61,700,284	94%	10%	116,206
SAMN04260039	leaf, MSOS (Nicotiana attenuata, SAMN04260039)	80,047,100	95%	10%	120,816
SAMN04260040	leaf, MSOS (Nicotiana attenuata, SAMN04260040)	78,961,396	94%	10%	122,268
SAMN04260041	leaf, SLOS (Nicotiana attenuata, SAMN04260041)	70,315,466	95%	10%	118,387
SAMN04260042	leaf, SLOS (Nicotiana attenuata, SAMN04260042)	73,731,276	92%	9%	118,494
SAMN04260043	leaf, SLOS (Nicotiana attenuata, SAMN04260043)	84,944,476	95%	10%	122,032
SAMN04625422	leaves, untreated (Nicotiana attenuata, SAMN04625422)	61,531,550	96%	25%	125,685
SAMN04914638	root, untreated 2h (Nicotiana attenuata, SAMN04914638)	43,111,992	95%	24%	127,166
SAMN04914639	root, untreated 2h (Nicotiana attenuata, SAMN04914639)	50,783,152	95%	24%	130,391
SAMN04914640	root, untreated 2h (Nicotiana attenuata, SAMN04914640)	46,625,040	95%	25%	129,815
SAMN04914641	root, untreated 6h (Nicotiana attenuata, SAMN04914641)	52,432,742	95%	25%	130,070
SAMN04914642	root, untreated 6h (Nicotiana attenuata, SAMN04914642)	53,323,914	95%	25%	130,030
SAMN04914644	root, untreated 6h (Nicotiana attenuata, SAMN04914644)	47,700,070	95%	25%	127,544
SAMN04914645	root, smoke treated 2h (Nicotiana attenuata, SAMN04914645)	50,061,622	95%	24%	130,529
SAMN04914646	root, smoke treated 2h (Nicotiana attenuata, SAMN04914646)	51,145,664	95%	24%	130,159
SAMN04914647	root, smoke treated 2h (Nicotiana attenuata, SAMN04914647)	48,289,344	95%	25%	129,596
SAMN04914648	root, smoke treated 6h (Nicotiana attenuata, SAMN04914648)	50,820,376	95%	24%	127,251
SAMN04914649	root, smoke treated 6h (Nicotiana attenuata, SAMN04914649)	52,330,558	94%	24%	129,232
SAMN04914650	root, smoke treated 6h (Nicotiana attenuata, SAMN04914650)	50,964,328	95%	24%	128,551
SAMN05181490	leaves, treated (Nicotiana attenuata, SAMN05181490)	71,082,822	95%	22%	121,221
SAMN05181491	leaves, untreated (Nicotiana attenuata, SAMN05181491)	76,405,140	97%	25%	122,180
SAMN05182674	leaves, treated and untreated (Nicotiana attenuata, SAMN05182674)	37,692,054	93%	18%	126,272
SAMN05182679	roots, treated (Nicotiana attenuata, SAMN05182679)	327,772,944	91%	8%	142,392
SAMN05182680	leaves, treated (Nicotiana attenuata, SAMN05182680)	328,071,888	87%	7%	134,289
SAMN05182681	seeds, smoked (Nicotiana attenuata, SAMN05182681)	51,423,280	90%	7%	90,256
SAMN05182682	seeds, watered (Nicotiana attenuata, SAMN05182682)	75,944,970	91%	8%	106,028
SAMN05182683	seeds, dry (Nicotiana attenuata, SAMN05182683)	63,463,542	90%	7%	103,254
SAMN05182684	stems, treated (Nicotiana attenuata, SAMN05182684)	72,473,514	92%	8%	126,767
SAMN05182685	corollas, early (Nicotiana attenuata, SAMN05182685)	44,064,054	95%	25%	132,636
SAMN05182686	stigmata (Nicotiana attenuata, SAMN05182686)	39,281,658	95%	24%	133,442
SAMN05182687	pollen tubes (Nicotiana attenuata, SAMN05182687)	41,692,244	94%	17%	68,808
SAMN05182688	style, unpollinated (Nicotiana attenuata, SAMN05182688)	45,428,336	93%	21%	128,209
SAMN05182689	style, outcrossed (Nicotiana attenuata, SAMN05182689)	39,071,214	94%	22%	126,882
SAMN05182690	style, selfed (Nicotiana attenuata, SAMN05182690)	50,505,064	94%	22%	131,241
SAMN05182691	nectaries (Nicotiana attenuata, SAMN05182691)	55,777,620	95%	27%	131,857
SAMN05182692	anthers (Nicotiana attenuata, SAMN05182692)	41,480,422	94%	23%	127,459
SAMN05182693	ovaries (Nicotiana attenuata, SAMN05182693)	39,608,326	94%	26%	129,394
SAMN05182694	pedicels (Nicotiana attenuata, SAMN05182694)	41,520,690	94%	24%	132,725
SAMN05182695	flowers (Nicotiana attenuata, SAMN05182695)	43,791,980	94%	23%	135,483
SAMN05182696	flower buds (Nicotiana attenuata, SAMN05182696)	45,864,746	94%	25%	138,575
SAMN05182697	corollas, late (Nicotiana attenuata, SAMN05182697)	41,110,650	94%	23%	130,989

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent spliced reads
SRR1581603	SRX706834	SRP047328	SAMN03074605	55,483,098	96%	21%
SRR1581604	SRX706835	SRP047328	SAMN03074606	53,048,674	96%	22%
SRR2913026	SRX1427413	SRP066048	SAMN04260032	48,614,778	92%	10%
SRR2913027	SRX1427450	SRP066048	SAMN04260033	93,840,122	94%	10%
SRR2913028	SRX1427451	SRP066048	SAMN04260034	65,400,996	93%	10%
SRR2913029	SRX1427452	SRP066048	SAMN04260035	75,122,426	94%	10%
SRR2913030	SRX1427498	SRP066048	SAMN04260036	68,494,928	94%	10%
SRR2913031	SRX1427499	SRP066048	SAMN04260037	69,435,026	94%	10%
SRR2913033	SRX1427500	SRP066048	SAMN04260038	61,700,284	94%	10%
SRR2913034	SRX1427501	SRP066048	SAMN04260039	80,047,100	95%	10%
SRR2913035	SRX1427502	SRP066048	SAMN04260040	78,961,396	94%	10%
SRR2913043	SRX1427503	SRP066048	SAMN04260041	70,315,466	95%	10%
SRR2913044	SRX1427504	SRP066048	SAMN04260042	73,731,276	92%	9%
SRR2913045	SRX1427511	SRP066048	SAMN04260043	84,944,476	95%	10%
SRR3473348	SRX1740640	SRP074311	SAMN04914638	43,111,992	95%	24%
SRR3473349	SRX1740641	SRP074311	SAMN04914639	50,783,152	95%	24%
SRR3473352	SRX1740644	SRP074311	SAMN04914640	46,625,040	95%	25%
SRR3473353	SRX1740645	SRP074311	SAMN04914641	52,432,742	95%	25%
SRR3473354	SRX1740646	SRP074311	SAMN04914642	53,323,914	95%	25%
SRR3473355	SRX1740647	SRP074311	SAMN04914644	47,700,070	95%	25%
SRR3473356	SRX1740648	SRP074311	SAMN04914645	50,061,622	95%	24%
SRR3473357	SRX1740649	SRP074311	SAMN04914646	51,145,664	95%	24%
SRR3473358	SRX1740650	SRP074311	SAMN04914647	48,289,344	95%	25%
SRR3473359	SRX1740651	SRP074311	SAMN04914648	50,820,376	95%	24%
SRR3473350	SRX1740642	SRP074311	SAMN04914649	52,330,558	94%	24%
SRR3473351	SRX1740643	SRP074311	SAMN04914650	50,964,328	95%	24%
SRR3596289	SRX1804540	SRP075847	SAMN05181490	71,082,822	95%	22%
SRR3596343	SRX1804553	SRP075847	SAMN05181491	76,405,140	97%	25%
SRR3596354	SRX1804554	SRP075848	SAMN04625422	61,531,550	96%	25%
SRR3597008	SRX1804765	SRP075848	SAMN05182674	37,692,054	93%	18%
SRR3597502	SRX1804895	SRP075848	SAMN05182679	327,772,944	91%	8%
SRR3597503	SRX1804896	SRP075848	SAMN05182680	328,071,888	87%	7%
SRR3597504	SRX1804897	SRP075848	SAMN05182681	51,423,280	90%	7%
SRR3597513	SRX1804898	SRP075848	SAMN05182682	75,944,970	91%	8%
SRR3597514	SRX1804899	SRP075848	SAMN05182683	63,463,542	90%	7%
SRR3597515	SRX1804900	SRP075848	SAMN05182684	72,473,514	92%	8%
SRR3597516	SRX1804901	SRP075848	SAMN05182685	44,064,054	95%	25%
SRR3597517	SRX1804902	SRP075848	SAMN05182686	39,281,658	95%	24%
SRR3597518	SRX1804903	SRP075848	SAMN05182687	41,692,244	94%	17%
SRR3597519	SRX1804904	SRP075848	SAMN05182688	45,428,336	93%	21%
SRR3597520	SRX1804905	SRP075848	SAMN05182689	39,071,214	94%	22%
SRR3597521	SRX1804906	SRP075848	SAMN05182690	50,505,064	94%	22%
SRR3597522	SRX1804907	SRP075848	SAMN05182691	55,777,620	95%	27%
SRR3597523	SRX1804908	SRP075848	SAMN05182692	41,480,422	94%	23%
SRR3597524	SRX1804909	SRP075848	SAMN05182693	39,608,326	94%	26%
SRR3597525	SRX1804910	SRP075848	SAMN05182694	41,520,690	94%	24%
SRR3597526	SRX1804911	SRP075848	SAMN05182695	43,791,980	94%	23%
SRR3597527	SRX1804912	SRP075848	SAMN05182696	45,864,746	94%	25%
SRR3597528	SRX1804913	SRP075848	SAMN05182697	41,110,650	94%	23%

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Arabidopsis thaliana known RefSeq (NP_)	48,113	42,132 (87.57%)	42,132 (87.57%)	66.89%	70.66%
Solanaceae GenBank	12,401	12,082 (97.43%)	12,082 (97.43%)	74.77%	85.74%
Solanaceae known RefSeq (NP_)	5,147	5,083 (98.76%)	5,083 (98.76%)	75.85%	86.42%
Same-species GenBank	220	215 (97.73%)	215 (97.73%)	79.31%	88.56%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20

RefSeq

Integrated reference sequences