NCBI Quercus lobata Annotation Release 100

The RefSeq genome records for Quercus lobata were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Quercus lobata Annotation Release 100

Annotation release ID: 100
Date of Entrez queries for transcripts and proteins: Oct 1 2019
Date of submission of annotation to the public databases: Oct 7 2019
Software version: 8.2

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
ValleyOak3.0	GCF_001633185.1	University of Maryland Center for Environmental Science	01-11-2019	Reference	12 assembled chromosomes; unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	ValleyOak3.0
Genes and pseudogenes	45,924
protein-coding	36,705
non-coding	5,018
transcribed pseudogenes	405
non-transcribed pseudogenes	3,796
genes with variants	9,343
immunoglobulin/T-cell receptor gene segments	0
other	0
mRNAs	53,228
fully-supported	42,743
with > 5% ab initio	9,244
partial	166
with filled gap(s)	0
known RefSeq (NM_)	0
model RefSeq (XM_)	53,228
non-coding RNAs	10,529
fully-supported	9,010
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	9,922
pseudo transcripts	405
fully-supported	287
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	405
CDSs	53,228
fully-supported	42,743
with > 5% ab initio	9,394
partial	166
with major correction(s)	191
known RefSeq (NP_)	0
model RefSeq (XP_)	53,228

Detailed reports

The counts below do not include pseudogenes.

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	41,723	4,829	3,072	62	159,726
All transcripts	63,757	1,910	1,570	62	24,112
mRNA	53,228	2,008	1,651	93	24,112
misc_RNA	3,164	2,291	1,950	174	13,108
tRNA	607	74	73	71	87
lncRNA	5,847	1,269	914	83	9,504
snoRNA	485	105	107	62	218
snRNA	154	145	152	98	201
rRNA	272	376	119	116	3,398
Single-exon transcripts	7,629	1,190	963	174	9,218
coding transcripts (NM_/XM_ )	7,629	1,190	963	174	9,218
CDSs	53,228	1,467	1,194	93	23,319
Exons	223,132	341	177	1	9,218
in coding transcripts (NM_/XM_ )	203,739	343	177	1	9,218
in non-coding transcripts (NR_/XR_ )	26,399	298	156	2	8,994
Introns	176,858	912	319	30	153,874
in coding transcripts (NM_/XM_ )	162,527	893	306	30	153,874
in non-coding transcripts (NR_/XR_ )	20,814	1,033	445	30	89,968

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	1.54	1	1	49
Number of exons per transcript	6.07	4	1	79

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the Arabidopsis thaliana known RefSeq proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 36705 coding genes, 31123 genes had a protein with an alignment covering 50% or more of the query and 13130 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: Arabidopsis thaliana known RefSeq proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker for each assembly. RepeatMasker results are only used for organisms for which a comprehensive repeat library is available.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with RepeatMasker	% Masked with WindowMasker
ValleyOak3.0	GCF_001633185.1	3.71%	36.77%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Depending on the other evidence available, long 454 reads (with average length above 250 nt) may be aligned as traditional evidence and reported in the Transcript alignments section or aligned with RNA-Seq reads and reported in the RNA-Seq alignments section.

Transcript alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Fagales Genbank	2,801	2,202 (78.61%)	1,022 (36.49%)	91.75%	93.25%
Fagales TSA	440,201	235,825 (53.57%)	69,028 (15.68%)	97.17%	97.10%
Fagales EST	292,856	171,990 (58.73%)	154,696 (52.82%)	96.91%	98.63%
Arabidopsis thaliana known RefSeq (NM_/NR_)	53,827	8,304 (15.43%)	99 (0.18%)	89.86%	78.13%
Arabidopsis thaliana Genbank	194,458	13,820 (7.11%)	403 (0.21%)	89.11%	83.83%

RNA-Seq alignments

The following RNA-Seq reads from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Track name	Number of reads	Percent aligned reads	Percent of aligned reads with introns	Number of introns
All	Aggregate of all aligned samples	1,518,869,379	89%	24%	239,720
SAMN03261510	expanding leaf bud (Quercus lobata, not collected, SAMN03261510)	44,531,084	88%	25%	154,663
SAMN03261511	expanding leaf bud (Quercus lobata, not collected, SAMN03261511)	32,055,916	89%	25%	155,276
SAMN03261515	expanding leaf bud (Quercus lobata, not collected, SAMN03261515)	37,177,984	87%	25%	148,952
SAMN03261516	expanding leaf bud (Quercus lobata, not collected, SAMN03261516)	26,793,830	89%	25%	146,500
SAMN03567913	expanding bud (leaf) (Quercus lobata, adult, SAMN03567913)	37,228,434	89%	26%	150,890
SAMN03567915	unopened and opening buds (Quercus lobata, adult, SAMN03567915)	38,894,908	86%	25%	157,551
SAMN03567916	expanding bud (leaf) (Quercus lobata, adult, SAMN03567916)	34,220,852	89%	25%	152,193
SAMN03567917	expanding bud (leaf) (Quercus lobata, adult, SAMN03567917)	36,303,738	89%	25%	151,156
SAMN03567918	expanding bud (leaf) (Quercus lobata, adult, SAMN03567918)	23,618,334	88%	21%	145,168
SAMN03567919	unopened buds (Quercus lobata, adult, SAMN03567919)	36,791,248	80%	23%	153,769
SAMN03567920	expanding bud (male flower) (Quercus lobata, adult, SAMN03567920)	28,072,596	88%	22%	147,721
SAMN03567921	expanding bud (male flower) (Quercus lobata, adult, SAMN03567921)	28,613,106	89%	25%	147,774
SAMN03567922	expanding bud (leaf) (Quercus lobata, adult, SAMN03567922)	27,581,926	88%	25%	141,797
SAMN03567923	expanding bud (leaf, male flower) (Quercus lobata, adult, SAMN03567923)	39,072,278	89%	26%	151,471
SAMN03567924	expanding bud (leaf) (Quercus lobata, adult, SAMN03567924)	41,915,272	87%	26%	156,992
SAMN03567925	expanding bud (Quercus lobata, adult, SAMN03567925)	36,999,600	89%	26%	152,924
SAMN03567926	expanding bud (leaf) (Quercus lobata, adult, SAMN03567926)	34,291,686	88%	25%	153,242
SAMN03567927	expanding bud (leaf, male flower) (Quercus lobata, adult, SAMN03567927)	34,919,638	89%	24%	157,505
SAMN03567928	expanding bud (leaf) (Quercus lobata, adult, SAMN03567928)	41,114,780	88%	24%	154,261
SAMN03567929	full-size young leaf (Quercus lobata, adult, SAMN03567929)	28,431,778	89%	24%	142,181
SAMN03567931	expanding bud (leaf) (Quercus lobata, adult, SAMN03567931)	36,636,540	87%	23%	153,428
SAMN03567932	expanding bud (leaf) (Quercus lobata, adult, SAMN03567932)	38,720,504	89%	26%	151,555
SAMN06133239	leaf (Quercus lobata, 11 months, SAMN06133239)	15,169,380	85%	10%	110,892
SAMN06133240	leaf (Quercus lobata, 11 months, SAMN06133240)	31,384,725	88%	10%	122,106
SAMN06133242	leaf (Quercus lobata, 11 months, SAMN06133242)	15,153,206	89%	10%	100,236
SAMN06133243	leaf (Quercus lobata, 11 months, SAMN06133243)	12,647,669	87%	9%	106,890
SAMN06133244	leaf (Quercus lobata, 11 months, SAMN06133244)	10,436,359	76%	9%	85,933
SAMN06133247	leaf (Quercus lobata, 11 months, SAMN06133247)	17,262,764	89%	10%	104,305
SAMN12861979	Leaf, YORK.09.37 (Quercus lobata, SAMN12861979)	55,367,298	91%	24%	149,719
SAMN12861980	Leaf, YORK.09.10 (Quercus lobata, SAMN12861980)	52,675,750	90%	25%	149,636
SAMN12861981	Leaf, YORK.09.08 (Quercus lobata, SAMN12861981)	53,622,068	90%	24%	154,612
SAMN12861982	Leaf, YORK.09.03 (Quercus lobata, SAMN12861982)	59,972,244	90%	24%	156,154
SAMN12861983	Leaf, LAY.15.34 (Quercus lobata, SAMN12861983)	55,581,908	91%	26%	151,977
SAMN12861984	Leaf, LAY.15.30 (Quercus lobata, SAMN12861984)	48,744,026	91%	27%	149,407
SAMN12861985	Leaf, LAY.15.09 (Quercus lobata, SAMN12861985)	51,286,722	91%	25%	148,968
SAMN12861986	Leaf, LAY.15.03 (Quercus lobata, SAMN12861986)	50,605,746	91%	25%	150,346
SAMN12862047	Leaf, HV.11.40 (Quercus lobata, SAMN12862047)	55,992,198	89%	25%	155,819
SAMN12862048	Leaf, HV.11.34 (Quercus lobata, SAMN12862048)	57,270,094	90%	24%	155,241
SAMN12862049	Leaf, HV.11.33 (Quercus lobata, SAMN12862049)	54,047,306	91%	26%	154,289
SAMN12862050	Leaf, HV.11.05 (Quercus lobata, SAMN12862050)	57,663,884	91%	27%	148,983

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent of aligned reads with introns
SRR2062053	SRX1057298	SRP059423	SAMN03261510	44,531,084	88%	25%
SRR2062127	SRX1058022	SRP059423	SAMN03261511	32,055,916	89%	25%
SRR2062078	SRX1057305	SRP059423	SAMN03261515	37,177,984	87%	25%
SRR2062140	SRX1058025	SRP059423	SAMN03261516	26,793,830	89%	25%
SRR2062131	SRX1058023	SRP059423	SAMN03567913	37,228,434	89%	26%
SRR2062073	SRX1057304	SRP059423	SAMN03567915	38,894,908	86%	25%
SRR2062145	SRX1058026	SRP059423	SAMN03567916	34,220,852	89%	25%
SRR2062111	SRX1057313	SRP059423	SAMN03567917	36,303,738	89%	25%
SRR2062089	SRX1057308	SRP059423	SAMN03567918	23,618,334	88%	21%
SRR2062070	SRX1057303	SRP059423	SAMN03567919	36,791,248	80%	23%
SRR2062106	SRX1057312	SRP059423	SAMN03567920	28,072,596	88%	22%
SRR2062119	SRX1057316	SRP059423	SAMN03567921	28,613,106	89%	25%
SRR2062083	SRX1057306	SRP059423	SAMN03567922	27,581,926	88%	25%
SRR2062149	SRX1058036	SRP059423	SAMN03567923	39,072,278	89%	26%
SRR2062061	SRX1057300	SRP059423	SAMN03567924	41,915,272	87%	26%
SRR2062116	SRX1057315	SRP059423	SAMN03567925	36,999,600	89%	26%
SRR2062102	SRX1057311	SRP059423	SAMN03567926	34,291,686	88%	25%
SRR2062147	SRX1058035	SRP059423	SAMN03567927	34,919,638	89%	24%
SRR2062067	SRX1057301	SRP059423	SAMN03567928	41,114,780	88%	24%
SRR2062136	SRX1058024	SRP059423	SAMN03567929	28,431,778	89%	24%
SRR2062095	SRX1057309	SRP059423	SAMN03567931	36,636,540	87%	23%
SRR2062113	SRX1057314	SRP059423	SAMN03567932	38,720,504	89%	26%
SRR5100919	SRX2417406	SRP095023	SAMN06133239	15,169,380	85%	10%
SRR5100914	SRX2417401	SRP095023	SAMN06133240	31,384,725	88%	10%
SRR5100928	SRX2417415	SRP095023	SAMN06133242	15,153,206	89%	10%
SRR5100916	SRX2417403	SRP095023	SAMN06133243	12,647,669	87%	9%
SRR5100923	SRX2417410	SRP095023	SAMN06133244	10,436,359	76%	9%
SRR5100920	SRX2417407	SRP095023	SAMN06133247	17,262,764	89%	10%
SRR10196240	SRX6916382	SRP223525	SAMN12861979	55,367,298	91%	24%
SRR10196239	SRX6916381	SRP223525	SAMN12861980	52,675,750	90%	25%
SRR10196238	SRX6916380	SRP223525	SAMN12861981	53,622,068	90%	24%
SRR10196237	SRX6916379	SRP223525	SAMN12861982	59,972,244	90%	24%
SRR10196236	SRX6916378	SRP223525	SAMN12861983	55,581,908	91%	26%
SRR10196235	SRX6916377	SRP223525	SAMN12861984	48,744,026	91%	27%
SRR10196234	SRX6916376	SRP223525	SAMN12861985	51,286,722	91%	25%
SRR10196233	SRX6916375	SRP223525	SAMN12861986	50,605,746	91%	25%
SRR10196232	SRX6916374	SRP223525	SAMN12862047	55,992,198	89%	25%
SRR10196231	SRX6916373	SRP223525	SAMN12862048	57,270,094	90%	24%
SRR10196230	SRX6916372	SRP223525	SAMN12862049	54,047,306	91%	26%
SRR10196229	SRX6916371	SRP223525	SAMN12862050	57,663,884	91%	27%

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Prunus mume high-quality model RefSeq (XP_)	12,164	11,695 (96.14%)	11,695 (96.14%)	69.58%	82.84%
Cucumis sativus high-quality model RefSeq (XP_)	11,600	11,303 (97.44%)	11,303 (97.44%)	69.06%	80.23%
Arabidopsis thaliana GenBank	53,535	49,467 (92.40%)	49,467 (92.40%)	68.63%	77.21%
Arabidopsis thaliana known RefSeq (NP_)	48,147	42,378 (88.02%)	42,378 (88.02%)	66.67%	73.02%
Juglans regia high-quality model RefSeq (XP_)	18,512	17,873 (96.55%)	17,873 (96.55%)	71.09%	84.07%
Fragaria vesca high-quality model RefSeq (XP_)	13,116	12,632 (96.31%)	12,632 (96.31%)	68.63%	81.04%
Populus euphratica high-quality model RefSeq (XP_)	18,422	17,836 (96.82%)	17,836 (96.82%)	69.62%	81.67%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20

RefSeq

Integrated reference sequences