NCBI Trachemys scripta elegans Annotation Release 100

The RefSeq genome records for Trachemys scripta elegans were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Trachemys scripta elegans Annotation Release 100

Annotation release ID: 100
Date of Entrez queries for transcripts and proteins: May 27 2020
Date of submission of annotation to the public databases: May 31 2020
Software version: 8.4

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
CAS_Tse_1.0	GCF_013100865.1	California Academy of Sciences	05-15-2020	Reference	26 assembled chromosomes; unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	CAS_Tse_1.0
Genes and pseudogenes	22,456
protein-coding	18,662
non-coding	3,092
transcribed pseudogenes	0
non-transcribed pseudogenes	579
genes with variants	8,911
immunoglobulin/T-cell receptor gene segments	123
other	0
mRNAs	42,063
fully-supported	40,101
with > 5% ab initio	817
partial	1,121
with filled gap(s)	987
known RefSeq (NM_)	0
model RefSeq (XM_)	42,063
non-coding RNAs	5,562
fully-supported	4,807
with > 5% ab initio	0
partial	13
with filled gap(s)	12
known RefSeq (NR_)	0
model RefSeq (XR_)	5,196
pseudo transcripts	0
fully-supported	0
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	0
CDSs	42,199
fully-supported	40,101
with > 5% ab initio	976
partial	1,046
with major correction(s)	681
known RefSeq (NP_)	0
model RefSeq (XP_)	42,076

Detailed reports

The counts below do not include pseudogenes.

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	21,754	50,618	19,799	37	1,855,422
All transcripts	47,625	3,507	2,882	37	100,643
mRNA	42,063	3,736	3,084	174	100,643
misc_RNA	1,401	3,150	2,636	167	19,287
tRNA	364	74	73	66	84
lncRNA	3,406	1,570	1,096	93	9,886
snoRNA	254	113	100	55	312
snRNA	114	133	127	37	190
guide_RNA	14	185	143	86	410
rRNA	9	727	119	119	3,206
Single-exon transcripts	1,309	1,462	957	174	14,816
coding transcripts (NM_/XM_ )	1,309	1,462	957	174	14,816
CDSs	42,076	2,091	1,494	96	99,069
Exons	229,088	320	137	1	18,115
in coding transcripts (NM_/XM_ )	217,738	313	137	1	18,115
in non-coding transcripts (NR_/XR_ )	20,584	330	137	2	11,442
Introns	206,217	6,391	1,723	30	1,151,776
in coding transcripts (NM_/XM_ )	197,970	6,307	1,701	30	1,151,776
in non-coding transcripts (NR_/XR_ )	17,186	7,148	2,085	30	425,786

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	2.21	1	1	50
Number of exons per transcript	12.34	9	1	286

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the UniProtKB/Swiss-Prot curated proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 18649 coding genes, 18046 genes had a protein with an alignment covering 50% or more of the query and 12114 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: UniProtKB/Swiss-Prot curated proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker for each assembly. RepeatMasker results are only used for organisms for which a comprehensive repeat library is available.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with RepeatMasker	% Masked with WindowMasker
CAS_Tse_1.0	GCF_013100865.1	14.48%	30.24%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Depending on the other evidence available, long 454 reads (with average length above 250 nt) may be aligned as traditional evidence and reported in the Transcript alignments section or aligned with RNA-Seq reads and reported in the RNA-Seq alignments section.

Transcript alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species Genbank	188	175 (93.09%)	120 (63.83%)	99.16%	97.83%

RNA-Seq alignments

The following RNA-Seq reads from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Publication	Track name	Number of reads	Percent aligned reads	Percent of aligned reads with introns	Number of introns
All	NA	Aggregate of all aligned samples	2,577,867,968	68%	24%	247,642
SAMN05437666	27671871	embryonic gonad (Trachemys scripta elegans, female, SAMN05437666)	29,445,004	76%	23%	154,107
SAMN05437667	27671871	embryonic gonad (Trachemys scripta elegans, female, SAMN05437667)	21,481,498	65%	19%	116,507
SAMN05437668	27671871	embryonic gonad (Trachemys scripta elegans, female, SAMN05437668)	30,678,864	68%	23%	144,544
SAMN05437669	27671871	embryonic gonad (Trachemys scripta elegans, female, SAMN05437669)	27,731,346	72%	22%	141,935
SAMN05437670	27671871	embryonic gonad (Trachemys scripta elegans, male, SAMN05437670)	34,500,292	75%	23%	152,997
SAMN05437671	27671871	embryonic gonad (Trachemys scripta elegans, male, SAMN05437671)	29,827,362	74%	22%	146,909
SAMN05579945	NA	Juvenile, Spinal cord T3 sham-lesion epicenter +- 2mm (Trachemys scripta elegans, 1 year, SAMN05579945)	43,152,330	61%	14%	147,432
SAMN05580037	NA	Juvenile, Spinal cord T3 sham-lesion epicenter +- 2mm (Trachemys scripta elegans, 1 year, SAMN05580037)	27,743,364	67%	7%	120,828
SAMN05580097	NA	Juvenile, Spinal cord T3 lesion epicenter +- 2mm (Trachemys scripta elegans, 1 year, SAMN05580097)	38,985,982	62%	17%	149,578
SAMN05580101	NA	Juvenile, Spinal cord T3 lesion epicenter +- 2mm (Trachemys scripta elegans, 1 year, SAMN05580101)	9,946,296	68%	8%	95,487
SAMN05580167	NA	Juvenile, Spinal cord T3 lesion epicenter +- 2mm (Trachemys scripta elegans, 1 year, SAMN05580167)	17,446,882	67%	8%	113,405
SAMN05580238	NA	Juvenile, Pool of Brain, spinal cord and liver (Trachemys scripta elegans, 1 year, SAMN05580238)	35,567,558	57%	10%	127,623
SAMN07416878	27671871	whole embryo (Trachemys scripta elegans, male, SAMN07416878)	91,111,140	72%	20%	197,983
SAMN07416880	27671871	embryonic gonad (Trachemys scripta elegans, male, SAMN07416880)	48,252,574	73%	13%	136,978
SAMN07416882	27671871	embryonic gonad (Trachemys scripta elegans, female, SAMN07416882)	40,987,734	78%	18%	163,557
SAMN07416884	27671871	embryonic gonad (Trachemys scripta elegans, female, SAMN07416884)	55,397,496	59%	16%	134,166
SAMN07416886	27671871	embryonic gonad (Trachemys scripta elegans, female, SAMN07416886)	53,142,262	75%	20%	160,807
SAMN07416891	27671871	embryonic gonad (Trachemys scripta elegans, female, SAMN07416891)	45,389,340	67%	18%	140,726
SAMN07416892	27671871	embryonic gonad (Trachemys scripta elegans, male, SAMN07416892)	37,986,560	72%	19%	155,584
SAMN07416894	27671871	embryonic gonad (Trachemys scripta elegans, male, SAMN07416894)	80,335,122	75%	21%	179,477
SAMN07416896	27671871	embryonic gonad (Trachemys scripta elegans, male, SAMN07416896)	48,964,320	73%	21%	165,460
SAMN07416898	27671871	whole embryo (Trachemys scripta elegans, female, SAMN07416898)	43,374,782	72%	21%	179,650
SAMN07416899	27671871	embryonic gonad (Trachemys scripta elegans, female, SAMN07416899)	49,915,768	55%	16%	128,735
SAMN07416901	27671871	embryonic gonad (Trachemys scripta elegans, male, SAMN07416901)	37,466,138	76%	20%	157,751
SAMN07416902	27671871	embryonic gonad (Trachemys scripta elegans, male, SAMN07416902)	42,954,434	77%	18%	156,683
SAMN09691329	NA	liver (Trachemys scripta elegans, two years, SAMN09691329)	48,616,356	68%	46%	163,750
SAMN09691330	NA	liver (Trachemys scripta elegans, two years, SAMN09691330)	57,912,616	68%	44%	168,664
SAMN09691331	NA	liver (Trachemys scripta elegans, two years, SAMN09691331)	50,681,398	67%	45%	165,805
SAMN11081407	NA	liver (Trachemys scripta elegans, female, SAMN11081407)	157,384,476	72%	20%	193,469
SAMN11081408	NA	liver (Trachemys scripta elegans, female, SAMN11081408)	177,029,988	69%	21%	185,745
SAMN11081409	NA	liver (Trachemys scripta elegans, female, SAMN11081409)	160,902,862	68%	20%	182,719
SAMN11081410	NA	liver (Trachemys scripta elegans, female, SAMN11081410)	161,001,556	66%	19%	180,572
SAMN11081411	NA	liver (Trachemys scripta elegans, female, SAMN11081411)	148,150,044	66%	20%	180,449
SAMN11081412	NA	liver (Trachemys scripta elegans, female, SAMN11081412)	150,765,402	66%	20%	179,972
SAMN14377601	NA	liver (Trachemys scripta elegans, 1 year, SAMN14377601)	47,526,602	66%	44%	149,007
SAMN14377602	NA	liver (Trachemys scripta elegans, 1 year, SAMN14377602)	47,839,876	60%	36%	135,000
SAMN14377603	NA	liver (Trachemys scripta elegans, 1 year, SAMN14377603)	45,064,018	65%	43%	149,320
SAMN14377604	NA	liver (Trachemys scripta elegans, 1 year, SAMN14377604)	49,574,362	63%	42%	152,085
SAMN14377605	NA	liver (Trachemys scripta elegans, 1 year, SAMN14377605)	50,612,010	62%	40%	157,379
SAMN14377606	NA	liver (Trachemys scripta elegans, 1 year, SAMN14377606)	44,721,466	66%	43%	150,027
SAMN14377607	NA	liver (Trachemys scripta elegans, 1 year, SAMN14377607)	58,253,826	64%	44%	141,833
SAMN14377608	NA	liver (Trachemys scripta elegans, 1 year, SAMN14377608)	49,010,824	53%	36%	149,836
SAMN14377609	NA	liver (Trachemys scripta elegans, 1 year, SAMN14377609)	51,035,838	62%	41%	146,471

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent of aligned reads with introns
SRR3948583	SRX1973580	SRP079664	SAMN05437666	29,445,004	76%	23%
SRR3948584	SRX1973581	SRP079664	SAMN05437667	21,481,498	65%	19%
SRR3948586	SRX1973582	SRP079664	SAMN05437668	30,678,864	68%	23%
SRR3948587	SRX1973583	SRP079664	SAMN05437669	27,731,346	72%	22%
SRR3948588	SRX1973584	SRP079664	SAMN05437670	34,500,292	75%	23%
SRR3948589	SRX1973585	SRP079664	SAMN05437671	29,827,362	74%	22%
SRR5871182	SRX3038760	SRP079664	SAMN07416878	91,111,140	72%	20%
SRR5871181	SRX3038761	SRP079664	SAMN07416880	48,252,574	73%	13%
SRR5871184	SRX3038758	SRP079664	SAMN07416882	40,987,734	78%	18%
SRR5871183	SRX3038759	SRP079664	SAMN07416884	55,397,496	59%	16%
SRR5871180	SRX3038762	SRP079664	SAMN07416886	53,142,262	75%	20%
SRR5871187	SRX3038755	SRP079664	SAMN07416891	45,389,340	67%	18%
SRR5871186	SRX3038756	SRP079664	SAMN07416892	37,986,560	72%	19%
SRR5871188	SRX3038754	SRP079664	SAMN07416894	80,335,122	75%	21%
SRR5871185	SRX3038757	SRP079664	SAMN07416896	48,964,320	73%	21%
SRR5871178	SRX3038764	SRP079664	SAMN07416898	43,374,782	72%	21%
SRR5871177	SRX3038765	SRP079664	SAMN07416899	49,915,768	55%	16%
SRR5871176	SRX3038766	SRP079664	SAMN07416901	37,466,138	76%	20%
SRR5871179	SRX3038763	SRP079664	SAMN07416902	42,954,434	77%	18%
SRR4046640	SRX2037551	SRP082501	SAMN05579945	43,152,330	61%	14%
SRR4046642	SRX2037553	SRP082501	SAMN05580037	27,743,364	67%	7%
SRR4046641	SRX2037552	SRP082501	SAMN05580097	38,985,982	62%	17%
SRR4046643	SRX2037554	SRP082501	SAMN05580101	9,946,296	68%	8%
SRR4046644	SRX2037555	SRP082501	SAMN05580167	17,446,882	67%	8%
SRR4046638	SRX2037550	SRP082501	SAMN05580238	35,567,558	57%	10%
SRR7540571	SRX4408099	SRP154424	SAMN09691329	48,616,356	68%	46%
SRR7540570	SRX4408098	SRP154424	SAMN09691330	57,912,616	68%	44%
SRR7540569	SRX4408097	SRP154424	SAMN09691331	50,681,398	67%	45%
SRR8695401	SRX5491254	SRP187832	SAMN11081407	157,384,476	72%	20%
SRR8695402	SRX5491253	SRP187832	SAMN11081408	177,029,988	69%	21%
SRR8695399	SRX5491256	SRP187832	SAMN11081409	160,902,862	68%	20%
SRR8695400	SRX5491255	SRP187832	SAMN11081410	161,001,556	66%	19%
SRR8695397	SRX5491258	SRP187832	SAMN11081411	148,150,044	66%	20%
SRR8695398	SRX5491257	SRP187832	SAMN11081412	150,765,402	66%	20%
SRR11306940	SRX7912055	SRP252802	SAMN14377601	47,526,602	66%	44%
SRR11306939	SRX7912056	SRP252802	SAMN14377602	47,839,876	60%	36%
SRR11306938	SRX7912057	SRP252802	SAMN14377603	45,064,018	65%	43%
SRR11306937	SRX7912058	SRP252802	SAMN14377604	49,574,362	63%	42%
SRR11306936	SRX7912059	SRP252802	SAMN14377605	50,612,010	62%	40%
SRR11306935	SRX7912060	SRP252802	SAMN14377606	44,721,466	66%	43%
SRR11306934	SRX7912061	SRP252802	SAMN14377607	58,253,826	64%	44%
SRR11306933	SRX7912062	SRP252802	SAMN14377608	49,010,824	53%	36%
SRR11306932	SRX7912063	SRP252802	SAMN14377609	51,035,838	62%	41%

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Pelodiscus sinensis high-quality model RefSeq (XP_)	10,355	9,865 (95.27%)	9,865 (95.27%)	74.91%	85.44%
Same-species GenBank	187	163 (87.17%)	163 (87.17%)	75.52%	81.27%
Xenopus GenBank	31,816	8,613 (27.07%)	8,613 (27.07%)	68.42%	75.28%
Xenopus known RefSeq (NP_)	19,656	18,271 (92.95%)	18,271 (92.95%)	69.73%	78.96%
Sauropsida GenBank	29,217	16,904 (57.86%)	16,904 (57.86%)	68.66%	75.49%
Sauropsida known RefSeq (NP_)	8,135	7,592 (93.33%)	7,592 (93.33%)	73.19%	82.03%
Chrysemys picta high-quality model RefSeq (XP_)	14,390	13,793 (95.85%)	13,793 (95.85%)	78.57%	86.73%
Homo sapiens GenBank	144,554	74,889 (51.81%)	74,889 (51.81%)	63.55%	77.83%
Homo sapiens known RefSeq (NP_)	57,498	39,927 (69.44%)	39,927 (69.44%)	69.93%	77.19%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20

RefSeq

Integrated reference sequences