NCBI Lates calcarifer Annotation Release 100

The RefSeq genome records for Lates calcarifer were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Lates calcarifer Annotation Release 100

Annotation release ID: 100
Date of Entrez queries for transcripts and proteins: Oct 6 2016
Date of submission of annotation to the public databases: Oct 14 2016
Software version: 7.2

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
ASM164080v1	GCF_001640805.1	Temasek Life Sciences Laboratory	05-09-2016	Reference	1 assembled chromosomes; unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	ASM164080v1
Genes and pseudogenes	30,767
protein-coding	25,532
non-coding	4,507
pseudogenes	728
genes with variants	9,369
mRNAs	45,210
fully-supported	43,618
with > 5% ab initio	486
partial	773
with filled gap(s)	245
known RefSeq (NM_)	0
model RefSeq (XM_)	45,210
Other RNAs	6,401
fully-supported	4,978
with > 5% ab initio	0
partial	8
with filled gap(s)	8
known RefSeq (NR_)	0
model RefSeq (XR_)	4,978
CDSs	45,414
fully-supported	43,618
with > 5% ab initio	601
partial	766
with major correction(s)	3,656
known RefSeq (NP_)	0
model RefSeq (XP_)	45,210

Detailed reports

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	30,039	14,730	6,544	70	729,023
All transcripts	51,611	3,310	2,730	70	89,380
mRNA	45,210	3,612	3,012	183	89,380
misc_RNA	954	3,114	2,632	101	19,933
tRNA	1,423	74	73	70	84
lncRNA	4,024	1,107	722	80	12,745
Single-exon transcripts	981	1,894	1,501	375	13,567
coding transcripts (NM_/XM_ )	981	1,894	1,501	375	13,567
CDSs	45,210	1,988	1,473	96	87,621
Exons	308,376	306	137	1	21,364
in coding transcripts (NM_/XM_ )	294,840	305	138	1	21,364
in non-coding transcripts (NR_/XR_ )	19,912	284	123	2	9,443
Introns	276,911	1,534	353	30	714,103
in coding transcripts (NM_/XM_ )	267,302	1,463	345	30	714,103
in non-coding transcripts (NR_/XR_ )	15,844	2,712	557	30	219,452

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	1.77	1	1	50
Number of exons per transcript	11.92	9	1	229

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the UniProtKB/Swiss-Prot curated proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 25329 coding genes, 23381 genes had a protein with an alignment covering 50% or more of the query and 10834 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: UniProtKB/Swiss-Prot curated proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker for each assembly. RepeatMasker results are only used for organisms for which a comprehensive repeat library is available.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with RepeatMasker	% Masked with WindowMasker
ASM164080v1	GCF_001640805.1	3.10%	23.22%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Depending on the other evidence available, long 454 reads (with average length above 250 nt) may be aligned as traditional evidence and reported in the Transcript alignments section or aligned with RNA-Seq reads and reported in the RNA-Seq alignments section.

Transcript alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species Genbank	124	123 (99.19%)	118 (95.16%)	99.18%	99.52%
Same-species EST	22,315	19,345 (86.69%)	18,088 (81.06%)	99.24%	99.07%

RNA-Seq alignments

The following RNA-Seq reads from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Track name	Number of reads	Percent aligned reads	Percent spliced reads	Number of introns
All	Aggregate of all aligned samples	2,433,639,453	81%	22%	364,113
SAMD00010991	8 RNA-seq libray from Asian seabass (Lates calcarifer) (Lates calcarifer, SAMD00010991)	513,512,098	90%	23%	286,576
SAMD00016657	Int6(SW;Feed) (Lates calcarifer, SAMD00016657)	132,249	33%	19%	28,845
SAMD00016658	Int3(PBS) (Lates calcarifer, SAMD00016658)	128,005	37%	21%	25,135
SAMD00016659	Int5(FW;Fasting) (Lates calcarifer, SAMD00016659)	130,721	35%	20%	26,573
SAMD00016660	Int1(LPS) (Lates calcarifer, SAMD00016660)	256,448	39%	21%	43,552
SAMD00016661	Int4(FW;Feed) (Lates calcarifer, SAMD00016661)	238,098	35%	20%	38,881
SAMD00016662	Int2(Vibrio) (Lates calcarifer, SAMD00016662)	118,560	40%	23%	25,100
SAMN02401948	All organs with library normalization (Lates calcarifer, SAMN02401948)	118,761,886	93%	21%	244,971
SAMN02401949	All organs without library normalization (Lates calcarifer, SAMN02401949)	126,025,462	91%	20%	268,866
SAMN02401950	Gut with library normalization (Lates calcarifer, SAMN02401950)	106,620,560	93%	26%	225,918
SAMN02401951	Gut without library normalization (Lates calcarifer, SAMN02401951)	136,076,760	90%	25%	258,422
SAMN02437147	Asian seabass brain (Lates calcarifer, SAMN02437147)	50,264,212	83%	15%	218,625
SAMN02437148	Asian seabass transiting gonad (Lates calcarifer, SAMN02437148)	62,018,454	78%	23%	244,240
SAMN02437149	Asian seabass testis (Lates calcarifer, SAMN02437149)	64,956,070	81%	17%	230,891
SAMN02437150	Asian seabass ovary (Lates calcarifer, SAMN02437150)	67,681,062	90%	26%	201,432
SAMN02437151	Asian seabass spleen (Lates calcarifer, SAMN02437151)	82,334,754	85%	22%	184,198
SAMN02437152	Asian seabass head kidney (Lates calcarifer, SAMN02437152)	69,610,410	81%	19%	151,063
SAMN02437153	Asian seabass intestine (various feeds) (Lates calcarifer, SAMN02437153)	83,416,734	77%	17%	140,829
SAMN02437154	Asian seabass liver (various feeds) (Lates calcarifer, SAMN02437154)	51,007,320	74%	14%	76,252
SAMN02437155	Asian seabass brain (various feeds) (Lates calcarifer, SAMN02437155)	73,156,840	86%	8%	174,352
SAMN02437156	Asian seabass intestine (probiotics) (Lates calcarifer, SAMN02437156)	61,443,772	75%	19%	111,928
SAMN03650330	fibroblast (Lates calcarifer, not determined, SAMN03650330)	414,930,358	92%	31%	264,031
SAMN03862127	liver (Lates calcarifer, male, SAMN03862127)	28,338,300	34%	14%	104,341
SAMN03862128	liver (Lates calcarifer, male, SAMN03862128)	27,607,560	38%	15%	114,972
SAMN03862129	liver (Lates calcarifer, male, SAMN03862129)	26,428,764	38%	14%	108,572
SAMN03862130	tissue (Lates calcarifer, male, SAMN03862130)	25,396,886	38%	15%	90,449
SAMN03862131	liver (Lates calcarifer, male, SAMN03862131)	25,756,166	36%	14%	114,025
SAMN03862132	liver (Lates calcarifer, male, SAMN03862132)	25,617,428	37%	15%	100,435
SAMN03862133	liver (Lates calcarifer, male, SAMN03862133)	28,591,612	39%	15%	132,363
SAMN03862134	liver (Lates calcarifer, male, SAMN03862134)	28,442,230	40%	16%	130,456
SAMN03862135	liver (Lates calcarifer, male, SAMN03862135)	31,638,690	39%	14%	159,085
SAMN03862136	liver (Lates calcarifer, male, SAMN03862136)	26,032,594	39%	15%	135,832
SAMN03890969	liver (Lates calcarifer, male, SAMN03890969)	24,743,654	39%	16%	129,687
SAMN03890970	liver (Lates calcarifer, male, SAMN03890970)	25,908,496	34%	14%	124,623
SAMN03890971	liver (Lates calcarifer, male, SAMN03890971)	26,316,240	41%	16%	122,957

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent spliced reads
DRR002190	DRX001618	DRP000610	SAMD00016657	132,249	33%	19%
DRR002187	DRX001615	DRP000610	SAMD00016658	128,005	37%	21%
DRR002189	DRX001617	DRP000610	SAMD00016659	130,721	35%	20%
DRR002185	DRX001613	DRP000610	SAMD00016660	256,448	39%	21%
DRR002188	DRX001616	DRP000610	SAMD00016661	238,098	35%	20%
DRR002186	DRX001614	DRP000610	SAMD00016662	118,560	40%	23%
DRR014186	DRX012717	DRP001234	SAMD00010991	513,512,098	90%	23%
SRR1032078	SRX378875	SRP033113	SAMN02401948	118,761,886	93%	21%
SRR1032087	SRX378876	SRP033113	SAMN02401949	126,025,462	91%	20%
SRR1032088	SRX378877	SRP033113	SAMN02401950	106,620,560	93%	26%
SRR1032089	SRX378879	SRP033113	SAMN02401951	136,076,760	90%	25%
SRR1791593	SRX867227	SRP053272	SAMN02437147	50,264,212	83%	15%
SRR1791594	SRX867250	SRP053272	SAMN02437148	62,018,454	78%	23%
SRR1791598	SRX867251	SRP053272	SAMN02437149	64,956,070	81%	17%
SRR1791597	SRX867252	SRP053272	SAMN02437150	67,681,062	90%	26%
SRR1791601	SRX867253	SRP053272	SAMN02437151	82,334,754	85%	22%
SRR1795764	SRX867254	SRP053272	SAMN02437152	69,610,410	81%	19%
SRR1795765	SRX867255	SRP053272	SAMN02437153	83,416,734	77%	17%
SRR1795766	SRX867256	SRP053272	SAMN02437154	51,007,320	74%	14%
SRR1795767	SRX867257	SRP053272	SAMN02437155	73,156,840	86%	8%
SRR1795768	SRX867258	SRP053272	SAMN02437156	61,443,772	75%	19%
SRR2015334	SRX1022612	SRP058160	SAMN03650330	414,930,358	92%	31%
SRR2179915	SRX1162664	SRP061524	SAMN03862127	28,338,300	34%	14%
SRR2179916	SRX1162665	SRP061524	SAMN03862128	27,607,560	38%	15%
SRR2179914	SRX1117093	SRP061524	SAMN03862129	26,428,764	38%	14%
SRR2179917	SRX1162666	SRP061524	SAMN03862130	25,396,886	38%	15%
SRR2179918	SRX1162667	SRP061524	SAMN03862131	25,756,166	36%	14%
SRR2179919	SRX1162668	SRP061524	SAMN03862132	25,617,428	37%	15%
SRR2179920	SRX1162669	SRP061524	SAMN03862133	28,591,612	39%	15%
SRR2179921	SRX1162670	SRP061524	SAMN03862134	28,442,230	40%	16%
SRR2179922	SRX1162671	SRP061524	SAMN03862135	31,638,690	39%	14%
SRR2179923	SRX1162673	SRP061524	SAMN03862136	26,032,594	39%	15%
SRR2179935	SRX1162677	SRP061524	SAMN03890969	24,743,654	39%	16%
SRR2179936	SRX1162678	SRP061524	SAMN03890970	25,908,496	34%	14%
SRR2179937	SRX1162679	SRP061524	SAMN03890971	26,316,240	41%	16%

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Cynoglossus semilaevis high-quality model RefSeq (XP_)	13,609	13,531 (99.43%)	13,531 (99.43%)	72.02%	80.55%
Poecilia formosa high-quality model RefSeq (XP_)	18,503	18,364 (99.25%)	18,364 (99.25%)	70.87%	79.57%
Actinopterygii GenBank	76,040	73,364 (96.48%)	73,364 (96.48%)	70.00%	80.07%
Actinopterygii known RefSeq (NP_)	24,660	23,709 (96.14%)	23,709 (96.14%)	68.88%	77.95%
Danio rerio high-quality model RefSeq (XP_)	7,662	7,403 (96.62%)	7,403 (96.62%)	66.46%	73.01%
Astyanax mexicanus high-quality model RefSeq (XP_)	13,209	12,942 (97.98%)	12,942 (97.98%)	68.07%	75.93%
Homo sapiens known RefSeq (NP_)	44,572	37,808 (84.82%)	37,808 (84.82%)	66.63%	67.88%

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20

RefSeq

Integrated reference sequences