NCBI Maylandia zebra Annotation Release 104

The RefSeq genome records for Maylandia zebra were annotated by the NCBI Eukaryotic Genome Annotation Pipeline, an automated pipeline that annotates genes, transcripts and proteins on draft and finished genome assemblies. This report presents statistics on the annotation products, the input data used in the pipeline and intermediate alignment results.

The annotation products are available in the sequence databases and on the FTP site.

This report provides:

Annotation Release information: The name of the release, important dates, the software version
Assemblies: A brief description of the annotated assembly(ies)
Gene and feature statistics: The counts and characteristics of the annotated features
Alignment of the annotated proteins to a set of high-quality proteins: The number of annotated proteins with hits to a set of high-quality proteins
Masking of genomic sequence: How much of the genome was masked
Transcript and protein alignments: The number and type of evidence retrieved from public databases and used for gene prediction
Similarity of current and previous assembly: The similarity of the current and previous assembly
Comparison of the current and previous annotations: What proportion of the genes changed in this annotation

For more information on the annotation process, please visit the NCBI Eukaryotic Genome Annotation Pipeline page.

Annotation Release information

This annotation should be referred to as NCBI Maylandia zebra Annotation Release 104

Annotation release ID: 104
Date of Entrez queries for transcripts and proteins: Apr 19 2018
Date of submission of annotation to the public databases: Apr 24 2018
Software version: 8.0

Assemblies

The following assemblies were included in this annotation run:

Assembly name	Assembly accession	Submitter	Assembly date	Reference/Alternate	Assembly content
M_zebra_UMD2a	GCF_000238955.4	Broad Institute	04-10-2018	Reference	23 assembled chromosomes; unplaced scaffolds

Gene and feature statistics

Counts and length of annotated features are provided below for each assembly.

Feature counts

Feature	M_zebra_UMD2a
Genes and pseudogenes	32,471
protein-coding	25,898
non-coding	5,149
transcribed pseudogenes	1
non-transcribed pseudogenes	1,238
genes with variants	9,566
immunoglobulin/T-cell receptor gene segments	185
other	0
mRNAs	46,160
fully-supported	43,159
with > 5% ab initio	1,575
partial	655
with filled gap(s)	246
known RefSeq (NM_)	12
model RefSeq (XM_)	46,148
non-coding RNAs	6,209
fully-supported	4,047
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	4,851
pseudo transcripts	1
fully-supported	1
with > 5% ab initio	0
partial	0
with filled gap(s)	0
known RefSeq (NR_)	0
model RefSeq (XR_)	1
CDSs	46,358
fully-supported	43,159
with > 5% ab initio	1,752
partial	654
with major correction(s)	478
known RefSeq (NP_)	12
model RefSeq (XP_)	46,161

Detailed reports

The counts below do not include pseudogenes.

Feature lengths

Feature	Count	Mean length (bp)	Median length (bp)	Min length (bp)	Max length (bp)
Genes	31,047	16,347	6,815	56	1,249,466
All transcripts	52,369	3,248	2,541	56	92,247
mRNA	46,160	3,585	2,842	201	92,247
misc_RNA	652	2,710	2,300	110	13,599
tRNA	1,356	74	73	67	84
lncRNA	3,403	733	508	77	10,853
snoRNA	180	118	119	65	345
snRNA	221	155	164	56	200
guide_RNA	7	225	272	130	362
rRNA	390	441	119	116	3,928
Single-exon transcripts	971	1,691	1,398	281	8,817
coding transcripts (NM_/XM_ )	969	1,688	1,392	281	8,817
non-coding transcripts (NR_/XR_ )	2	3,259	3,634	2,884	3,634
CDSs	46,173	2,177	1,530	96	90,936
Exons	303,046	279	135	1	21,574
in coding transcripts (NM_/XM_ )	291,771	280	136	1	21,574
in non-coding transcripts (NR_/XR_ )	15,733	224	116	3	10,920
Introns	274,103	1,863	381	30	1,095,440
in coding transcripts (NM_/XM_ )	266,321	1,872	382	30	1,095,440
in non-coding transcripts (NR_/XR_ )	12,131	1,650	366	31	207,653

Transcripts per gene, exons per transcript

	Mean	Median	Min	Max
Number of transcripts per gene	1.72	1	1	37
Number of exons per transcript	12.63	9	1	238

Alignment of the annotated proteins to a set of high-quality proteins

The final set of annotated proteins was searched with BLASTP against the UniProtKB/Swiss-Prot curated proteins, using the annotated proteins as the query and the high-quality proteins as the target. Out of 25885 coding genes, 23491 genes had a protein with an alignment covering 50% or more of the query and 11042 had an alignment covering 95% or more of the query.

Definition of query and target coverage. The query coverage is the percentage of the annotated protein length that is included in the alignment. The target coverage is the percentage of the target length that is included in the alignment.

Below is a cumulative graph displaying the number of genes with alignments above a given query or target coverage threshold. For comparison, corresponding statistics for other organisms annotated by the NCBI eukaryotic annotation pipeline were added to the graph.

Query: annotated proteins
Target: UniProtKB/Swiss-Prot curated proteins

Masking of genomic sequence

Transcript and protein alignments are performed on the repeat-masked genome. Below are the percentages of genomic sequence masked by WindowMasker and RepeatMasker for each assembly. RepeatMasker results are only used for organisms for which a comprehensive repeat library is available.

For this annotation run, transcripts and proteins were aligned to the genome masked with WindowMasker only.

Assembly name	Assembly accession	% Masked with RepeatMasker	% Masked with WindowMasker
M_zebra_UMD2a	GCF_000238955.4	6.22%	30.95%

Transcript and protein alignments

The annotation pipeline relies heavily on alignments of experimental evidence for gene prediction. Below are the sets of transcripts and proteins that were retrieved from Entrez, aligned to the genome by Splign or ProSplign and passed to Gnomon, NCBI's gene prediction software.

Depending on the other evidence available, long 454 reads (with average length above 250 nt) may be aligned as traditional evidence and reported in the Transcript alignments section or aligned with RNA-Seq reads and reported in the RNA-Seq alignments section.

Transcript alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by Splign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species known RefSeq (NM_/NR_)	12	12 (100.00%)	12 (100.00%)	99.16%	99.83%
Same-species Genbank	203	202 (99.51%)	197 (97.04%)	99.49%	99.20%

RNA-Seq alignments

The following RNA-Seq reads from the Sequence Read Archive were also used for gene prediction:

Hide alignments statistics, by sample (SAME, SAMN, SAMD, DRS)

Sample Id	Publication	Track name	Number of reads	Percent aligned reads	Percent of aligned reads with introns	Number of introns
All	NA	Aggregate of all aligned samples	716,542,638	81%	28%	325,379
SAMEA4513632	NA	liver (Maylandia zebra, male, SAMEA4513632)	23,246,884	70%	17%	114,318
SAMEA4513633	NA	liver (Maylandia zebra, male, SAMEA4513633)	26,048,546	76%	19%	136,458
SAMEA4513634	NA	liver (Maylandia zebra, male, SAMEA4513634)	24,574,878	76%	16%	137,327
SAMEA4513677	NA	muscle (Maylandia zebra, male, SAMEA4513677)	20,851,898	76%	18%	123,965
SAMEA4513678	NA	muscle (Maylandia zebra, male, SAMEA4513678)	23,757,956	74%	18%	123,036
SAMEA4513679	NA	muscle (Maylandia zebra, male, SAMEA4513679)	23,893,410	70%	15%	134,884
SAMN00760990	25186727	skin (Maylandia zebra, female, SAMN00760990)	46,901,284	83%	31%	222,074
SAMN00760991	25186727	skin (Maylandia zebra, male, SAMN00760991)	70,138,352	75%	37%	179,413
SAMN00760992	25186727	liver (Maylandia zebra, male, SAMN00760992)	53,334,042	89%	38%	175,980
SAMN00760993	25186727	embryo (Maylandia zebra, SAMN00760993)	51,045,994	88%	31%	261,368
SAMN00760994	25186727	kidney (Maylandia zebra, female, SAMN00760994)	51,648,568	82%	28%	239,626
SAMN00760995	25186727	testis (Maylandia zebra, male, SAMN00760995)	47,610,476	88%	33%	233,934
SAMN00760996	25186727	ovary (Maylandia zebra, female, SAMN00760996)	52,199,552	91%	30%	231,742
SAMN00760997	25186727	brain (Maylandia zebra, male, SAMN00760997)	55,473,956	84%	22%	236,043
SAMN00760998	25186727	eye (retina) (Maylandia zebra, male, SAMN00760998)	49,414,820	80%	23%	227,026
SAMN00760999	25186727	heart (Maylandia zebra, female, SAMN00760999)	48,854,772	79%	28%	204,862
SAMN00761736	25186727	pooled samples (Maylandia zebra, SAMN00761736)	47,547,250	82%	24%	185,433

Show alignments statistics, by run (ERR, SRR, DRR)

Run	Experiment	Project	Sample	Number of reads	Percent aligned reads	Percent of aligned reads with introns
ERR1940482	ERX2001015	ERP016820	SAMEA4513632	5,793,672	70%	17%
ERR1940572	ERX2001105	ERP016820	SAMEA4513632	5,820,904	70%	16%
ERR1952897	ERX2018493	ERP016820	SAMEA4513632	5,829,532	71%	16%
ERR1952987	ERX2018583	ERP016820	SAMEA4513632	5,802,776	71%	17%
ERR1940483	ERX2001016	ERP016820	SAMEA4513633	6,495,064	75%	19%
ERR1940573	ERX2001106	ERP016820	SAMEA4513633	6,527,408	75%	19%
ERR1952898	ERX2018494	ERP016820	SAMEA4513633	6,533,014	76%	19%
ERR1952988	ERX2018584	ERP016820	SAMEA4513633	6,493,060	76%	19%
ERR1940484	ERX2001017	ERP016820	SAMEA4513634	6,128,012	75%	16%
ERR1940574	ERX2001107	ERP016820	SAMEA4513634	6,153,708	75%	16%
ERR1952899	ERX2018495	ERP016820	SAMEA4513634	6,162,930	76%	16%
ERR1952989	ERX2018585	ERP016820	SAMEA4513634	6,130,228	76%	16%
ERR1940527	ERX2001060	ERP016820	SAMEA4513677	5,177,284	76%	18%
ERR1940617	ERX2001150	ERP016820	SAMEA4513677	5,196,402	76%	18%
ERR1952942	ERX2018538	ERP016820	SAMEA4513677	5,245,760	76%	18%
ERR1953032	ERX2018628	ERP016820	SAMEA4513677	5,232,452	76%	18%
ERR1940528	ERX2001061	ERP016820	SAMEA4513678	5,911,586	74%	18%
ERR1940618	ERX2001151	ERP016820	SAMEA4513678	5,925,084	74%	18%
ERR1952943	ERX2018539	ERP016820	SAMEA4513678	5,976,780	74%	18%
ERR1953033	ERX2018629	ERP016820	SAMEA4513678	5,944,506	74%	18%
ERR1940529	ERX2001062	ERP016820	SAMEA4513679	5,973,854	70%	15%
ERR1940619	ERX2001152	ERP016820	SAMEA4513679	6,007,712	70%	15%
ERR1952944	ERX2018540	ERP016820	SAMEA4513679	5,969,934	70%	15%
ERR1953034	ERX2018630	ERP016820	SAMEA4513679	5,941,910	70%	15%
SRR385832	SRX109697	SRP009483	SAMN00760990	16,266,412	83%	31%
SRR385841	SRX109697	SRP009483	SAMN00760990	14,890,938	83%	31%
SRR385847	SRX109697	SRP009483	SAMN00760990	15,743,934	83%	31%
SRR385833	SRX109698	SRP009483	SAMN00760991	23,513,674	76%	37%
SRR385843	SRX109698	SRP009483	SAMN00760991	24,289,036	75%	37%
SRR385855	SRX109698	SRP009483	SAMN00760991	22,335,642	75%	37%
SRR385834	SRX109699	SRP009483	SAMN00760992	16,907,172	89%	38%
SRR385838	SRX109699	SRP009483	SAMN00760992	18,504,442	89%	38%
SRR385839	SRX109699	SRP009483	SAMN00760992	17,922,428	89%	38%
SRR385835	SRX109700	SRP009483	SAMN00760993	16,178,628	88%	31%
SRR385836	SRX109700	SRP009483	SAMN00760993	17,701,940	88%	31%
SRR385857	SRX109700	SRP009483	SAMN00760993	17,165,426	88%	31%
SRR385837	SRX109701	SRP009483	SAMN00760994	17,898,840	82%	28%
SRR385844	SRX109701	SRP009483	SAMN00760994	17,346,378	82%	28%
SRR385848	SRX109701	SRP009483	SAMN00760994	16,403,350	82%	28%
SRR385840	SRX109702	SRP009483	SAMN00760995	15,993,176	88%	33%
SRR385842	SRX109702	SRP009483	SAMN00760995	16,497,022	88%	33%
SRR385856	SRX109702	SRP009483	SAMN00760995	15,120,278	88%	33%
SRR385845	SRX109703	SRP009483	SAMN00760996	17,536,162	91%	30%
SRR385851	SRX109703	SRP009483	SAMN00760996	16,573,960	91%	30%
SRR385860	SRX109703	SRP009483	SAMN00760996	18,089,430	91%	30%
SRR385846	SRX109704	SRP009483	SAMN00760997	18,636,978	84%	22%
SRR385854	SRX109704	SRP009483	SAMN00760997	17,611,452	84%	22%
SRR385858	SRX109704	SRP009483	SAMN00760997	19,225,526	84%	22%
SRR385849	SRX109705	SRP009483	SAMN00760998	16,581,018	80%	23%
SRR385853	SRX109705	SRP009483	SAMN00760998	17,131,850	80%	23%
SRR385859	SRX109705	SRP009483	SAMN00760998	15,701,952	80%	23%
SRR385850	SRX109706	SRP009483	SAMN00760999	15,480,110	79%	28%
SRR385852	SRX109706	SRP009483	SAMN00760999	16,942,526	79%	28%
SRR385861	SRX109706	SRP009483	SAMN00760999	16,432,136	79%	28%
SRR387536	SRX110238	SRP009483	SAMN00761736	16,483,754	82%	24%
SRR387537	SRX110238	SRP009483	SAMN00761736	15,115,208	82%	24%
SRR387540	SRX110238	SRP009483	SAMN00761736	15,948,288	82%	24%

Protein alignments

Source	Number of sequences retrieved from Entrez	Number (%) of sequences aligned by ProSplign	Number (%) of sequences passed to Gnomon	Average % identity	Average % coverage
Same-species GenBank	54	53 (98.15%)	53 (98.15%)	71.70%	80.54%
Same-species known RefSeq (NP_)	12	12 (100.00%)	12 (100.00%)	75.30%	78.77%
Actinopterygii GenBank	79,248	75,954 (95.84%)	75,954 (95.84%)	68.80%	80.49%
Actinopterygii known RefSeq (NP_)	24,836	23,819 (95.91%)	23,819 (95.91%)	67.94%	78.09%
Homo sapiens known RefSeq (NP_)	50,089	42,928 (85.70%)	42,928 (85.70%)	65.62%	67.68%

Assembly-assembly alignments of current to previous assembly

When the assembly changes between two rounds of annotation, genes in the current and the previous annotation are mapped to each other using the genomic alignments of the current assembly to the previous assembly so that gene identifiers can be preserved. The success of the remapping depends largely on how well the two assembly versions align to each other.

Below are the percent coverage of one assembly by the other and the average percent identity of the alignments. The 'First pass' alignments are reciprocal best hits, while the 'Total' alignments also include 'Second pass' or non-reciprocal best alignments. For more information about the assembly-assembly alignment process, please visit the NCBI Genome Remapping Service page.

First Pass	Total
M_zebra_UMD2a (Current) Coverage: 80.81%	M_zebra_UMD2a (Current) Coverage: 88.14%
M_zebra_UMD1 (Previous) Coverage: 96.19%	M_zebra_UMD1 (Previous) Coverage: 97.43%
Percent Identity: 99.55%	Percent Identity: 99.37%

Comparison of the current and previous annotations

The annotation produced for this release (104) was compared to the annotation in the previous release (103) for each assembly annotated in both releases. Scores for current and previous gene and transcript features were calculated based on overlap in exon sequence and matches in exon boundaries. Pairs of current and previous features were categorized based on these scores, whether they are reciprocal best matches, and changes in attributes (gene biotype, completeness, etc.). If the assembly was updated between the two releases, alignments between the current and the previous assembly were used to match the current and previous gene and transcript features in mapped regions.

The table below summarizes the changes in the gene set for each assembly as a percent of the number of genes in the current annotation release, and provides links to the details of the comparison in tabular format and in a Genome Workbench project.

	M_zebra_UMD2a (Current) to M_zebra_UMD1 (Previous)
Identical	41%
Minor changes	30%
Major changes	7%
New	20%
Deprecated	6%
Other	1%
Download the report	tabular, Genome Workbench

References

RefSeq: Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, Murphy MR, O'Leary NA, Pujar S, Rajput B, Rangwala SH, Riddick LD, Shkeda A, Sun H, Tamez P, Tully RE, Wallin C, Webb D, Weber J, Wu W, Dicuccio M, Kitts P, Maglott DR, Murphy TD, Ostell JM. Nucleic Acids Research 2014, 42(Database issue):D756-63
RepeatMasker: Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
WindowMasker: Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006, 2:134-41
Splign: Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3:20

RefSeq

Integrated reference sequences