associated to the article
GOLD: a cooperative approach to choose the best mRNA to genome
alignment and what we learn about human genes and the genome.
by Jean Thierry-Mieg et al. August 11, 2004
Updated versions of this document will be available on
the GOLD web site: www.ncbi.nlm.nih.gov/IEB/Research/Acembly/GOLD
1.
The selected cDNAs and genome, and the various contributing
alignment programs.
a.
The test
dataset: cDNA accessions and genome
b.
The cDNA projects: analysis of the properties of the
GOLD alignments, per cDNA project: quality of
the alignment, structural anomalies, submission defects and splicing anomalies (includes 4 figures
and associated tables of lists).
c.
Analysis of suspected frameshift errors, in the cDNA or in the genome
(with lists).
d.
Representation
of GOLD in the NCBI RefSeq project (lists of genes
missing from RefSeq).
e.
The seven contributing alignment programs (and alignments
from UCSC, NCBI and AceView).
f.
The .ali alignment format
2.
The GOLD selection procedure and results
a.
The GOLD cost function: lists of poly-A
and vectors clipping sites, and alternate alignments.
b.
An open
question: the shortest exons
c.
The GOLD selection program
d.
The current GOLD alignments
3.
Quality of the GOLD alignments and inferences on the quality of
the genome.
a.
Classification of the mRNAs by the quality of their
GOLD alignment, with lists of cDNAs from masked polymorphic areas.
b.
Map of all GOLD alignments on the genome
(highly polymorphic loci highlighted).
c.
Map of partial alignments.
d.
Lists and
map of suspected frameshifts in the genome.
4. Multiple alignments identify duplicated genes or genome assembly problems.
a.
Lists of exact and nearly exact duplicate alignments
(excellent good or partial).
b.
Genome map of the exact duplicate alignments
c.
Figure 4: Classification of the duplicate alignments,
in %alignments per chromosome:
intra versus inter-chromosomal duplicates, and
presence of introns in 0/1/2 members of a pair, pseudo-duplicates in unfinished genome areas.
d.
Supporting
tables for Figure 4: Table of properties
of excellent quality duplicate alignments, per chromosome; intron exon structure and
relative organization of the excellent duplicate alignments, per
chromosome.
e.
Functional analysis of the
duplicate genes
5.
a.
List of cDNAs with
rearranged alignment.
b. Genome map of the rearrangements.
6.
Intron properties:
Splicing in GOLD alignments: The frequency of structural polymorphisms and
intronless genes is unexpectedly high.
a.
Statistics on the presence or absence of introns in
the GOLD alignments, and on the types of intron
boundaries.
b.
Size histograms for standard introns, non-standard
introns and other micro-rearrangements (e.g. variable repeat number
polymorphisms), restricted or not to that in the
last exon.
c.
Genome map of variable repeat number
polymorphisms and anomalous introns that do not correspond to suspected
internal deletions in the cDNA.
7.
Comparing the
quality of the sequence of the chromosomes
a.
Tables and lists of cDNA per quality; includes cDNAs from masked areas; tables and lists of structural defects per
chromosome.
b.
Human Genome Project Coordinators by Chromosome.
8.
A snapshot at
compared performances, strength and weaknesses of public alignment programs
displayed at UCSC, NCBI 34.3 and AceView.
This supplementary
online material consists of one
section per section of the article. Each contains tables in which the numbers
of mRNA alignments with a given property are linked to an html document listing
the corresponding annotated mRNAs. These tables show, for each accession, the
main features of an alignment compared to the GOLD: the length, position and
number of errors of the match, the number of exons and exact map position on
the genome are also reported.
There may be some
redundancy in the lists, because they aim at groups with different interests.
The rearranged alignments for example appear by themselves in paragraph 5,
sorted by rearrangement type: mosaic, inversion, transposition, variable repeat
number, genome deletion. They appear again in paragraph 1, sorted by cDNA
project, for the use for this interest group. They also appear in paragraph 7,
sorted by chromosomes, because they may be useful to genome sequencing centers
for rechecking some areas.
The raw data, the GOLD
alignments, the program and the acedb database schema are provided in the first
paragraph. The entire primary and associated material is downloadable by ftp
from the GOLD web site as a single tar.gz file (about 50 Mbytes plus the
genome: 800 Mbytes), which expands as an autonomous hyperlinked collection of
html page.
1: The selected cDNAs and genome, and the contributing alignment programs
1a The test dataset: cDNA and
genome
We welcome submissions of alignments to the benchmark, in the format explained below.
The test data set are
available online here and on the web site www.ncbi.nlm.nih.gov/IEB/Research/Acembly/GOLD
.
§
The cDNA data
set (cDNA.fasta.gz)
contains the DNA sequences of 74,106 cDNAs, as they were in GenBank
on
§
We also
provide just the list of 74,106 mRNA GenBank accessions used
in this project
We have used the
genome from NCBI build 34, July 2003, which also supports NCBI annotations
release 34_3 (Feb 2004 to August 2004). It consists of the 24 chromosomes and
the unmapped contigs. The downloadable fasta files are:
The chromosome
fragments exported here were reconstructed using the NCBI agp file, and are
identical to the main chromosome fragments available from UCSC, except that we
have not remapped the floating NT contigs into the random contigs used by UCSC.
The genome is exported here as a pair of large fasta.gz files (each expands a
little below 2 Gb) which should be downloaded by ftp.
A
review of the various cDNA projects can be found in Imanishi et al., 2004
[10]. The 4 diagrams below, and the
support tables linking to the lists of accessions, show some properties of the
Gold alignments (as discussed in the article). A FLJ filtered protein coding set, including 22,167 of the 29,923
accessions is also available [Isogai and Nishikawa, personal communication].
Figure 1.1: Quality of the Gold alignments, by project.
Figure
1.1: Quality as defined in Figure 1 of the article, for all clones in the 4
collections, except those from hypervariable regions, which are masked; in total:
2057 KIAA (D or AB accessions), 32,723 MGC (BC) of which 14,867 belong to the
reference set, 9,228 DKFZ (AL or BX) and 29,711 FLJ (AK) accessions, of which
21,959 belong to the FLJ filtered set.
Project |
#cDNA |
Excellent |
Good |
Partial |
Dubious |
Bad |
Masked |
Unaligned |
KIAA |
0 |
0 |
||||||
MGC.filtered |
||||||||
MGC |
||||||||
DKFZ |
||||||||
FLJ |
||||||||
FLJ.filtered |
||||||||
other |
0 |
0 |
0 |
0 |
||||
Total |
74,106 |
65,966 |
6,120 |
887 |
427 |
205 |
379 |
122 |
Figure 1.2: Structural
anomalies in
the genome or the cDNA, per 100 kilobases of mRNA aligned in excellent good or
partial GOLD alignments, per project.
Figure 1b: Partial or
rearranged Gold alignments. From left to right: suspected genome deletion,
transposition, inversion, mosaic, partial alignment and suspected genomic
contamination, as defined in paragraph 5 and 6 of the article. Only excellent,
good and partial alignments outside of the highly polymorphic regions are
considered here.
Project |
cDNA |
Structural
problem in cDNA* |
Suspected genome deletion |
Transposition |
Inversion |
Mosaic |
Suspected genomic contamination |
KIAA |
2,051 |
103 |
|||||
MGC.filtered |
14,813 |
326 |
0 |
||||
MGC |
32,423 |
1,108 |
|||||
DKFZ |
9,164 |
543 |
|||||
FLJ |
29,330 |
2,970 |
|||||
other |
5 |
3 |
0 |
0 |
0 |
0 |
|
Total |
72,973 |
4,727 |
265 |
294 |
121 |
1,195 |
589 |
*Putative
structural problems in cDNA includes transposition, inversion, mosaic,
suspected genomic contamination and suspected internal deletion (partial) of
the insert. The other anomalies, suspected genome deletion, partial alignment,
or variable repeat number polymorphisms are not counted here, because they have
a good chance of reflecting structural problems in the genome. However, recall
that all types of structural defects may be due to anomalies in either the cDNA
or the genome.
Figure
1.3: Defects
in the sequence submission to GenBank/DDBJ/EMBL
Figure 1.3:
Defects in the sequence submission to the public databases: improper clipping of vector or other
putative 5’ exogenous sequences (4,927 in total), accessions clearly submitted
from 3’ to 5’, on the wrong strand
(114), and identical sequences, submitted in duplicate under different clone
names (787). Only cDNAs with excellent, good or partial alignments outside of
the hyper variable regions are considered.
We flagged only
114 submissions on the wrong strand, because this project does not use protein
annotation or clustering. But we believe, based on AceView complete gene analysis,
that this defect is more frequent (0.8%). Another problem is the presence in GenBank
of exactly identical (redundant) sequences submitted as different accessions.
Some 39 were submitted twice by MGC, with the same sequence under the same cDNA
name. These were clearly bugs in the submissions and were filtered out from our
initial selection. But an additional 787 exactly
redundant sequences, from clones with different names but from the same
project, most of them from MGC, were pointed out to us only recently, by Guy
Slater, from EBI. These were kept in the analysis and show up as “duplicate
clones” in Figure 1.3.
Project |
Out of #cDNA |
Unclipped
vector |
Submitted on
wrong strand |
Duplicate
clone |
KIAA |
2,051 |
|||
MGC.filtered |
14,813 |
|||
MGC |
32,423 |
|||
DKFZ |
9,164 |
|||
FLJ |
29,330 |
|||
other |
5 |
0 |
0 |
|
Total |
72,973 |
4,927 |
114 |
787 |
Figure 1.4: Number of splicing anomalies, per 100 kb
aligned.
Figure
1.4: Copy number polymorphisms of type “variable repeat number” (micro-deletion
in the genome) and all non-standard introns, split among those recognized as
suspected partial deletion of the insert by AceView, and the other non-standard
introns (as explained in paragraph 6 of the article). Only cDNAs with excellent, good or partial alignments outside of
the hyper variable regions are considered.
Project |
Out of #cDNA |
Variable repeat number polymorphism |
Suspected internal deletion |
Other non-standard intron* |
Aligns with no intron+ |
KIAA |
2,051 |
86 |
|||
MGC.filtered |
14,813 |
675 |
|||
MGC |
32,423 |
4,742 |
|||
DKFZ |
9,164 |
2,498 |
|||
FLJ |
29,330 |
8,827 |
|||
FLJ.filtered |
21,700 |
3,970 |
|||
other |
5 |
0 |
0 |
0 |
1 |
Total |
72,973 |
896 |
2,634 |
4,149 |
16,154 |
*Not
flagged as suspected internal deletion of the insert by AceView. +In
the rare cases where there are exactly duplicate alignments (repeated genes)
and one is intronless, it counts in this column (by convention).
1c. Identifying putative frameshift errors in the cDNA or in the
genome sequences:
The frequency of
suspected frameshift per library was monitored independently, by measuring the difference in length
between the ORF read on the cDNA clone and on its genomic GOLD footprint: as
hoped for, the MGC reference collection has, on average, identical lengths in
cDNA and Gold; FLJ comes next, with an average increase of 6 residues in Gold;
the non-reference MGC increases by 9 and DKFZ by 13. Interestingly, KIAA in
fact looses 2 residues on average in Gold, because these clones evidence more
frameshifts or mutations in the genome than the opposite.
We
selected accessions (from cDNAs with excellent quality alignment, with no
rearrangement and not belonging to a repeated gene) for which the ORF length is
at least 100 aminoacids smaller than on the genomic Gold footprint. The
corresponding list of 1536 suspected frameshifts in
the cDNAs is provided here. The complementary situation, in which the
ORF length on the cDNA is at least 100 aminoacids larger than on the genomic
Gold, points to 171 suspected frameshifts in the genome.
The corresponding genome map is here.
A
comparison of the levels of the two types of frameshifts, in the cDNA and the
genome, is shown in the table below, per library. Frameshifts appear to be
about 10 times more frequent in the cDNAs as in the genome, except when
selective criteria are applied, through informatics, as in MGC reference, or
through molecular biology, as in KIAA.
Table 1.2: Suspected frameshifts evidenced in cDNA and
genome, per library
cDNA project |
#cDNA considered |
#cDNA with ORF
length on genome>cDNA |
%suspected frameshift in cDNA |
#cDNA with ORF
length on cDNA>genome |
%suspected frameshift in genome |
FLJ
|
24,991 |
625 |
2.50 |
60 |
0.24 |
DKFZ |
7,998 |
300 |
3.75 |
32 |
0.40 |
MGC
other |
15,407 |
522 |
3.39 |
22 |
0.14 |
MGC
reference |
13,955 |
77 |
0.55 |
34 |
0.24 |
KIAA |
1,936 |
12 |
0.62 |
23 |
1.19 |
Total |
64,289 |
2.38 |
0.26 |
1d. Representation of the GOLD mRNAs in RefSeq
As of August 2, 2004, the NCBI RefSeq reference sequence
collection does not yet report 49% of the genes identified by the cDNAs from
the four large scale projects. A number of reasons why this may be so come to
mind: to create a new NM accession associated to a validated LocusLink gene,
RefSeq prefers coding genes, and in principle imposes that the CDS should be
complete. It favors genes with introns, avoiding a chance of validating
possible genomic contaminations or unspliced variants, and it is biased for
long CDS. The ideal cDNA would align on the genome at high quality, with no
rearrangement and over its entire length. There would be no evidence for a
frameshift or a suspect Stop codon in either the cDNA or the genome, and the
protein sequence read on the cDNA should closely match the protein sequence
read on the genome footprint. Ideally, the predicted encoded protein would be
conserved in evolution so that putative orthologs could be found. In addition,
a “good” RefSeq should map unambiguously in a single gene rather than in repeated
genes.
We
therefore selected, using related conditions on all mRNAs currently aligning in
a gene with no RefSeq, as of August 2, 2004, a set of 4,373 genes with no RefSeq, in which 4,928 cDNAs fulfill all of the
following criteria:
•
they align as “excellent” Gold on the July 2003/build 34 genome, i.e. over 99.98%
of their length, with 99.88% bases identical to the genome on average (and
>99% length and >99.6% identity). The alignments have no gap and are not rearranged. They map unambiguously in a single gene
each, this gene has no NM yet in August 2004. They constitute a non-redundant
set, and when more than one clone is selected in a gene, these clones represent
non-mergeable alternative variants.
•
They encode a predicted CDS of more than 100 aminoacids
(the CDS goes from Met to Stop, Met being coded by any of the recognized
possible codon: ATG, TTG, CTG or GTG; all the selected cDNAs are complete
C-terminal side). In addition there is no
frameshift in the genome or the cDNA, since the length of the CDS on the
cDNA submitted to GenBank equals that on the Gold genomic footprint.
These
accessions can then be split (Table 1d below) according to the other desirable
features for RefSeq: presence or absence of standard introns; completeness of
the CDS on the N-terminal side, which we define by the presence of an upstream
Stop, 5’ and in frame with the main ORF (in AceView); conservation of the
protein, which we define by presence of at least one hit, using BlastP with
expect 10-3, to a protein from another species. The presence of a
significant Pfam hit is also monitored.
Completion
on NH2 side |
Standard
introns |
CDS
average length (bp) |
Have
putative orthologs |
Have
Pfam motifs |
Complete
CDS with orthologs or Pfam (average #aa) |
Yes: 4138
mRNA complete CDS (from 3,719
genes*) |
Yes: 2352
mRNAs, (1934 genes*) |
829
bp (300
to 6000 bp) |
Yes |
Yes:
|
|
No: |
|
||||
No
(Homo sapiens specific) |
Yes: |
|
|||
No: |
|
||||
No:
1786
mRNAs, (1785 genes*) |
441
bp (300
to 2400 bp) |
Yes |
Yes: |
|
|
No: |
|
||||
No |
Yes: |
|
|||
No: |
|
||||
No
proof: 790
mRNA partial CDS (from 680 genes*) |
Yes: No: |
Min
CDS: 1343 (300 to ?) |
Yes
|
Yes |
|
No |
|
||||
No
|
Yes |
|
|||
No |
|
*None
has a RefSeq in August 04.
One
correlation we found was that most of the genes excluded from RefSeq are novel
in human, although most have putative orthologs in the protein databases. They
are expressed at a relative low level, hence are not found in all libraries.
One restriction we thought might apply is that of protein length, because it is
well known that RefSeq has a bias in favor of longer than average proteins. But
many of the missing genes encode large proteins. An example list of 791
accessions, encoding complete CDS proteins of 652 aa on average (and at
least 300 aminoacids), and mapping at excellent quality with no anomaly in 587
genes that do not yet have a RefSeq mRNA, is available here.
Similarly,
another 703 accessions, from 557 other
genes with no NM as of August 2, 2004, contain an ORF of more than 900 bp
(average ORF size is 1944 bases, 646 aa), but their products are not proven to
be 5’ complete. In addition, there are thousands of highly-coding alternatively
spliced variants not yet represented in an NM.
1e. Seven alignment
projects participated in the endeavor so far:
You are welcome
to download the alignments from UCSC,
NCBI
and AceView,
in the format defined below. These alignment
files were constructed by reformatting the sources as explained in the README
file in the same directory. Each table has around 300,000 lines and is named
according the method. The other alignments remain private.
To generate GOLD, we picked from
- AceView alignments from April 4th, 2004, for the entire set of 74,106 accessions;
- NCBI
alignments , version 34.3, march 10th, 2004 for the
entire set of 74,106 accessions, available from ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/maps/mapview/BUILD.34.3/hs_esttrn.md.gz;
- UCSC
alignments from February 15th, 2004, for a
subset of 73,602 accessions, available from ftp://hgdownload.cse.ucsc.edu/GoldenPath/hg16/database/all_mrna.txt.gz. We had to remove 504 accessions which mapped on
random contigs, because we did not succeed in reliably decoding UCSC non
standard coordinates for those;
- JBIRC/AIST alignments from February 5th, 2004, for a subset of 69,521 accessions, kindly contributed by Yasuyuki Fujii, Tadashi Imanishi and Takashi Gojobori (pers. comm.)
- NEDO alignments from November 21, 2003, for a
subset of 54,901 accessions, contributed by Kimura, Nishikawa, Isogai
and Sugano (unpublished)
- Exonerate alignments from April 8th,
2004, for a subset of 59,235 sequences, kindly contributed by Guy Slater
(Slater and Birney, pers. comm..).
- NCBI Splign alignments from February 13th,
2004, for the entire set, kindly provided by Kapustin and Lipman (pers. comm.).
1f. Alignment format definition:
This
format is used in the alignments available above and for the GOLD alignments. If you plan to contribute your alignments, please
use the following .ali format:
Alignments
are given as tab delimited tables; optionally lines may start with # and
contain a comment, otherwise each line corresponds to the alignment of an exon
on the genome; the successive exons of a single alignment follow each other.
The table has 9 columns:
Note that the orientation of the cDNA and the strand encoding the gene on the
genome can be deduced unambiguously from the relative values of x1, x2 on the
cDNA and of a1, a2 on the genome. If the cDNA was submitted to GenBank in the
direct orientation, from 5’ to 3’, and there is no rearrangement, each line for
this cDNA will have x1 < x2. If on the contrary the cDNA was submitted to
GenBank in the reverse orientation, each line in our table will have x1 >
x2. If the gene is encoded on the direct (Watson) strand of the genome, each
line will have a1 < a2, and the a’s will increase with exon number. Vice
versa, if the gene is encoded on the reverse (Crick) strand, each line will
have a1 > a2 and the a’s will decrease from one exon to the next.
If
a cDNA has several independent alignments on the genome, for example if the
gene is duplicated or the corresponding genome section inadvertently duplicated
in the current human genome build, or if the cDNA or genomic clone is mosaic,
the exon number starts at 1 for each independent alignment. If a cDNA is
rearranged relative to the genome, the x coordinates will not be a strictly
increasing or decreasing function of the a coordinates.
2: The
GOLD selection procedure: quality evaluation and choice of the best alignment
2a. The GOLD cost function:
1. Scoring the mRNA to genome match: To win the
GOLD, the game is effectively to align the most part of the cDNA while
minimizing the errors. Some cases are really equivalent. For example if the 3’ end of a cDNA does not match the genome
perfectly, one may either trim the end of the alignment and keep a shorter but
perfectly matching last exon, or keep a longer last exon with a few mismatches.
But
often one solution makes better biological sense. Because a random alignment is expected to yield about 1 in 4 correct
bases, the error cost has to be above 1. At 2 points per error, alignments
extending beyond a reasonable match (less than 2 matching bases after an error
on average) are disfavored relative to incomplete alignments.
2. Intron feet and splice site consensus: The ribonucleoprotein complex in charge of
performing the splicing is equipped with various small RNA guides and proteins
ensuring proper recognition and removal of introns with conserved sequences at
their feet: gt-ag, and more rarely gc-ag or at-ac. The cost function slightly
favors these.
3. Poly A addition site and vector sequence recognition:
· As expected, more than 60% of the mRNAs submitted to GenBank contain a poly-A tail, naturally added at the 3’ end of mature transcripts. Yet there are A-rich regions in the genome downstream of any gene, and most programs get fooled and occasionally create pure A exons. The average number of A in the GenBank accessions is 29, the largest poly-A tails are submitted by the DKFZ project (Wiemann et al, 2001; AL512733 has 245 A at the 3’ end). We established a list of poly-A addition sites by looking for pure or almost pure A stretches beyond the ends of alignments. We then analyzed the base composition of the last exon of all alignments winning the Gold (temporarily) thanks to an extra 3’ exon, and identified more poly-A tails. During that process, we occasionally recognized and clipped exit vector sequences. This analysis was not always trivial and we validated the ambiguous cases by hand. The current list tentatively flags poly-A addition sites in 45,865 cDNAs. This table gives the coordinate of the last base of the accession to be aligned (i.e. the base just before the poly-A).
By
convention, we chose to place
the poly-A addition site right before the first A of the poly-A stretch, as
also done by UCSC, but not by AceView. As explained in the article, this convention
does not influence the score, yet one weak argument in favor of this choice is
cDNA AK024744, from the RPC155 gene encoding RNA polymerase III, where 12 of
the 21 terminal A match the genome in direct extension of the last exon. So the
poly-A could start at any of these 12 A. A poly-A signal variant, AACAAA, found
in 2.2% of all human transcripts, occurs however 16-21 bases before the first
A, just in the range to suggest that the first A belongs to the poly-A tail
rather than to the premessenger. Almost all other cases, where a stretch of at
least 12A matches the genome at the 3’ end, with or without standard introns,
lack a poly-A signal. Clearly the oligo-dT method allows binding and priming in
such A-rich sites, leading to internally primed (3’ incomplete) cDNAs and
possibly also, in rare instances, to genomic contaminations.
• Vector
recognition: Similarly, first exons composed mostly of
vector sequence should be penalized. We
noticed that cDNAs for which the first few bases fail to align often start with
identical sequences, usually found in vectors polylinkers. We therefore
developed a tool to identify common sequences found at the same position in
multiple clones from the same library, and tentatively identified altogether 5,031
suspected vector sequences or other sequence to clip on
the 5' or the 3’ side of the insert. The coordinates in
the accession of the first and last base to be aligned are given in this file [accession
a b seq]: the vector or other foreign sequence that we
propose to clip is base 1 to (a-1) on the 5' side. The explicit sequence that
we’d clip is given in the fourth column.
Other 5’ additions we monitor in this list are unaligned poly-T stretches,
which probably correspond to inversions with breakpoint in the polyA, short
non-matching poly-G stretches, and poly-A stretches that sometimes occur in
cap-selected libraries at the 5’ end of the mRNA (Ota et al., 2004): 295 such clones are currently identified in the
collection, the longest has 59 A on the 5’ end (AK074948). The significance of the 5' poly A is unknown.
As shown in this
diagram, the MGC consortium [8] contributes most to this problem, with 10.9%
accessions apparently deserving a 5’ clip (8.9% in the MGC reference
collection), and this despite the fact that a large number of MGC accessions were
resubmitted and fixed just before we imported the GenBank data (possibly thanks
to the list of 3,938 BC that we sent to this group in Sept 2002). We also
propose to clip 6.9% accessions from DKFZ, 2.6% from FLJ and 0.5% from KIAA.
4. These 3,141 accessions have “alternate" alignments, i.e. different exon structure, but they lie in the same genomic area and have the same score
as the GOLD.
2b. An open question: the short exons
We do not
penalize short exons. However, the smallest experimentally confirmed exons in worm are nine
bases long, and both are alternative, indicating they are difficult to splice
out (gene 2G876 and clp-1; Kohara, Y., Shin-i, T.,
Thierry-Mieg, J., Thierry-Mieg, D., Suzuki, Y., Sugano, S., A complementary view of the C. elegans genome; unpublished;
sequences are in the public databases). There are a number of
independent well documented examples of 6 base long exons in vertebrate, for
example in the gene encoding Troponin T. Yet to our knowledge, below 6 bp,
there is no hard evidence in human on how small an exon can be.
A recent article by a bioinformatics team at TIGR claims
to discover a number of micro-exons shorter than 6 bases (Volfovsky N, Haas BJ,
Salzberg SL: 2003. Computational discovery of internal micro-exons. Genome Res 13:1216-21). Unfortunately,
our examination of the five cases they give for human mRNAs leads us to invalidate
3, and to offer a more likely hypothesis for the two others.
Indeed, in the three cases of X15949, AF144241 and BC014605, alignments to the genome are complete
and fully error-free without invoking micro-exons. The actual exon 2 of X15949 is
93 bases long, not 5 bp; intron 1 is standard gt-ag but 45,308 bp long. Exon 3
of AF144241 is 176 bp long not 5 bp, but the second intron is non standard gt-gt
and 3700 bp long (ATCAT[gt-gt]GTGAGCCACC). It is interpreted by AceView as having
undergone partial deletion of the insert. Finally, the case of BC014605 is one
where the 2,084 bp intron is gc-ag and not gt-ag, and the 5 bp proposed to be
an independent exon belong to exon 1. The problem seems to be a bug in the
alignment program used by the authors: it refuses
to splice a non gt-ag intron or even long gt-ag introns. As expected
statistically, it then finds a few examples of solutions with very small exons
bounded by two [gt-ag] introns. None of the seven programs we used in Gold had
such dramatic defects.
The two
remaining cases, U43586 and AB053301,
could align with a 5 bp exon, yet it seems more likely that the underlying reference
genome sequence has errors (or represents a rare polymorphisms). Indeed, in
both cases, there are locally more than 30 clones with the exact same sequence;
there are no alternative (mRNA or EST; see AceView). So the 5 bp exon would be constitutive in
these highly expressed genes. This is contrary to the multiple observations
indicating that very short exons are always alternative. In addition, this area
of the gene ALS2CR4 is highly conserved in rat, mouse, and even drosophila, and
the 5 bases of the proposed micro-exon of AB053301 are found right in front of
the homolog of the next exon, exon 3, as TCAAGatgatattc… The conclusions of
this article do not seem well supported, so the problem of the existence and
size of micro-exons remains open.
Below
is the histogram showing the exon size
in the GOLD set. Notice the strong
bias toward exon lengths that are multiple of three bases. This effect may be
slightly less preeminent for 3 bp, possibly making exons in such short range
questionable. But notice it is also unexpectedly low for 24, weakening the argument.
•
Micro-exons: A list of 88 GOLD alignments (of excellent quality
and not rearranged) containing a very short exon, less than 6 bp long, bounded
by two standard [gt-ag] introns, is given here in the hope that some researcher
may test, for example by mutagenizing the underlying genome in the region of
the short exon and testing the transcripts generated in an in vitro or in vivo transcription
system, whether or not the mutation is later found in the transcript, thereby fixing
the biological limit to how small an exon might be. Note that the NCBI program is the only one to routinely
find and propose such short “good looking” exons (in build 34.3, February to
July 2004); this is technically a remarkable achievement.
2c. The GOLD selection
program and the dedicated
database schema are available
·
the
coordinates on the genome of the excellent or good quality Gold alignments, which align over more than 99% of their length
with less than 4 (for the excellent) or 15 (for the good) differences per kb
from the genome
·
the coordinates on the genome of the other Gold alignments of lesser quality (not so interesting for the
general user)
·
The sequences
of the GOLD mRNAs, as derived from the genome, in fasta format, limited to excellent, good and partial Gold
alignments.
·
The sequences,
in aminoacids, of the GOLD longest open reading frames (ORF), also in fasta
format, deduced from the GOLD mRNA
genome footprint above. Note that we export the ORFs, not the coding sequence
(CDS) because the ORF is not disputable, while the CDS implies we have an
hypothesis on the completeness of the transcript and the initiator codon.
·
A description of the
properties of the mRNAs alignments, in .ace format, i.e. presented as a human readable well organized
document. This document can also be read in an acedb
database for higher level querying, using the schema provided.
3: The quality of the GOLD alignments reflects quality and completion of the genome and cDNA sequences.
3a. Quality alignment of
the GOLD, or lack of alignment, for all 74,106 cDNAs without exception:
Quality |
Gold |
mRNA |
%mRNA |
mRNA
with |
mRNA
with |
Excellent |
66,419 |
66,018 |
89.09% |
1,634 |
362 |
Good |
6,229 |
8.34% |
|||
Partial |
914 |
1.23% |
|||
Dubious |
541 |
0.72% |
|||
Bad |
425 |
0.57% |
|||
Unaligned |
0 |
0.06% |
0 |
0 |
|
Total |
74,528 |
74,106 |
100.00% |
2,871 |
424 |
1 mRNAs aligning with exact same score in n
different positions in the genome count n. Rearranged alignments (e.g. from
mosaic or rearranged cDNA or genome) count once at their combined rationalized
score. If a cDNA aligns in multiple places in the genome at the exact best
score, all these alignments become Gold. This happens for genes that are
exactly duplicated, and explains why there are more Gold alignments than mRNAs.
For
convenience, the categories are directly linked as html tables. We also provide
a complete excel-compatible tab delimited document.
The complete list of rearranged accessions and those
with duplicate alignments can be found below.
A more detailed partition of the mRNAs aligning over a given percent of their
length with a given percentage of “errors” is given below (We added 78
unclassifiable, with no score, in the bad category)
%lengthAligned %error
per kb |
>=99% |
>=60% |
<60% |
0% |
Total |
% mRNA |
<=4
|
66,018 |
0 |
66,762 |
90.09% |
||
4
to 15 |
0 |
6,443 |
8.69% |
|||
15
to 30 |
0 |
449 |
0.61% |
|||
>
30 |
|
408 |
0.55% |
|||
Unaligned |
|
|
|
44 |
0.06% |
|
Total |
72,742 |
1,080 |
240 |
44 |
74,106 |
100.00% |
%
mRNA |
98.16% |
1.46% |
0.32% |
0.06% |
100.00% |
|
3b. Map of all Gold mRNA on the genome:
The
highly polymorphic loci (masked mRNA) are highlighted (red superscript The lists
of alignments per chromosome, sorted by quality, and the lists of cDNA from
masked areas of the genome, are here.
The
seemingly gene-poor regions may correspond to gaps in the genome sequence (July
2003), or be truly poor in genes, like parts of chromosome 13 or some
centromeric regions.
3c Partial alignments usually highlight problems in
the genome assembly or in the cDNA submission.
Figure 3c: Genome map
of the partial alignments, where 60 to 99% length of the mRNA aligns at
good quality, i.e. with less than 15 differences per kb.
Clustering
of partials is a strong indication of genome problem.
3d. Identifying suspected frameshifts in the genome:
We
have not done a systematic search for coherent disagreements between the genome
and all mRNAs and ESTs for this project. Yet here is a list of 171 clones where the ORF in the
cDNA is larger than that on GOLD by at least 100 aminoacids. A good proportion
of these identify frameshift mutations
in the genome that the sequencing centers might be happy to fix. Their
genome map is shown below: notice the absence of suspected frameshift on
chromosome 21, and the large number on 17. Clustering of multiple clones
(superimposed glyphs) indicate a confirmed genome frameshift. About half of the
others are confirmed in the complete EST/mRNA AceView database.
Figure 3d:
Suspected frameshifts in the genome sequence:
4: Multiple alignments identify duplicated genes (or incomplete genome assembly), and point to
chromosome-specific control mechanisms.
4a. Lists of exact or nearly exact gene duplications: Out of 71,421 mRNA aligning at excellent, good or partial quality, and
not split, we identified and list below the clones with multiple alignments in
different areas of the genome, either with exact same Gold score (from exact
duplicate genes) or nearly exact (5 points away from the Gold score, i.e. on
average 99.97% sequence identity).
|
Exact duplicate alignments |
Exact or nearly exact duplicate alignments |
|||||||
Quality |
mRNA |
Gold alignments |
mRNA with 2 copies |
mRNA with >2 copies |
mRNA |
Exact or nearly exact duplicates |
mRNA with 2 copies |
mRNA with >2 copies |
1 copy in a random contig1 |
Excellent |
763 |
342 |
20 |
1,379 |
544 |
83 |
249 |
||
Good |
96 |
45 |
2 |
237 |
91 |
16 |
25 |
||
Partial |
10 |
5 |
0 |
24 |
10 |
1 |
5 |
||
Total |
414 |
869 |
392 |
22 |
1,640 |
645 |
100 |
279 |
1About the 249 excellent
duplications in random contigs (build 34/July 2003): we noticed that all
cDNAs in that group consistently align on a given chromosome and one of its
associated “random”, except for 5 accessions mapping in finished chromosome 5
but random chromosome 8, suggesting that
a mosaic genomic clone is involved (NT079526, build 34).
4b Genome wide map of the exact duplicate alignments.
This map complements the
paper’s Figure 2, which shows the combination of exact and
almost exact duplicates, while the map below shows only the exact duplicates. Chromosomes
3, 11, 14, 18, 19, 20, 21 do not contain any exact duplicates and are
omitted from the diagram. Notice the dominance of the intrachromosomal
duplications (in blue), with 243 duplicates with introns in both copies and 147
with no intron in either. There are no other types of intrachromosomal repeats.
Notice also the huge reduction in extrachromosomal (red) duplicates, compared
to Figure 2, especially in those with introns in one copy and not in the other:
these retroposon like pseudogenes usually have started diverging and are no
longer exact: Here we see only 5 exact pseudogenes of yes-no type (intron in
one, not the other; 7 cDNAs), from two genes with introns (6 cDNAs), both
ribosomal proteins. We also see 5 sites (18 cDNAs) for intronless
interchromosomal repeats. Finally, the 78 intron-intron exact interchromosomal
duplicates are all in the pseudoautosomal region: the four telomeric copies
from 1/15 and 14/22 already differ by one or two bases.
Red bar : interchromosomal
duplicate alignment
Blue bar: intrachromosomal duplicate alignment
Icon on top of vertical
bar: v if the copy mapped here has intron, no sign if it is intronless.
Bottom icon: ^ if the
copy elsewhere has intron, none if intronless, o if there are copies with introns as well as intronless.
4c Figure 4: Classification
of the duplicate alignments, per chromosome.
The figures in
this paragraph display graphically the classification of the
exact or almost exact duplicate genes (within 5 points of the GOLD score, on
average 99.97% similarity), per
chromosome, on genome build 34/July 2003.
The 1,377 repeats diagramed here are
generated by 625 mRNA aligned at excellent quality and not split (list available in 4a; there are altogether 66,004 alignments).
Results are shown in percent of all excellent non-split alignments on each
chromosome. In cases where more than 2 equivalent alignments are generated, a
given alignment may contribute to more than one class, for example inter and
intra-chromosomal, yet its total contribution is normalized to 1. The corresponding genome map, restricted to
duplicates in finished areas of the genome, is provided in the article (Figure
2). The data tables that allowed building these diagrams are available below.
a) Repetitions for which at least one member lies on a “random” contig, usually in an unfinished chromosome. These 470 repeated alignments do not have biological meaning; they just reflect lack of completion of the chromosome sequence.
b)
Duplicates for which both members map on the same autosome, with indication of the presence or absence of
standard introns. Alignments in this category include 327 where both copies
have standard introns, 2 where only one copy has introns and 232 where neither
copy has introns. See Figure 1d for sex chromosomes, which are highly enriched
in repetitions.
c)
Duplicates where both members map on different autosomes. Notice the much lower number of alignments and the
majority of retroposon-like pseudogenes: there are only 4 alignments (2 pairs)
where both copies have standard introns, 65 where only one copy has introns and
25 where neither copy has introns.
d) Same
for the X and Y chromosomes. Notice the much larger scale: X has 104 repeats of X on X (94 with
introns in both and 20 with no introns in either) and Y has 37 repeats of Y on
Y (35 with introns in both and 2 with no introns in either). For
inter-chromosomal repeats, X and Y share 45 repeats in two blocks of homology,
2.4 Mb in total (39 with introns in both copies and 6 with no introns in
either). X has an additional 10 repeats shared with autosomes, many of those
could be pseudogenes (5 have introns in only one copy and 5 have no intron in
either). For reference, 3.05% of the excellent GOLD alignments lie on the X and
Y chromosomes rather than on the autosomes.
Exact and almost
exact duplicate alignments, within 5 points from the Gold, are counted. Only
excellent alignments, not rearranged, are considered. Each alignment is
normalized to 1.
Chromosome |
cDNA |
Duplication |
Duplication
on the same chromosome |
On
another chromosome |
Duplication
involving isolated contig |
1 |
6,585 |
219 |
42 |
12 |
165 |
2 |
4,731 |
112 |
56 |
4 |
52 |
3 |
4,013 |
17 |
0 |
2 |
15 |
4 |
2,672 |
6 |
0 |
6 |
0 |
5 |
3,077 |
108 |
92 |
1 |
15 |
6 |
3,222 |
23 |
0 |
3 |
20 |
7 |
3,287 |
81 |
72 |
9 |
0 |
8 |
2,335 |
57 |
31 |
3 |
23 |
9 |
2,771 |
110 |
27 |
2 |
81 |
10 |
2,848 |
53 |
38 |
5 |
10 |
11 |
3,648 |
7 |
0 |
7 |
0 |
12 |
3,466 |
9 |
0 |
6 |
3 |
13 |
1,304 |
11 |
5 |
6 |
0 |
14 |
2,149 |
2 |
0 |
2 |
0 |
15 |
2,320 |
76 |
55 |
10 |
10 |
16 |
2,955 |
114 |
110 |
4 |
0 |
17 |
3,736 |
65 |
22 |
1 |
41 |
19 |
3,793 |
10 |
4 |
4 |
2 |
20 |
1,834 |
2 |
2 |
0 |
0 |
22 |
1,518 |
13 |
9 |
4 |
0 |
X |
1,906 |
194 |
114 |
55 |
25 |
Y |
104 |
82 |
37 |
45 |
0 |
Un |
63 |
8 |
0 |
0 |
8 |
Total |
66,018 |
1,379 |
717 |
191 |
470 |
4d.2. Intron exon structure and
relative organization of the excellent duplicate alignments
Same set as
above, same rule for normalizing.
Note: Our
analysis of this dataset indicates that chromosome 8 has 31 tandem in 8 genes
and no other repeat. Chromosomes 5 (30 genes), 15 (31 genes) and 17 (16 genes)
show similar but milder biases toward tandems. On the other hand, some
chromosomes, such as X, tend to have much more palindromic duplication than
tandem. Other chromosomes with a similar but milder bias are 2 (19 genes), 7
(30 genes), 9 (13 genes) and Y (15 genes). We do not know what the determinant
for this difference is, neither do we understand why chromosomes 3, 11, 14, 18,
19, 20, 21 simply do not contain any exact duplicates, but it is clearly a
problem of the utmost importance.
Chromo-some |
Same chrom both intron |
Same chrom one intron |
Same chrom no intron |
Other chrom both intron |
Other chrom one intron |
Other chrom no intron |
Tandem |
Inward palin- drome |
Outward
palin-drome |
Further than 1 Mb |
1 |
19.00 |
0.00 |
23.00 |
1.00 |
10.00 |
1.00 |
18.00 |
7.00 |
13.00 |
4.00 |
2 |
35.00 |
0.00 |
21.00 |
0.00 |
3.00 |
1.00 |
9.67 |
26.67 |
12.00 |
7.67 |
3 |
0.00 |
0.00 |
0.00 |
0.00 |
1.50 |
0.50 |
0.00 |
0.00 |
0.00 |
0.00 |
4 |
0.00 |
0.00 |
0.00 |
0.00 |
5.50 |
0.50 |
0.00 |
0.00 |
0.00 |
0.00 |
5 |
57.00 |
0.00 |
35.00 |
0.00 |
1.00 |
0.00 |
60.73 |
6.00 |
7.47 |
17.80 |
6 |
0.00 |
0.00 |
0.00 |
0.00 |
1.00 |
2.00 |
0.00 |
0.00 |
0.00 |
0.00 |
7 |
39.00 |
0.00 |
33.00 |
0.00 |
3.50 |
5.50 |
9.00 |
6.00 |
29.00 |
28.00 |
8 |
17.00 |
0.00 |
14.00 |
0.00 |
2.00 |
1.00 |
31.00 |
0.00 |
0.00 |
0.00 |
9 |
12.00 |
0.00 |
15.00 |
0.00 |
1.00 |
1.00 |
2.00 |
7.00 |
2.00 |
16.00 |
10 |
16.00 |
0.00 |
22.00 |
0.00 |
2.50 |
2.50 |
10.00 |
3.00 |
9.00 |
16.00 |
11 |
0.00 |
0.00 |
0.00 |
0.00 |
7.00 |
0.00 |
0.00 |
0.00 |
0.00 |
0.00 |
12 |
0.00 |
0.00 |
0.00 |
0.00 |
5.50 |
0.50 |
0.00 |
0.00 |
0.00 |
0.00 |
13 |
5.00 |
0.00 |
0.00 |
0.00 |
5.50 |
0.50 |
5.00 |
0.00 |
0.00 |
0.00 |
14 |
0.00 |
0.00 |
0.00 |
1.00 |
1.00 |
0.00 |
0.00 |
0.00 |
0.00 |
0.00 |
15 |
36.00 |
0.00 |
19.33 |
1.00 |
3.33 |
6.33 |
35.33 |
8.00 |
6.00 |
6.00 |
16 |
72.00 |
0.00 |
38.00 |
0.00 |
3.00 |
1.00 |
51.67 |
6.00 |
26.67 |
25.67 |
17 |
17.00 |
2.00 |
3.67 |
0.00 |
0.00 |
1.00 |
15.00 |
4.00 |
3.67 |
0.00 |
19 |
4.00 |
0.00 |
0.00 |
0.00 |
4.00 |
0.00 |
0.00 |
0.00 |
4.00 |
0.00 |
20 |
0.00 |
0.00 |
2.00 |
0.00 |
0.00 |
0.00 |
0.00 |
0.00 |
2.00 |
0.00 |
22 |
2.00 |
0.00 |
7.00 |
1.00 |
1.00 |
2.00 |
0.00 |
3.00 |
2.00 |
4.00 |
X |
94.00 |
0.00 |
20.00 |
39.00 |
4.67 |
11.33 |
19.00 |
62.50 |
32.50 |
0.00 |
Y |
35.00 |
0.00 |
2.00 |
39.00 |
0.00 |
6.00 |
3.00 |
12.00 |
11.33 |
10.67 |
Un |
0.00 |
0.00 |
0.00 |
0.00 |
0.00 |
0.00 |
0.00 |
0.00 |
0.00 |
0.00 |
Total |
460.00 |
2.00 |
255.00 |
82.00 |
66.00 |
43.67 |
269.40 |
151.17 |
160.63 |
135.80 |
4e. Functional analysis of a sample of the duplicated genes on
the July 2003 genome:
Selected
genes (with a LocusID and a
meaningful name) that have multiple excellent GOLD alignments, with exact or
almost exact copies duplicated outside of the pseudo-autosomal region are shown
in the table below. Each gene is linked to its AceView representations (updated
automatically). Some of these genes, but not all, are recognized as having
duplicate genes in the standard gene/RefSeq/LocusLink annotation at NCBI, and
this is evidenced by multiple LocusID in the second column. The others are not
annotated as duplicates at NCBI (July 2003 to 2004); this problem will probably
be fixed in the next releases. (Caution: the LocusID identifications result
from AceView analysis, not directly from GOLD, since GOLD only deals with
alignments and does not provide clustering or protein analysis. For this
reason, and because the public version of AceView was missing one quarter of
the repeats at the time of this work, this list is not exhaustive).
Functional
analysis of the duplicated genes: The
duplicated gene title or description, and its involvement in molecular and
biological processes (using GO classification), as annotated by AceView, are
displayed in the table.
Notice the unexpectedly high number of genes whose
products bind DNA or RNA, compared to the description of segmental duplications
by Bailey et al, 2002. Transcription factors, helicases, splicing factors, or
chromatin structure proteins are as preeminent in the list as those involved in
development and signal transduction. So, the picture we describe for repeated
genes matches more closely that described by the Bailey team for mouse or rat
segmental duplications, both by their relative location and by their function
(Bailey et al, 2004).
A more complete list, including genes duplicated in
the pseudo-autosomal region (e.g. ALTE,
ASMT, ASMTL, CD99, ELK4, IL3RA, IL9R, P2RY8, PR48, SLC25A6, SPRY3, SYBL1and TXNRD1) and genes with a
LocusID but no official name and for which the protein has not been studied, is
given in another more complete table.
That table also includes annotation of the GO cellular compartment: duplicated
gene products can be located in any part of the cell, with no apparent bias.
Phenotype:
Only 5 diseases have so far been
described in these duplicate genes, a three fold lower frequency than in all
genes confounded (CLN3, SMN1, SSX1, SSX2, STRC, i.e. 5 duplicated loci with
OMIM phenotype out of 199 duplicated genes (3.1%)(161 with LocusID) versus 1543
genes with OMIM phenotype/19413 (7.9%) genes with LocusID in LocusLink July
2003). This is expected, since loss of function mutation, e.g. hypomorphic or null mutation, in
the sense of Muller (Muller HJ:
Further studies on the nature and causes
of gene mutations. Proc. Int. Cong. Genet. 1932, 6:213-255.) will, except for the very few haplo-insufficient loci,
lead only to dosage-type effects, that is to mild or undetectable phenotypic
effects. But neomorphic or antimorphic mutations are expected to lead to strong
effects. This type of gain of function mutation is much less frequent. In
practice, it is usually associated to dominant or semi dominant effects,
although not always. It may, for example, be associated to rearrangements, such
as translocations that would bring the promotor of one gene upstream and in
control of the transcribed region of the other. The various translocations
involving the duplicated SSX genes on chromosome X and the gene SYT on chromosome
18, and leading to the synovial sarcomas, are examples of the neomorphic type.
Gene |
LocusID |
Title |
GO molecular processes |
GO Biological process |
Phenotype |
64682;285069 |
gene ANAPC1 encoding anaphase-promoting
complex 1 (meiotic checkpoint regulator). |
Possibly chaperone |
Cell cycle control |
|
|
57730 |
gene
ank.12/ ank.13 encoding KIAA1641 protein. |
ankyrin |
|
|
|
255326 |
gene ArfGap.8 (and glamu) encoding
hypothetical protein LOC255326. |
DNA binding |
Possibly nucleotide excision repair |
|
|
10156;246721 |
gene |
DNA binding activity; GTPase activator
activity |
intracellular signaling cascade;
transcription regulation |
|
|
1066;51716 |
gene CES1 and CESR, encoding
carboxylesterase 1 (monocyte/macrophage serine esterase 1). |
serine esterase activity; hydrolase
activity |
embryogenesis and morphogenesis;
metabolism; response to toxin; xenobiotic metabolism |
|
|
1201 |
gene CLN3 encoding ceroid-lipofuscinosis,
neuronal 3, juvenile (Batten, Spielmeyer-Vogt disease). |
chaperone activity |
protein folding |
[OMIM 204200] Ceroid, lipofuscinosis,
neuronal, 3, juvenile |
|
64506; 338963;6218 |
gene CPEB1 encoding cytoplasmic
polyadenylation element binding protein 1. |
nucleic acid binding activity |
protein biosynthesis |
|
|
9598;158511 |
gene CSAGE encoding taxol resistance
associated gene 3. |
|
|
|
|
1617;57055; 57054;57135 |
gene DAZ.1 encoding deleted in
azoospermia ; DAZ3 encoding deleted in azoospermia 3.. |
RNA binding activity;nucleic acid
binding activity |
fertilization (sensu Animalia);
spermatogenesis |
|
|
51428 |
gene DDX41 encoding DEAD
(Asp-Glu-Ala-Asp) box polypeptide 41. |
RNA binding activity; ATP dependent
helicase activity |
apoptosis; development; RNA processing |
|
|
1667;1668 |
gene DEFA1 encoding defensin, alpha 1,
myeloid-related sequence. |
antifungal peptide activity |
defense response; xenobiotic metabolism |
|
|
63947;203429 |
gene DMRTC1 encoding DMRT-like family
C1. |
|
|
|
|
8663 |
gene EIF3S8 encoding eukaryotic
translation initiation factor 3, subunit 8, 110kDa. |
translation initiation factor activity |
regulation of translational initiation |
|
|
9260 |
gene ENIGMA
encoding enigma (LIM domain protein). |
|
intracellular signaling cascade;
receptor mediated endocytosis |
|
|
FKBP6
/ FKBP6.1/
FKBP6.2
/ FKBP6.3
|
8468; 54441 |
gene FKBP6 or .1,.2,.3, encoding FK506
binding protein 6, 36kDa. |
FK506
binding activity; zinc ion binding |
protein folding |
|
96626;339692 |
gene Gaa1.0 encoding pinch-2. |
|
|
|
|
2543;2574;2575; 2576;2577;2578; 2579;26748;
26749;89801 |
gene GAGE1
encoding G antigen 1. |
|
cellular defense response |
|
|
9503 |
gene GAGED2 or GAGED2.1, encoding G
antigen, family D, 2. |
|
|
|
|
9502 |
gene GAGED3 or GAGED3.1, encoding G
antigen, family D, 3. |
|
|
|
|
23015 |
gene GOLGIN-67 encoding golgin-67. |
|
|
|
|
2953 |
gene GSTT2 encoding glutathione
S-transferase theta 2. |
|
|
|
|
GTF2H2/ GTF2H2.1 |
2966 |
gene GTF2H2.1 encoding general
transcription factor IIH, polypeptide 2, 44kDa. |
|
DNA repair; regulation of
transcription, DNA-dependent |
|
2969;2970 |
gene GTF2I encoding general
transcription factor II, i. |
protein binding activity; transcription
factor activity |
regulation of transcription initiation
from Pol II promoter, DNA-dependent; signal transduction |
|
|
54441;84163 |
gene GTF2IRD2 encoding DKFZp434A0131
protein./ gene GTF2I.4 encoding transcription factor GTF2IRD2. |
|
|
|
|
84500 |
gene HEAT.2 or HEAT.4, encoding
KIAA1833 protein. |
|
|
|
|
374676 |
gene heyu, similar to golgi
autoantigen, golgin subfamily a, 2; SY11 protein. |
|
|
|
|
8370 |
gene HIST2H4 encoding histone 2, H4. |
DNA binding |
|
|
|
55410;86614; 159119;83869 |
gene HSFY encoding chromosome Y open
reading frame 14. |
transcription factor activity |
regulation of transcription,
DNA-dependent |
|
|
54732 |
gene
HSGP25L2G encoding gp25L2 protein. |
protein carrier activity |
intracellular protein transport |
|
|
79930 |
gene IRS.4 encoding Dok-like protein. |
insulin receptor binding |
|
|
|
81893 |
gene LAT1-3TM encoding LAT1-3TM
protein. |
transcription regulator activity |
|
|
|
3963 |
genes LGALS7 encoding lectin,
galactoside-binding, soluble, 7 (galectin 7). |
lectin;sugar binding activity |
heterophilic cell adhesion; cell growth
and/or maintenance |
|
|
51402 |
gene LW-1 encoding LW-1. |
transcription factor activity |
regulation of transcription,
DNA-dependent |
|
|
81557 |
gene MAGE.2 encoding melanoma antigen,
family D, 4. or MAGED4 melanoma antigen, family D, 4.. |
|
|
|
|
9598;266740; 4101;4111 |
gene MAGEA2 encoding taxol resistance
associated gene 3 |
|
drug resistance |
|
|
266740; 4101;4105 |
gene MAGEA2b encoding melanoma antigen,
family A, 2, copy b. |
|
|
|
|
4108 |
gene MAGEA9 encoding melanoma antigen,
family A, 9. |
|
|
|
|
222967;285927 |
gene MORN.6 encoding hypothetical
protein LOC222967. |
|
|
|
|
4360 |
gene MRC1 encoding mannose receptor, C
type 1. |
sugar binding activity; mannose binding
activity; calcium ion binding activity; receptor activity |
heterophilic cell adhesion;
pinocytosis; receptor mediated endocytosis |
|
|
51326;10230 |
gene NBR2.1
encoding ARF protein./gene arf.11 encoding ARF protein. |
|
|
|
|
51402 |
gene novar encoding LW-1. |
transcription factor activity |
regulation of transcription, DNA-dependent |
|
|
56001 |
gene NXF2 or NXF2.1 encoding nuclear
RNA export factor 2. |
protein transporter activity; RNA
binding activity |
protein-nucleus import; mRNA
processing; mRNA-nucleus export; |
|
|
84528 |
gene PEPP-2 encoding PEPP subfamily gene
2. |
transcription factor activity |
regulation of transcription,
DNA-dependent |
|
|
112597;5342 |
gene PLGL.1 encoding hypothetical
protein MGC4677. |
plasmin activity |
|
|
|
283820;23420 |
gene PM5 (or Cna_B.0) encoding hypothetical
protein LOC283820. |
|
|
|
|
84220 |
gene RANBP2L1 encoding RAN binding
protein 2-like 1. |
RAN protein binding activity |
|
|
|
5940; 159163 |
gene RBMY1A1.2/3 encoding RNA binding
motif protein, Y-linked, family 1, member A1. |
RNA binding activity |
RNA processing; spermatogenesis |
|
|
6218 |
gene RPS17 encoding ribosomal protein
S17. |
RNA binding activity |
protein biosynthesis; cytosolic small
ribosomal subunit (sensu Eukarya) |
|
|
283971;348174 |
gene SCP.0 or 7, encoding hypothetical
protein MGC34761. |
sugar binding |
|
|
|
8293;6606;6607 |
gene SERF1A.1 encoding small EDRK-rich
factor 1A (telomeric). |
|
|
|
|
6435 |
gene SFTPA1 encoding surfactant,
pulmonary-associated protein A1. |
sugar binding |
|
|
|
6606; 6607 |
gene SMN1 encoding survival of motor neuron
1, telomeric and SMN2 survival of motor neuron 2,centromeric. |
|
RNA processing; pre-mRNA
splicing; snRNP assembly |
[OMIM 158590] Spinal muscular atrophy,
4 |
|
23585 |
gene SMP1 or UPF0220.0, encoding small
membrane protein 1. |
|
|
|
|
6756;6759 |
gene SSX1 encoding synovial sarcoma, X
breakpoint 1. |
nucleic acid binding activity;
transcription co-repressor activity |
regulation of transcription,
DNA-dependent; cell growth and/or maintenance |
[OMIM 300326] Sarcoma, synovial |
|
6757 |
gene SSX2 encoding synovial sarcoma, X
breakpoint 2. |
nucleic acid binding |
regulation of transcription,
DNA-dependent |
[OMIM 300192] Sarcoma, synovial |
|
6759 |
gene SSX4 encoding synovial sarcoma, X
breakpoint 4. |
nucleic acid binding |
regulation of transcription,
DNA-dependent |
|
|
24150 |
gene steeju
encoding TP53TG3 protein. |
|
|
|
|
161497 |
gene STRC encoding stereocilin. |
|
|
[OMIM 603720] Deafness, autosomal
recessive 16 |
|
79008;6818 |
gene SULT1A3 encoding hypothetical
protein MGC5178. |
aryl sulfotransferase activity;
transferase activity |
steroid metabolism; synaptic
transmission |
|
|
339284;84218 |
gene TBC1D3 encoding hypothetical
protein FLJ11822. |
membrane alanyl aminopeptidase activity |
proteolysis and peptidolysis |
|
|
84321 |
gene THOC3 encoding THO complex 3. |
RNA binding activity |
mRNA-nucleus export; nuclear mRNA
splicing, via spliceosome; transport |
|
|
55911 |
gene titori encoding apolipoprotein B48
receptor. |
|
|
|
|
84673 |
gene TTTY8 encoding testis-specific
transcript, Y-linked 8. |
|
|
|
|
9084 |
gene VCY encoding variable charge,
Y-linked. |
|
|
|
|
81554 |
gene WBSCR16 encoding Williams-Beuren
syndrome chromosome region 16. |
|
|
|
|
26095 |
gene Y_phosphatase.7 or kloraby,
encoding DKFZP566K0524 protein. |
protein tyrosine phosphatase activity |
protein amino acid dephosphorylation |
|
5:
Background
on rearrangements can be found in the book:
Sturtevant
AH and Beadle GW: An introduction to
genetics. 1940 ed: Saunders,
a.
Partition of the rearranged Gold alignments, of excellent, good or partial quality, excluding repetitions, by
type of rearrangement:
Type |
mRNA |
Parts |
%mRNA |
Mosaic |
2,444 |
1.65% |
|
Inversion |
254 |
0.17% |
|
Transposition |
608 |
0.40% |
|
Suspected
genome deletion |
283 |
0.39% |
|
Variable
tandem repeat number |
894 |
1.23% |
|
Pseudo-split
or mRNA bridging two correctly ordered and oriented contigs |
72 |
0.05% |
|
Total rearrangements |
4,438 |
3.63% |
|
Total
alignments at this quality |
72,693 |
76,371 |
100.00% |
5b. Map of the rearranged alignments on the
finished part of the chromosomes
The
map position of the major fragment from 248 transpositions (red, above the
line), 116 inversions (red, under the line), 209 suspected genome deletions
(blue, above the line) and 1241 mosaic (light blue, under the line) are shown.
Some chromosomes house definitely a higher density of rearrangements than
other: 21 or X for example look good, 19, 22 or 17 should probably be rechecked
in the suspect areas. The control is given here.
To shorten the figure, the rearrangements mapping on the “random” contigs are
not shown here.
6: Intron
properties. Splicing in Gold alignments: the frequency of structural
polymorphisms and intronless genes is unexpectedly high.
Out
of the 66,419 excellent GOLD alignments, including exact duplicates (but not
the almost exact duplicates)
Alignment
has |
# alignments |
% alignments |
At
least one standard intron |
50,830 |
76.5% |
no
intron and no alignment gap |
14,343 |
21.6% |
only
non-standard introns or gaps |
1,246 |
1.9% |
at
least one non-standard intron |
5,068 |
7.6% |
Statistics
on the types of intron boundaries, on the same sample of excellent alignments.
There
are on average 7.8 intron per alignment that is not intronless. The partition
by type of intron boundary is as follows:
Intron
boundary |
Number of introns |
% of all introns |
Standard gt-ag |
395,953 |
97.41% |
Standard gc-ag |
4,427 |
1.09% |
Standard
U12 at-ac |
552 |
0.14% |
Any
other non-standard type# |
5,675 |
1.40% |
Total |
406,501 |
100.00% |
#
Interpretations for the high level of such anomalous pseudo-introns are
discussed in the article.
6b. Size histograms for standard introns, non-standard
introns and variable repeat number polymorphisms, restricted or not to that in
the last exon.
An
analysis of standard and non-standard introns sizes and micro-rearrangements:
comparison between those lying in last exons and all others
a)
Same histogram as Figure 3 (/b below), but showing only the last intron and
micro-rearrangements below 300 bp, out of a total of 52,054 last introns: 8574
standard introns (pink), 1594 non-standard (blue), 356 variable repeat number
polymorphisms plus 29 suspected genome deletions (green). Notice the prevalence
of micro-rearrangements, i.e. non standard introns or variable repeat numbers
in the last exon, compared to the histogram for all exons (presented again
right below). Notice, in the last exon, a small peak of very short standard
introns, as expected, because the cost function slightly favors gt-ag or gc-ag
or at-ac, especially in non-coding areas.
The marked dip from -6 to +6 is also a direct consequence of the cost
function, which effectively penalizes such short exons/introns. b) same data as
figure 3, but collapsed. Data on the same set of 52,054 excellent alignments as
in a: 72,797 standard introns (pink), 2693 non standard introns (blue), 516
micro-rearrangements (green).
a)
b)
For comparison, same diagram but for all introns and micro-rearrangements, not
just the last one.
6c: Genome map of variable repeat number
polymorphisms and anomalous introns that do not correspond to suspected
internal deletions in the cDNA.
894
variable repeat number polymorphisms are shown in red above the line, and 2417
anomalous introns (not gt-ag, gc-ag or at-ac), less than 100 bp long and not
yet clearly recognized as suspected internal deletions in the cDNAs, hence
possibly corresponding to polymorphisms in the genome sequence (or more rarely
to actual use of non-standard introns), are represented in light blue under the
line. The map is limited to the finished chromosomes and does not include the
random contigs.
Correlation
between the two types might indicate regions that are more fluid and tend to be
more fluid (or regions where the genome assembly is of lesser quality). Note
that there also appears to be some correlation with the map of the other rearrangements.
7: Comparing the quality of the sequence of the
chromosomes
Chromosome |
cDNA |
Excellent |
Good |
Partial |
Dubious |
Bad |
Masked |
1
|
7,328 |
6,585 |
561 |
0 |
|||
2* |
5,227 |
4,731 |
410 |
||||
3
|
4,432 |
4,013 |
338 |
0 |
|||
4
|
2,958 |
2,672 |
222 |
0 |
|||
5
|
3,413 |
3,077 |
274 |
0 |
|||
6* |
3,574 |
3,222 |
290 |
||||
7
|
3,729 |
3,287 |
365 |
0 |
|||
8
|
2,642 |
2,335 |
257 |
0 |
|||
9
|
3,135 |
2,771 |
262 |
0 |
|||
10
|
3,157 |
2,848 |
255 |
0 |
|||
11
|
3,981 |
3,648 |
261 |
0 |
|||
12
|
3,844 |
3,466 |
297 |
0 |
|||
13
|
1,444 |
1,304 |
122 |
0 |
|||
14* |
2,355 |
2,149 |
184 |
||||
15
|
2,584 |
2,320 |
216 |
0 |
|||
16
|
3,318 |
2,955 |
288 |
0 |
|||
17
|
4,187 |
3,736 |
350 |
0 |
|||
18
|
1,105 |
965 |
123 |
0 |
|||
19
|
4,307 |
3,793 |
425 |
0 |
|||
20
|
2,045 |
1,834 |
162 |
0 |
|||
21
|
775 |
664 |
92 |
0 |
|||
22* |
1,771 |
1,518 |
202 |
||||
X
|
2,064 |
1,906 |
120 |
0 |
|||
Y
|
135 |
104 |
21 |
0 |
0 |
||
Un
|
95 |
63 |
23 |
0 |
|||
Masked* |
379 |
52 |
60 |
||||
Total |
73,984 |
66,018 |
6,180 |
Only
excellent, good and partial alignments are considered, hyper variable regions
are excluded (*).
Chromosome |
#
Gold alignments |
Suspected
genome deletion |
Transpo-sition |
Variable
repeat number |
Inversion |
Mosaic |
Non-standard
intron |
1
|
7,235 |
||||||
2* |
5,198 |
||||||
3
|
4,401 |
||||||
4
|
2,927 |
||||||
5
|
3,389 |
||||||
6* |
3,549 |
||||||
7
|
3,695 |
||||||
8
|
2,624 |
||||||
9
|
3,090 |
||||||
10
|
3,137 |
||||||
11
|
3,964 |
||||||
12
|
3,816 |
||||||
13
|
1,435 |
0 |
|||||
14* |
2,347 |
||||||
15
|
2,571 |
||||||
16
|
3,287 |
||||||
17
|
4,143 |
||||||
18
|
1,098 |
||||||
19
|
4,267 |
||||||
20
|
2,021 |
||||||
21
|
763 |
0 |
0 |
||||
22* |
1,753 |
||||||
X
|
2,045 |
||||||
Y
|
128 |
0 |
0 |
0 |
|||
Un
|
90 |
0 |
0 |
0 |
0 |
||
Masked* |
134 |
0 |
|||||
Total |
73,107 |
7b) Human Genome Project Coordinators by Chromosome (From the Washington University,
For
reference…
Chr |
Coordinating Center |
Contact |
1 |
||
2 |
||
3 |
||
4 |
||
5 |
||
6 |
||
7 |
||
8 |
||
9 |
||
10 |
||
11 |
||
12 |
||
13 |
||
14 |
||
15 |
||
16 |
||
17 |
||
18 |
||
19 |
||
20 |
||
21 |
||
22 |
||
X |
||
Y |
8: A snapshot at
compared performances, strength and weaknesses of public alignment
programs: UCSC, NCBI, AceView
Users
of the main public sites rely on the results displayed, so we analyze here the
properties of three public programs, NCBI 34.3, UCSC and AceView. Blat has been
excellent and very stable for years now, but NCBI and AceView are still
evolving, so this analysis should be considered a snapshot in time, for public
data from march 10 to august 21, 2004. Updates on the comparison will certainly
converge within a year or two.
Figure
5a shows the overall performance of the programs. Although the choice of the
GOLD may not always be optimal, these cases are now too rare to influence
notably the results presented in this section. Let us recall that the score is
of the same order of magnitude as the number of bases that exactly match the
genome. The overall success rate in Figure 5a is evaluated by counting the
percentage of alignments whose score reaches the Gold score, is off by 1
to 5 points, 6 to 50, 51 to 500 or more than 500 points. In addition, some
mRNAs may remain unaligned.
We
then try to understand in which way the programs differ from the GOLD. To
usefully pinpoint defects, we limited this analysis to the 64,938 GOLD with
excellent and non-split alignments, aligning on average over 99.98% of their
length with 1.4 base differences from the genome per kb. There are two broad
types of difficulties: the proper determination of the area of the genome where
the cDNA originated and the precise definition of the exons and introns. To
quantify the precision of the local alignment (Figure 4b), we simply compare
any alignment that overlaps the corresponding Gold alignment on the genome, and
count the number of cases where a program
o misses an exon. We distinguish the first exon, most
often missed by programs, any internal exon, and the last exon, missed in rare
instances. AceView currently has the best record there, with altogether 1.18%
of the alignments missing exons; UCSC has 1.47%, NCBI 1.56%.
o
produces too
many exons. We again distinguish by position: first extra exons are often
spurious vector matches; others, such as most of the AceView lot, are validated
cooperative first short exons. Internal extra exons are rarer. Last extra exons
often consist of pure poly-A. UCSC holds the record with only 0.16% alignments
with too many exons, AceView has 0.57% and NCBI 4.00%.
o
produces
inaccurate exons with suboptimal boundaries. AceView has 1.18%, NCBI 2.35% and
UCSC 3.04% (its weakest point).
If
we now look at a larger scale picture, and ask how often a program assembles
the mRNA to its best position on the genome and only there, different measures
can be made:
o
How many
accessions are assigned to a wrong map position? We call “wrong map” a position
not overlapping the GOLD and where the alignment is at least 50 points below
the Gold score. We find that mapping is highly reliable: less than 0.02% of the
accessions are mapped in wrong positions by the programs discussed. The
only caveat is redundancy, if an accession may align in several places, some
programs do not always discard correctly the lower quality alignments, thus keeping,
in addition to the good map a number of “wrong map” positions. NCBI 34.3 has no
such cases, AceView has 3 and UCSC has 54 instances (0.08%).
o
A related
problem is how many of the exact duplications go undetected. Of course, a
program with high redundancy is expected to miss less of the true repeats, and
indeed UCSC is the best, missing only 6% of the duplications. AceView misses
22% and NCBI 87%. This factor is certainly tunable, but it clearly seems
important to show the exactly duplicated alignments in the 1% cases where they
occur.
o
Finally, as we
showed above, more than 3% of the cDNAs align best as split alignments,
different pieces landing in more than one place/one strand on the genome. Most
programs apart from UCSC and Nedo do not propose split alignments in an
efficient manner. We hope to have convinced the reader of the interest of such
split alignments in the last phase of quality assessment of the genome assembly
and that more of us will build this feature into their programs.
Every
data contributor receives privately extensive lists of problems that should
help them debug their program. Example lists are available in supplementary
material 9.
Comparison
of the results of any program to the Gold allows to classify the defects of
this program, and to provide lists of suboptimal alignments to the
contributors. It also allows to identify the main difficulties, and to can
lists of examples for each type of problem.
We
only consider mRNAs with excellent non-split Gold alignment; exact duplications
are included
Method |
Gold |
AceView |
UCSC |
NCBI
34.3 |
Gold |
65287 |
60705 |
59547 |
55336 |
loses
1 to 5 |
0 |
3115 |
4076 |
3233 |
loses
6 to 50 |
0 |
|||
loses
51 to 500 |
0 |
|||
loses
>500 |
0 |
|||
Unaligned |
0 |
0 |
||
No
score |
0 |
0 |
0 |
|
Total |
65287 |
65535 |
65589 |
64931 |
mRNA
tested total |
64886 |
64886 |
64548 |
64886 |
Defect |
AceView |
UCSC |
NCBI 34.3 |
Total
comparable alignments |
64,652 |
64,284 |
63,494 |
missed first exon |
|||
missed central exon |
|||
missed last exon |
|||
inaccurate exon |
|||
extra first exon |
|||
extra central exon |
|||
extra last exon |
|||
missed micro-insertion/deletion |
|||
extra micro insertion/deletion |
0 |
The line
"Total" gives the number of accessions that were mapped to the same
place on the genome as the GOLD, and that could thus be compared to the GOLD in
terms of exon and intron structure. An exon is "missed" if it
is part of the GOLD but the alignment proposed has no overlapping exon. An exon
is "extra" if it belongs to an alignment but does not overlap any
exon in the GOLD. The first exon is the 5'-most, the last the 3'-most, central
is any exon in between the first and last. An inaccurate exon is an internal
exon, bounded by two introns and overlapping a Gold exon, but it has at least
one intron boundary not optimally chosen according to our analysis. Sometimes,
the Gold solution is very close... The number of accessions with the defects
are shown. The same accession may appear in more than one line in the table.
Defect |
AceView |
UCSC |
NCBI 34.3 |
Unaligned |
0 |
||
Wrong
map |
|||
Redundant |
0 |
||
mRNA
tested |
64,886 |
64,548 |
64,886 |
This table lists
all accessions where finding the place in the genome where the cDNA aligns was
problematic. If an accession was tested, but failed to produce any alignment,
the accession is "Unaligned". An accession that was aligned only in a
position where the quality of the match is far less good than the GOLD (50
points below) is qualified of "Wrong map". An accession for which two
or more independent alignments of the same part of the clone were provided, but
one was far less good than the GOLD (50 points below Gold score, hence its map
position is not correct) is called "Redundant": it should have been
discarded.
Defect |
AceView |
UCSC |
NCBI
34.3 |
mRNA
with missed duplication |
|||
mRNA
with missed split |
|||
#mRNA
with duplication tested |
362 |
235 |
362 |
#mRNA
which are split tested |
1,132 |
1,122 |
1,132 |
10 Discussion:
Lists of challenging problems and questions that require experimental input
•
Long genes, extending over 800 kb of genome sequence: A list of 72 excellent GOLD alignments spread over
more than 800 kb of genome, is given. The longest extending alignment in the
set is that of AB020675, 2.298 Mb long, on chromosome 7. The solution was found
only by the NEDO group, congratulations!
•
Long introns, above 400 kb: A list of 61 excellent GOLD with challenging standard
introns above 300 kb is provided. The longest intron, 1136 kb, is found by NEDO
in accession AK097990, but there is a gene in the proposed intron, leaving open
the possibility of a genome misassembly or of a missed duplication. Similar
doubts arise for other examples from this list. One set of examples that seem
to lie on a correct genome assembly, so they would be real biologically, not
just examples for programmers, includes BC030832, with an intron 866 kb long.
This one appears less dubious because another variant, BC038514, has an
alternative exon in the same area, with an intron 846 kb long.
•
Micro-introns, between 6 bp and 12 bp, with standard
boundaries: A list of 21 accessions with
this feature is provided. These gt-ag or gc-ag introns are probably not a
result of spliceosome-dependent splicing, but rather examples of length
polymorphisms. Yet they provide an excellent training set.