GOLD: a cooperative approach to choose
the best mRNA-to-genome alignment and what we learn about human genes and the
genome.
Jean
Thierry-Mieg1, Danielle Thierry-Mieg1, Michel Potdevin2,
Mark Sienkiewicz1, Yuji Kohara3, Sumio Sugano4,5,
W. James Kent6, Kouichi Kimura7, Tetsuo Nishikawa7,8,
Takao Isogai8, Takashi Gojobori5, Nobuo Nomura5.
1NCBI, NLM, NIH, Bethesda, MD20894, USA; 2LPM,
CNRS, Montpellier, France ; 3National Institute of Genetics,
Mishima, Japan; 4The University of Tokyo, Japan; 5METI/Japan
Biological Information Research Center, Tokyo, Japan; 6University of
California, Santa Cruz, USA; 7Hitachi Ltd., Central Research
Laboratory, Kokubunji, Tokyo, Japan; 8 Reverse Proteomics Research
Institute, Kisarazu, Chiba, Japan.
Corresponding
author: Danielle Thierry-Mieg, NCBI, NLM, NIH, building 38A room 3S306, 8600
Rockville Pike, Bethesda MD20894,
Takao Isogai Reverse Proteomics Research Institute, 2-6-7, Kazusa-Kamatari, Kisarazu,
Takashi Gojobori CIB-DDBJ, National Institute of Genetics, Mishima, Japan and Japan Biological
Information Research Center, AIST Tokyo Waterfront, 2-41-6, Aomi, Koto-ku,
Tokyo 135-0064, Japan; Tel: 81 355 318550, 81 559 816847; Fax: 81 355 318551; tgojobor@jbirc.aist.go.jp .
W. James Kent, Center for
Biomolecular Science and Engineering,
Kouichi Kimura, Hitachi Ltd., Central Research Laboratory, 1-280 Higashi-Koigakubo, Kokubunji,
Tokyo 185-8601, Japan; kokimura@crl.hitachi.co.jp
Yuji Kohara, National Institute of Genetics,
Tetsuo Nishikawa Reverse Proteomics Research Institute, 2-6-7, Kazusa-Kamatari, Kisarazu,
Nobuo Nomura METI/ Japan Biological Information Research Center,
AIST Tokyo Waterfront, 2-41-6, Aomi, Koto-ku, Tokyo 135-0064, Japan; Tel: 81
335 998142, Fax: 81 335 998141; nnomura@jbirc.aist.go.jp
Michel Potdevin, Laboratoire de Physique Mathématique, CNRS, 2 place
Eugène Bataillon, 34095 Montpellier, France; michel@lpm.univ-montp2.fr
Mark Sienkiewicz, NCBI, NLM, NIH, building 38A, 8600 Rockville
Pike, Bethesda MD20894,
Sumio Sugano The
Jean Thierry-Mieg, NCBI, NLM, NIH, building 38A room 3S306, 8600
Rockville Pike, Bethesda MD20894,
Running
title: Gold alignments of human mRNAs and gene annotation
Abstract
Background:
Identifying
the genes, an essential task in post-genome biology, requires aligning the many
cDNA sequences available in the public databases on the genome. However, the
various programs used at the main genome annotation sites propose different
solutions to the exact exon structure in about half the cases.
Results:
To
help resolve this problem, we pool the mRNA-to-genome alignments proposed by
NCBI, UCSC, ensembl, AceView, and H-inv, for 74,106 mRNA from 29,194 human genes.
We carefully define a cost function and let “GOLD”, Genomewide Optimization of
Locus Description, select the best alignment for each clone. We annotate the
Gold alignments, discuss the distribution of introns and minimal size of exons,
classify the frequent rearrangements, and propose that variable
tandem-repeat-number and micro-introns below 65 bp, which occur in 9% of the
genes, are micro-polymorphisms. We evidence striking chromosomal and regional specificity
in the control of gene duplication and discover that exact duplicates of genes
containing introns are all clustered within 3.1 megabases of each other. We also
observe interchromosomal and regional variability in the levels of base
mismatch and rearrangements, annotate suspected defects, including frameshifts,
in the genome and the cDNAs, and discuss the high frequency of intronless
genes. Finally we identify difficult alignments through programs comparison.
The current Gold, their annotations, the C program and acedb schema are
available online, or from www.ncbi.nlm.nih.gov/IEB/Research/Acembly/GOLD.
Contributions of new alignments are encouraged.
Conclusions:
Because
GOLD extracts the best solutions to difficult alignment problems from all programs,
it opens a new dimension toward a precise annotation of the human genes and
genome.
----------------------------------
Background
Genes
are primarily defined by their transcribed regions, and a large number of cDNA
sequences are available in the public databases. Since a high quality human
genome sequence is also at hand, one might expect that the collection of gene
models would, by now, be basically complete, and consensual between the various
public resources. But this is not the case, in part because aligning each cDNA
precisely back from where it was transcribed on the genome is a pre-requisite,
and is surprisingly difficult. Indeed, when we compare, as of July 2004, the
alignments of mRNAs displayed on the public web sites at UCSC, NCBI, ensembl,
AceView, and H-inv, we find that they strictly agree in only 50% cases. More
softly, if we ignore discrepancies at the very ends of the alignments, the
programs agree on the intron boundaries for 66% of the accessions. This does
not facilitate international collaboration on the important question of
defining and annotating the human genes, which is the ultimate goal.
When
we examined the alignments provided by each group in detail, it quickly became
apparent that each program has distinct strengths and weaknesses and that the
best collection of alignments should include results from many, maybe all,
programs. It also appeared that a consensus approach would not provide a
satisfactory solution except in the most trivial cases. So we chose a more complex
strategy. We observed that it is usually easy to decide which of many
alignments is best by just looking at them, so we reasoned we could write and
train a program to do the same thing, provided we analyzed large numbers of
cases by eye. We therefore pooled the
results from all programs into a dedicated acedb database and painstakingly
elaborated a cost function, adding successive refinements until the program
automatically identified reliably, in nearly all cases, the very alignment that
we would select by hand. We call the program “GOLD”, for Genomewide
Optimization of Locus Description. The resulting Gold reference alignments are
annotated and made freely available (supplement S2d) and can be used
as a basis to analyze the genome and the genes.
In the first part of
this article, we present the raw data: the genome and the cDNAs selected for
the benchmark. We describe the four main large scale cDNA projects that serve
as a basis to this work, and show that all are of the highest quality. We
present the set of alignment programs that contributed their data. We then
outline the principles of the GOLD selection algorithm, detail the construction
of the cost function and provide the actual Gold alignments.
The second part takes
advantage of the precision of the Gold alignments to increase our knowledge of
the genome and the transcriptome. Since this project is limited to providing a
solution to the first, but critical step in the gene building procedure, the
mRNA-to-genome alignment, we did not address directly some more elaborate
issues, such as clustering of the alignments into gene models, alternative
splicing, encoded proteins, or identification of the most useful cDNA clones;
we hope that the Gold set will be used as a building block in further studies.
But already, by analyzing the Gold alignments, we have made several
observations on completion, level of variation and quality of the genome: we
find that the July 2003 (build 34/hg16) genome is only missing between 0.5% and
1.2% of the gene containing sequence (it is at least 98.8% complete), and that
the average level of differences between the aligned cDNAs and the genome is
1.62 base per kb (99.84% identity). We observe a large interchromosomal
variability in the quality of the mRNA-to-genome match and identify in the
genome sequence a few errors leading to frameshifts in the coding regions. We
also point to potential structural problems, polymorphisms, or misassemblies in
the genome, chromosome by chromosome. We examine the exactly or almost exactly
duplicated alignments, where an mRNA aligns in two distinct genes in the genome
with less than 1 or 2 differences among the copies, and we show striking
chromosome-dependent variations in frequency and relative organization, in
tandem or in palindrome, of the repeats. An intriguing effect of the presence
or absence of introns in the two copies is also noted, and taken as a strong
indication that when a gene gets duplicated on a different chromosome, only one
copy remains transcriptionally active, unless maybe its sequence diverges. This
is not the case for closely linked copies, within 3.1 megabases of each other
on the same chromosome, which appear to both remain active. We then classify the
numerous rearrangements, close to 4%, that occur in alignments genome wide. We
analyze the length and properties of “introns” with non-standard boundaries and
propose that, together with variable copy number rearrangements, they
correspond to micro-insertion / micro-deletion polymorphisms, which affect 4%
of the alignments and close to 9% of the genes, a much more frequent occurrence
than previously thought. Finally, the main difficulties in the mRNA-to-genome
alignment problem are identified through comparison of the contributing
programs.
As
was hoped for, our cooperative approach significantly improved the resulting
alignment models over each method taken separately. Every program tested won
the Gold for at least some accessions, which implies that the current best set
can still be improved, and we welcome contributions of new alignments by any
interested party. The scripts that perform the Gold selection and analysis are
automatic. Researchers who contribute their alignments will see their best
results incorporated into the updated version, and will receive lists of
accessions for which their proposed alignment appears suboptimal. These lists,
ordered by specific types of defects, undoubtedly facilitate debugging. They
were transmitted to alignment contributors, and their exploitation is one of
the reasons why AceView strives in this analysis.
In this project, our
goal is precision, and we support all reported numbers by explicit lists of
accessions and additional details in the supplementary material online, or at our web site www.aceview.org/GOLD [1].
The annotation process is dynamic, and updated versions of the Gold alignments
will be made available on the web site as data and programs improve.
Results and
discussion
1: The selected cDNAs and genome,
and the contributing alignment programs
Alignments
were generated around February 2004, on the July 2003 genome (NCBI build
34/hg16); the mitochondrial genome was omitted. For the benchmark, we aimed at
cDNA sequences that would be homogeneous and of the highest quality, so we
selected the 74,106 fully sequenced mRNAs from four large scale ongoing cDNA
projects, available from the public databases (GenBank/EMBL/DDBJ) on February
10, 2004 (Supplement S1a).
These mRNAs represent more than half of the human mRNA accessions. They cover a
wide range of size, averaging 2,392 bp (after removal of poly-A and eventual
vector). 26 are below 200 bp (mostly from MGC and DKFZ) with a minimum at 38 bp
(BC033526); 34 are above 10 kb (all from KIAA and DKFZ), with a maximum at
14,491 bp (AB007934). According to AceView
[2], 71,740 clones with reliable Gold alignments nicely
cluster into 29,194 genes. 68% of these genes have introns with standard
boundaries, and 78% encode putative open reading frames (ORFs) of more than 100
aminoacids.
Using
the results of this work (supplement S1b), the cDNA projects
can be described as follows:
-
The Japanese KIAA
effort was the first to be launched, more than ten years ago (2,058 accessions;
[3,4]). This pioneer work aimed at
identifying the longest mRNAs, mostly from brain, that would support in
vitro translation of very large proteins. Indeed the average length aligned
in GOLD, 5,072 bp, and average coding potential 957 aminoacids (as measured by
the longest open reading frame of the genomic footprint of the cDNA, for 1902
clones with excellent alignments and ORF > 100 aa), are by far the largest
of all projects. The very best quality of alignments is also observed, with on
average 0.9 base mismatch per kb between the cDNA and the genome (Supplement S1b1).
Indeed, this library is the only one where the alignments evidence more
frameshifts or mutations in the genome than in the cDNA (S1c). The frequency of
rearrangements (0.8 per 100 kb aligned, S1b2)
and of non-standard introns (1.5 per 100 kb aligned, S1b3)
is also the lowest. These impressive figures highlight the utmost interest of
requiring in vitro translatability if one aims at protein-coding clones.
-
The Japanese NEDO Full-Length
cDNA sequencing project (FLJ) is a very large scale project using a remarkably
efficient method, developed by Sugano, Suzuki and Isogai, for cloning complete
mRNAs, from the capped first base to the poly-A [5, 6]. This project is unique in that it aims at faithfully
depicting the whole transcriptome, with no preconceived idea. For example, it
does not select clones for complete sequencing based on conditions on the CDS
or on the presence of introns. This collection contributed 29,923 cDNAs; the
average length aligned is 2,442 bp and average Gold coding potential 355
aminoacids (for 22,018 clones, with excellent alignment and ORF above 100
residues). The slightly elevated frequencies of mismatches and single base
insertions or deletions (1.9 per kb on average, S1b1), of
rearrangements (2.5 per 100 kb aligned, S1b2) and of
non-standard introns (4.9 per 100 kb when including partial deletions of the
insert, S1b3)
are probably due to the PCR amplification step used before cloning. Yet, this
collection detects the largest number of frameshifts in the genome (S1c).
-
The German DKFZ-based
cDNA consortium aims at the identification and biological characterization of
novel long cDNAs encoding complete ORFs [7]; it contributed
9,258 accessions. The average length aligned is 3,307 bp, and average Gold
coding potential 451 aminoacids (for 7,067 clones, with excellent alignment and
ORF above 100 residues), making it the second longest protein-producing set.
Like FLJ, it uses a PCR step before cloning, and has similar levels of single
base differences from the genome (1.8 per kb, S1b1;
3% of the cDNAs appear to contain a frameshift in the CDS, S1c),
rearrangements (1.5 per 100 kb aligned, S1b2) and
non-standard introns (2.6 per 100 kb, S1b3).
-
The American-led
Mammalian Gene Collection (MGC) effort was the last to be launched [8]. It does not focus on novel genes, but rather aims at
identifying one good representative encoding a complete CDS for each
protein-coding gene. The consortium submitted 32,859 human accessions to
GenBank. A stringently filtered subset, the MGC reference collection, includes
the best non-redundant 14,961 clones, matching the genome if possible without
structural anomaly or deleterious protein mutation. The average length aligned
is the shortest, 1,933 bp for the bulk and 1,970 bp for the filtered set; yet
the average coding potential is 423 aminoacids (for 13,921 ORFs above 100 aa in
the filtered set). As expected, the filter for choosing the reference clones
improves the quality of the match, from 1.3 to 1.0 mismatches per kb (S1b1),
efficiently discarding the numerous frameshifts from the collection (S1c);
it also reduces the level of rearrangements (from 1.6 to 0.7 per 100 kb
aligned, S1b2)
and of non-standard introns (from 2.7 to 2.3 per 100 kb, S1b3).
However, it discards some good protein coding genes not yet known to RefSeq
(108 complete CDS mRNAs with more than 300 aminoacids are listed in S1d).
Together,
the four cDNA projects constitute a valuable resource for the community,
because they aim at providing the researchers with a comprehensive collection
of clones that will ultimately represent the human transcriptome. The clones
also serve as a basis for gene annotation by public efforts, such as the new H-inv project [9-10] or the well-known NCBI RefSeq project [11] (S1d).
Yet
notice that the 21,542 RefSeq mRNAs (from August 2, 2004) identify only 16,832
mapped genes genomewide (some with alternative variants). That is because the
RefSeq collection has a representative mRNA/Protein for only 51% of the genes
from the Gold project (14,935/29,194 genes), more precisely 25% of the FLJ, 29%
of the DKFZ, 46% of the KIAA and 71% of the MGC reference clones. On the other
hand, 1,897 genes with a RefSeq do not yet have a representative cDNA in any of
the four collections. It appears that the otherwise excellent quality RefSeq
project is not yet comprehensive, but we may help by providing in supplement S1d
a selected list of 610 cDNAs encoding complete CDS proteins of 635 aa on
average (from 300 aa to 2,089 (D79992)
aminoacids), mapping at excellent quality with no anomaly in 587 genes that do
not yet have a RefSeq mRNA.
To
generate the Gold set, we gathered the alignments proposed by the following
groups
-
AceView [2] (Thierry-Mieg et al., unpublished; http://0-www-ncbi-nlm-nih-gov.brum.beds.ac.uk/IEB/Research/Acembly): this program, developed at NCBI, is the only
one to our knowledge to use a co-alignment strategy; all public mRNAs and ESTs
are coaligned on the genome, the alignments are refined using information from
other mRNAs in the cluster, but no prediction input. They are then clustered
into transcripts and genes. The final output is a minimal set of alternative
transcripts synthesized from all of the alignments, as well as a refined
alignment of each mRNA. 4 of the 5 million of mRNA sequences in GenBank/dBest
(in build 34/July 2003) contribute to the reconstruction of the AceView
transcripts annotated on the AceView web site.
-
NCBI human gene
alignments available on the public site [12] from February
to August 2004 (version 34.3, www.ncbi.nlm.nih.gov)
-
UCSC Blat [13-14]; Blat alignments are the basis of
many of the tracks at www.genome.ucsc.edu
.
-
JBIRC/AIST, one of
the Japanese teams associated to the H-inv project; JBIRC alignments are used
as a basis for the H-inv annotation project [9-10] (http://www.jbirc.aist.go.jp/hinv/index.jsp)
-
the Japanese NEDO “Analysis of
Protein Function and Practical Application”,
also associated to the H-inv project ([6] and unpublished
results);
-
Exonerate, a program
developed by Guy Slater and Ewan Birney and used internally as an element in
the gene building process displayed at the EBI/Sanger ensembl site http://www.ensembl.org/Homo_sapiens
[15]. The ensembl
transcript models then add predictive elements, such as similarity to other
proteins.
NCBI,
UCSC and AceView made their alignments public, others kindly provided them
privately for inclusion in Gold; some groups contributed partial sets
(supplement S1e).
The test dataset, the format for
submission by interested parties and the public alignments, in that format, are
available online.
2: The GOLD selection procedure:
quality evaluation and choice of the best alignment
Evaluating
the quality of an alignment involves comparing the sequence of the cDNA to its
image on the genome and assigning a quality score to each alignment. Yet,
despite a clear need to respect a few simple rules derived from basic molecular biology, choosing the best alignment for each cDNA is a
difficult and somewhat subjective task, and we would like to explain how we derived our cost function. Of course we
did not use the function implicit inside AceView or any of the other
contributing programs. Rather, we started by favoring the longest possible
alignment and evaluated the match to the genome using the absolute
Smith-Waterman algorithm. Then, in an iterative process, we went back and forth
between visual inspection of hundreds of alignments and modification of the
cost function. Of special interest were the cases where a Gold alignment was
found by just one method. Some of those were unconvincing for reasons that
could be expressed (hence programmed) and we added progressively
different penalties. As the results of
the GOLD selection procedure were improving, we could address more and more
subtle problems and fine tune the cost function. We stopped only when the
overwhelming majority of the Gold alignments appeared well chosen.
The
principal elements involved in the calculation of the score are:
•
Scoring the
mRNA to genome match: We first
consider each proposed exon on the cDNA aligned onto its corresponding genome
segment and use the Smith-Waterman algorithm to count, in each exon, the single
base differences between the cDNA and the genome sequence. A “difference” may
be a single base deletion, a single base insertion, a single base mismatch or
an uncalled base (n) in the cDNA or the genome. We score 1 point per base
aligned and we substract 2 points per single base difference. We then consider
the set of proposed exons and substract 1 point per unaligned base between
exons, with a maximal penalty of 6 points per gap, de facto discouraging
microdeletions of less than 6 basepairs.
Note that the total score is dominated by and of
the same order of magnitude as the number of aligned bases that exactly match
the genome, because the average measured “error” rate is less than 0.2% and all
penalties described below are minor.
•
Rearrangements: Note that this way of scoring does not
constrain the geometry of the alignment. Indeed, rearrangements do occur in
about 4% cases, in the cDNA or in the genome: some arise from technical
artifacts, some from interesting biological reasons. To treat correctly these
frequent events, GOLD will prefer a solution aligning the entire cDNA clone to
a partial alignment, even if it violates the strict colinearity rule. For
example, mosaic clones, matching two different areas of the genome in
successive segments, or local alignments demanding an inversion, duplication or
deletion, for a segment of the cDNA or of the genomic clone, are acceptable. When this happens, we simply select the
combination of partial alignments representing the best coverage of the cDNA
with the minimal number of base differences and consider it as a single
Gold-split alignment. The parts may come from any method. The score of a
Gold-split alignment is the sum of the scores of the parts, rationalized by
multiplying by the overall length of aligned cDNA divided by the sum of the
length of the parts, minus 6 points. In this way, we favor a single alignment
over two split overlapping ones, yet a rearranged alignment wins over a
partial.
•
Intron
boundaries and splice site consensus:
By
respect for the spliceosomes, introns with standard boundaries gt-ag or gc-ag or at-ac are preferred
to introns with any other feet. Yet we do not forcefully impose standard
introns irrespective of the cDNA sequence: we substract only 3 points per non-standard intron. Thus an alignment
accepting a single base difference to splice out a standard intron is slightly
advantaged with respect to an error-less alignment using a non-standard intron.
But accepting two base differences locally to gain a standard intron is disfavored,
since the observed level of differences between genome and aligned mRNA, below
0.2%, makes this hypothesis farfetched.
•
Poly-A
addition site and vector sequence recognition: Most mRNAs are submitted to the public databases with a poly-A tail,
as expected, but a few thousand also contain short stretches of vector, which
is unfortunate. Both are problematic, because often, one of the programs will
manage to align these sequences as additional exons, extending the image of the
gene beyond its natural boundary, sometimes intertwining it with nearby genes.
It is therefore critical to precisely determine the part of the cDNA sequence
that was truly copied from the genome locally and needs to be aligned.
We developed tools to tentatively identify the
poly-A addition sites, the vector sequences on both sides of the insert, and
the eventual non-matching poly-A or poly-T on the 5’ side of inserts, and we
clipped these probably irrelevant sequences. This analysis was difficult and we
validated the ambiguous cases by hand (details and lists online, supplement S2a).
Once the 5’ and 3’ boundaries of the area to align
have been defined, we do not score points for aligned bases outside this area, but
keep counting the base differences. This means that a program aligning the
poly-A matching the genome locally is not penalized. However, we substract a
penalty of 3 points per first or last exon entirely or almost entirely outside
of the defined area, in order to avoid pure poly-A or pure vector exons in the
Gold.
·
Invalidating
dubious first exons: Some short
artificial sequences at the 5’ ends of cDNAs may still have escaped detection,
and be aligned untrustworthily. For example, we noticed a few hundred cases of
Gold alignments with relatively short 5’ exons and a concentration of base
differences 50 to 100 times higher than the rest of the alignment. To
discourage this, we count each single base difference twice in any first exon
shorter than 50 bases. Note however that this mild treatment does not eliminate
all first exons with visually dubious anchorage.
A related problem is the design by some programs of
very small first exons that appear to be random matches, since they are at a
distance from the next exon well beyond statistical significance. There are
numerous instances where for example a 4 letter first exon lies 30 kb 5’ to the
second exon. Among these are some justified cases where the short first exon
exactly matches the 3’ end of an exon of the same gene defined by another
clone. In fact, AceView intentionally uses co-alignment and favors such short
distant exons supported by other mRNAs, what we call “team jump”. But in order
not to advantage any program, we penalized all cases irrespective of this
argument. In practice, if nBp is the number of bases in the first exon, nDiff
the number of base differences in that exon and IntronBp the size of the
bordering intron, we demand that 4nBp-2nDiff-1 > IntronBp. We use
twice the number of base differences to mildly take into account the
combinatorial nature of mismatch positions. When the condition is not met, the
few aligned bases do not score points and we again penalize such cases by 3
points, because they tend to give a spider web image of the genes that is not
vindicated by the data.
Because of the softness of our criterion however, a
few tens of Gold alignments use multiple dubious short exons separated by very
long non-standard introns. But considering the low number of such instances, we
did not try to force their exclusion from the Gold set.
·
Extent on
the genome and number of introns:
We have not introduced any advantage or penalty for the number of introns or
the extent on the genome, because none of the methods compared so far badly
exaggerated the size and number of introns and still won. In fact, a rather
large number (3138 accessions, 4%) of the alignments lying in the same genomic
region and sharing the best score differ in the position of an intron, but the
intron is usually non-standard (“alternate” list in S2a). In just a few cases,
the alignments also differ in their extent on the genome. Then, by consideration
for the RNA polymerase and the times it takes to synthesize the premessenger,
we select for Gold the most
compact at equal score. If on the other hand the extent on the genome is the
same, we pick the Gold at random.
·
Coding
potential: We do not use protein
coding potential as a major component. Yet the regularization of intron
boundaries is not always innocuous: accepting a single base insertion/deletion
to favor a standard intron may disrupt the ORF. So, when different alignments
of a cDNA provide ORFs of different lengths, we give a bonus of three points to
the alignments with the longest ORF, so that they win in spite of the
non-standard intron.
The minor importance we give to the coding
potential is justified by the fact that most of the frameshifts or stop
mutations in the cDNAs are efficiently corrected in Gold by just maximizing the
DNA to RNA match: the average ORF length read on the cDNA sequence is 386
aminoacids, but it rises to 392 aminoacids on the Gold genome footprint (over
61,807 well aligned cDNAs with ORF > 100 aa). This observation confirms that
the genome sequence is usually a better reference than the cDNA, in particular,
it contains about 10 times less frameshifts: we identified and list in
supplement S1c 1,536
suspected frameshifts in the cDNAs and 171 in the genome.
The
cost function is implemented in the program GOLD, Genomewide Optimization of
Locus Description, written in C and relying on the acedb object
oriented database system [16] and its new AceC
programmer’s interface (Thierry-Mieg and Sienkiewicz, unpublished). The code, the schema of the
database, and the annotations, sequences, and coordinates of the Gold alignments are
available online.
The
selection method turns out to be both reliable and sensitive when applied to
the 95% alignments of good quality, matching with no rearrangement: we have
seen no example where a difference of 5 points does not distinguish the
visually best alignment in a sample of 300 alignments. Even a single point
difference is diagnostic in a majority of cases, although some are really
equivalent.
We
do not mean to say that all current Gold alignments are perfect: we have
already mentioned a few situations where we have doubts, for example we do not
penalize multiple non-standard introns, although some do not look convincing.
And a few tens of Gold alignments have annotated defects, such as a pure poly-A
last exon, yet the solution they propose to another problem makes them Gold in
spite of the defect.
Another
open question is that of very small exons bounded by two standard introns;
their size histogram in Gold may be found in supplement S2b. Such micro-exons
“look” good, but are they real? In the cell, for mechanistic reasons, there is
likely a low limit to exon size. The
literature provides a number of reliable micro-exons 6 bases long (e.g. [17]), but we found no convincing example
below 6 bp and to our knowledge, the minimal size has not yet been assessed
experimentally. 88
Gold examples proposing micro-exons 2 to 5 bases long could serve as a support
for such experiments. But in the absence of hard data, we chose not to penalize
very short exons either.
Altogether
we believe that the current choice of perhaps up to 1% of the Gold alignments
may be controversial, but this number should decrease in the future, because
Gold provides debugging tools while the contributed data allow incremental
refinements of the cost function.
3: The quality of the Gold
alignments reflects quality and completion of genome and cDNA sequences.
To evaluate the quality and completion of the human
genome sequence, we quantified the quality of the Gold alignments by measuring
the proportion of the length of the mRNA (as found in GenBank minus the vector
and poly-A) that aligns and the percentage of “single base differences” (either
a mismatch or a single base insertion or deletion) between the cDNA and the
genome in the aligned region. We then defined criteria to link these two
numbers to the reliability of the genomic map assignment, by using the knowledge
of single base difference rates we gained from hand-editing the 205,205
sequencing traces from the worm transcriptome project (Kohara,
Y., Shin-i, T., Thierry-Mieg, J., Thierry-Mieg, D., Suzuki, Y., Sugano, S.,
unpublished). The reference strain in C.
elegans is a true homozygote at the basepair level, and after manually
correcting the many errors due to the basecall programs, the worm cDNA
libraries contain between 0.08 and 1.2 mutations per kilobase, depending on the
library construction protocol. On the other hand, the genome has 0.05 confirmed
mutations per kb, as measured by sampling 6 Mb of sequence covered by multiple
independent mRNA sequences, all with excellent quality sequencing traces
locally. If we add room for the polymorphisms expected in human sequences, we
consider that an alignment with less than 4 differences per kilobase is
certainly reliable in terms of genomic map assignment, and up to 15 per kb
acceptable for a region with more variations. Above 15 per kb, we have doubts:
either sequencing is of bad quality, or the gene is especially polymorphic, or
the chromosomal region is incorrectly assigned. Perhaps the correct region has
not yet been sequenced. Above 30 differences per kb, the alignment is unacceptable.
Figure 1 defines the terms excellent, good,
partial, dubious and bad, as we use them throughout the article, and shows the
partition of the Gold by those criteria (lists and maps are in supplement S3a,b,c,d).
-
Excellent and good
alignments match the genome over
their full length and with few base differences. With 72,198 mRNAs, 97.43% of
all mRNAs, they are by far the largest groups. To be precise, the average percentage
of length aligned for this major class of accessions is 99.98% and the combined
average sequencing error, mutation or polymorphism rate in the cDNAs and the
genome is 1.62 per kb (276,340 base differences in 170,090,311 bp aligned).
This confirms the high degree and quality of completion of today’s human genome
sequence (NCBI Build 34/July 2003).
Still, the genome sequence is not devoid of
mutations or sequencing errors. For example, we provide in S3e a list and
map of 171 cases where the protein on the cDNA is longer than that on the Gold
genomic image by at least 100 residues. Using the AceView database [2], it appears that more than half of these cases are
mutations or sequencing errors in the genome (single base insertions or
deletions, or nonsense mutations) confirmed by all clones sequenced in the
area, including large numbers of ESTs. For example, there is a single base
insertion or deletion in the current genome in genes INPP5B (AK022846), PLCD3
(AK074240), PKP4 (BC050308) or XTP2 (AB029019), affecting nucleotide doublets
in each of these cases.
Interesting polymorphisms in the genome sequence
are also observed. For example, the gene GLULC (AK122784), possibly involved in
the regulation of the level of the neurotransmitter (and aminoacid) glutamate,
bears in the 5’ region of the gene a G(20%) or C(80%) polymorphism, leading to
a Stop versus a tyrosine. The most frequent allele, found in 80% of the cDNAs,
is predicted to have a considerably longer form of the protein than the allele
represented in the reference genome of July 2003.
In other puzzling instances, a unique indel
polymorphism in an mRNA allows specific alternative variants that should
include an early stop and might be degraded by the mRNA surveillance machinery
to become open up to the last exon, hence probably stable as messenger RNA.
Genes SLC35A1 (AL122119), encoding through alternative splicing a large family
of nucleotide sugar transporters, or SUMF2 (AL050037), encoding multiple forms
of the sulfatase modifying factor 2 are good examples. Such variations being of
rare occurrence, one may wonder whether they represent rare RNA polymerase
errors or modifications by RNA editing that make the transcript translatable
thus stable and observable, or if they represent rare genomic
polymorphisms. More experimental work
will be required to answer this question.
-
The partial
alignments come from 907 mRNAs
that align well, but incompletely. Upon examination, a good proportion cluster
in specific loci, suggesting that they highlight defects in the genome.
As expected, some mRNAs border a known genome gap
(e.g. AL831825 on chr2, BX647560 from gene ABCA13 on 7, AL117478 from gene AGS3
on 9 or BX648928 from gene PCCX2 on 12). But even more frequently, in at least
181 accessions from 128 genes, partial alignments uncover a deletion in the
genome.
For example, in locus GRIPAP1, mapping in a
finished area of chromosome X, the 5’ ends of mRNAs AB032993, BX647801,
BX640704 fail to align over 757 to 1018 bases. Although the unaligned sequences
are not found in any genomic clone in GenBank, they form a consensus which is
furthermore conserved in the homologous genes Grasp1 in rat and Uqcrc1 in mouse
(where the gene encodes ubiquinol cytochrome c reductase core protein 1): this
human sequence is artificially missing from the July 2003 genome.
In another example, gene VPS13D on chromosome 1, a
large genomic piece is missing right in the middle of the gene. This leads to
both partial alignments of 5’ ends of clones (AK057675 aligns from base 557 and
AL833079 from base 199) and to large internal sequences unaligned (1553 bp in
AK125118, 1435 in AB007922 and 1363 in BX641035). The problem will probably be
solved in the next genome build, since the Sanger Institute submitted in
November 03, after build 34/July 2003 was out, a genomic clone BX682532 that
contains the missing piece.
Finally, some deletions arose in the course of
building the reference genome, possibly due to a suboptimal choice of the
genomic path across the numerous overlapping genomic accessions. An example is
the first 122 bp of AK001244, which are missing from build 34, yet are present
in genomic accession AC115284, partly used to build the reference genome just
upstream of this gene.
Altogether,
we estimate that more than half of the partial alignments are due to deletions
or missing pieces in the genome. Other causes of partials are unrecognized
vector or other irrelevant sequences and rearrangements.
-
The dubious, the
bad and the unaligned: 957 mRNAs
give unreliable alignments and genome map assignment and are ignored in all
subsequent analyses. Another 44 mRNA accessions are still unmapped by all
programs that tested them. For many of these, the genome is probably still
missing. Two classes of poorly aligned sequences may be recognized:
•
543 mRNAs
align over their entire length but with many base differences, usually spread
over the entire sequence length. One third of these come from 15 highly
polymorphic genes encoding the hypervariable immunoglobulins and
histocompatibility complex proteins HLA/MHC. About a fifth appear to be
pseudogenes or members of multigene families for which the real gene has not
yet been sequenced (e.g. tubulin BC019829 or alpha tryptase BC028059). A few
others raise a contamination warning for the Alu repeat and some may result
from small batches of bad quality sequencing. 26 in this category may be
excellent, but the coordinates of the proposed alignments could not be
correctly decoded.
•
Most of the
remaining 414 mRNAs have a partial match, usually with a high level of
mismatches. Some 75 good but very short matches may indeed turn out to be solid
anchors, but they will need confirmation on future genome builds.
In
conclusion, if we add all the weak map data so far, we find that at least 0.4%
but at most 1.2% of the human genome bearing expressed genes is still missing
in build 34/July 2003.
4: Multiple alignments point to
chromosome-specific gene duplication control mechanisms.
About
1% of the mRNAs align well, but in multiple non overlapping regions of the
genome: their map is ambiguous, because they belong to repeated, or paralogous genes
that did not yet diverge. To analyze this type of events, we considered the
exact duplicate alignments of excellent quality, and also included the quasi
exact duplicates, with score within 5 points from Gold, for example with 2 base
differences. These criteria select 627 mRNAs producing 1,379 duplicate
alignments, 763 exact and 616 nearly exact (details and lists in supplement S4a). The sequence
similarity between both copies is on average 99.97%, with a minimum of 99.5%
for some ribosomal pseudogenes.
34%
of these apparent duplicates are disregarded because they simply reveal a
transient state in unfinished areas of the genome. Indeed, when conflicting
situations arise during assembly, some fragments are left loosely associated to
the chromosome, as “randomly mapped” contigs. These often contain duplications
that get merged in a later genome release. Supplement 4b shows the
percent of such alignments per chromosome, and provides a measure of completion
of the chromosomes’ sequence: the least finished chromosomes would be, in that
order, 9, 1, X, 17, 2 and 8 (in build 34/July 2003).
On
the other hand, provided the finished genome sequence can be trusted, we may
hope that the remaining 908 duplicate alignments, where both copies lay outside
of the random contigs, are real. Their main properties are summarized in Table 1: 88% have both copies on the same chromosome
(intrachromosomal), rising to 95% if we restrict to exact duplicates.
Localization of the duplicated genes on the genome (Figure 2 and supplement S4d for exact
duplicates) shows a striking regional heterogeneity: duplications tend to
cluster in a few megabase-sized fragments, reminiscent of what has been
described, from human and mouse genomes comparison, for single base variations,
specific rearrangements and meiotic recombination [18]. Possibly
as a result, some chromosomes have no duplicates (18 and 21) or very few (3,
14, 6, 20, 12, 11, ordered by percentage of duplicate alignments), while others
have very many (Y, X, 16, 5, 15, 7, 10, 8, 2, in decreasing order). Note the
absence of clear dependency on chromosome size.
The
sexually dimorphic chromosomes are especially rich in duplicate genes, in part
because of the pseudo-autosomal region, which consists of two telomeric genomic
segments 2.2 Mb and 200 kb long highly similar between X and Y and represented
as identical in build 34. But even outside these regions, the X and Y have a
propensity for duplications [19], affecting 7% and 36% of
their alignments respectively. This might be advantageous for dosage purposes
and for protection against deleterious mutations in males.
The
few inter-chromosomal duplicates are special in two ways, both indicating that
very few of these remain functional (Table 1). First they
include 84% divergent pairs in which only one member is expressed, to be
compared to 46% for the intra-chromosomal.
Second they include a large number of retroposon-like pseudogenes, in
which the Gold has standard introns and the copy does not. Such retroposons are
known to result from random insertion in the genome of the retro-transcription
product of a mature mRNA. Pairs with typical retroposons account for 8% of the
duplicates; they affect a dozen highly expressed genes, such as those encoding
ribosomal proteins or histones. And because reinsertion is random, all such
genes have their intronless pseudogenes landing on a different chromosome.
In
contrast, duplicates with introns in both copies are extremely rare among
interchromosomal pairs. The only cases, mRNAs BC015119 and AK056232, align as
almost exact repeats twice at the telomeric end of a chromosome, respectively
14/22 and 15/1. They might possibly correspond to genome assembly errors, or
else be living examples of very recent telomeric born duplications:
experimental assessment would be especially appreciated.
The
remaining interchromosomal duplicates belong to intronless pairs of genes.
Because they occur in the right proportion (12% of intronless pairs, versus 12%
of intron-noIntron pairs relative to intron-intron) and have started diverging
at the same rate as the pseudogenes, we would like to propose that some if not
all of the intronless interchromosomal duplicates are retroposons of intronless
genes.
88%
of the duplications are intrachromosomal (Table 1). The
462 pairs where both members have introns and the 255 pairs where both are
intronless have similar properties. More than half are exact, in line with the
idea that both genes are expressed. The duplicate genes have a strong tendency
to be in close proximity: 81% (581/717) occur within 1 megabase of one another.
The median is 230 kb: half of the pairs are within that distance, ranging from
4,860 bp (BC029540, gene LOC255313, AceView gene koby) to 25.2 Mb (BC005089,
gene ANAPC1). If we restrict to exact duplications where both copies have
introns, the median drops to 180 kb, ranging from 4,860 bp to 3.1 Mb (BC026100,
gene TTTY8).
On
average, duplications tend to be in palindrome slightly more often than in
tandem (312 palindrome versus 269 tandem), inward and outward palindromes occur
at the same rate overall (151 versus 161). Again, interestingly, there are
clear large scale regional variations (supplement 4c),
emphasized by the fact that chromosome 8 has 31 tandem in 8 genes and no other
type of duplication while chromosome X has ten times more palindromes than the
average chromosome (62 inward and 33 outward), but only twice as many tandems
(19). This ability of the X chromosome to undergo close palindromic
duplications is largely responsible for the increased frequency of repeats on
this chromosome. It is tempting to imagine a mechanistic link between the
frequent palindromic gene duplications and the phenomenon of
X-inactivation.
Clustered
gene duplications are usually interspersed with genes that are not duplicated
(by our standards). The number and position of introns is identical in the
copies, but intron sequences are conserved to a lesser extent. Most of the
repeated genes are simply duplications. However, as expected, the multiply
repeated ones tend to be in tandem, since this arrangement, through paralogous
recombination and sister chromatid exchange, allows increases and decreases of
the copy number. The record so far is held by BC029540, which maps in a 52 kb
cluster of 10 close tandem repeats on chromosome X, 8 exact and 2 with 1 base
difference. The cluster is supported by 12 other GenBank or dbest entries (gene
koby in AceView). It is expressed mainly in testis and encodes a family of
human specific putative cytoplasmic proteins.
The
pattern we observe for duplicate genes that barely diverged: a strict
inter-chromosomal barrier which prevents genes with introns from being
duplicated on different chromosomes, added to the tendency for duplicates to be
closer than a few megabase apart, contrasts with the pattern described for
segmental duplications of genomic fragments: those occur at comparable levels
inter (76 Mb) and intra (95 Mb) chromosomes and are often far away [21, 22, 23]. A main
difference is in the choice of the minimal degree of sequence identity: 99.5%
here (average 99.97%) versus 90% there, and in the genome length we
interrogate: 18.8 kb here (the average extent of these 1.9 kb mRNA alignments
on the genome) versus 1 to 10 kb there. Our functional annotation also seems
different (supplement 4e):
we find many duplicate genes encoding RNA/DNA binding factors, much like the
mouse repeated genes [23], but we do not identify
multigene families with conserved motifs because they are not identical all the
way and are too divergent. Yet, as expected, a majority of the duplicate genes
fall within segmental duplications. Our interpretation is that genes that
belonged to far away segmental duplications have diverged, while duplications
physically closer on the chromosome, in the range of 5 kb to 3000 kb, could be
mechanically acted upon by gene conversion or other mechanisms maintaining them
identical or nearly, while often remaining both active.
There
are legitimate concerns that this entire analysis rests on the hypothesis that
the finished chromosomes sequence did not merge or split the repeats during
sequence assembly. It is clear that the clone based physical map approach taken
by the publicly funded human genome project [20] is
more secure than a whole genome shotgun approach. The mapped-clone approach has
been validated, in the case of the nematode, by thousands of careful C. elegans
researchers. Another reassuring observation is that dramatic variations in
frequency of duplicates have been observed by at least four large scale genome
sequencing centers, which produced some chromosomes very rich in repeats and
some deprived of those (Sanger: chromosome X versus 6, Washington University:
chromosome 7 versus 4, Whitehead Institute: chromosome 15 versus 18 and
SHGC/JGI: chromosome 5 versus 19). So it is probably correct to assume that the
differences among large megabase-sized regions in the chromosomes do not
reflect technical artifacts in the genome assembly, but instead indicate a regional
specific control mechanism on the number, orientation, organization and
clustering of the identical or barely divergent duplicated genes. Whether the
control is positive or negative, acts on the generation of duplications,
prevents their accumulation, actively removes duplications, or controls
mutation rate is not known. Potential actors include replication, repair,
conversion or recombination, chromosome breakage, sister chromatid exchange and
selective factors such as those that control the length of the chromosome or
the level of expression. But Figure 2 shows
that whichever mechanism controls the genes duplications has to be remarkably
chromosome-specific and act at a large molecular scale, in megabase-sized
segments, akin to position effect variegation and X inactivation.
5:
In
most instances, the paradigm linking DNA to its mature RNA copy is verified,
and the entire length of the cDNA clone aligns locally on the chromosome in a
collinear fashion. But a surprisingly large number of mRNAs, almost 4%, do not
follow this rule. While their entire length may ultimately be aligned at
excellent quality, this requires splitting the sequence of the mRNA into
separate fragments or aligning in a non-collinear fashion.
Rearrangements
may be generated in silico, in vitro or in vivo. Genome assembly, the
construction of genomic or cDNA libraries, the tissue or individual origin of
the samples can be a cause. For example, some cDNA libraries come from tumor
tissues in which reorganization of the genetic material is known to occur.
Possible sources of rearrangements also include remodeling of the genome and
structural variations across individuals. Some genomic areas occur in variable
Copy Number (CNP) in different individuals [24], and any
CNP integrated in a chromosome should be bounded by rearrangements. Rearranged
cDNA alignments may also be useful to spot defects in the genome sequence. But
they are currently among the most difficult to pinpoint and benefit most from
using simultaneously the alignments generated by all methods. UCSC and the
Japanese NEDO “Analysis of Protein Function and Practical Application” (unpublished data) are especially good at
providing those, while most other groups simply forbid it in the current
version of their program.
We
classified the rearranged alignments of excellent, good or partial quality and
distinguished six cases, each defined by a particular geometry (Table 2; lists of accessions are in supplement S5a):
•
Mosaic: For 1,197 mRNAs, the split fragments align best
in distant places in the genome, at least 300 kb apart, a property one might
expect for mosaic cDNA clones (also called chimeric), in which the insert is a
concatenation of more than one mRNA molecule, possibly because reverse
transcribed copies of two mRNA molecules were ligated artificially during the
cloning procedure. A proportion of mosaics could be due to genome
rearrangements, misassemblies or genomic clone mosaicism. Alternatively, some
may represent rare events of trans-splicing, although none has yet been documented
in human.
•
Inversion: For 128 mRNAs, the split fragments align in the
same area of the genome within 300 kb, but on different strands, suggesting an
inversion in the cDNA or the genome. Inversions typically require two
breakpoints, enough to define the segment to be flipped (in the chromosome or
the cDNA).
•
Transposition: For 293 mRNAs, the split fragments align in the
same area of the genome, within 300 kb, and on the same strand, but the 5’
fragment lies downward of the 3’ fragment. This is known as a transposition in
classical genetics. Transpositions require 3 breakpoints, two to define the
segment that will transpose and one for the insertion site. Therefore,
transpositions are expected to be less frequent than inversions, but we observe
they are more than twice as frequent. One possibility would be that there is an
active motor or mechanism for such events and that pieces of genes can
physically move close by in the genome, much like transposable elements.
Another would be that the event occurred before ligation of the insert in the
vector, during library construction, then a single breakpoint would suffice.
And of course, genome assembly errors could also lead to transpositions.
• Suspected genome deletion: For 283 mRNAs, the alignment is monotonous
along the genome, yet a piece of the mRNA fails to match the genome and sticks
out unmatched, often in between two exons but sometimes within a single exon.
These cases highlight putative deletions in the genome, or possibly for some of
the 10 shorter than 35 bp, a length polymorphism. Examples supported by
multiple clones were given in the discussion of partial alignments. The average
length of unmatched cDNA is 176 bp, and the genome deletion is of course at
least as large as the unaligned cDNA stretch. Each program taken separately
seldom misses an exon of 12 bp. So, to be secure, we only flagged genome
deletions longer than 18 bp, ensuring that it could not escape detection by at
least one program, even if the exon contained one or two base differences.
•
Variable
tandem repeat number: For 893
mRNAs, originating from 704 genes, or about 3% of the genes (see map in
supplement 6c),
successive fragments of the sequence align in overlapping areas, a region on a
given strand of the genome being used more than once, in tandem, in the cDNA.
These very frequent events could result from artificial tandem duplication in
the cDNA, a mechanistically difficult event, or more likely from partial
deletion of tandemly repeated sequence in the genome. They are probably
associated to natural copy number polymorphisms affecting tandem repeats. As
expected, the longest forms of such variants are enriched in the two libraries
that chose to isolate the longest mRNA variants: KIAA (3.0%) and DKFZ (1.7%).
In addition, the localization of the event in the gene is not random: it
affects almost systematically the 3’ or 5’UTRs, as discussed in paragraph 6.
•
Pseudo-split
or mRNA bridging two correctly ordered and oriented contigs: 36 such mRNAs were recognized. For about half, the
two parts of the alignment lie in separate genomic contigs, but the genome
assembly positioned them correctly in the scaffold. For the other half, the two
alignments could be connected with no geometrical anomaly on a single contig,
but the best scoring alignment was discontinuous.
6: Splice usage in Gold
alignments: the frequency of structural polymorphisms and intronless genes is
unexpectedly high.
Let
us now analyze the splicing pattern of the human genes (supplement 6). Limiting our analysis to the excellent Gold
alignments, we find that only 76.5% have standard introns, 21.6% are intronless
and the remaining 1.9% have only non-standard introns, micro-rearrangements or
alignment gaps. The frequency of non-standard introns, 1.23%, is high, and
affects 7.1% of the excellent alignments. These findings were unexpected, and
we will discuss successively the standard introns types and size distribution,
the non-standard introns and their relation to micro-rearrangements, and the
intronless genes.
An
excellent Gold with introns has on average 7.9 standard introns per alignment.
Examination of the types of intron boundaries shows that most of the 406,003
introns follow the consensus rule for splice junctions, a feature only slightly
favored by our cost function: 97.52% have standard gt-ag intron boundaries,
1.08% the gc-ag variant and 0.13% (534 cases) the at-ac boundaries known to be
spliced by the minor U12 spliceosome. As we had shown before on a smaller
sample [19], the length distribution of the standard
introns is far from normal (Figure 3a): it
starts sharply at 65bp, peaks in the 81 to 95 bases range, decreases to around
160 bp to end in a long flat tail driving the average intron size to 5,372 bp.
Another
remarkable feature in Figure 3a is the
distinctive size distribution of the non-standard introns, which peaks in the 5
to 8 bp range and has very little overlap with that of the standard introns.
Indeed 86% of the introns below 65 bp are non-standard, and the others are
probably standard by chance, because the precise position of an intron may
slide and our cost function favors a solution using one of the three standard
intron types.
How
could we explain the large number of non-standard introns? We propose that at
least the 33% short ones, below 65 bp, are micro-deletions in the cDNAs or
micro-duplications in the genome, akin to the micro-deletions in the genome
that are associated to “variable tandem repeat number” rearrangements. In
support of this idea, we observe that the non-standard introns contribute most
to the 3,139 ‘alternate alignments’ (supplement S2a), where programs give
different splicing structures with the exact same score by choosing different
coordinates for the non-standard “intron”. These non-standard pseudo-introns can
slide because of the repetitive nature of the underlying sequence. Direct
sequence analysis of the short pseudo-introns confirms that one third are pure
dinucleotide repeats, mainly and almost exclusively GT, TT, AC, AA, AT (in this
order); longer repeats are seen in others.
In
fact, within an array of genomic repeats, reciprocal events such as unequal
recombination will lead symmetrically to both micro-duplications (“variable
tandem repeat number”) and micro-deletions (very short introns). Events generating micro-deletions are less
constrained in sequence relative to the cDNA alignment, because they do not
need to be exact tandem repeats, so they should appear more frequent. Figure 3b plots the two types of events on the
same diagram, and strongly suggests that a majority of short non-standard and
standard “introns” do not result from splicing, but rather correspond to the
expected micro-rearrangements. These micro-introns would have no reason to
follow the splice rules, except by chance. Interestingly, if we restrict the
analysis to the intron lying in the last exon, therefore usually in the 3’ UTR
region (supplement 6b),
we observe that micro-introns and
variable tandem repeat number rearrangements are ten times enriched there
relative to any other position in the transcript. On the other hand, the
prevalence of genes harboring one or the other type of event is impressively
high, 8.7% (2,535 genes out of 29,174)!
We
propose that the micro-rearrangements (micro-duplications and microdeletions)
represent mostly natural structural polymorphisms. Polymorphic variations in
number of tandem repeats, for example di or trinucleotides, are known to occur
naturally. Their biological importance is exemplified by their implication in
many genetic diseases (at least 15 so far). As we confirm, the majority occur
in the transcribed untranslated gene regions, where they could play a role in
regulating transcription level or messenger stability or translatability. One
mechanism recently elucidated involves pairing to micro RNAs, as was first
described in the nematode (e.g. [25]). Another interesting
possibility is silencing mediated by the HP1 protein, which also participates
in position effect variegation [26].
A
second important factor generating non-standard introns is partial deletion of
the insert, which according to AceView accounts for at least half of all
non-standard introns of all sizes. These are the predominant cause for
non-standard introns with feet in two separate exons, and generally for
pseudo-introns in coding regions. Partial deletions of the insert commonly
occur in the bacterial host and may be selected for, because bacteria with
shorter insert clones often have a growth advantage. To identify such putative
insert deletions, we use the AceView human genome database including all mRNAs
and ESTs [2]. In that database, if a non-standard intron is
supported by a single cDNA clone and both its ends fall within exons of other
transcript variants, we flag the pseudo-intron as a suspected internal
deletion. These annotations transfer back to 2,634 Gold cDNAs, leaving at most
0.4% non-standard introns unaccounted for. A few of those might truly
represent novel splice junction usage. Indeed, in the worm C. elegans,
we have definitive evidence that non-standard splice junctions are used for a
number of well characterized genes, in an alternative or constitutive way, a
fraction of the junctions, but not all, being single base variants of gt-ag
(Kohara, Thierry-Mieg, Sulston et al, unpublished).
The
third puzzle is the high frequency, 21.6%, of mRNAs that align over their
entire length at excellent quality but without introns. Some could be insidious
contaminating genomic fragments that we tried to identify as follows. We look
for the absence of introns, the lack of a substantial ORF (less than 20% of the
length of the corresponding AceView transcript, eventually lengthened by
concatenation with other mRNAs and ESTs), the presence of a 3’ A-rich stretch
(more than 12 A in a row in the genome just 3’ to the mRNA) and the absence of
a polyadenylation signal in the correct position (AATAAA or any single base
variant in the last 35 bp of the transcript). We demand in addition that the
transcript does not belong to a gene with a locusID (it would then more likely
be an unspliced by-product of transcription) and that it be supported by less
than 5 clones in the complete database, four times less than the average
comparable transcript in the current AceView. Finally, we rescue the few where
the small putative protein has homologies to other known proteins, or has
special features that make it a candidate for being a real protein, such as a
signal peptide or a transmembrane domain. Altogether 589 accessions (0.8% of
mRNAs aligned at this quality) have been flagged as possible genomic
contaminations: chances are that they are primed on A rich regions of traces of
genomic DNA contaminating the RNA preparation rather than on actual poly-A in
the reverse transcribed mRNA. Some are found in all unfiltered libraries (List
in S1b).
This
leaves us with 20.7% mRNAs with intronless alignments. In AceView, we observe
that 45% of these actually belong to genes with introns: 27% appear to be
partial and encode only the last exon, and 18% represent intronless alternative
variants, usually interpreted as immature transcripts. The remaining 7,599
intronless cDNAs belong to 6,388 intronless genes, which represent 24% of all
genes with excellent Gold alignments, an intriguingly high percentage that
should trigger new experimental studies.
7: Chromosomes differ in their
frequency of single base mismatches and rearrangements.
The
72,198 excellent or good Gold alignments offer a sensitive probe to measure the
mRNA-to-genome match, providing us with more than 170 Mb of high quality cDNA
sequence aligned on finished genome sequence.
To
our surprise, significant interchromosomal variations occur (Figure 4a; supplement 7) when we
monitor the 276,340 single base mismatches, insertions and deletions identified
using Smith Waterman. Chromosome X has the best mRNA-to-genome match, with on
average 0.5 differences per kb fewer than chromosomes 21 or 22. Chromosome 21
is the most deviant of autosomes, with 0.36 differences per kb in excess of the
best matching autosome, chromosome 11. If we accept that mutations or
sequencing errors occur in the current genome sequence at a constant and low
rate across chromosomes, of the order of 0.05/kb, and if we take into account
the fact that, although there are mutations or sequencing errors in the cDNA
sequences, these should not a priori depend
on the genome map position of the gene, then most of the variations evidenced
in Figure 4a should reflect true
interchromosomal differences in the level of single nucleotide variations.
Our
focus here is exclusively transcribed genes, yet our results are in line with
previous observations showing the existence of large scale regional neutral
mutation rate variations [27, 18].
For example, it is well known that mutations occur more frequently in the male
germ line and we observe the expected bias, X < autosome < Y, despite the
diluting effect of the pseudo-autosomal region. We also confirm that
chromosomes 21, followed by 22, 18, 19, 7, 8 and 4 have the highest single base
variation rates among autosomes.
Similarly,
the frequency of structural rearrangements (Figure
4b) shows mild variations per chromosome; once again the X chromosome has
fewer rearrangements. Our hope is that the lists and maps of accessions
(supplement 7a2)
might become useful, in the ultimate phase of verification of the genome
sequence, to distinguish between local variations reflecting natural
polymorphisms [18] and defects during cloning or genome
assembly.
Note
that initially, chromosome 14 seemed to be among the worse for both criteria,
but that was in large part due to the 1,100 kb highly expressed immunoglobulin
region, known to be hyper variable and somatically rearranged. Thus we decided
for this analysis to identify and mask all distinctly highly polymorphic loci,
this region on 14, the immunoglobulin region on chromosome 2 (450 kb), some of
the HLA genes on chromosome 6 (a total of 550 kb), and the immunoglobulin
region on chromosome 22 (1000 kb) (lists in supplement 3b). Altogether
379 accessions map best in these regions, 52 at excellent quality, 60 at good,
22 at partial, 105 at dubious and the remaining 140 at bad quality.
8. Identifying the problems
through alignment programs comparison.
Programming is
difficult: the human genome is large and requires parallel processing and
multiple ancillary data files, and computers are temperamental. As a result, it
may happen that some mRNAs, some genome fragments, or some intermediate results
get lost without being noticed. Such problems, often referred to as bookkeeping
errors, are uninteresting but they have a large impact on the quality of the
results. A clear advantage of our collaborative approach is that we are fully
immune to bookkeeping errors, because they are not reproducible across laboratories.
The second problem is
the existence of programming bugs which only become apparent in marginal and
rare conditions. They are hard to spot. For example, there may be a stretch of
NNN in the middle of a quasi finished piece of genome, and this complication
may have been overlooked. The alignment problem is so complex that there are
many such pitfalls. In our approach, these wild events can be identified
because they are not reproducible across methods.
Then come the more
conceptual limitations. Each program implements its idea of the gene, and by
comparing the different programs, we compare these implicit models. For
example, most of the introns are bounded by gt-ag, but a few are bounded by
gc-ag or other variants. If a program forces gt-ag in all cases, some of its
alignments will not optimally match the genome. Another example is that of
exactly repeated alignments: if a gene maps in two loci, this should be made
explicit because it has important biological implications, yet many programs forbid
it. By combining all programs, we increase the total number of gene copies!
Another case is that of the rearrangements, which are frequent. They break the
central rule of colinearity and are hard to balance. If one allows breakpoints,
one may allow too many. Other challenging features include long genes extending
over 800 kb, introns that are either long (>300 kb) or very short (from 6 to
12 bp), and micro-exons (below 6 bp): we provide lists of accessions with
examples of these difficulties in supplement 8. Comparing independent
alignment programs reveals the excess or the lack of desirable features.
In Gold, we tuned the
cost function not to match our preconceived idea, which we had implemented a priori in our own code, but to instead
match the collaborative best picture which came from the union of all codes.
Namely, we paid great attention to the cases where the programs disagreed. This
helped deepen our understanding of the human genes.
Another benefit of the
comparative approach is to allow the identification and classification of
examples where a program is suboptimal. To illustrate the procedure, we give in
supplement 8
the comparison between the three best programs: AceView, UCSC and NCBI. Of
course, programs are in a rapid phase of improvement, so the snapshot shown in
the tables reflects the period from February to August 2004 and will certainly
evolve. Currently, AceView is the most precise and finds the Gold alignments
most frequently (92.98% Gold, or 97.75% within 5 points from Gold) while UCSC
is the most exhaustive and gets the largest proportion of alignments close to
Gold (91.68% Gold, 97.95% within 5 points from Gold).
If
we consider only the cDNAs for which at least one program provided an excellent
Gold alignment, any program doing less well missed a subtlety or had a bug.
When we compare the second alignment to the Gold reference, the mRNA either
maps in the same locus but with a different proposed exon structure, or maps in
another locus, or is not aligned at all. In the first case, we can be more
specific. There may be too few or too many exons or the exon boundaries may be
chosen differently, and it is instructive to separate the first, central and
last exons. As expected, all programs have some errors in each category, with
some biases (supplement 8b). We
reported these detailed lists of putative bugs to the authors of each program,
who usually found them useful because they facilitate debugging. This helped us
in particular to significantly improve the quality of the AceView alignments
for build 35.
In
turn, the few cases where the Gold choice seems disputable are used to improve
the cost function, and we need feedback to the bug reports! This is why we
conceived Gold as a benchmark: we hope to receive further contributions from
any interested party. If people find this approach interesting and send new
alignments, we will incorporate them in Gold if they win, and will send
privately the debugging lists. The annotation process is dynamic and updated
versions of the Gold alignments will be made available as data and programs
improve. We hope that, using this work, all mRNA to genome alignment methods
will soon converge toward a clear and solid picture of the structure and
organization of the human genes.
Conclusions
Our
long term goal is to understand the human transcriptome, its structure, its
polymorphisms, its expression, its regulation and its functions. This work focuses on the critical, but not
yet consensual first step: the mRNA to genome alignment. The GOLD cooperative
approach presented here chooses the best alignment from a collection of
programs, including the most prestigious, on a cDNA per cDNA basis. The
solution to different problems requires different, sometimes incompatible
strategies, but in Gold we benefit from the combined brainpower behind all
programs. Gold is therefore empowered with hybrid vigor, and provides better
alignments than any program taken separately.
We
selected 74,106 cDNAs corresponding to the four large scale human transcriptome
projects and annotated some basic properties of their Gold alignments; we hope
that these more precise alignments will be used by others for in depth
analyses. We are of course open to collaborations.
In
the course of the analysis, we tentatively identified “contaminating” sequences,
vectors, poly-A or others (supplement 2), which tend to be
aligned on the genome as spurious supplementary exons. We would welcome
comments and corrections on these lists. At the same time, we actively looked
for trans-spliced leaders and confirm that there are none in human: a majority
of clones in the oligo-cap selected FLJ collections start at the capped first
base of the transcript, and would undoubtedly evidence recurrent 5’ additions,
yet the only extra 5’ sequences routinely found in these libraries are poly-A
stretches, which occur in 1.8% of the oligo-capped cDNAs. The significance of
the 5’ poly-A additions is unknown.
The
lists of possible defects in the cDNAs may help select the best clones for use
in future experiments: researchers interested in using the cDNA clones for
protein experiments should prefer clones with an identical CDS on the cDNA and
on Gold, and be cautious with clones evidencing frameshifts in the genome or
with rearranged clones. They should be aware of ambiguities in the genomic map
position in the case of duplicated genomic loci, and all these details are
explicitly annotated in Gold. On the other hand, we have not yet labeled the
best representative clone for each transcript variant, annotated the
completeness of its CDS, categorized single base differences as putative
polymorphic variations or suspected mutations in the clone, or identified an
eventual match in the RefSeq collection, because all these annotations require
mRNA and EST clustering and co-alignment, something we only do in AceView so
far.
However,
we have gathered some elements on the relative levels of natural polymorphisms
and cloning mutations or sequencing errors. In fact, one of the libraries,
KIAA, has the least differences from the genome, with only 0.91+/-0.02 single
base difference per kb, and its alignments uncover more frameshifts or Stop
mutations in the genome than in the cDNAs (supplement 1c): the quality of its
sequence is therefore equal or better than that of the genome sequence,
evaluated at 0.05/kb. The average level of natural single base polymorphisms
genomewide would therefore be of the order of 0.86 per kb, including mismatches
and single base insertion or deletion. By comparison, the MGC reference, MGC
complete, DKFZ and FLJ collections might harbor respectively 0.11, 0.36, 0.94
and 1.02 mutations or sequencing errors per kb (supplement 1a).
We
also annotated the presence of micro-rearrangements of type “variable tandem
repeat number” and “micro-introns” (less than 65 bp long), which essentially
affect the UTR regions, and we propose that they correspond to copy number
polymorphisms or structural polymorphisms that may have biological
consequences. The very high frequency of such events, affecting almost 9%
(2,535 genes /29,194) of the genes currently identified by the large scale cDNA
collections, provides an interesting complement to the catalog of SNPs, and
could be further analyzed in relation to tissue specificity, phenotype, binding
to regulatory RNAs such as micro-RNAs or binding to chromatin remodeling
proteins [e.g. 25, 26].
On
the genome side, we evaluated the completion and minimal quality of the genome:
as of July 2003 (Build 34/hg16), the 2,865 Mb reference genome was only missing
between 0.4% and 1.2% of the gene-bearing sequence. 97.4% of the cDNAs were
aligning fully (within 1% of their length) on the genome, with few base
differences (1.62 per kb on average), except for 15 highly polymorphic loci.
However, the genome is not error free and we identified a series of putative
frameshift or Stop mutation, but for the reasons mentioned before we did not
yet try to identify putative point mutations in the genome sequence. However,
we found highly significant interchromosomal variations in the rates of single
nucleotide differences (mismatch or insertion/deletion) in the cDNA to genome
alignments, up to 0.36 per kb among autosomes, from 0.25 to 0.62 per kb between
X and any autosome, and at least 0.85 per kb between X and Y (Figure 4a). In the same vein, local variations
in the frequency of rearrangements and partial alignments are also observed (Figure 4b).
Because
there are two possible causes for interchromosomal or large scale local
variations, namely variable quality of the chromosome sequence and intrinsic
variability of the single nucleotide polymorphism or mutation rate, we were
happy to observe that the variations we see in the transcribed genes parallel
those observed for selected sets of neutral base substitutions and rearrangements,
which vary together with the rate of meiotic recombination in megabase-sized
segments and may correspond to hotspots of mammalian chromosome evolution [18, 27]. Conservation of the phenomenon
across species is a good argument. Nevertheless, since all these studies are
made on the same human genome, the question of whether the variations
correspond to regions naturally more polymorphic or to regions where the genome
sequence and assembly are of lower quality remains somewhat open. More
experiments will be required to distinguish the two hypotheses.
The
analysis of exact and quasi exact gene duplicates, which amount to close to
more than 1% of all alignments, is also striking. Multiple aspects of the
control of gene duplications appear chromosome specific, from the overall level
to the organization of the duplicate gene pairs, in tandem or in palindrome.
The frequency of duplicated genes per chromosome is highly heterogeneous, but
this difference seems reliable since it does not correlate to the sequencing
laboratory. Interchromosomal duplicated
genes almost invariably contain one active gene and one pseudogene, while exact
and presumably active duplicates are nearly always on the same chromosome and
less than 3 megabases away from each other. The only 2 cases of nearly exact
inter-chromosomal duplicates with introns outside the XY pseudo-autosomal
region occur at the telomeres of chromosomes 14/22 and 15/1 and would deserve
verification, since these exceptions could either be misassemblies or
interesting examples of telomere borne duplications.
The
existence of a multitude of large scale chromosome specific aspects, including
control of the level and organization of gene duplications, control of the
local frequency of single base variations and of structural rearrangements, and
possibly also control of gene expression in larger neighborhoods raise
interesting questions. It is likely that some mechanistic time or space constraints
acting on some basic machinery, for example that involved in chromosome
replication or chromatin compaction, account for these effects: as a result,
all genes are not born equal and their localization on the genome has important
implications.
The
distribution of exons and introns lengths confirms our previous observations [19], however the existence of very short exons, under 6 bp
and bounded by 2 standard introns, needs experimental confirmation. A
relatively high frequency of introns with non-standard boundaries is observed,
but we propose that one third correspond to micro-polymorphisms and most of the
others to putative partial deletions of the insert in the cDNA (supplement 6a), leaving only few
confirmed introns that do not follow the consensual splice rules (gt-ag, gc-ag
or at-ac).
We observe that 22% of the mRNAs align without
introns, after discounting the few (0.8%) that we identified as putative
genomic contaminations. However, many are probably full length clones,
especially from the FLJ cap-selected collection. This implies that a large
proportion of the human genes, around 24%, would have no introns, a distinctive
feature also seen in mouse, but not in Caenorhabditis, Drosophila or
Arabidopsis [2]. Similarly, some micro-rearrangements may turn out to be
artifacts in the cDNA libraries rather than structural polymorphisms, and the
large scale chromosomal effects we observed for the duplications and
polymorphisms may seem trivial once we know the cause, but in a global genome
and gene function perspective, we believe that we have to understand the
preeminence of all these unexpected features in the eutherian genes. Recall how
spliced genes or micro RNAs seemed awkward not so long ago, before the roles of
alternative variants and RNA interference were even imagined. And it is in our
view the combination of expected and unexpected which vindicates best, if this
was necessary, the interest of the systematic cDNA projects, the experimental
material which underlies all this analysis.
Materials and
methods
Details
on the method, including the raw data, the format to submit alignments, the
current lists of poly-A and vector identification, and the C code to select the
Gold and do the analyses are provided in the supplementary material document 1.
The NCBI version of the acedb database [16] is used throughout this project.
The
supplementary online material
contains supporting tables of results and explicit lists of occurrences. It
consists of one section per section of the article. The entire primary and
associated material is downloadable by ftp from the GOLD web site www.ncbi.nlm.nih.gov/IEB/Research/Acembly/GOLD.
List
of abbreviations
aa aminoacid
bp basepair
CDS Coding sequence, from Met to Stop
DKFZ The German cDNA Consortium [7]
FLJ Full Length Japanese cDNA project, from NEDO
[5, 6]
KIAA Kazusa DNA research Institute cDNA project [3, 4]
MGC Mammalian Gene Collection [8]
ORF Open reading frame: a region of a sequence
which in a given frame contains no stop codon
Acknowledgments
We
thank our colleagues from the JBIRC informatics team for their involvement in
this project since its birth, Guy Slater and Ewan Birney for contributing the
exonerate alignments and the list of duplicate sequences, our colleagues at H-inv,
IDRIS and NCBI for their help and fruitful interactions, Nicolas Thierry-Mieg
for discussions about algorithms and Vahan Simonyan for presentation of the
work. We are grateful to David Lipman for his stimulating ideas and insightful
comments on the manuscript, and for the highly productive and pleasant
environment he and Jim Ostell create at NCBI.
References
1.
GOLD, Genomewide Optimization of
Locus Description [http://0-www-ncbi-nlm-nih-gov.brum.beds.ac.uk/IEB/Research/Acembly/GOLD]
or [http://www.aceview.org/GOLD]
2.
AceView: identification and functional annotation of cDNA-supported genes in
higher organisms [http://0-www-ncbi-nlm-nih-gov.brum.beds.ac.uk/IEB/Research/Acembly].
3. Nomura N, Miyajima N, Sazuka T, Tanaka A, Kawarabayasi
Y, Sato S, Nagase T, Seki N, Ishikawa K, Tabata S: Prediction of the coding sequences of unidentified
human genes. I. The coding sequences of 40 new genes (KIAA0001-KIAA0040)
deduced by analysis of randomly sampled cDNA clones from human immature myeloid
cell line KG-1. DNA Res 1994, 1:27-35.
4.
Kikuno R, Nagase T, Nakayama M, Koga H, Okazaki N, Nakajima D, Ohara O: HUGE:
a database for human KIAA proteins, a 2004 update integrating HUGEppi and
ROUGE. Nucleic Acids Res. 2004, 32:D502-504.
5.
Suzuki Y, Sugano S: Construction of a full-length enriched and a 5'-end
enriched cDNA library using the oligo-capping method. Methods Mol Biol 2003,
221:73-91.
6. Ota
T, Suzuki Y, Nishikawa T, Otsuki T, Sugiyama T, Irie R, Wakamatsu A, Hayashi K,
Sato H, Nagai K, et al.: Complete sequencing and characterization of 21,243
full-length human cDNAs. Nat Genet 2004, 36:40-45.
7.
Wiemann S, Weil B, Wellenreuther R, Gassenhuber J, Glassl S, Ansorge W, Bocher
M, Blocker H, Bauersachs S, Blum H, et al.: Toward a catalog of human genes
and proteins: sequencing and analysis of 500 novel complete protein coding human
cDNAs. Genome Res 2001, 11:422-435.
8.
Strausberg RL, FeinGold EA, Grouse LH, Derge JG,
9.
The H-inv international database for integrative human gene annotation [http://www.jbirc.aist.go.jp/hinv/index.jsp].
10. Imanishi T, Itoh T, Suzuki
Y, O'Donovan C, Fukuchi S, Koyanagi KO, Barrero RA, Tamura T, Yamaguchi-Kabata
Y, Tanino M, et al.: Integrative annotation of 21,037 human genes validated
by full-length cDNA clones. PLoS Biol 2004, 2:856-75.
11.
Pruitt KD, Tatusova T,
12.
NCBI human genome resources [http://0-www-ncbi-nlm-nih-gov.brum.beds.ac.uk/genome/guide/human/]
and
[ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/maps/mapview/BUILD.34.3/hs_esttrn.md.gz]
for downloading the NCBI public alignments.
13.
Kent WJ: BLAT--the BLAST-like alignment tool. Genome Res 2002, 12:656-664.
14. UCSC Genome Browser [http://genome.ucsc.edu] and
[ftp://genome.ucsc.edu/GoldenPath/hg16/database/all_mrna.txt.gz],
for downloading the UCSC public alignments.
15. The ensembl human genome browser [http://www.ensembl.org/Homo_sapiens/].
Note that exonerate alignments are not yet publicly available on this server.
16. The acedb object-oriented database system at
NCBI [http://0-www-ncbi-nlm-nih-gov.brum.beds.ac.uk/IEB/Research/Acembly/Acedb].
17. Carlo T, Sierra R, Berget SM: A 5' splice
site-proximal enhancer binds SF1 and activates exon bridging of a microexon.
Mol Cell Biol 2000, 20:3988-95.
18. Hardison RC, Roskin KM, Yang S, Diekhans M, Kent WJ,
Weber R, Elnitski L, Li J, O'Connor M, Kolbe D, et al. : Covariation in
frequencies of substitution, deletion, transposition, and recombination during
eutherian evolution. Genome Res. 2003, 13:13-26.
19. Skaletsky H, Kuroda-Kawaguchi T, Minx PJ, Cordum
HS, Hillier L, Brown LG, Repping S, Pyntikova T, Ali J, Bieri T, et al.: The male-specific region of the human Y
chromosome is a mosaic of discrete sequence classes. Nature 2003, 423:825-37.
20. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC,
Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, et al.: Initial sequencing
and analysis of the human genome. Nature 2001, 409:860-921.
21. Bailey JA, Gu Z, Clark RA, Reinert K, Samonte RV,
Schwartz S, Adams MD, Myers EW, Li PW, Eichler EE: Recent segmental
duplications in the human genome. Science 2002, 297:1003-7
22. Cheung J, Estivill X, Khaja R,
MacDonald JR, Lau K, Tsui LC, Scherer SW: Genome-wide detection of segmental
duplications and potential assembly errors in the human genome sequence. Genome
Biol 2003, 4: R25.
23. Bailey JA, Church DM, Ventura M, Rocchi M,
Eichler EE: Analysis of segmental duplications and genome assembly in the
mouse. Genome Res 2004, 14:789-801.
24. Sebat J, Lakshmi B, Troge J, Alexander J, Young
J, Lundin P, Maner S, Massa H, Walker M, Chi M, et al.: Large-scale
copy number polymorphism in the human genome. Science 2004, 305:525-8.
25. Ambros V, Lee RC, Lavanway A, Williams PT, Jewell
D: MicroRNAs and other tiny endogenous RNAs in C. elegans. Curr Biol
2003, 13:807-18.
26. Saveliev A, Everett C, Sharpe T, Webster Z,
Festenstein R: DNA triplet repeats mediate heterochromatin-protein-1-sensitive
variegated gene silencing. Nature 2003, 422:909-13.
27. Ellegren H, Smith NG, Webster MT: Mutation
rate variation in the mammalian genome. Curr Opin Genet Dev. 2003, 13:562-8.
Table 1:
Properties of duplicate alignments.
The normalized number of inter and intra-chromosome
duplicates is recorded, for exact and almost exact duplicates, then (after the
dash /) for exact duplicates only, together with information on the presence
(Yes) or absence (No) of standard introns in one or both copies. The yes-no
category corresponds to retroposon type pseudogenes. Only pairs with excellent
alignments in finished areas of the genome are considered (out of 66,018 such
alignments).
Mapping on… |
Duplicate alignments (all/ exact ) |
Existence of
introns in the copies (exact or
almost / exact only) |
||
Yes-Yes |
No - No |
Yes - No |
||
Different chromosomes The same chromosome Total outside pseudo-autosomal |
101 / 19 717 / 390 818 / 409 |
4 / 0 462 / 243 466 / 243 |
32 / 6 255 / 147 287 / 153 |
65 / 13 0 / 0 65 / 13 |
Pseudo-autosomal XY |
90 / 90 |
78 /78 |
12 / 12 |
0 / 0 |
Table 2:
Rearranged Gold alignments.
Classification of the rearranged alignments of
combined score excellent, good or partial (out of 72,691 mRNAs). 1Note
that nearly 7% of the clones with rearranged alignments have rearrangements of
more than one type, hence the total is not additive. Lists of accessions are
available in supplement 5.
Type
of rearranged alignment |
mRNAs |
% mRNA |
Mosaic Inversion Transposition
Suspected
genome deletion Variable
tandem repeat number Pseudo-split
or bridgeable genomic contigs |
1,197 128 293 283 893 36 |
1.65% 0.18% 0.40% 0.39% 1.23% 0.05% |
Total
rearranged1 |
2,637 |
3.63% |
FIGURE LEGENDS
Figure 1:
Quality of the Gold alignments
The quality of the alignment is defined as a
function of the percent length aligned and the number of single base
differences per kb (single base mismatch or single base insertion/deletion
between the cDNA and the matching genome). The number and percentage of the
mRNAs at each quality is shown (total 74,106). Each exact repeat counts as 1.
Rearranged alignments (e.g. from mosaic or rearranged cDNA or genome) count
once at their combined rationalized score. Lists of accessions and more details
are online, in supplement 3.
Figure 2:
Frequency and map on the genome of identical or nearly identical duplicate
genes.
404 mRNAs, with excellent quality Gold, map in
multiple loci, both in finished areas of the genome (July 2003). The resulting
908 alignments are represented here. a) Duplicates for which both copies map on the
same chromosome (in blue) or on different chromosomes (in red) are shown as a
percentage of excellent alignments on each chromosome. The presence or absence of standard introns is coded by a pattern: wide
oblique stripes if both copies have introns, solid color for putative
pseudogenes, and narrow stripes for pairs of intronless duplicates. The X and Y
chromosomes are highly enriched
in gene duplications, affecting 8.7% and 73.9% of their alignments
respectively: for convenience, the data for X is shown divided by 2 and for Y
divided by 20. Notice the huge variations per chromosome and the scarcity of interchromosomal duplicates that do not
behave as pseudogenes.
b) Map of
the 908 duplicate alignments on the genome. Each horizontal line
represents a chromosome, in megabases. A cDNA alignment is represented at its
physical map position by a colored line, blue for intrachromosomal
duplications, red for interchromosomal, green for a cDNA with both inter and
intrachromosomal duplications. The presence or absence of standard introns is
also encoded: on top of the line, a v
indicates the copy at this site has introns, a plain bar l that it is intronless. The bottom of the same
glyph, below the chromosome line, shows the type of the duplicate gene(s) at
the alternate location(s): ^ if all copies elsewhere have introns, no sign if
all are intronless, o if some have introns,
some don’t. Chromosomes 18 and 21 do not contain any duplicates and are omitted
from the diagram. See supplement 4b
for a similar map showing only exact duplicates.
Figure 3: Histograms of
the size of standard introns, non-standard introns and micro-rearrangements.
Analysis of introns and micro-rearrangements below
300 bp, in 52,089 excellent alignments with introns. a) Histogram
of intron size, limited to introns below 300 bp (18.6% of a total of 406,003).
73,081 standard introns (gt-ag, gc-ag or at-ac boundaries; in pink) and 2,737
non-standard (any other boundaries, in blue) are shown. Notice the sharply
distinctive size distribution of standard versus non-standard introns. b) Size histogram of the
2,737 non standard introns, in blue, complemented, on the negative side, by the
493 micro-rearrangements of less than 100 bases and of type “variable tandem
repeat number” (in green): a value of
-20 bp corresponds to a micro-duplication where the cDNA contains 20 bp
of the genome twice, in tandem. We propose that microintrons below 65 bp and
variable tandem repeat numbers are two manifestations of the same event,
structural micro-polymorphisms.
Figure 4:
Comparison of the quality of the mRNA-to-genome alignments, per chromosome
(Build 34/July 2003).
a) Rates of single base differences (mismatch or
single base insertion or deletion) per kilobase of cDNA-to-genome alignment are
measured on each chromosome for excellent and good Gold, aligning completely
outside of the highly polymorphic loci (masked), and not rearranged. All
chromosomes except the Y have between 750 and 6,963 such alignments,
constituting a test for 1.7 to 16.5 Mb of sequence (Y has only 122 alignments,
over 250 kb). The standard errors (confidence 95%; 2s) are indicated by a bar (data in supplement 7a3).
b) Rates
of occurrence of rearrangements, assembly problems or possible polymorphisms may be
compared across chromosomes through the frequency of structural rearrangements
seen in the Gold alignments, per 100 kb aligned at excellent, good or partial
quality. From top to bottom: Frequency per chromosome of partial alignments,
mosaic, inversion, variable tandem repeat number, transposition and suspected
genome deletion. Corresponding lists are in supplement 7a2
and genome maps in supplement 7b.