GOLD alignments of human mRNAs and gene annotation

GOLD: a cooperative approach to choose the best mRNA-to-genome alignment and what we learn about human genes and the genome.

Jean Thierry-Mieg¹, Danielle Thierry-Mieg¹, Michel Potdevin², Mark Sienkiewicz¹, Yuji Kohara³, Sumio Sugano^4,5, W. James Kent⁶, Kouichi Kimura⁷, Tetsuo Nishikawa^7,8, Takao Isogai⁸, Takashi Gojobori⁵, Nobuo Nomura⁵.

¹NCBI, NLM, NIH, Bethesda, MD20894, USA; ²LPM, CNRS, Montpellier, France ; ³National Institute of Genetics, Mishima, Japan;⁴The University of Tokyo, Japan;⁵METI/Japan Biological Information Research Center, Tokyo, Japan;⁶University of California, Santa Cruz, USA; ⁷Hitachi Ltd., Central Research Laboratory, Kokubunji, Tokyo, Japan; ⁸Reverse Proteomics Research Institute, Kisarazu, Chiba, Japan.

Corresponding author: Danielle Thierry-Mieg, NCBI, NLM, NIH, building 38A room 3S306, 8600 Rockville Pike, Bethesda MD20894, USA. Tel: 1 301 5947092; Fax: 1 301 4809241 mieg@ncbi.nlm.nih.gov

Takao Isogai Reverse Proteomics Research Institute, 2-6-7, Kazusa-Kamatari, Kisarazu, Chiba 292-0818, Japan; isogai-t@reprori.jp

Takashi Gojobori CIB-DDBJ, National Institute of Genetics, Mishima, Japan and Japan Biological Information Research Center, AIST Tokyo Waterfront, 2-41-6, Aomi, Koto-ku, Tokyo 135-0064, Japan; Tel: 81 355 318550, 81 559 816847; Fax: 81 355 318551; tgojobor@jbirc.aist.go.jp .

W. James Kent, Center for Biomolecular Science and Engineering, University of California Santa Cruz, CA 95064 kent@soe.ucsc.edu

Kouichi Kimura, Hitachi Ltd., Central Research Laboratory, 1-280 Higashi-Koigakubo, Kokubunji, Tokyo 185-8601, Japan; kokimura@crl.hitachi.co.jp

Yuji Kohara, National Institute of Genetics, Mishima, Japan; Tel: 81- 559 816854; Fax: 81 559 816855 ykohara@lab.nig.ac.jp

Tetsuo Nishikawa Reverse Proteomics Research Institute, 2-6-7, Kazusa-Kamatari, Kisarazu, Chiba 292-0818, Japan; Tel: 81 438 523975, Fax: 81 438 523986; nisikawa-t@reprori.jp

Nobuo Nomura METI/ Japan Biological Information Research Center, AIST Tokyo Waterfront, 2-41-6, Aomi, Koto-ku, Tokyo 135-0064, Japan; Tel: 81 335 998142, Fax: 81 335 998141; nnomura@jbirc.aist.go.jp

Michel Potdevin, Laboratoire de Physique Mathématique, CNRS, 2 place Eugène Bataillon, 34095 Montpellier, France; michel@lpm.univ-montp2.fr

Mark Sienkiewicz, NCBI, NLM, NIH, building 38A, 8600 Rockville Pike, Bethesda MD20894, USA (at the time of the work).

Sumio Sugano The University of Tokyo, Tokyo, Japan; Tel: 81 354 495286, Fax: 81 354 495416; ssugano@ims.u-tokyo.ac.jp

Jean Thierry-Mieg, NCBI, NLM, NIH, building 38A room 3S306, 8600 Rockville Pike, Bethesda MD20894, USA. Tel: 1 301 5947092; Fax: 1 301 4809241 mieg@ncbi.nlm.nih.gov

Running title: Gold alignments of human mRNAs and gene annotation

Abstract

Background:

Identifying the genes, an essential task in post-genome biology, requires aligning the many cDNA sequences available in the public databases on the genome. However, the various programs used at the main genome annotation sites propose different solutions to the exact exon structure in about half the cases.

Results:

To help resolve this problem, we pool the mRNA-to-genome alignments proposed by NCBI, UCSC, ensembl, AceView, and H-inv, for 74,106 mRNA from 29,194 human genes. We carefully define a cost function and let “GOLD”, Genomewide Optimization of Locus Description, select the best alignment for each clone. We annotate the Gold alignments, discuss the distribution of introns and minimal size of exons, classify the frequent rearrangements, and propose that variable tandem-repeat-number and micro-introns below 65 bp, which occur in 9% of the genes, are micro-polymorphisms. We evidence striking chromosomal and regional specificity in the control of gene duplication and discover that exact duplicates of genes containing introns are all clustered within 3.1 megabases of each other. We also observe interchromosomal and regional variability in the levels of base mismatch and rearrangements, annotate suspected defects, including frameshifts, in the genome and the cDNAs, and discuss the high frequency of intronless genes. Finally we identify difficult alignments through programs comparison. The current Gold, their annotations, the C program and acedb schema are available online, or from www.ncbi.nlm.nih.gov/IEB/Research/Acembly/GOLD. Contributions of new alignments are encouraged.

Conclusions:

Because GOLD extracts the best solutions to difficult alignment problems from all programs, it opens a new dimension toward a precise annotation of the human genes and genome.

----------------------------------

Background

Genes are primarily defined by their transcribed regions, and a large number of cDNA sequences are available in the public databases. Since a high quality human genome sequence is also at hand, one might expect that the collection of gene models would, by now, be basically complete, and consensual between the various public resources. But this is not the case, in part because aligning each cDNA precisely back from where it was transcribed on the genome is a pre-requisite, and is surprisingly difficult. Indeed, when we compare, as of July 2004, the alignments of mRNAs displayed on the public web sites at UCSC, NCBI, ensembl, AceView, and H-inv, we find that they strictly agree in only 50% cases. More softly, if we ignore discrepancies at the very ends of the alignments, the programs agree on the intron boundaries for 66% of the accessions. This does not facilitate international collaboration on the important question of defining and annotating the human genes, which is the ultimate goal.

When we examined the alignments provided by each group in detail, it quickly became apparent that each program has distinct strengths and weaknesses and that the best collection of alignments should include results from many, maybe all, programs. It also appeared that a consensus approach would not provide a satisfactory solution except in the most trivial cases. So we chose a more complex strategy. We observed that it is usually easy to decide which of many alignments is best by just looking at them, so we reasoned we could write and train a program to do the same thing, provided we analyzed large numbers of cases by eye. We therefore pooled the results from all programs into a dedicated acedb database and painstakingly elaborated a cost function, adding successive refinements until the program automatically identified reliably, in nearly all cases, the very alignment that we would select by hand. We call the program “GOLD”, for Genomewide Optimization of Locus Description. The resulting Gold reference alignments are annotated and made freely available (supplement S2d) and can be used as a basis to analyze the genome and the genes.

In the first part of this article, we present the raw data: the genome and the cDNAs selected for the benchmark. We describe the four main large scale cDNA projects that serve as a basis to this work, and show that all are of the highest quality. We present the set of alignment programs that contributed their data. We then outline the principles of the GOLD selection algorithm, detail the construction of the cost function and provide the actual Gold alignments.

The second part takes advantage of the precision of the Gold alignments to increase our knowledge of the genome and the transcriptome. Since this project is limited to providing a solution to the first, but critical step in the gene building procedure, the mRNA-to-genome alignment, we did not address directly some more elaborate issues, such as clustering of the alignments into gene models, alternative splicing, encoded proteins, or identification of the most useful cDNA clones; we hope that the Gold set will be used as a building block in further studies. But already, by analyzing the Gold alignments, we have made several observations on completion, level of variation and quality of the genome: we find that the July 2003 (build 34/hg16) genome is only missing between 0.5% and 1.2% of the gene containing sequence (it is at least 98.8% complete), and that the average level of differences between the aligned cDNAs and the genome is 1.62 base per kb (99.84% identity). We observe a large interchromosomal variability in the quality of the mRNA-to-genome match and identify in the genome sequence a few errors leading to frameshifts in the coding regions. We also point to potential structural problems, polymorphisms, or misassemblies in the genome, chromosome by chromosome. We examine the exactly or almost exactly duplicated alignments, where an mRNA aligns in two distinct genes in the genome with less than 1 or 2 differences among the copies, and we show striking chromosome-dependent variations in frequency and relative organization, in tandem or in palindrome, of the repeats. An intriguing effect of the presence or absence of introns in the two copies is also noted, and taken as a strong indication that when a gene gets duplicated on a different chromosome, only one copy remains transcriptionally active, unless maybe its sequence diverges. This is not the case for closely linked copies, within 3.1 megabases of each other on the same chromosome, which appear to both remain active. We then classify the numerous rearrangements, close to 4%, that occur in alignments genome wide. We analyze the length and properties of “introns” with non-standard boundaries and propose that, together with variable copy number rearrangements, they correspond to micro-insertion / micro-deletion polymorphisms, which affect 4% of the alignments and close to 9% of the genes, a much more frequent occurrence than previously thought. Finally, the main difficulties in the mRNA-to-genome alignment problem are identified through comparison of the contributing programs.

As was hoped for, our cooperative approach significantly improved the resulting alignment models over each method taken separately. Every program tested won the Gold for at least some accessions, which implies that the current best set can still be improved, and we welcome contributions of new alignments by any interested party. The scripts that perform the Gold selection and analysis are automatic. Researchers who contribute their alignments will see their best results incorporated into the updated version, and will receive lists of accessions for which their proposed alignment appears suboptimal. These lists, ordered by specific types of defects, undoubtedly facilitate debugging. They were transmitted to alignment contributors, and their exploitation is one of the reasons why AceView strives in this analysis.

In this project, our goal is precision, and we support all reported numbers by explicit lists of accessions and additional details in the supplementary material online, or at our web site www.aceview.org/GOLD [1]. The annotation process is dynamic, and updated versions of the Gold alignments will be made available on the web site as data and programs improve.

Results and discussion

1: The selected cDNAs and genome, and the contributing alignment programs

Alignments were generated around February 2004, on the July 2003 genome (NCBI build 34/hg16); the mitochondrial genome was omitted. For the benchmark, we aimed at cDNA sequences that would be homogeneous and of the highest quality, so we selected the 74,106 fully sequenced mRNAs from four large scale ongoing cDNA projects, available from the public databases (GenBank/EMBL/DDBJ) on February 10, 2004 (Supplement S1a). These mRNAs represent more than half of the human mRNA accessions. They cover a wide range of size, averaging 2,392 bp (after removal of poly-A and eventual vector). 26 are below 200 bp (mostly from MGC and DKFZ) with a minimum at 38 bp (BC033526); 34 are above 10 kb (all from KIAA and DKFZ), with a maximum at 14,491 bp (AB007934). According to AceView [2], 71,740 clones with reliable Gold alignments nicely cluster into 29,194 genes. 68% of these genes have introns with standard boundaries, and 78% encode putative open reading frames (ORFs) of more than 100 aminoacids.

Using the results of this work (supplement S1b), the cDNA projects can be described as follows:

- The Japanese KIAA effort was the first to be launched, more than ten years ago (2,058 accessions; [3,4]). This pioneer work aimed at identifying the longest mRNAs, mostly from brain, that would support in vitro translation of very large proteins. Indeed the average length aligned in GOLD, 5,072 bp, and average coding potential 957 aminoacids (as measured by the longest open reading frame of the genomic footprint of the cDNA, for 1902 clones with excellent alignments and ORF > 100 aa), are by far the largest of all projects. The very best quality of alignments is also observed, with on average 0.9 base mismatch per kb between the cDNA and the genome (Supplement S1b1). Indeed, this library is the only one where the alignments evidence more frameshifts or mutations in the genome than in the cDNA (S1c). The frequency of rearrangements (0.8 per 100 kb aligned, S1b2) and of non-standard introns (1.5 per 100 kb aligned, S1b3) is also the lowest. These impressive figures highlight the utmost interest of requiring in vitro translatability if one aims at protein-coding clones.

- The Japanese NEDO Full-Length cDNA sequencing project (FLJ) is a very large scale project using a remarkably efficient method, developed by Sugano, Suzuki and Isogai, for cloning complete mRNAs, from the capped first base to the poly-A [5, 6]. This project is unique in that it aims at faithfully depicting the whole transcriptome, with no preconceived idea. For example, it does not select clones for complete sequencing based on conditions on the CDS or on the presence of introns. This collection contributed 29,923 cDNAs; the average length aligned is 2,442 bp and average Gold coding potential 355 aminoacids (for 22,018 clones, with excellent alignment and ORF above 100 residues). The slightly elevated frequencies of mismatches and single base insertions or deletions (1.9 per kb on average, S1b1), of rearrangements (2.5 per 100 kb aligned, S1b2) and of non-standard introns (4.9 per 100 kb when including partial deletions of the insert, S1b3) are probably due to the PCR amplification step used before cloning. Yet, this collection detects the largest number of frameshifts in the genome (S1c).

- The German DKFZ-based cDNA consortium aims at the identification and biological characterization of novel long cDNAs encoding complete ORFs [7]; it contributed 9,258 accessions. The average length aligned is 3,307 bp, and average Gold coding potential 451 aminoacids (for 7,067 clones, with excellent alignment and ORF above 100 residues), making it the second longest protein-producing set. Like FLJ, it uses a PCR step before cloning, and has similar levels of single base differences from the genome (1.8 per kb, S1b1; 3% of the cDNAs appear to contain a frameshift in the CDS, S1c), rearrangements (1.5 per 100 kb aligned, S1b2) and non-standard introns (2.6 per 100 kb, S1b3).

- The American-led Mammalian Gene Collection (MGC) effort was the last to be launched [8]. It does not focus on novel genes, but rather aims at identifying one good representative encoding a complete CDS for each protein-coding gene. The consortium submitted 32,859 human accessions to GenBank. A stringently filtered subset, the MGC reference collection, includes the best non-redundant 14,961 clones, matching the genome if possible without structural anomaly or deleterious protein mutation. The average length aligned is the shortest, 1,933 bp for the bulk and 1,970 bp for the filtered set; yet the average coding potential is 423 aminoacids (for 13,921 ORFs above 100 aa in the filtered set). As expected, the filter for choosing the reference clones improves the quality of the match, from 1.3 to 1.0 mismatches per kb (S1b1), efficiently discarding the numerous frameshifts from the collection (S1c); it also reduces the level of rearrangements (from 1.6 to 0.7 per 100 kb aligned, S1b2) and of non-standard introns (from 2.7 to 2.3 per 100 kb, S1b3). However, it discards some good protein coding genes not yet known to RefSeq (108 complete CDS mRNAs with more than 300 aminoacids are listed in S1d).

Together, the four cDNA projects constitute a valuable resource for the community, because they aim at providing the researchers with a comprehensive collection of clones that will ultimately represent the human transcriptome. The clones also serve as a basis for gene annotation by public efforts, such as the new H-inv project [9-10] or the well-known NCBI RefSeq project [11] (S1d).

Yet notice that the 21,542 RefSeq mRNAs (from August 2, 2004) identify only 16,832 mapped genes genomewide (some with alternative variants). That is because the RefSeq collection has a representative mRNA/Protein for only 51% of the genes from the Gold project (14,935/29,194 genes), more precisely 25% of the FLJ, 29% of the DKFZ, 46% of the KIAA and 71% of the MGC reference clones. On the other hand, 1,897 genes with a RefSeq do not yet have a representative cDNA in any of the four collections. It appears that the otherwise excellent quality RefSeq project is not yet comprehensive, but we may help by providing in supplement S1d a selected list of 610 cDNAs encoding complete CDS proteins of 635 aa on average (from 300 aa to 2,089 (D79992) aminoacids), mapping at excellent quality with no anomaly in 587 genes that do not yet have a RefSeq mRNA.

To generate the Gold set, we gathered the alignments proposed by the following groups

- AceView [2] (Thierry-Mieg et al., unpublished; http://0-www-ncbi-nlm-nih-gov.brum.beds.ac.uk/IEB/Research/Acembly): this program, developed at NCBI, is the only one to our knowledge to use a co-alignment strategy; all public mRNAs and ESTs are coaligned on the genome, the alignments are refined using information from other mRNAs in the cluster, but no prediction input. They are then clustered into transcripts and genes. The final output is a minimal set of alternative transcripts synthesized from all of the alignments, as well as a refined alignment of each mRNA. 4 of the 5 million of mRNA sequences in GenBank/dBest (in build 34/July 2003) contribute to the reconstruction of the AceView transcripts annotated on the AceView web site.

- NCBI human gene alignments available on the public site [12] from February to August 2004 (version 34.3, www.ncbi.nlm.nih.gov)

- UCSC Blat [13-14]; Blat alignments are the basis of many of the tracks at www.genome.ucsc.edu .

- JBIRC/AIST, one of the Japanese teams associated to the H-inv project; JBIRC alignments are used as a basis for the H-inv annotation project [9-10] (http://www.jbirc.aist.go.jp/hinv/index.jsp)

- the Japanese NEDO “Analysis of Protein Function and Practical Application”, also associated to the H-inv project ([6] and unpublished results);

- Exonerate, a program developed by Guy Slater and Ewan Birney and used internally as an element in the gene building process displayed at the EBI/Sanger ensembl site http://www.ensembl.org/Homo_sapiens [15]. The ensembl transcript models then add predictive elements, such as similarity to other proteins.

NCBI, UCSC and AceView made their alignments public, others kindly provided them privately for inclusion in Gold; some groups contributed partial sets (supplement S1e). The test dataset, the format for submission by interested parties and the public alignments, in that format, are available online.

2: The GOLD selection procedure: quality evaluation and choice of the best alignment

Evaluating the quality of an alignment involves comparing the sequence of the cDNA to its image on the genome and assigning a quality score to each alignment. Yet, despite a clear need to respect a few simple rules derived from basic molecular biology, choosing the best alignment for each cDNA is a difficult and somewhat subjective task, and we would like to explain how we derived our cost function. Of course we did not use the function implicit inside AceView or any of the other contributing programs. Rather, we started by favoring the longest possible alignment and evaluated the match to the genome using the absolute Smith-Waterman algorithm. Then, in an iterative process, we went back and forth between visual inspection of hundreds of alignments and modification of the cost function. Of special interest were the cases where a Gold alignment was found by just one method. Some of those were unconvincing for reasons that could be expressed (hence programmed) and we added progressively different penalties. As the results of the GOLD selection procedure were improving, we could address more and more subtle problems and fine tune the cost function. We stopped only when the overwhelming majority of the Gold alignments appeared well chosen.

The principal elements involved in the calculation of the score are:

• Scoring the mRNA to genome match: We first consider each proposed exon on the cDNA aligned onto its corresponding genome segment and use the Smith-Waterman algorithm to count, in each exon, the single base differences between the cDNA and the genome sequence. A “difference” may be a single base deletion, a single base insertion, a single base mismatch or an uncalled base (n) in the cDNA or the genome. We score 1 point per base aligned and we substract 2 points per single base difference. We then consider the set of proposed exons and substract 1 point per unaligned base between exons, with a maximal penalty of 6 points per gap, de facto discouraging microdeletions of less than 6 basepairs.

Note that the total score is dominated by and of the same order of magnitude as the number of aligned bases that exactly match the genome, because the average measured “error” rate is less than 0.2% and all penalties described below are minor.

• Rearrangements: Note that this way of scoring does not constrain the geometry of the alignment. Indeed, rearrangements do occur in about 4% cases, in the cDNA or in the genome: some arise from technical artifacts, some from interesting biological reasons. To treat correctly these frequent events, GOLD will prefer a solution aligning the entire cDNA clone to a partial alignment, even if it violates the strict colinearity rule. For example, mosaic clones, matching two different areas of the genome in successive segments, or local alignments demanding an inversion, duplication or deletion, for a segment of the cDNA or of the genomic clone, are acceptable. When this happens, we simply select the combination of partial alignments representing the best coverage of the cDNA with the minimal number of base differences and consider it as a single Gold-split alignment. The parts may come from any method. The score of a Gold-split alignment is the sum of the scores of the parts, rationalized by multiplying by the overall length of aligned cDNA divided by the sum of the length of the parts, minus 6 points. In this way, we favor a single alignment over two split overlapping ones, yet a rearranged alignment wins over a partial.

• Intron boundaries and splice site consensus: By respect for the spliceosomes, introns with standard boundaries gt-ag or gc-ag or at-ac are preferred to introns with any other feet. Yet we do not forcefully impose standard introns irrespective of the cDNA sequence: we substract only 3 points per non-standard intron. Thus an alignment accepting a single base difference to splice out a standard intron is slightly advantaged with respect to an error-less alignment using a non-standard intron. But accepting two base differences locally to gain a standard intron is disfavored, since the observed level of differences between genome and aligned mRNA, below 0.2%, makes this hypothesis farfetched.

• Poly-A addition site and vector sequence recognition: Most mRNAs are submitted to the public databases with a poly-A tail, as expected, but a few thousand also contain short stretches of vector, which is unfortunate. Both are problematic, because often, one of the programs will manage to align these sequences as additional exons, extending the image of the gene beyond its natural boundary, sometimes intertwining it with nearby genes. It is therefore critical to precisely determine the part of the cDNA sequence that was truly copied from the genome locally and needs to be aligned.

We developed tools to tentatively identify the poly-A addition sites, the vector sequences on both sides of the insert, and the eventual non-matching poly-A or poly-T on the 5’ side of inserts, and we clipped these probably irrelevant sequences. This analysis was difficult and we validated the ambiguous cases by hand (details and lists online, supplement S2a).

Once the 5’ and 3’ boundaries of the area to align have been defined, we do not score points for aligned bases outside this area, but keep counting the base differences. This means that a program aligning the poly-A matching the genome locally is not penalized. However, we substract a penalty of 3 points per first or last exon entirely or almost entirely outside of the defined area, in order to avoid pure poly-A or pure vector exons in the Gold.

· Invalidating dubious first exons: Some short artificial sequences at the 5’ ends of cDNAs may still have escaped detection, and be aligned untrustworthily. For example, we noticed a few hundred cases of Gold alignments with relatively short 5’ exons and a concentration of base differences 50 to 100 times higher than the rest of the alignment. To discourage this, we count each single base difference twice in any first exon shorter than 50 bases. Note however that this mild treatment does not eliminate all first exons with visually dubious anchorage.

A related problem is the design by some programs of very small first exons that appear to be random matches, since they are at a distance from the next exon well beyond statistical significance. There are numerous instances where for example a 4 letter first exon lies 30 kb 5’ to the second exon. Among these are some justified cases where the short first exon exactly matches the 3’ end of an exon of the same gene defined by another clone. In fact, AceView intentionally uses co-alignment and favors such short distant exons supported by other mRNAs, what we call “team jump”. But in order not to advantage any program, we penalized all cases irrespective of this argument. In practice, if nBp is the number of bases in the first exon, nDiff the number of base differences in that exon and IntronBp the size of the bordering intron, we demand that 4^nBp-2nDiff-1 > IntronBp. We use twice the number of base differences to mildly take into account the combinatorial nature of mismatch positions. When the condition is not met, the few aligned bases do not score points and we again penalize such cases by 3 points, because they tend to give a spider web image of the genes that is not vindicated by the data.

Because of the softness of our criterion however, a few tens of Gold alignments use multiple dubious short exons separated by very long non-standard introns. But considering the low number of such instances, we did not try to force their exclusion from the Gold set.

· Extent on the genome and number of introns: We have not introduced any advantage or penalty for the number of introns or the extent on the genome, because none of the methods compared so far badly exaggerated the size and number of introns and still won. In fact, a rather large number (3138 accessions, 4%) of the alignments lying in the same genomic region and sharing the best score differ in the position of an intron, but the intron is usually non-standard (“alternate” list in S2a). In just a few cases, the alignments also differ in their extent on the genome. Then, by consideration for the RNA polymerase and the times it takes to synthesize the premessenger, we select for Gold the most compact at equal score. If on the other hand the extent on the genome is the same, we pick the Gold at random.

· Coding potential: We do not use protein coding potential as a major component. Yet the regularization of intron boundaries is not always innocuous: accepting a single base insertion/deletion to favor a standard intron may disrupt the ORF. So, when different alignments of a cDNA provide ORFs of different lengths, we give a bonus of three points to the alignments with the longest ORF, so that they win in spite of the non-standard intron.

The minor importance we give to the coding potential is justified by the fact that most of the frameshifts or stop mutations in the cDNAs are efficiently corrected in Gold by just maximizing the DNA to RNA match: the average ORF length read on the cDNA sequence is 386 aminoacids, but it rises to 392 aminoacids on the Gold genome footprint (over 61,807 well aligned cDNAs with ORF > 100 aa). This observation confirms that the genome sequence is usually a better reference than the cDNA, in particular, it contains about 10 times less frameshifts: we identified and list in supplement S1c 1,536 suspected frameshifts in the cDNAs and 171 in the genome.

The cost function is implemented in the program GOLD, Genomewide Optimization of Locus Description, written in C and relying on the acedb object oriented database system [16] and its new AceC programmer’s interface (Thierry-Mieg and Sienkiewicz, unpublished). The code, the schema of the database, and the annotations, sequences, and coordinates of the Gold alignments are available online.

The selection method turns out to be both reliable and sensitive when applied to the 95% alignments of good quality, matching with no rearrangement: we have seen no example where a difference of 5 points does not distinguish the visually best alignment in a sample of 300 alignments. Even a single point difference is diagnostic in a majority of cases, although some are really equivalent.

We do not mean to say that all current Gold alignments are perfect: we have already mentioned a few situations where we have doubts, for example we do not penalize multiple non-standard introns, although some do not look convincing. And a few tens of Gold alignments have annotated defects, such as a pure poly-A last exon, yet the solution they propose to another problem makes them Gold in spite of the defect.

Another open question is that of very small exons bounded by two standard introns; their size histogram in Gold may be found in supplement S2b. Such micro-exons “look” good, but are they real? In the cell, for mechanistic reasons, there is likely a low limit to exon size. The literature provides a number of reliable micro-exons 6 bases long (e.g. [17]), but we found no convincing example below 6 bp and to our knowledge, the minimal size has not yet been assessed experimentally. 88 Gold examples proposing micro-exons 2 to 5 bases long could serve as a support for such experiments. But in the absence of hard data, we chose not to penalize very short exons either.

Altogether we believe that the current choice of perhaps up to 1% of the Gold alignments may be controversial, but this number should decrease in the future, because Gold provides debugging tools while the contributed data allow incremental refinements of the cost function.

3: The quality of the Gold alignments reflects quality and completion of genome and cDNA sequences.

To evaluate the quality and completion of the human genome sequence, we quantified the quality of the Gold alignments by measuring the proportion of the length of the mRNA (as found in GenBank minus the vector and poly-A) that aligns and the percentage of “single base differences” (either a mismatch or a single base insertion or deletion) between the cDNA and the genome in the aligned region. We then defined criteria to link these two numbers to the reliability of the genomic map assignment, by using the knowledge of single base difference rates we gained from hand-editing the 205,205 sequencing traces from the worm transcriptome project (Kohara, Y., Shin-i, T., Thierry-Mieg, J., Thierry-Mieg, D., Suzuki, Y., Sugano, S., unpublished). The reference strain in C. elegans is a true homozygote at the basepair level, and after manually correcting the many errors due to the basecall programs, the worm cDNA libraries contain between 0.08 and 1.2 mutations per kilobase, depending on the library construction protocol. On the other hand, the genome has 0.05 confirmed mutations per kb, as measured by sampling 6 Mb of sequence covered by multiple independent mRNA sequences, all with excellent quality sequencing traces locally. If we add room for the polymorphisms expected in human sequences, we consider that an alignment with less than 4 differences per kilobase is certainly reliable in terms of genomic map assignment, and up to 15 per kb acceptable for a region with more variations. Above 15 per kb, we have doubts: either sequencing is of bad quality, or the gene is especially polymorphic, or the chromosomal region is incorrectly assigned. Perhaps the correct region has not yet been sequenced. Above 30 differences per kb, the alignment is unacceptable. Figure 1 defines the terms excellent, good, partial, dubious and bad, as we use them throughout the article, and shows the partition of the Gold by those criteria (lists and maps are in supplement S3a,b,c,d).

- Excellent and good alignments match the genome over their full length and with few base differences. With 72,198 mRNAs, 97.43% of all mRNAs, they are by far the largest groups. To be precise, the average percentage of length aligned for this major class of accessions is 99.98% and the combined average sequencing error, mutation or polymorphism rate in the cDNAs and the genome is 1.62 per kb (276,340 base differences in 170,090,311 bp aligned). This confirms the high degree and quality of completion of today’s human genome sequence (NCBI Build 34/July 2003).

Still, the genome sequence is not devoid of mutations or sequencing errors. For example, we provide in S3e a list and map of 171 cases where the protein on the cDNA is longer than that on the Gold genomic image by at least 100 residues. Using the AceView database [2], it appears that more than half of these cases are mutations or sequencing errors in the genome (single base insertions or deletions, or nonsense mutations) confirmed by all clones sequenced in the area, including large numbers of ESTs. For example, there is a single base insertion or deletion in the current genome in genes INPP5B (AK022846), PLCD3 (AK074240), PKP4 (BC050308) or XTP2 (AB029019), affecting nucleotide doublets in each of these cases.

Interesting polymorphisms in the genome sequence are also observed. For example, the gene GLULC (AK122784), possibly involved in the regulation of the level of the neurotransmitter (and aminoacid) glutamate, bears in the 5’ region of the gene a G(20%) or C(80%) polymorphism, leading to a Stop versus a tyrosine. The most frequent allele, found in 80% of the cDNAs, is predicted to have a considerably longer form of the protein than the allele represented in the reference genome of July 2003.

In other puzzling instances, a unique indel polymorphism in an mRNA allows specific alternative variants that should include an early stop and might be degraded by the mRNA surveillance machinery to become open up to the last exon, hence probably stable as messenger RNA. Genes SLC35A1 (AL122119), encoding through alternative splicing a large family of nucleotide sugar transporters, or SUMF2 (AL050037), encoding multiple forms of the sulfatase modifying factor 2 are good examples. Such variations being of rare occurrence, one may wonder whether they represent rare RNA polymerase errors or modifications by RNA editing that make the transcript translatable thus stable and observable, or if they represent rare genomic polymorphisms. More experimental work will be required to answer this question.

- The partial alignments come from 907 mRNAs that align well, but incompletely. Upon examination, a good proportion cluster in specific loci, suggesting that they highlight defects in the genome.

As expected, some mRNAs border a known genome gap (e.g. AL831825 on chr2, BX647560 from gene ABCA13 on 7, AL117478 from gene AGS3 on 9 or BX648928 from gene PCCX2 on 12). But even more frequently, in at least 181 accessions from 128 genes, partial alignments uncover a deletion in the genome.

For example, in locus GRIPAP1, mapping in a finished area of chromosome X, the 5’ ends of mRNAs AB032993, BX647801, BX640704 fail to align over 757 to 1018 bases. Although the unaligned sequences are not found in any genomic clone in GenBank, they form a consensus which is furthermore conserved in the homologous genes Grasp1 in rat and Uqcrc1 in mouse (where the gene encodes ubiquinol cytochrome c reductase core protein 1): this human sequence is artificially missing from the July 2003 genome.

In another example, gene VPS13D on chromosome 1, a large genomic piece is missing right in the middle of the gene. This leads to both partial alignments of 5’ ends of clones (AK057675 aligns from base 557 and AL833079 from base 199) and to large internal sequences unaligned (1553 bp in AK125118, 1435 in AB007922 and 1363 in BX641035). The problem will probably be solved in the next genome build, since the Sanger Institute submitted in November 03, after build 34/July 2003 was out, a genomic clone BX682532 that contains the missing piece.

Finally, some deletions arose in the course of building the reference genome, possibly due to a suboptimal choice of the genomic path across the numerous overlapping genomic accessions. An example is the first 122 bp of AK001244, which are missing from build 34, yet are present in genomic accession AC115284, partly used to build the reference genome just upstream of this gene.

Altogether, we estimate that more than half of the partial alignments are due to deletions or missing pieces in the genome. Other causes of partials are unrecognized vector or other irrelevant sequences and rearrangements.

- The dubious, the bad and the unaligned: 957 mRNAs give unreliable alignments and genome map assignment and are ignored in all subsequent analyses. Another 44 mRNA accessions are still unmapped by all programs that tested them. For many of these, the genome is probably still missing. Two classes of poorly aligned sequences may be recognized:

• 543 mRNAs align over their entire length but with many base differences, usually spread over the entire sequence length. One third of these come from 15 highly polymorphic genes encoding the hypervariable immunoglobulins and histocompatibility complex proteins HLA/MHC. About a fifth appear to be pseudogenes or members of multigene families for which the real gene has not yet been sequenced (e.g. tubulin BC019829 or alpha tryptase BC028059). A few others raise a contamination warning for the Alu repeat and some may result from small batches of bad quality sequencing. 26 in this category may be excellent, but the coordinates of the proposed alignments could not be correctly decoded.

• Most of the remaining 414 mRNAs have a partial match, usually with a high level of mismatches. Some 75 good but very short matches may indeed turn out to be solid anchors, but they will need confirmation on future genome builds.

In conclusion, if we add all the weak map data so far, we find that at least 0.4% but at most 1.2% of the human genome bearing expressed genes is still missing in build 34/July 2003.

4: Multiple alignments point to chromosome-specific gene duplication control mechanisms.

About 1% of the mRNAs align well, but in multiple non overlapping regions of the genome: their map is ambiguous, because they belong to repeated, or paralogous genes that did not yet diverge. To analyze this type of events, we considered the exact duplicate alignments of excellent quality, and also included the quasi exact duplicates, with score within 5 points from Gold, for example with 2 base differences. These criteria select 627 mRNAs producing 1,379 duplicate alignments, 763 exact and 616 nearly exact (details and lists in supplement S4a). The sequence similarity between both copies is on average 99.97%, with a minimum of 99.5% for some ribosomal pseudogenes.

34% of these apparent duplicates are disregarded because they simply reveal a transient state in unfinished areas of the genome. Indeed, when conflicting situations arise during assembly, some fragments are left loosely associated to the chromosome, as “randomly mapped” contigs. These often contain duplications that get merged in a later genome release. Supplement 4b shows the percent of such alignments per chromosome, and provides a measure of completion of the chromosomes’ sequence: the least finished chromosomes would be, in that order, 9, 1, X, 17, 2 and 8 (in build 34/July 2003).

On the other hand, provided the finished genome sequence can be trusted, we may hope that the remaining 908 duplicate alignments, where both copies lay outside of the random contigs, are real. Their main properties are summarized in Table 1: 88% have both copies on the same chromosome (intrachromosomal), rising to 95% if we restrict to exact duplicates. Localization of the duplicated genes on the genome (Figure 2 and supplement S4d for exact duplicates) shows a striking regional heterogeneity: duplications tend to cluster in a few megabase-sized fragments, reminiscent of what has been described, from human and mouse genomes comparison, for single base variations, specific rearrangements and meiotic recombination [18]. Possibly as a result, some chromosomes have no duplicates (18 and 21) or very few (3, 14, 6, 20, 12, 11, ordered by percentage of duplicate alignments), while others have very many (Y, X, 16, 5, 15, 7, 10, 8, 2, in decreasing order). Note the absence of clear dependency on chromosome size.

The sexually dimorphic chromosomes are especially rich in duplicate genes, in part because of the pseudo-autosomal region, which consists of two telomeric genomic segments 2.2 Mb and 200 kb long highly similar between X and Y and represented as identical in build 34. But even outside these regions, the X and Y have a propensity for duplications [19], affecting 7% and 36% of their alignments respectively. This might be advantageous for dosage purposes and for protection against deleterious mutations in males.

The few inter-chromosomal duplicates are special in two ways, both indicating that very few of these remain functional (Table 1). First they include 84% divergent pairs in which only one member is expressed, to be compared to 46% for the intra-chromosomal. Second they include a large number of retroposon-like pseudogenes, in which the Gold has standard introns and the copy does not. Such retroposons are known to result from random insertion in the genome of the retro-transcription product of a mature mRNA. Pairs with typical retroposons account for 8% of the duplicates; they affect a dozen highly expressed genes, such as those encoding ribosomal proteins or histones. And because reinsertion is random, all such genes have their intronless pseudogenes landing on a different chromosome.

In contrast, duplicates with introns in both copies are extremely rare among interchromosomal pairs. The only cases, mRNAs BC015119 and AK056232, align as almost exact repeats twice at the telomeric end of a chromosome, respectively 14/22 and 15/1. They might possibly correspond to genome assembly errors, or else be living examples of very recent telomeric born duplications: experimental assessment would be especially appreciated.

The remaining interchromosomal duplicates belong to intronless pairs of genes. Because they occur in the right proportion (12% of intronless pairs, versus 12% of intron-noIntron pairs relative to intron-intron) and have started diverging at the same rate as the pseudogenes, we would like to propose that some if not all of the intronless interchromosomal duplicates are retroposons of intronless genes.

88% of the duplications are intrachromosomal (Table 1). The 462 pairs where both members have introns and the 255 pairs where both are intronless have similar properties. More than half are exact, in line with the idea that both genes are expressed. The duplicate genes have a strong tendency to be in close proximity: 81% (581/717) occur within 1 megabase of one another. The median is 230 kb: half of the pairs are within that distance, ranging from 4,860 bp (BC029540, gene LOC255313, AceView gene koby) to 25.2 Mb (BC005089, gene ANAPC1). If we restrict to exact duplications where both copies have introns, the median drops to 180 kb, ranging from 4,860 bp to 3.1 Mb (BC026100, gene TTTY8).

On average, duplications tend to be in palindrome slightly more often than in tandem (312 palindrome versus 269 tandem), inward and outward palindromes occur at the same rate overall (151 versus 161). Again, interestingly, there are clear large scale regional variations (supplement 4c), emphasized by the fact that chromosome 8 has 31 tandem in 8 genes and no other type of duplication while chromosome X has ten times more palindromes than the average chromosome (62 inward and 33 outward), but only twice as many tandems (19). This ability of the X chromosome to undergo close palindromic duplications is largely responsible for the increased frequency of repeats on this chromosome. It is tempting to imagine a mechanistic link between the frequent palindromic gene duplications and the phenomenon of X-inactivation.

Clustered gene duplications are usually interspersed with genes that are not duplicated (by our standards). The number and position of introns is identical in the copies, but intron sequences are conserved to a lesser extent. Most of the repeated genes are simply duplications. However, as expected, the multiply repeated ones tend to be in tandem, since this arrangement, through paralogous recombination and sister chromatid exchange, allows increases and decreases of the copy number. The record so far is held by BC029540, which maps in a 52 kb cluster of 10 close tandem repeats on chromosome X, 8 exact and 2 with 1 base difference. The cluster is supported by 12 other GenBank or dbest entries (gene koby in AceView). It is expressed mainly in testis and encodes a family of human specific putative cytoplasmic proteins.

The pattern we observe for duplicate genes that barely diverged: a strict inter-chromosomal barrier which prevents genes with introns from being duplicated on different chromosomes, added to the tendency for duplicates to be closer than a few megabase apart, contrasts with the pattern described for segmental duplications of genomic fragments: those occur at comparable levels inter (76 Mb) and intra (95 Mb) chromosomes and are often far away [21, 22, 23]. A main difference is in the choice of the minimal degree of sequence identity: 99.5% here (average 99.97%) versus 90% there, and in the genome length we interrogate: 18.8 kb here (the average extent of these 1.9 kb mRNA alignments on the genome) versus 1 to 10 kb there. Our functional annotation also seems different (supplement 4e): we find many duplicate genes encoding RNA/DNA binding factors, much like the mouse repeated genes [23], but we do not identify multigene families with conserved motifs because they are not identical all the way and are too divergent. Yet, as expected, a majority of the duplicate genes fall within segmental duplications. Our interpretation is that genes that belonged to far away segmental duplications have diverged, while duplications physically closer on the chromosome, in the range of 5 kb to 3000 kb, could be mechanically acted upon by gene conversion or other mechanisms maintaining them identical or nearly, while often remaining both active.

There are legitimate concerns that this entire analysis rests on the hypothesis that the finished chromosomes sequence did not merge or split the repeats during sequence assembly. It is clear that the clone based physical map approach taken by the publicly funded human genome project [20] is more secure than a whole genome shotgun approach. The mapped-clone approach has been validated, in the case of the nematode, by thousands of careful C. elegans researchers. Another reassuring observation is that dramatic variations in frequency of duplicates have been observed by at least four large scale genome sequencing centers, which produced some chromosomes very rich in repeats and some deprived of those (Sanger: chromosome X versus 6, Washington University: chromosome 7 versus 4, Whitehead Institute: chromosome 15 versus 18 and SHGC/JGI: chromosome 5 versus 19). So it is probably correct to assume that the differences among large megabase-sized regions in the chromosomes do not reflect technical artifacts in the genome assembly, but instead indicate a regional specific control mechanism on the number, orientation, organization and clustering of the identical or barely divergent duplicated genes. Whether the control is positive or negative, acts on the generation of duplications, prevents their accumulation, actively removes duplications, or controls mutation rate is not known. Potential actors include replication, repair, conversion or recombination, chromosome breakage, sister chromatid exchange and selective factors such as those that control the length of the chromosome or the level of expression. But Figure 2 shows that whichever mechanism controls the genes duplications has to be remarkably chromosome-specific and act at a large molecular scale, in megabase-sized segments, akin to position effect variegation and X inactivation.

5: Split or discontinuous alignments identify cDNA or genome rearrangements, polymorphisms, or assembly errors.

In most instances, the paradigm linking DNA to its mature RNA copy is verified, and the entire length of the cDNA clone aligns locally on the chromosome in a collinear fashion. But a surprisingly large number of mRNAs, almost 4%, do not follow this rule. While their entire length may ultimately be aligned at excellent quality, this requires splitting the sequence of the mRNA into separate fragments or aligning in a non-collinear fashion.

Rearrangements may be generated in silico, in vitro or in vivo. Genome assembly, the construction of genomic or cDNA libraries, the tissue or individual origin of the samples can be a cause. For example, some cDNA libraries come from tumor tissues in which reorganization of the genetic material is known to occur. Possible sources of rearrangements also include remodeling of the genome and structural variations across individuals. Some genomic areas occur in variable Copy Number (CNP) in different individuals [24], and any CNP integrated in a chromosome should be bounded by rearrangements. Rearranged cDNA alignments may also be useful to spot defects in the genome sequence. But they are currently among the most difficult to pinpoint and benefit most from using simultaneously the alignments generated by all methods. UCSC and the Japanese NEDO “Analysis of Protein Function and Practical Application” (unpublished data) are especially good at providing those, while most other groups simply forbid it in the current version of their program.

We classified the rearranged alignments of excellent, good or partial quality and distinguished six cases, each defined by a particular geometry (Table 2; lists of accessions are in supplement S5a):

• Mosaic: For 1,197 mRNAs, the split fragments align best in distant places in the genome, at least 300 kb apart, a property one might expect for mosaic cDNA clones (also called chimeric), in which the insert is a concatenation of more than one mRNA molecule, possibly because reverse transcribed copies of two mRNA molecules were ligated artificially during the cloning procedure. A proportion of mosaics could be due to genome rearrangements, misassemblies or genomic clone mosaicism. Alternatively, some may represent rare events of trans-splicing, although none has yet been documented in human.

• Inversion: For 128 mRNAs, the split fragments align in the same area of the genome within 300 kb, but on different strands, suggesting an inversion in the cDNA or the genome. Inversions typically require two breakpoints, enough to define the segment to be flipped (in the chromosome or the cDNA).

• Transposition: For 293 mRNAs, the split fragments align in the same area of the genome, within 300 kb, and on the same strand, but the 5’ fragment lies downward of the 3’ fragment. This is known as a transposition in classical genetics. Transpositions require 3 breakpoints, two to define the segment that will transpose and one for the insertion site. Therefore, transpositions are expected to be less frequent than inversions, but we observe they are more than twice as frequent. One possibility would be that there is an active motor or mechanism for such events and that pieces of genes can physically move close by in the genome, much like transposable elements. Another would be that the event occurred before ligation of the insert in the vector, during library construction, then a single breakpoint would suffice. And of course, genome assembly errors could also lead to transpositions.

• Suspected genome deletion: For 283 mRNAs, the alignment is monotonous along the genome, yet a piece of the mRNA fails to match the genome and sticks out unmatched, often in between two exons but sometimes within a single exon. These cases highlight putative deletions in the genome, or possibly for some of the 10 shorter than 35 bp, a length polymorphism. Examples supported by multiple clones were given in the discussion of partial alignments. The average length of unmatched cDNA is 176 bp, and the genome deletion is of course at least as large as the unaligned cDNA stretch. Each program taken separately seldom misses an exon of 12 bp. So, to be secure, we only flagged genome deletions longer than 18 bp, ensuring that it could not escape detection by at least one program, even if the exon contained one or two base differences.

• Variable tandem repeat number: For 893 mRNAs, originating from 704 genes, or about 3% of the genes (see map in supplement 6c), successive fragments of the sequence align in overlapping areas, a region on a given strand of the genome being used more than once, in tandem, in the cDNA. These very frequent events could result from artificial tandem duplication in the cDNA, a mechanistically difficult event, or more likely from partial deletion of tandemly repeated sequence in the genome. They are probably associated to natural copy number polymorphisms affecting tandem repeats. As expected, the longest forms of such variants are enriched in the two libraries that chose to isolate the longest mRNA variants: KIAA (3.0%) and DKFZ (1.7%). In addition, the localization of the event in the gene is not random: it affects almost systematically the 3’ or 5’UTRs, as discussed in paragraph 6.

• Pseudo-split or mRNA bridging two correctly ordered and oriented contigs: 36 such mRNAs were recognized. For about half, the two parts of the alignment lie in separate genomic contigs, but the genome assembly positioned them correctly in the scaffold. For the other half, the two alignments could be connected with no geometrical anomaly on a single contig, but the best scoring alignment was discontinuous.

6: Splice usage in Gold alignments: the frequency of structural polymorphisms and intronless genes is unexpectedly high.

Let us now analyze the splicing pattern of the human genes (supplement 6). Limiting our analysis to the excellent Gold alignments, we find that only 76.5% have standard introns, 21.6% are intronless and the remaining 1.9% have only non-standard introns, micro-rearrangements or alignment gaps. The frequency of non-standard introns, 1.23%, is high, and affects 7.1% of the excellent alignments. These findings were unexpected, and we will discuss successively the standard introns types and size distribution, the non-standard introns and their relation to micro-rearrangements, and the intronless genes.

An excellent Gold with introns has on average 7.9 standard introns per alignment. Examination of the types of intron boundaries shows that most of the 406,003 introns follow the consensus rule for splice junctions, a feature only slightly favored by our cost function: 97.52% have standard gt-ag intron boundaries, 1.08% the gc-ag variant and 0.13% (534 cases) the at-ac boundaries known to be spliced by the minor U12 spliceosome. As we had shown before on a smaller sample [19], the length distribution of the standard introns is far from normal (Figure 3a): it starts sharply at 65bp, peaks in the 81 to 95 bases range, decreases to around 160 bp to end in a long flat tail driving the average intron size to 5,372 bp.

Another remarkable feature in Figure 3a is the distinctive size distribution of the non-standard introns, which peaks in the 5 to 8 bp range and has very little overlap with that of the standard introns. Indeed 86% of the introns below 65 bp are non-standard, and the others are probably standard by chance, because the precise position of an intron may slide and our cost function favors a solution using one of the three standard intron types.

How could we explain the large number of non-standard introns? We propose that at least the 33% short ones, below 65 bp, are micro-deletions in the cDNAs or micro-duplications in the genome, akin to the micro-deletions in the genome that are associated to “variable tandem repeat number” rearrangements. In support of this idea, we observe that the non-standard introns contribute most to the 3,139 ‘alternate alignments’ (supplement S2a), where programs give different splicing structures with the exact same score by choosing different coordinates for the non-standard “intron”. These non-standard pseudo-introns can slide because of the repetitive nature of the underlying sequence. Direct sequence analysis of the short pseudo-introns confirms that one third are pure dinucleotide repeats, mainly and almost exclusively GT, TT, AC, AA, AT (in this order); longer repeats are seen in others.

In fact, within an array of genomic repeats, reciprocal events such as unequal recombination will lead symmetrically to both micro-duplications (“variable tandem repeat number”) and micro-deletions (very short introns). Events generating micro-deletions are less constrained in sequence relative to the cDNA alignment, because they do not need to be exact tandem repeats, so they should appear more frequent. Figure 3b plots the two types of events on the same diagram, and strongly suggests that a majority of short non-standard and standard “introns” do not result from splicing, but rather correspond to the expected micro-rearrangements. These micro-introns would have no reason to follow the splice rules, except by chance. Interestingly, if we restrict the analysis to the intron lying in the last exon, therefore usually in the 3’ UTR region (supplement 6b), we observe that micro-introns and variable tandem repeat number rearrangements are ten times enriched there relative to any other position in the transcript. On the other hand, the prevalence of genes harboring one or the other type of event is impressively high, 8.7% (2,535 genes out of 29,174)!

We propose that the micro-rearrangements (micro-duplications and microdeletions) represent mostly natural structural polymorphisms. Polymorphic variations in number of tandem repeats, for example di or trinucleotides, are known to occur naturally. Their biological importance is exemplified by their implication in many genetic diseases (at least 15 so far). As we confirm, the majority occur in the transcribed untranslated gene regions, where they could play a role in regulating transcription level or messenger stability or translatability. One mechanism recently elucidated involves pairing to micro RNAs, as was first described in the nematode (e.g. [25]). Another interesting possibility is silencing mediated by the HP1 protein, which also participates in position effect variegation [26].

A second important factor generating non-standard introns is partial deletion of the insert, which according to AceView accounts for at least half of all non-standard introns of all sizes. These are the predominant cause for non-standard introns with feet in two separate exons, and generally for pseudo-introns in coding regions. Partial deletions of the insert commonly occur in the bacterial host and may be selected for, because bacteria with shorter insert clones often have a growth advantage. To identify such putative insert deletions, we use the AceView human genome database including all mRNAs and ESTs [2]. In that database, if a non-standard intron is supported by a single cDNA clone and both its ends fall within exons of other transcript variants, we flag the pseudo-intron as a suspected internal deletion. These annotations transfer back to 2,634 Gold cDNAs, leaving at most 0.4% non-standard introns unaccounted for. A few of those might truly represent novel splice junction usage. Indeed, in the worm C. elegans, we have definitive evidence that non-standard splice junctions are used for a number of well characterized genes, in an alternative or constitutive way, a fraction of the junctions, but not all, being single base variants of gt-ag (Kohara, Thierry-Mieg, Sulston et al, unpublished).

The third puzzle is the high frequency, 21.6%, of mRNAs that align over their entire length at excellent quality but without introns. Some could be insidious contaminating genomic fragments that we tried to identify as follows. We look for the absence of introns, the lack of a substantial ORF (less than 20% of the length of the corresponding AceView transcript, eventually lengthened by concatenation with other mRNAs and ESTs), the presence of a 3’ A-rich stretch (more than 12 A in a row in the genome just 3’ to the mRNA) and the absence of a polyadenylation signal in the correct position (AATAAA or any single base variant in the last 35 bp of the transcript). We demand in addition that the transcript does not belong to a gene with a locusID (it would then more likely be an unspliced by-product of transcription) and that it be supported by less than 5 clones in the complete database, four times less than the average comparable transcript in the current AceView. Finally, we rescue the few where the small putative protein has homologies to other known proteins, or has special features that make it a candidate for being a real protein, such as a signal peptide or a transmembrane domain. Altogether 589 accessions (0.8% of mRNAs aligned at this quality) have been flagged as possible genomic contaminations: chances are that they are primed on A rich regions of traces of genomic DNA contaminating the RNA preparation rather than on actual poly-A in the reverse transcribed mRNA. Some are found in all unfiltered libraries (List in S1b).

This leaves us with 20.7% mRNAs with intronless alignments. In AceView, we observe that 45% of these actually belong to genes with introns: 27% appear to be partial and encode only the last exon, and 18% represent intronless alternative variants, usually interpreted as immature transcripts. The remaining 7,599 intronless cDNAs belong to 6,388 intronless genes, which represent 24% of all genes with excellent Gold alignments, an intriguingly high percentage that should trigger new experimental studies.

7: Chromosomes differ in their frequency of single base mismatches and rearrangements.

The 72,198 excellent or good Gold alignments offer a sensitive probe to measure the mRNA-to-genome match, providing us with more than 170 Mb of high quality cDNA sequence aligned on finished genome sequence.

To our surprise, significant interchromosomal variations occur (Figure 4a; supplement 7) when we monitor the 276,340 single base mismatches, insertions and deletions identified using Smith Waterman. Chromosome X has the best mRNA-to-genome match, with on average 0.5 differences per kb fewer than chromosomes 21 or 22. Chromosome 21 is the most deviant of autosomes, with 0.36 differences per kb in excess of the best matching autosome, chromosome 11. If we accept that mutations or sequencing errors occur in the current genome sequence at a constant and low rate across chromosomes, of the order of 0.05/kb, and if we take into account the fact that, although there are mutations or sequencing errors in the cDNA sequences, these should not a priori depend on the genome map position of the gene, then most of the variations evidenced in Figure 4a should reflect true interchromosomal differences in the level of single nucleotide variations.

Our focus here is exclusively transcribed genes, yet our results are in line with previous observations showing the existence of large scale regional neutral mutation rate variations [27, 18]. For example, it is well known that mutations occur more frequently in the male germ line and we observe the expected bias, X < autosome < Y, despite the diluting effect of the pseudo-autosomal region. We also confirm that chromosomes 21, followed by 22, 18, 19, 7, 8 and 4 have the highest single base variation rates among autosomes.

Similarly, the frequency of structural rearrangements (Figure 4b) shows mild variations per chromosome; once again the X chromosome has fewer rearrangements. Our hope is that the lists and maps of accessions (supplement 7a2) might become useful, in the ultimate phase of verification of the genome sequence, to distinguish between local variations reflecting natural polymorphisms [18] and defects during cloning or genome assembly.

Note that initially, chromosome 14 seemed to be among the worse for both criteria, but that was in large part due to the 1,100 kb highly expressed immunoglobulin region, known to be hyper variable and somatically rearranged. Thus we decided for this analysis to identify and mask all distinctly highly polymorphic loci, this region on 14, the immunoglobulin region on chromosome 2 (450 kb), some of the HLA genes on chromosome 6 (a total of 550 kb), and the immunoglobulin region on chromosome 22 (1000 kb) (lists in supplement 3b). Altogether 379 accessions map best in these regions, 52 at excellent quality, 60 at good, 22 at partial, 105 at dubious and the remaining 140 at bad quality.

8. Identifying the problems through alignment programs comparison.

Programming is difficult: the human genome is large and requires parallel processing and multiple ancillary data files, and computers are temperamental. As a result, it may happen that some mRNAs, some genome fragments, or some intermediate results get lost without being noticed. Such problems, often referred to as bookkeeping errors, are uninteresting but they have a large impact on the quality of the results. A clear advantage of our collaborative approach is that we are fully immune to bookkeeping errors, because they are not reproducible across laboratories.

The second problem is the existence of programming bugs which only become apparent in marginal and rare conditions. They are hard to spot. For example, there may be a stretch of NNN in the middle of a quasi finished piece of genome, and this complication may have been overlooked. The alignment problem is so complex that there are many such pitfalls. In our approach, these wild events can be identified because they are not reproducible across methods.

Then come the more conceptual limitations. Each program implements its idea of the gene, and by comparing the different programs, we compare these implicit models. For example, most of the introns are bounded by gt-ag, but a few are bounded by gc-ag or other variants. If a program forces gt-ag in all cases, some of its alignments will not optimally match the genome. Another example is that of exactly repeated alignments: if a gene maps in two loci, this should be made explicit because it has important biological implications, yet many programs forbid it. By combining all programs, we increase the total number of gene copies! Another case is that of the rearrangements, which are frequent. They break the central rule of colinearity and are hard to balance. If one allows breakpoints, one may allow too many. Other challenging features include long genes extending over 800 kb, introns that are either long (>300 kb) or very short (from 6 to 12 bp), and micro-exons (below 6 bp): we provide lists of accessions with examples of these difficulties in supplement 8. Comparing independent alignment programs reveals the excess or the lack of desirable features.

In Gold, we tuned the cost function not to match our preconceived idea, which we had implemented a priori in our own code, but to instead match the collaborative best picture which came from the union of all codes. Namely, we paid great attention to the cases where the programs disagreed. This helped deepen our understanding of the human genes.

Another benefit of the comparative approach is to allow the identification and classification of examples where a program is suboptimal. To illustrate the procedure, we give in supplement 8 the comparison between the three best programs: AceView, UCSC and NCBI. Of course, programs are in a rapid phase of improvement, so the snapshot shown in the tables reflects the period from February to August 2004 and will certainly evolve. Currently, AceView is the most precise and finds the Gold alignments most frequently (92.98% Gold, or 97.75% within 5 points from Gold) while UCSC is the most exhaustive and gets the largest proportion of alignments close to Gold (91.68% Gold, 97.95% within 5 points from Gold).

If we consider only the cDNAs for which at least one program provided an excellent Gold alignment, any program doing less well missed a subtlety or had a bug. When we compare the second alignment to the Gold reference, the mRNA either maps in the same locus but with a different proposed exon structure, or maps in another locus, or is not aligned at all. In the first case, we can be more specific. There may be too few or too many exons or the exon boundaries may be chosen differently, and it is instructive to separate the first, central and last exons. As expected, all programs have some errors in each category, with some biases (supplement 8b). We reported these detailed lists of putative bugs to the authors of each program, who usually found them useful because they facilitate debugging. This helped us in particular to significantly improve the quality of the AceView alignments for build 35.

In turn, the few cases where the Gold choice seems disputable are used to improve the cost function, and we need feedback to the bug reports! This is why we conceived Gold as a benchmark: we hope to receive further contributions from any interested party. If people find this approach interesting and send new alignments, we will incorporate them in Gold if they win, and will send privately the debugging lists. The annotation process is dynamic and updated versions of the Gold alignments will be made available as data and programs improve. We hope that, using this work, all mRNA to genome alignment methods will soon converge toward a clear and solid picture of the structure and organization of the human genes.

Conclusions

Our long term goal is to understand the human transcriptome, its structure, its polymorphisms, its expression, its regulation and its functions. This work focuses on the critical, but not yet consensual first step: the mRNA to genome alignment. The GOLD cooperative approach presented here chooses the best alignment from a collection of programs, including the most prestigious, on a cDNA per cDNA basis. The solution to different problems requires different, sometimes incompatible strategies, but in Gold we benefit from the combined brainpower behind all programs. Gold is therefore empowered with hybrid vigor, and provides better alignments than any program taken separately.

We selected 74,106 cDNAs corresponding to the four large scale human transcriptome projects and annotated some basic properties of their Gold alignments; we hope that these more precise alignments will be used by others for in depth analyses. We are of course open to collaborations.

In the course of the analysis, we tentatively identified “contaminating” sequences, vectors, poly-A or others (supplement 2), which tend to be aligned on the genome as spurious supplementary exons. We would welcome comments and corrections on these lists. At the same time, we actively looked for trans-spliced leaders and confirm that there are none in human: a majority of clones in the oligo-cap selected FLJ collections start at the capped first base of the transcript, and would undoubtedly evidence recurrent 5’ additions, yet the only extra 5’ sequences routinely found in these libraries are poly-A stretches, which occur in 1.8% of the oligo-capped cDNAs. The significance of the 5’ poly-A additions is unknown.

The lists of possible defects in the cDNAs may help select the best clones for use in future experiments: researchers interested in using the cDNA clones for protein experiments should prefer clones with an identical CDS on the cDNA and on Gold, and be cautious with clones evidencing frameshifts in the genome or with rearranged clones. They should be aware of ambiguities in the genomic map position in the case of duplicated genomic loci, and all these details are explicitly annotated in Gold. On the other hand, we have not yet labeled the best representative clone for each transcript variant, annotated the completeness of its CDS, categorized single base differences as putative polymorphic variations or suspected mutations in the clone, or identified an eventual match in the RefSeq collection, because all these annotations require mRNA and EST clustering and co-alignment, something we only do in AceView so far.

However, we have gathered some elements on the relative levels of natural polymorphisms and cloning mutations or sequencing errors. In fact, one of the libraries, KIAA, has the least differences from the genome, with only 0.91+/-0.02 single base difference per kb, and its alignments uncover more frameshifts or Stop mutations in the genome than in the cDNAs (supplement 1c): the quality of its sequence is therefore equal or better than that of the genome sequence, evaluated at 0.05/kb. The average level of natural single base polymorphisms genomewide would therefore be of the order of 0.86 per kb, including mismatches and single base insertion or deletion. By comparison, the MGC reference, MGC complete, DKFZ and FLJ collections might harbor respectively 0.11, 0.36, 0.94 and 1.02 mutations or sequencing errors per kb (supplement 1a).

We also annotated the presence of micro-rearrangements of type “variable tandem repeat number” and “micro-introns” (less than 65 bp long), which essentially affect the UTR regions, and we propose that they correspond to copy number polymorphisms or structural polymorphisms that may have biological consequences. The very high frequency of such events, affecting almost 9% (2,535 genes /29,194) of the genes currently identified by the large scale cDNA collections, provides an interesting complement to the catalog of SNPs, and could be further analyzed in relation to tissue specificity, phenotype, binding to regulatory RNAs such as micro-RNAs or binding to chromatin remodeling proteins [e.g. 25, 26].

On the genome side, we evaluated the completion and minimal quality of the genome: as of July 2003 (Build 34/hg16), the 2,865 Mb reference genome was only missing between 0.4% and 1.2% of the gene-bearing sequence. 97.4% of the cDNAs were aligning fully (within 1% of their length) on the genome, with few base differences (1.62 per kb on average), except for 15 highly polymorphic loci. However, the genome is not error free and we identified a series of putative frameshift or Stop mutation, but for the reasons mentioned before we did not yet try to identify putative point mutations in the genome sequence. However, we found highly significant interchromosomal variations in the rates of single nucleotide differences (mismatch or insertion/deletion) in the cDNA to genome alignments, up to 0.36 per kb among autosomes, from 0.25 to 0.62 per kb between X and any autosome, and at least 0.85 per kb between X and Y (Figure 4a). In the same vein, local variations in the frequency of rearrangements and partial alignments are also observed (Figure 4b).

Because there are two possible causes for interchromosomal or large scale local variations, namely variable quality of the chromosome sequence and intrinsic variability of the single nucleotide polymorphism or mutation rate, we were happy to observe that the variations we see in the transcribed genes parallel those observed for selected sets of neutral base substitutions and rearrangements, which vary together with the rate of meiotic recombination in megabase-sized segments and may correspond to hotspots of mammalian chromosome evolution [18, 27]. Conservation of the phenomenon across species is a good argument. Nevertheless, since all these studies are made on the same human genome, the question of whether the variations correspond to regions naturally more polymorphic or to regions where the genome sequence and assembly are of lower quality remains somewhat open. More experiments will be required to distinguish the two hypotheses.

The analysis of exact and quasi exact gene duplicates, which amount to close to more than 1% of all alignments, is also striking. Multiple aspects of the control of gene duplications appear chromosome specific, from the overall level to the organization of the duplicate gene pairs, in tandem or in palindrome. The frequency of duplicated genes per chromosome is highly heterogeneous, but this difference seems reliable since it does not correlate to the sequencing laboratory. Interchromosomal duplicated genes almost invariably contain one active gene and one pseudogene, while exact and presumably active duplicates are nearly always on the same chromosome and less than 3 megabases away from each other. The only 2 cases of nearly exact inter-chromosomal duplicates with introns outside the XY pseudo-autosomal region occur at the telomeres of chromosomes 14/22 and 15/1 and would deserve verification, since these exceptions could either be misassemblies or interesting examples of telomere borne duplications.

The existence of a multitude of large scale chromosome specific aspects, including control of the level and organization of gene duplications, control of the local frequency of single base variations and of structural rearrangements, and possibly also control of gene expression in larger neighborhoods raise interesting questions. It is likely that some mechanistic time or space constraints acting on some basic machinery, for example that involved in chromosome replication or chromatin compaction, account for these effects: as a result, all genes are not born equal and their localization on the genome has important implications.

The distribution of exons and introns lengths confirms our previous observations [19], however the existence of very short exons, under 6 bp and bounded by 2 standard introns, needs experimental confirmation. A relatively high frequency of introns with non-standard boundaries is observed, but we propose that one third correspond to micro-polymorphisms and most of the others to putative partial deletions of the insert in the cDNA (supplement 6a), leaving only few confirmed introns that do not follow the consensual splice rules (gt-ag, gc-ag or at-ac).

We observe that 22% of the mRNAs align without introns, after discounting the few (0.8%) that we identified as putative genomic contaminations. However, many are probably full length clones, especially from the FLJ cap-selected collection. This implies that a large proportion of the human genes, around 24%, would have no introns, a distinctive feature also seen in mouse, but not in Caenorhabditis, Drosophila or Arabidopsis [2]. Similarly, some micro-rearrangements may turn out to be artifacts in the cDNA libraries rather than structural polymorphisms, and the large scale chromosomal effects we observed for the duplications and polymorphisms may seem trivial once we know the cause, but in a global genome and gene function perspective, we believe that we have to understand the preeminence of all these unexpected features in the eutherian genes. Recall how spliced genes or micro RNAs seemed awkward not so long ago, before the roles of alternative variants and RNA interference were even imagined. And it is in our view the combination of expected and unexpected which vindicates best, if this was necessary, the interest of the systematic cDNA projects, the experimental material which underlies all this analysis.

Materials and methods

Details on the method, including the raw data, the format to submit alignments, the current lists of poly-A and vector identification, and the C code to select the Gold and do the analyses are provided in the supplementary material document 1. The NCBI version of the acedb database [16] is used throughout this project.

The supplementary online material contains supporting tables of results and explicit lists of occurrences. It consists of one section per section of the article. The entire primary and associated material is downloadable by ftp from the GOLD web site www.ncbi.nlm.nih.gov/IEB/Research/Acembly/GOLD.

List of abbreviations

aa aminoacid

bp basepair

CDS Coding sequence, from Met to Stop

DKFZ The German cDNA Consortium [7]

FLJ Full Length Japanese cDNA project, from NEDO [5, 6]

KIAA Kazusa DNA research Institute cDNA project [3, 4]

MGC Mammalian Gene Collection [8]

ORF Open reading frame: a region of a sequence which in a given frame contains no stop codon

Acknowledgments

We thank our colleagues from the JBIRC informatics team for their involvement in this project since its birth, Guy Slater and Ewan Birney for contributing the exonerate alignments and the list of duplicate sequences, our colleagues at H-inv, IDRIS and NCBI for their help and fruitful interactions, Nicolas Thierry-Mieg for discussions about algorithms and Vahan Simonyan for presentation of the work. We are grateful to David Lipman for his stimulating ideas and insightful comments on the manuscript, and for the highly productive and pleasant environment he and Jim Ostell create at NCBI.

References

1. GOLD, Genomewide Optimization of Locus Description [http://0-www-ncbi-nlm-nih-gov.brum.beds.ac.uk/IEB/Research/Acembly/GOLD] or [http://www.aceview.org/GOLD]

2. AceView: identification and functional annotation of cDNA-supported genes in higher organisms [http://0-www-ncbi-nlm-nih-gov.brum.beds.ac.uk/IEB/Research/Acembly].

3. Nomura N, Miyajima N, Sazuka T, Tanaka A, Kawarabayasi Y, Sato S, Nagase T, Seki N, Ishikawa K, Tabata S: Prediction of the coding sequences of unidentified human genes. I. The coding sequences of 40 new genes (KIAA0001-KIAA0040) deduced by analysis of randomly sampled cDNA clones from human immature myeloid cell line KG-1. DNA Res 1994, 1:27-35.

4. Kikuno R, Nagase T, Nakayama M, Koga H, Okazaki N, Nakajima D, Ohara O: HUGE: a database for human KIAA proteins, a 2004 update integrating HUGEppi and ROUGE. Nucleic Acids Res. 2004, 32:D502-504.

5. Suzuki Y, Sugano S: Construction of a full-length enriched and a 5'-end enriched cDNA library using the oligo-capping method. Methods Mol Biol 2003, 221:73-91.

6. Ota T, Suzuki Y, Nishikawa T, Otsuki T, Sugiyama T, Irie R, Wakamatsu A, Hayashi K, Sato H, Nagai K, et al.: Complete sequencing and characterization of 21,243 full-length human cDNAs. Nat Genet 2004, 36:40-45.

7. Wiemann S, Weil B, Wellenreuther R, Gassenhuber J, Glassl S, Ansorge W, Bocher M, Blocker H, Bauersachs S, Blum H, et al.: Toward a catalog of human genes and proteins: sequencing and analysis of 500 novel complete protein coding human cDNAs. Genome Res 2001, 11:422-435.

8. Strausberg RL, FeinGold EA, Grouse LH, Derge JG, Klausner RD, Collins FS, Wagner L, Shenmen CM, Schuler GD, Altschul SF, et al.: Generation and initial analysis of more than 15,000 full-length human and mouse cDNA sequences. Proc Natl Acad Sci U S A 2002, 99:16899-16903.

9. The H-inv international database for integrative human gene annotation [http://www.jbirc.aist.go.jp/hinv/index.jsp].

10. Imanishi T, Itoh T, Suzuki Y, O'Donovan C, Fukuchi S, Koyanagi KO, Barrero RA, Tamura T, Yamaguchi-Kabata Y, Tanino M, et al.: Integrative annotation of 21,037 human genes validated by full-length cDNA clones. PLoS Biol 2004, 2:856-75.

11. Pruitt KD, Tatusova T, Maglott DR: NCBI Reference Sequence project: update and current status. Nucleic Acids Res 2003, 31:34-7.

12. NCBI human genome resources [http://0-www-ncbi-nlm-nih-gov.brum.beds.ac.uk/genome/guide/human/] and

[ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/maps/mapview/BUILD.34.3/hs_esttrn.md.gz] for downloading the NCBI public alignments.

13. Kent WJ: BLAT--the BLAST-like alignment tool. Genome Res 2002, 12:656-664.

14. UCSC Genome Browser [http://genome.ucsc.edu] and

[ftp://genome.ucsc.edu/GoldenPath/hg16/database/all_mrna.txt.gz], for downloading the UCSC public alignments.

15. The ensembl human genome browser [http://www.ensembl.org/Homo_sapiens/]. Note that exonerate alignments are not yet publicly available on this server.

16. The acedb object-oriented database system at NCBI [http://0-www-ncbi-nlm-nih-gov.brum.beds.ac.uk/IEB/Research/Acembly/Acedb ].

17. Carlo T, Sierra R, Berget SM: A 5' splice site-proximal enhancer binds SF1 and activates exon bridging of a microexon. Mol Cell Biol 2000, 20:3988-95.

18. Hardison RC, Roskin KM, Yang S, Diekhans M, Kent WJ, Weber R, Elnitski L, Li J, O'Connor M, Kolbe D, et al. : Covariation in frequencies of substitution, deletion, transposition, and recombination during eutherian evolution. Genome Res. 2003, 13:13-26.

19. Skaletsky H, Kuroda-Kawaguchi T, Minx PJ, Cordum HS, Hillier L, Brown LG, Repping S, Pyntikova T, Ali J, Bieri T, et al.: The male-specific region of the human Y chromosome is a mosaic of discrete sequence classes. Nature 2003, 423:825-37.

20. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, et al.: Initial sequencing and analysis of the human genome. Nature 2001, 409:860-921.

21. Bailey JA, Gu Z, Clark RA, Reinert K, Samonte RV, Schwartz S, Adams MD, Myers EW, Li PW, Eichler EE: Recent segmental duplications in the human genome. Science 2002, 297:1003-7

22. Cheung J, Estivill X, Khaja R, MacDonald JR, Lau K, Tsui LC, Scherer SW: Genome-wide detection of segmental duplications and potential assembly errors in the human genome sequence. Genome Biol 2003, 4: R25.

23. Bailey JA, Church DM, Ventura M, Rocchi M, Eichler EE: Analysis of segmental duplications and genome assembly in the mouse. Genome Res 2004, 14:789-801.

24. Sebat J, Lakshmi B, Troge J, Alexander J, Young J, Lundin P, Maner S, Massa H, Walker M, Chi M, et al.: Large-scale copy number polymorphism in the human genome. Science 2004, 305:525-8.

25. Ambros V, Lee RC, Lavanway A, Williams PT, Jewell D: MicroRNAs and other tiny endogenous RNAs in C. elegans. Curr Biol 2003, 13:807-18.

26. Saveliev A, Everett C, Sharpe T, Webster Z, Festenstein R: DNA triplet repeats mediate heterochromatin-protein-1-sensitive variegated gene silencing. Nature 2003, 422:909-13.

27. Ellegren H, Smith NG, Webster MT: Mutation rate variation in the mammalian genome. Curr Opin Genet Dev. 2003, 13:562-8.

Table 1: Properties of duplicate alignments.

The normalized number of inter and intra-chromosome duplicates is recorded, for exact and almost exact duplicates, then (after the dash /) for exact duplicates only, together with information on the presence (Yes) or absence (No) of standard introns in one or both copies. The yes-no category corresponds to retroposon type pseudogenes. Only pairs with excellent alignments in finished areas of the genome are considered (out of 66,018 such alignments).

Mapping on…	Duplicate alignments (all/ exact )	Existence of introns in the copies (exact or almost / exact only)
Mapping on…	Duplicate alignments (all/ exact )	Yes-Yes	No - No	Yes - No
Different chromosomes The same chromosome Total outside pseudo-autosomal	101 / 19 717 / 390 818 / 409	4 / 0 462 / 243 466 / 243	32 / 6 255 / 147 287 / 153	65 / 13 0 / 0 65 / 13
Pseudo-autosomal XY	90 / 90	78 /78	12 / 12	0 / 0

Table 2: Rearranged Gold alignments.

Classification of the rearranged alignments of combined score excellent, good or partial (out of 72,691 mRNAs). ¹Note that nearly 7% of the clones with rearranged alignments have rearrangements of more than one type, hence the total is not additive. Lists of accessions are available in supplement 5.

Type of rearranged alignment

mRNAs

% mRNA

Mosaic

Inversion

Transposition

Suspected genome deletion

Variable tandem repeat number

Pseudo-split or bridgeable genomic contigs

1,197

128

293

283

893

1.65%

0.18%

0.40%

0.39%

1.23%

0.05%

Total rearranged¹

2,637

3.63%

FIGURE LEGENDS

Figure 1: Quality of the Gold alignments

The quality of the alignment is defined as a function of the percent length aligned and the number of single base differences per kb (single base mismatch or single base insertion/deletion between the cDNA and the matching genome). The number and percentage of the mRNAs at each quality is shown (total 74,106). Each exact repeat counts as 1. Rearranged alignments (e.g. from mosaic or rearranged cDNA or genome) count once at their combined rationalized score. Lists of accessions and more details are online, in supplement 3.

Figure 2: Frequency and map on the genome of identical or nearly identical duplicate genes.

404 mRNAs, with excellent quality Gold, map in multiple loci, both in finished areas of the genome (July 2003). The resulting 908 alignments are represented here. a) Duplicates for which both copies map on the same chromosome (in blue) or on different chromosomes (in red) are shown as a percentage of excellent alignments on each chromosome. The presence or absence of standard introns is coded by a pattern: wide oblique stripes if both copies have introns, solid color for putative pseudogenes, and narrow stripes for pairs of intronless duplicates. The X and Y chromosomes are highly enriched in gene duplications, affecting 8.7% and 73.9% of their alignments respectively: for convenience, the data for X is shown divided by 2 and for Y divided by 20. Notice the huge variations per chromosome and the scarcity of interchromosomal duplicates that do not behave as pseudogenes.

b) Map of the 908 duplicate alignments on the genome. Each horizontal line represents a chromosome, in megabases. A cDNA alignment is represented at its physical map position by a colored line, blue for intrachromosomal duplications, red for interchromosomal, green for a cDNA with both inter and intrachromosomal duplications. The presence or absence of standard introns is also encoded: on top of the line, a v indicates the copy at this site has introns, a plain bar l that it is intronless. The bottom of the same glyph, below the chromosome line, shows the type of the duplicate gene(s) at the alternate location(s): _^ if all copies elsewhere have introns, no sign if all are intronless, o if some have introns, some don’t. Chromosomes 18 and 21 do not contain any duplicates and are omitted from the diagram. See supplement 4b for a similar map showing only exact duplicates.

Figure 3: Histograms of the size of standard introns, non-standard introns and micro-rearrangements.

Analysis of introns and micro-rearrangements below 300 bp, in 52,089 excellent alignments with introns. a) Histogram of intron size, limited to introns below 300 bp (18.6% of a total of 406,003). 73,081 standard introns (gt-ag, gc-ag or at-ac boundaries; in pink) and 2,737 non-standard (any other boundaries, in blue) are shown. Notice the sharply distinctive size distribution of standard versus non-standard introns. b) Size histogram of the 2,737 non standard introns, in blue, complemented, on the negative side, by the 493 micro-rearrangements of less than 100 bases and of type “variable tandem repeat number” (in green): a value of -20 bp corresponds to a micro-duplication where the cDNA contains 20 bp of the genome twice, in tandem. We propose that microintrons below 65 bp and variable tandem repeat numbers are two manifestations of the same event, structural micro-polymorphisms.

Figure 4: Comparison of the quality of the mRNA-to-genome alignments, per chromosome (Build 34/July 2003).

a) Rates of single base differences (mismatch or single base insertion or deletion) per kilobase of cDNA-to-genome alignment are measured on each chromosome for excellent and good Gold, aligning completely outside of the highly polymorphic loci (masked), and not rearranged. All chromosomes except the Y have between 750 and 6,963 such alignments, constituting a test for 1.7 to 16.5 Mb of sequence (Y has only 122 alignments, over 250 kb). The standard errors (confidence 95%; 2s) are indicated by a bar (data in supplement 7a3). b) Rates of occurrence of rearrangements, assembly problems or possible polymorphisms may be compared across chromosomes through the frequency of structural rearrangements seen in the Gold alignments, per 100 kb aligned at excellent, good or partial quality. From top to bottom: Frequency per chromosome of partial alignments, mosaic, inversion, variable tandem repeat number, transposition and suspected genome deletion. Corresponding lists are in supplement 7a2 and genome maps in supplement 7b.