SUPPLEMENTARY MATERIAL

associated to the article

GOLD: a cooperative approach to choose the best mRNA to genome alignment and what we learn about human genes and the genome.

by Jean Thierry-Mieg et al. August 11, 2004

 

Updated versions of this document will be available on the GOLD web site: www.ncbi.nlm.nih.gov/IEB/Research/Acembly/GOLD

 

Table of contents

1.    The selected cDNAs and genome, and the various contributing alignment programs.

a.    The test dataset: cDNA accessions and genome

b.    The cDNA projects: analysis of the properties of the GOLD alignments, per cDNA project: quality of the alignment, structural anomalies, submission defects and splicing anomalies (includes 4 figures and associated tables of lists).

c.     Analysis of suspected frameshift errors, in the cDNA or in the genome (with lists).

d.    Representation of GOLD in the NCBI RefSeq project (lists of genes missing from RefSeq).

e.    The seven contributing alignment programs (and alignments from UCSC, NCBI and AceView).

f.      The .ali alignment format

2.    The GOLD selection procedure and results

a.    The GOLD cost function: lists of poly-A and vectors clipping sites, and alternate alignments.

b.    An open question: the shortest exons

c.     The GOLD selection program

d.    The current GOLD alignments

3.    Quality of the GOLD alignments and inferences on the quality of the genome.

a.    Classification of the mRNAs by the quality of their GOLD alignment, with lists of cDNAs from masked polymorphic areas.

b.    Map of all GOLD alignments on the genome (highly polymorphic loci highlighted).

c.     Map of partial alignments.

d.      Lists and map of suspected frameshifts in the genome.

4.      Multiple alignments identify duplicated genes or genome assembly problems.

a.    Lists of exact and nearly exact duplicate alignments (excellent good or partial).

b.    Genome map of the exact duplicate alignments

c.     Figure 4: Classification of the duplicate alignments, in %alignments per chromosome: intra versus inter-chromosomal duplicates, and presence of introns in 0/1/2 members of a pair, pseudo-duplicates in unfinished genome areas.

d.    Supporting tables for Figure 4: Table of properties of excellent quality duplicate alignments, per chromosome; intron exon structure and relative organization of the excellent duplicate alignments, per chromosome.

e.    Functional analysis of the duplicate genes

5.    Split or discontinuous alignments identify genome polymorphisms or assembly errors, or cDNA rearrangements.

a.    List of cDNAs with rearranged alignment.

b.      Genome map of the rearrangements.

6.    Intron properties: Splicing in GOLD alignments: The frequency of structural polymorphisms and intronless genes is unexpectedly high.

a.    Statistics on the presence or absence of introns in the GOLD alignments, and on the types of intron boundaries.

b.    Size histograms for standard introns, non-standard introns and other micro-rearrangements (e.g. variable repeat number polymorphisms), restricted or not to that in the last exon.

c.     Genome map of variable repeat number polymorphisms and anomalous introns that do not correspond to suspected internal deletions in the cDNA.

7.    Comparing the quality of the sequence of the chromosomes

a.    Tables and lists of cDNA per quality; includes cDNAs from masked areas; tables and lists of structural defects per chromosome.

b.    Human Genome Project Coordinators by Chromosome.

8.    A snapshot at compared performances, strength and weaknesses of public alignment programs displayed at UCSC, NCBI 34.3 and AceView.

 

This supplementary online material consists of one section per section of the article. Each contains tables in which the numbers of mRNA alignments with a given property are linked to an html document listing the corresponding annotated mRNAs. These tables show, for each accession, the main features of an alignment compared to the GOLD: the length, position and number of errors of the match, the number of exons and exact map position on the genome are also reported.

There may be some redundancy in the lists, because they aim at groups with different interests. The rearranged alignments for example appear by themselves in paragraph 5, sorted by rearrangement type: mosaic, inversion, transposition, variable repeat number, genome deletion. They appear again in paragraph 1, sorted by cDNA project, for the use for this interest group. They also appear in paragraph 7, sorted by chromosomes, because they may be useful to genome sequencing centers for rechecking some areas.

The raw data, the GOLD alignments, the program and the acedb database schema are provided in the first paragraph. The entire primary and associated material is downloadable by ftp from the GOLD web site as a single tar.gz file (about 50 Mbytes plus the genome: 800 Mbytes), which expands as an autonomous hyperlinked collection of html page.

 

 1: The selected cDNAs and genome, and the contributing alignment programs

 

1a The test dataset: cDNA and genome

We welcome submissions of alignments to the benchmark, in the format explained below.

The test data set are available online here and on the web site www.ncbi.nlm.nih.gov/IEB/Research/Acembly/GOLD .

           The cDNA data set (cDNA.fasta.gz) contains the DNA sequences of 74,106 cDNAs, as they were in GenBank on February 10, 2004.

           We also provide just the list of 74,106 mRNA GenBank accessions used in this project

We have used the genome from NCBI build 34, July 2003, which also supports NCBI annotations release 34_3 (Feb 2004 to August 2004). It consists of the 24 chromosomes and the unmapped contigs. The downloadable fasta files are:

The chromosome fragments exported here were reconstructed using the NCBI agp file, and are identical to the main chromosome fragments available from UCSC, except that we have not remapped the floating NT contigs into the random contigs used by UCSC. The genome is exported here as a pair of large fasta.gz files (each expands a little below 2 Gb) which should be downloaded by ftp.

 

 1b The cDNA projects

A review of the various cDNA projects can be found in Imanishi et al., 2004 [10]. The 4 diagrams below, and the support tables linking to the lists of accessions, show some properties of the Gold alignments (as discussed in the article). A FLJ filtered protein coding set, including 22,167 of the 29,923 accessions is also available [Isogai and Nishikawa, personal communication].

 

Figure 1.1: Quality of the Gold alignments, by project.

Figure 1.1: Quality as defined in Figure 1 of the article, for all clones in the 4 collections, except those from hypervariable regions, which are masked; in total: 2057 KIAA (D or AB accessions), 32,723 MGC (BC) of which 14,867 belong to the reference set, 9,228 DKFZ (AL or BX) and 29,711 FLJ (AK) accessions, of which 21,959 belong to the FLJ filtered set.

Lists of accessions by quality and by project:

Absolutely all clones are considered here. Note: the masked clones may align at any quality, from excellent to bad.

Project

#cDNA

Excellent

Good

Partial

Dubious

Bad

Masked

Unaligned

KIAA

2,058

1,985

43

23

6

0

1

0

MGC.filtered

14,961

14,168

530

115

32

10

94

12

MGC

32,859

29,932

2,071

420

162

87

136

51

DKFZ

9,258

8,188

843

133

42

10

30

12

FLJ

29,923

25,856

3,163

311

215

108

212

58

FLJ.filtered

22,167

19,575

1,884

241

134

74

208

51

other

8

5

0

0

2

0

0

1

Total

74,106

65,966

6,120

887

427

205

379

122

 

Figure 1.2: Structural anomalies in the genome or the cDNA, per 100 kilobases of mRNA aligned in excellent good or partial GOLD alignments, per project.

 

Figure 1b: Partial or rearranged Gold alignments. From left to right: suspected genome deletion, transposition, inversion, mosaic, partial alignment and suspected genomic contamination, as defined in paragraph 5 and 6 of the article. Only excellent, good and partial alignments outside of the highly polymorphic regions are considered here.

Lists of accessions by rearrangement type:

Partials may be found in the previous list. Perhaps the essential observation from this type of data is that only a small number of cDNA clones in these collections are affected by any kind of rearrangement or anomaly (*): 5.0% of the KIAA accessions, 3.4% of the MGC (2.2% of the selected MGC), 5.9% of the DKFZ, 10.1% of the FLJ, (the numbers in Figure 1b are per kb aligned, and include genome but not cDNA suspected deletions, hence the difference).

Project

cDNA

Structural problem in cDNA*

Suspected genome deletion

Transposition

Inversion

Mosaic

Suspected genomic contamination

KIAA

2,051

103

11

5

17

28

1

MGC.filtered

14,813

326

24

4

6

58

0

MGC

32,423

1,108

84

38

36

336

116

DKFZ

9,164

543

30

21

38

140

99

FLJ

29,330

2,970

140

230

30

688

373

other

5

3

0

0

0

3

0

Total

72,973

4,727

265

294

121

1,195

589

*Putative structural problems in cDNA includes transposition, inversion, mosaic, suspected genomic contamination and suspected internal deletion (partial) of the insert. The other anomalies, suspected genome deletion, partial alignment, or variable repeat number polymorphisms are not counted here, because they have a good chance of reflecting structural problems in the genome. However, recall that all types of structural defects may be due to anomalies in either the cDNA or the genome.

 

Figure 1.3: Defects in the sequence submission to GenBank/DDBJ/EMBL

Figure 1.3: Defects in the sequence submission to the public databases: improper clipping of vector or other putative 5 exogenous sequences (4,927 in total), accessions clearly submitted from 3 to 5, on the wrong strand (114), and identical sequences, submitted in duplicate under different clone names (787). Only cDNAs with excellent, good or partial alignments outside of the hyper variable regions are considered.

We flagged only 114 submissions on the wrong strand, because this project does not use protein annotation or clustering. But we believe, based on AceView complete gene analysis, that this defect is more frequent (0.8%). Another problem is the presence in GenBank of exactly identical (redundant) sequences submitted as different accessions. Some 39 were submitted twice by MGC, with the same sequence under the same cDNA name. These were clearly bugs in the submissions and were filtered out from our initial selection. But an additional 787 exactly redundant sequences, from clones with different names but from the same project, most of them from MGC, were pointed out to us only recently, by Guy Slater, from EBI. These were kept in the analysis and show up as duplicate clones in Figure 1.3.

List of accession with submission defect, by project.

Project

Out of #cDNA

Unclipped vector

Submitted

on wrong strand

Duplicate clone

KIAA

2,051

10

1

1

MGC.filtered

14,813

1,316

1

377

MGC

32,423

3,528

56

721

DKFZ

9,164

637

15

40

FLJ

29,330

751

42

25

other

5

1

0

0

Total

72,973

4,927

114

787

 

Figure 1.4: Number of splicing anomalies, per 100 kb aligned.

 

Figure 1.4: Copy number polymorphisms of type variable repeat number (micro-deletion in the genome) and all non-standard introns, split among those recognized as suspected partial deletion of the insert by AceView, and the other non-standard introns (as explained in paragraph 6 of the article). Only cDNAs with excellent, good or partial alignments outside of the hyper variable regions are considered.

Lists of accessions with non-standard intron boundaries, variable repeat number polymorphisms, and also intronless GOLD alignments, per project:

Project

Out of #cDNA

Variable repeat number polymorphism

Suspected

internal

deletion

Other

non-standard

intron*

Aligns with

no intron+

KIAA

2,051

62

54

143

86

MGC.filtered

14,813

93

260

488

675

MGC

32,423

249

592

1,281

4,742

DKFZ

9,164

156

257

656

2,498

FLJ

29,330

429

1,731

2,069

8,827

FLJ.filtered

21,700

287

1,535

1,448

3,970

other

5

0

0

0

1

Total

72,973

896

2,634

4,149

16,154

*Not flagged as suspected internal deletion of the insert by AceView. +In the rare cases where there are exactly duplicate alignments (repeated genes) and one is intronless, it counts in this column (by convention).

 

1c. Identifying putative frameshift errors in the cDNA or in the genome sequences:

 

The frequency of suspected frameshift per library was monitored independently, by measuring the difference in length between the ORF read on the cDNA clone and on its genomic GOLD footprint: as hoped for, the MGC reference collection has, on average, identical lengths in cDNA and Gold; FLJ comes next, with an average increase of 6 residues in Gold; the non-reference MGC increases by 9 and DKFZ by 13. Interestingly, KIAA in fact looses 2 residues on average in Gold, because these clones evidence more frameshifts or mutations in the genome than the opposite.

 

We selected accessions (from cDNAs with excellent quality alignment, with no rearrangement and not belonging to a repeated gene) for which the ORF length is at least 100 aminoacids smaller than on the genomic Gold footprint. The corresponding list of 1536 suspected frameshifts in the cDNAs is provided here. The complementary situation, in which the ORF length on the cDNA is at least 100 aminoacids larger than on the genomic Gold, points to 171 suspected frameshifts in the genome. The corresponding genome map is here.

 

A comparison of the levels of the two types of frameshifts, in the cDNA and the genome, is shown in the table below, per library. Frameshifts appear to be about 10 times more frequent in the cDNAs as in the genome, except when selective criteria are applied, through informatics, as in MGC reference, or through molecular biology, as in KIAA.

 

Table 1.2: Suspected frameshifts evidenced in cDNA and genome, per library

 

cDNA project

#cDNA

considered

#cDNA with ORF length on genome>cDNA

%suspected frameshift in cDNA

#cDNA with ORF length on cDNA>genome

%suspected frameshift in genome

FLJ

24,991

625

2.50

60

0.24

DKFZ

7,998

300

3.75

32

0.40

MGC other

15,407

522

3.39

22

0.14

MGC reference

13,955

77

0.55

34

0.24

KIAA

1,936

12

0.62

23

1.19

Total

64,289

1,536

2.38

171

0.26

 

 

1d. Representation of the GOLD mRNAs in RefSeq

 

As of August 2, 2004, the NCBI RefSeq reference sequence collection does not yet report 49% of the genes identified by the cDNAs from the four large scale projects. A number of reasons why this may be so come to mind: to create a new NM accession associated to a validated LocusLink gene, RefSeq prefers coding genes, and in principle imposes that the CDS should be complete. It favors genes with introns, avoiding a chance of validating possible genomic contaminations or unspliced variants, and it is biased for long CDS. The ideal cDNA would align on the genome at high quality, with no rearrangement and over its entire length. There would be no evidence for a frameshift or a suspect Stop codon in either the cDNA or the genome, and the protein sequence read on the cDNA should closely match the protein sequence read on the genome footprint. Ideally, the predicted encoded protein would be conserved in evolution so that putative orthologs could be found. In addition, a good RefSeq should map unambiguously in a single gene rather than in repeated genes.

We therefore selected, using related conditions on all mRNAs currently aligning in a gene with no RefSeq, as of August 2, 2004, a set of 4,373 genes with no RefSeq, in which 4,928 cDNAs fulfill all of the following criteria:

    they align as excellent Gold on the July 2003/build 34 genome, i.e. over 99.98% of their length, with 99.88% bases identical to the genome on average (and >99% length and >99.6% identity). The alignments have no gap and are not rearranged. They map unambiguously in a single gene each, this gene has no NM yet in August 2004. They constitute a non-redundant set, and when more than one clone is selected in a gene, these clones represent non-mergeable alternative variants.

    They encode a predicted CDS of more than 100 aminoacids (the CDS goes from Met to Stop, Met being coded by any of the recognized possible codon: ATG, TTG, CTG or GTG; all the selected cDNAs are complete C-terminal side). In addition there is no frameshift in the genome or the cDNA, since the length of the CDS on the cDNA submitted to GenBank equals that on the Gold genomic footprint.

 

These accessions can then be split (Table 1d below) according to the other desirable features for RefSeq: presence or absence of standard introns; completeness of the CDS on the N-terminal side, which we define by the presence of an upstream Stop, 5 and in frame with the main ORF (in AceView); conservation of the protein, which we define by presence of at least one hit, using BlastP with expect 10-3, to a protein from another species. The presence of a significant Pfam hit is also monitored.

 

Completion on NH2 side

Standard introns

CDS average length (bp)

Have putative orthologs

Have Pfam motifs

Complete CDS with orthologs or Pfam (average #aa)

Yes:

4138 mRNA complete CDS (from 3,719 genes*)

Yes:

2352 mRNAs, (1934 genes*)

829 bp

(300 to 6000 bp)

Yes

Yes:

 

No:

 

No (Homo sapiens specific)

Yes:

 

No:

 

No:

1786 mRNAs, (1785 genes*)

441 bp

(300 to 2400 bp)

Yes

Yes:

 

No:

 

No

Yes:

 

No:

 

No proof:

790 mRNA partial CDS (from 680 genes*)

Yes:

No:

Min CDS: 1343 (300 to ?)

Yes

Yes

 

No

 

No

Yes

 

No

 

*None has a RefSeq in August 04.

 

One correlation we found was that most of the genes excluded from RefSeq are novel in human, although most have putative orthologs in the protein databases. They are expressed at a relative low level, hence are not found in all libraries. One restriction we thought might apply is that of protein length, because it is well known that RefSeq has a bias in favor of longer than average proteins. But many of the missing genes encode large proteins. An example list of 791 accessions, encoding complete CDS proteins of 652 aa on average (and at least 300 aminoacids), and mapping at excellent quality with no anomaly in 587 genes that do not yet have a RefSeq mRNA, is available here.

Similarly, another 703 accessions, from 557 other genes with no NM as of August 2, 2004, contain an ORF of more than 900 bp (average ORF size is 1944 bases, 646 aa), but their products are not proven to be 5 complete. In addition, there are thousands of highly-coding alternatively spliced variants not yet represented in an NM.

 

1e. Seven alignment projects participated in the endeavor so far:

You are welcome to download the alignments from UCSC, NCBI and AceView, in the format defined below. These alignment files were constructed by reformatting the sources as explained in the README file in the same directory. Each table has around 300,000 lines and is named according the method. The other alignments remain private. 

To generate GOLD, we picked from

-   AceView alignments from April 4th, 2004, for the entire set of 74,106 accessions;

-   NCBI alignments , version 34.3, march 10th, 2004 for the entire set of 74,106 accessions, available from ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/maps/mapview/BUILD.34.3/hs_esttrn.md.gz;

-   UCSC alignments   from February 15th, 2004, for a subset of 73,602 accessions, available from ftp://hgdownload.cse.ucsc.edu/GoldenPath/hg16/database/all_mrna.txt.gz. We had to remove 504 accessions which mapped on random contigs, because we did not succeed in reliably decoding UCSC non standard coordinates for those;

-      JBIRC/AIST alignments from February 5th, 2004, for a subset of 69,521 accessions, kindly contributed by Yasuyuki Fujii, Tadashi Imanishi and Takashi Gojobori (pers. comm.)

-      NEDO alignments from November 21, 2003, for a subset of 54,901 accessions, contributed by Kimura, Nishikawa, Isogai and Sugano (unpublished)

-      Exonerate alignments from April 8th, 2004, for a subset of 59,235 sequences, kindly contributed by Guy Slater (Slater and Birney, pers. comm..).

-       NCBI Splign alignments from February 13th, 2004, for the entire set, kindly provided by Kapustin and Lipman (pers. comm.).

 

1f. Alignment format definition:

 

This format is used in the alignments available above and for the GOLD alignments. If you plan to contribute your alignments, please use the following .ali format:

Alignments are given as tab delimited tables; optionally lines may start with # and contain a comment, otherwise each line corresponds to the alignment of an exon on the genome; the successive exons of a single alignment follow each other. The table has 9 columns:

  1. A unique identifier for your alignment
  2. The cDNA accession.version
  3. The name of the method.
  4. The exon number, given in the direction of the gene, increasing from one line of the table to the next. Each alignment starts on exon 1.
  5. x1: the first base of this exon on the cDNA, calling 1 the first base of the sequence.
  6. x2: the last base of this exon on the cDNA
  7. The chromosome number: 1, 2, 3 22, X, Y, n_random or n|NTxxx
  8. a1: the first base of this exon on the chromosome, i.e. base x1 aligns on a1.
  9. a2: the last base of this exon on the chromosome, i.e. base x2 aligns on a2.


Note that the orientation of the cDNA and the strand encoding the gene on the genome can be deduced unambiguously from the relative values of x1, x2 on the cDNA and of a1, a2 on the genome. If the cDNA was submitted to GenBank in the direct orientation, from 5 to 3, and there is no rearrangement, each line for this cDNA will have x1 < x2. If on the contrary the cDNA was submitted to GenBank in the reverse orientation, each line in our table will have x1 > x2. If the gene is encoded on the direct (Watson) strand of the genome, each line will have a1 < a2, and the as will increase with exon number. Vice versa, if the gene is encoded on the reverse (Crick) strand, each line will have a1 > a2 and the as will decrease from one exon to the next.

If a cDNA has several independent alignments on the genome, for example if the gene is duplicated or the corresponding genome section inadvertently duplicated in the current human genome build, or if the cDNA or genomic clone is mosaic, the exon number starts at 1 for each independent alignment. If a cDNA is rearranged relative to the genome, the x coordinates will not be a strictly increasing or decreasing function of the a coordinates.

 

2: The GOLD selection procedure: quality evaluation and choice of the best alignment

 

2a. The GOLD cost function:

 

1.    Scoring the mRNA to genome match: To win the GOLD, the game is effectively to align the most part of the cDNA while minimizing the errors. Some cases are really equivalent. For example if the 3 end of a cDNA does not match the genome perfectly, one may either trim the end of the alignment and keep a shorter but perfectly matching last exon, or keep a longer last exon with a few mismatches. But often one solution makes better biological sense. Because a random alignment is expected to yield about 1 in 4 correct bases, the error cost has to be above 1. At 2 points per error, alignments extending beyond a reasonable match (less than 2 matching bases after an error on average) are disfavored relative to incomplete alignments.

 

2.    Intron feet and splice site consensus: The ribonucleoprotein complex in charge of performing the splicing is equipped with various small RNA guides and proteins ensuring proper recognition and removal of introns with conserved sequences at their feet: gt-ag, and more rarely gc-ag or at-ac. The cost function slightly favors these.

 

3.    Poly A addition site and vector sequence recognition:

 

        As expected, more than 60% of the mRNAs submitted to GenBank contain a poly-A tail, naturally added at the 3 end of mature transcripts. Yet there are A-rich regions in the genome downstream of any gene, and most programs get fooled and occasionally create pure A exons. The average number of A in the GenBank accessions is 29, the largest poly-A tails are submitted by the DKFZ project (Wiemann et al, 2001; AL512733 has 245 A at the 3 end). We established a list of poly-A addition sites by looking for pure or almost pure A stretches beyond the ends of alignments. We then analyzed the base composition of the last exon of all alignments winning the Gold (temporarily) thanks to an extra 3 exon, and identified more poly-A tails. During that process, we occasionally recognized and clipped exit vector sequences. This analysis was not always trivial and we validated the ambiguous cases by hand. The current list tentatively flags poly-A addition sites in 45,865 cDNAs. This table gives the coordinate of the last base of the accession to be aligned (i.e. the base just before the poly-A).

By convention, we chose to place the poly-A addition site right before the first A of the poly-A stretch, as also done by UCSC, but not by AceView. As explained in the article, this convention does not influence the score, yet one weak argument in favor of this choice is cDNA AK024744, from the RPC155 gene encoding RNA polymerase III, where 12 of the 21 terminal A match the genome in direct extension of the last exon. So the poly-A could start at any of these 12 A. A poly-A signal variant, AACAAA, found in 2.2% of all human transcripts, occurs however 16-21 bases before the first A, just in the range to suggest that the first A belongs to the poly-A tail rather than to the premessenger. Almost all other cases, where a stretch of at least 12A matches the genome at the 3 end, with or without standard introns, lack a poly-A signal. Clearly the oligo-dT method allows binding and priming in such A-rich sites, leading to internally primed (3 incomplete) cDNAs and possibly also, in rare instances, to genomic contaminations.

     Vector recognition: Similarly, first exons composed mostly of vector sequence should be penalized. We noticed that cDNAs for which the first few bases fail to align often start with identical sequences, usually found in vectors polylinkers. We therefore developed a tool to identify common sequences found at the same position in multiple clones from the same library, and tentatively identified altogether 5,031 suspected vector sequences or other sequence to clip on the 5' or the 3 side of the insert. The coordinates in the accession of the first and last base to be aligned are given in this file [accession   a   b   seq]: the vector or other foreign sequence that we propose to clip is base 1 to (a-1) on the 5' side. The explicit sequence that wed clip is given in the fourth column. Other 5 additions we monitor in this list are unaligned poly-T stretches, which probably correspond to inversions with breakpoint in the polyA, short non-matching poly-G stretches, and poly-A stretches that sometimes occur in cap-selected libraries at the 5 end of the mRNA (Ota et al., 2004): 295 such clones are currently identified in the collection, the longest has 59 A on the 5 end (AK074948). The significance of the 5' poly A is unknown.

As shown in this diagram, the MGC consortium [8] contributes most to this problem, with 10.9% accessions apparently deserving a 5 clip (8.9% in the MGC reference collection), and this despite the fact that a large number of MGC accessions were resubmitted and fixed just before we imported the GenBank data (possibly thanks to the list of 3,938 BC that we sent to this group in Sept 2002). We also propose to clip 6.9% accessions from DKFZ, 2.6% from FLJ and 0.5% from KIAA.

 

4.  These 3,141 accessions have alternate" alignments, i.e. different exon structure, but they lie in the same genomic area and have the same score as the GOLD.

 

2b. An open question: the short exons

We do not penalize short exons. However, the smallest experimentally confirmed exons in worm are nine bases long, and both are alternative, indicating they are difficult to splice out (gene 2G876 and clp-1; Kohara, Y., Shin-i, T., Thierry-Mieg, J., Thierry-Mieg, D., Suzuki, Y., Sugano, S., A complementary view of the C. elegans genome; unpublished; sequences are in the public databases). There are a number of independent well documented examples of 6 base long exons in vertebrate, for example in the gene encoding Troponin T. Yet to our knowledge, below 6 bp, there is no hard evidence in human on how small an exon can be.

 

A recent article by a bioinformatics team at TIGR claims to discover a number of micro-exons shorter than 6 bases (Volfovsky N, Haas BJ, Salzberg SL: 2003. Computational discovery of internal micro-exons. Genome Res 13:1216-21). Unfortunately, our examination of the five cases they give for human mRNAs leads us to invalidate 3, and to offer a more likely hypothesis for the two others.

Indeed, in the three cases of X15949, AF144241 and BC014605, alignments to the genome are complete and fully error-free without invoking micro-exons. The actual exon 2 of X15949 is 93 bases long, not 5 bp; intron 1 is standard gt-ag but 45,308 bp long. Exon 3 of AF144241 is 176 bp long not 5 bp, but the second intron is non standard gt-gt and 3700 bp long (ATCAT[gt-gt]GTGAGCCACC). It is interpreted by AceView as having undergone partial deletion of the insert. Finally, the case of BC014605 is one where the 2,084 bp intron is gc-ag and not gt-ag, and the 5 bp proposed to be an independent exon belong to exon 1. The problem seems to be a bug in the alignment program used by the authors: it refuses to splice a non gt-ag intron or even long gt-ag introns. As expected statistically, it then finds a few examples of solutions with very small exons bounded by two [gt-ag] introns. None of the seven programs we used in Gold had such dramatic defects.

The two remaining cases, U43586 and AB053301, could align with a 5 bp exon, yet it seems more likely that the underlying reference genome sequence has errors (or represents a rare polymorphisms). Indeed, in both cases, there are locally more than 30 clones with the exact same sequence; there are no alternative (mRNA or EST; see AceView). So the 5 bp exon would be constitutive in these highly expressed genes. This is contrary to the multiple observations indicating that very short exons are always alternative. In addition, this area of the gene ALS2CR4 is highly conserved in rat, mouse, and even drosophila, and the 5 bases of the proposed micro-exon of AB053301 are found right in front of the homolog of the next exon, exon 3, as TCAAGatgatattc The conclusions of this article do not seem well supported, so the problem of the existence and size of micro-exons remains open.

 

Below is the histogram showing the exon size in the GOLD set. Notice the strong bias toward exon lengths that are multiple of three bases. This effect may be slightly less preeminent for 3 bp, possibly making exons in such short range questionable. But notice it is also unexpectedly low for 24, weakening the argument.

 

 

 

    Micro-exons: A list of 88 GOLD alignments (of excellent quality and not rearranged) containing a very short exon, less than 6 bp long, bounded by two standard [gt-ag] introns, is given here in the hope that some researcher may test, for example by mutagenizing the underlying genome in the region of the short exon and testing the transcripts generated in an in vitro or in vivo transcription system, whether or not the mutation is later found in the transcript, thereby fixing the biological limit to how small an exon might be. Note that the NCBI program is the only one to routinely find and propose such short good looking exons (in build 34.3, February to July 2004); this is technically a remarkable achievement.

 

2c. The GOLD selection program and the dedicated database schema are available

 

2d. The current Gold alignments

The descriptions, coordinates and sequences of the current Gold alignments are available as 5 files:

         the coordinates on the genome of the excellent or good quality Gold alignments, which align over more than 99% of their length with less than 4 (for the excellent) or 15 (for the good) differences per kb from the genome

         the coordinates on the genome of the other Gold alignments of lesser quality (not so interesting for the general user)

         The sequences of the GOLD mRNAs, as derived from the genome, in fasta format, limited to excellent, good and partial Gold alignments.

         The sequences, in aminoacids, of the GOLD longest open reading frames (ORF), also in fasta format, deduced from the GOLD mRNA genome footprint above. Note that we export the ORFs, not the coding sequence (CDS) because the ORF is not disputable, while the CDS implies we have an hypothesis on the completeness of the transcript and the initiator codon.

         A description of the properties of the mRNAs alignments, in .ace format, i.e. presented as a human readable well organized document. This document can also be read in an acedb database for higher level querying, using the schema provided.

 

3: The quality of the GOLD alignments reflects quality and completion of the genome and cDNA sequences.

 

3a. Quality alignment of the GOLD, or lack of alignment, for all 74,106 cDNAs without exception:

 

Quality

Gold
alignments1

mRNA
accession

%mRNA

mRNA with
rearranged
GOLD

mRNA with
exact
duplications

Excellent

66,419

66,018

89.09%

1,634

362

Good

6,229

6,180

8.34%

555

47

Partial

914

909

1.23%

453

5

Dubious

541

532

0.72%

139

8

Bad

425

423

0.57%

90

2

Unaligned

0

44

0.06%

0

0

Total

74,528

74,106

100.00%

2,871

424

 

  1 mRNAs aligning with exact same score in n different positions in the genome count n. Rearranged alignments (e.g. from mosaic or rearranged cDNA or genome) count once at their combined rationalized score. If a cDNA aligns in multiple places in the genome at the exact best score, all these alignments become Gold. This happens for genes that are exactly duplicated, and explains why there are more Gold alignments than mRNAs.

 

For convenience, the categories are directly linked as html tables. We also provide a complete excel-compatible tab delimited document. The complete list of rearranged accessions and those with duplicate alignments can be found below.

A more detailed partition of the mRNAs aligning over a given percent of their length with a given percentage of errors is given below (We added 78 unclassifiable, with no score, in the bad category)

%lengthAligned

%error per kb

>=99%

>=60%

<60%

0%

Total

% mRNA

<=4

66,018

669

75

0

66,762

90.09%

4 to 15

6,180

240

23

0

6,443

8.69%

15 to 30

344

90

15

0

449

0.61%

> 30

200

81

127

 

408

0.55%

Unaligned

 

 

 

44

44

0.06%

Total

72,742

1,080

240

44

74,106

100.00%

% mRNA

98.16%

1.46%

0.32%

0.06%

100.00%

 

 

3b. Map of all Gold mRNA on the genome:

 

The highly polymorphic loci (masked mRNA) are highlighted (red superscript The lists of alignments per chromosome, sorted by quality, and the lists of cDNA from masked areas of the genome, are here.

The seemingly gene-poor regions may correspond to gaps in the genome sequence (July 2003), or be truly poor in genes, like parts of chromosome 13 or some centromeric regions.

 

3c Partial alignments usually highlight problems in the genome assembly or in the cDNA submission.

 

Figure 3c: Genome map of the partial alignments, where 60 to 99% length of the mRNA aligns at good quality, i.e. with less than 15 differences per kb.

 

Clustering of partials is a strong indication of genome problem.

 

3d. Identifying suspected frameshifts in the genome:

 

We have not done a systematic search for coherent disagreements between the genome and all mRNAs and ESTs for this project. Yet here is a list of 171 clones where the ORF in the cDNA is larger than that on GOLD by at least 100 aminoacids. A good proportion of these identify frameshift mutations in the genome that the sequencing centers might be happy to fix. Their genome map is shown below: notice the absence of suspected frameshift on chromosome 21, and the large number on 17. Clustering of multiple clones (superimposed glyphs) indicate a confirmed genome frameshift. About half of the others are confirmed in the complete EST/mRNA AceView database.

 

Figure 3d: Suspected frameshifts in the genome sequence:

 

 

4: Multiple alignments identify duplicated genes (or incomplete genome assembly), and point to chromosome-specific control mechanisms.

4a. Lists of exact or nearly exact gene duplications: Out of 71,421 mRNA aligning at excellent, good or partial quality, and not split, we identified and list below the clones with multiple alignments in different areas of the genome, either with exact same Gold score (from exact duplicate genes) or nearly exact (5 points away from the Gold score, i.e. on average 99.97% sequence identity).

 

 

Exact duplicate alignments

Exact or nearly exact duplicate alignments

Quality

mRNA

 

Gold alignments

mRNA with 2 copies

mRNA with >2 copies

mRNA

Exact or nearly exact

duplicates

mRNA with 2 copies

mRNA with >2 copies

1 copy in a random contig1

Excellent

362

763

342

20

627

1,379

544

83

249

Good

47

96

45

2

107

237

91

16

25

Partial

5

10

5

0

11

24

10

1

5

Total

414

869

392

22

745

1,640

645

100

279

 

1About the 249 excellent duplications in random contigs (build 34/July 2003): we noticed that all cDNAs in that group consistently align on a given chromosome and one of its associated random, except for 5 accessions mapping in finished chromosome 5 but random chromosome 8, suggesting that a mosaic genomic clone is involved (NT079526, build 34).

 

 

4b Genome wide map of the exact duplicate alignments.

This map complements the papers Figure 2, which shows the combination of exact and almost exact duplicates, while the map below shows only the exact duplicates. Chromosomes 3, 11, 14, 18, 19, 20, 21 do not contain any exact duplicates and are omitted from the diagram. Notice the dominance of the intrachromosomal duplications (in blue), with 243 duplicates with introns in both copies and 147 with no intron in either. There are no other types of intrachromosomal repeats. Notice also the huge reduction in extrachromosomal (red) duplicates, compared to Figure 2, especially in those with introns in one copy and not in the other: these retroposon like pseudogenes usually have started diverging and are no longer exact: Here we see only 5 exact pseudogenes of yes-no type (intron in one, not the other; 7 cDNAs), from two genes with introns (6 cDNAs), both ribosomal proteins. We also see 5 sites (18 cDNAs) for intronless interchromosomal repeats. Finally, the 78 intron-intron exact interchromosomal duplicates are all in the pseudoautosomal region: the four telomeric copies from 1/15 and 14/22 already differ by one or two bases.

Red bar : interchromosomal duplicate alignment

Blue bar: intrachromosomal duplicate alignment

Icon on top of vertical bar: v if the copy mapped here has intron, no sign if it is intronless.

Bottom icon: ^ if the copy elsewhere has intron, none if intronless, o if there are copies with introns as well as intronless.

 

4c Figure 4: Classification of the duplicate alignments, per chromosome.

 

The figures in this paragraph display graphically the classification of the exact or almost exact duplicate genes (within 5 points of the GOLD score, on average 99.97% similarity), per chromosome, on genome build 34/July 2003.

The 1,377 repeats diagramed here are generated by 625 mRNA aligned at excellent quality and not split (list available in 4a; there are altogether 66,004 alignments). Results are shown in percent of all excellent non-split alignments on each chromosome. In cases where more than 2 equivalent alignments are generated, a given alignment may contribute to more than one class, for example inter and intra-chromosomal, yet its total contribution is normalized to 1. The corresponding genome map, restricted to duplicates in finished areas of the genome, is provided in the article (Figure 2). The data tables that allowed building these diagrams are available below.

 

a) Repetitions for which at least one member lies on a random contig, usually in an unfinished chromosome. These 470 repeated alignments do not have biological meaning; they just reflect lack of completion of the chromosome sequence.

 

b) Duplicates for which both members map on the same autosome, with indication of the presence or absence of standard introns. Alignments in this category include 327 where both copies have standard introns, 2 where only one copy has introns and 232 where neither copy has introns. See Figure 1d for sex chromosomes, which are highly enriched in repetitions.

 

c) Duplicates where both members map on different autosomes. Notice the much lower number of alignments and the majority of retroposon-like pseudogenes: there are only 4 alignments (2 pairs) where both copies have standard introns, 65 where only one copy has introns and 25 where neither copy has introns.

 

d) Same for the X and Y chromosomes. Notice the much larger scale: X has 104 repeats of X on X (94 with introns in both and 20 with no introns in either) and Y has 37 repeats of Y on Y (35 with introns in both and 2 with no introns in either). For inter-chromosomal repeats, X and Y share 45 repeats in two blocks of homology, 2.4 Mb in total (39 with introns in both copies and 6 with no introns in either). X has an additional 10 repeats shared with autosomes, many of those could be pseudogenes (5 have introns in only one copy and 5 have no intron in either). For reference, 3.05% of the excellent GOLD alignments lie on the X and Y chromosomes rather than on the autosomes.

 

4d. Supporting tables for Figure 4: 1. Table of properties of exact and almost exact duplications, for excellent quality alignments, sorted by chromosome

Exact and almost exact duplicate alignments, within 5 points from the Gold, are counted. Only excellent alignments, not rearranged, are considered. Each alignment is normalized to 1. 

Chromosome

cDNA

Duplication

Duplication on the same chromosome

On another chromosome

Duplication involving isolated contig

1

6,585

219

42

12

165

2

4,731

112

56

4

52

3

4,013

17

0

2

15

4

2,672

6

0

6

0

5

3,077

108

92

1

15

6

3,222

23

0

3

20

7

3,287

81

72

9

0

8

2,335

57

31

3

23

9

2,771

110

27

2

81

10

2,848

53

38

5

10

11

3,648

7

0

7

0

12

3,466

9

0

6

3

13

1,304

11

5

6

0

14

2,149

2

0

2

0

15

2,320

76

55

10

10

16

2,955

114

110

4

0

17

3,736

65

22

1

41

19

3,793

10

4

4

2

20

1,834

2

2

0

0

22

1,518

13

9

4

0

X

1,906

194

114

55

25

Y

104

82

37

45

0

Un

63

8

0

0

8

Total

66,018

1,379

717

191

470

 

4d.2. Intron exon structure and relative organization of the excellent duplicate alignments

Same set as above, same rule for normalizing.

Note: Our analysis of this dataset indicates that chromosome 8 has 31 tandem in 8 genes and no other repeat. Chromosomes 5 (30 genes), 15 (31 genes) and 17 (16 genes) show similar but milder biases toward tandems. On the other hand, some chromosomes, such as X, tend to have much more palindromic duplication than tandem. Other chromosomes with a similar but milder bias are 2 (19 genes), 7 (30 genes), 9 (13 genes) and Y (15 genes). We do not know what the determinant for this difference is, neither do we understand why chromosomes 3, 11, 14, 18, 19, 20, 21 simply do not contain any exact duplicates, but it is clearly a problem of the utmost importance.

Chromo-some

Same chrom both intron

Same chrom one intron

Same chrom no intron

Other chrom both intron

Other chrom one intron

Other chrom no intron

Tandem

Inward palin-

drome

Outward palin-drome

Further than

1 Mb

1

19.00

0.00

23.00

1.00

10.00

1.00

18.00

7.00

13.00

4.00

2

35.00

0.00

21.00

0.00

3.00

1.00

9.67

26.67

12.00

7.67

3

0.00

0.00

0.00

0.00

1.50

0.50

0.00

0.00

0.00

0.00

4

0.00

0.00

0.00

0.00

5.50

0.50

0.00

0.00

0.00

0.00

5

57.00

0.00

35.00

0.00

1.00

0.00

60.73

6.00

7.47

17.80

6

0.00

0.00

0.00

0.00

1.00

2.00

0.00

0.00

0.00

0.00

7

39.00

0.00

33.00

0.00

3.50

5.50

9.00

6.00

29.00

28.00

8

17.00

0.00

14.00

0.00

2.00

1.00

31.00

0.00

0.00

0.00

9

12.00

0.00

15.00

0.00

1.00

1.00

2.00

7.00

2.00

16.00

10

16.00

0.00

22.00

0.00

2.50

2.50

10.00

3.00

9.00

16.00

11

0.00

0.00

0.00

0.00

7.00

0.00

0.00

0.00

0.00

0.00

12

0.00

0.00

0.00

0.00

5.50

0.50

0.00

0.00

0.00

0.00

13

5.00

0.00

0.00

0.00

5.50

0.50

5.00

0.00

0.00

0.00

14

0.00

0.00

0.00

1.00

1.00

0.00

0.00

0.00

0.00

0.00

15

36.00

0.00

19.33

1.00

3.33

6.33

35.33

8.00

6.00

6.00

16

72.00

0.00

38.00

0.00

3.00

1.00

51.67

6.00

26.67

25.67

17

17.00

2.00

3.67

0.00

0.00

1.00

15.00

4.00

3.67

0.00

19

4.00

0.00

0.00

0.00

4.00

0.00

0.00

0.00

4.00

0.00

20

0.00

0.00

2.00

0.00

0.00

0.00

0.00

0.00

2.00

0.00

22

2.00

0.00

7.00

1.00

1.00

2.00

0.00

3.00

2.00

4.00

X

94.00

0.00

20.00

39.00

4.67

11.33

19.00

62.50

32.50

0.00

Y

35.00

0.00

2.00

39.00

0.00

6.00

3.00

12.00

11.33

10.67

Un

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

Total

460.00

2.00

255.00

82.00

66.00

43.67

269.40

151.17

160.63

135.80

 

 

4e. Functional analysis of a sample of the duplicated genes on the July 2003 genome:

 

Selected genes (with a LocusID and a meaningful name) that have multiple excellent GOLD alignments, with exact or almost exact copies duplicated outside of the pseudo-autosomal region are shown in the table below. Each gene is linked to its AceView representations (updated automatically). Some of these genes, but not all, are recognized as having duplicate genes in the standard gene/RefSeq/LocusLink annotation at NCBI, and this is evidenced by multiple LocusID in the second column. The others are not annotated as duplicates at NCBI (July 2003 to 2004); this problem will probably be fixed in the next releases. (Caution: the LocusID identifications result from AceView analysis, not directly from GOLD, since GOLD only deals with alignments and does not provide clustering or protein analysis. For this reason, and because the public version of AceView was missing one quarter of the repeats at the time of this work, this list is not exhaustive).

 

Functional analysis of the duplicated genes: The duplicated gene title or description, and its involvement in molecular and biological processes (using GO classification), as annotated by AceView, are displayed in the table.

Notice the unexpectedly high number of genes whose products bind DNA or RNA, compared to the description of segmental duplications by Bailey et al, 2002. Transcription factors, helicases, splicing factors, or chromatin structure proteins are as preeminent in the list as those involved in development and signal transduction. So, the picture we describe for repeated genes matches more closely that described by the Bailey team for mouse or rat segmental duplications, both by their relative location and by their function (Bailey et al, 2004).

A more complete list, including genes duplicated in the pseudo-autosomal region (e.g. ALTE, ASMT, ASMTL, CD99, ELK4, IL3RA, IL9R, P2RY8, PR48, SLC25A6, SPRY3, SYBL1and TXNRD1) and genes with a LocusID but no official name and for which the protein has not been studied, is given in another more complete table. That table also includes annotation of the GO cellular compartment: duplicated gene products can be located in any part of the cell, with no apparent bias.

Phenotype: Only 5 diseases have so far been described in these duplicate genes, a three fold lower frequency than in all genes confounded (CLN3, SMN1, SSX1, SSX2, STRC, i.e. 5 duplicated loci with OMIM phenotype out of 199 duplicated genes (3.1%)(161 with LocusID) versus 1543 genes with OMIM phenotype/19413 (7.9%) genes with LocusID in LocusLink July 2003). This is expected, since loss of function mutation, e.g. hypomorphic or null mutation, in the sense of Muller (Muller HJ: Further studies on the nature and causes of gene mutations. Proc. Int. Cong. Genet. 1932, 6:213-255.) will, except for the very few haplo-insufficient loci, lead only to dosage-type effects, that is to mild or undetectable phenotypic effects. But neomorphic or antimorphic mutations are expected to lead to strong effects. This type of gain of function mutation is much less frequent. In practice, it is usually associated to dominant or semi dominant effects, although not always. It may, for example, be associated to rearrangements, such as translocations that would bring the promotor of one gene upstream and in control of the transcribed region of the other. The various translocations involving the duplicated SSX genes on chromosome X and the gene SYT on chromosome 18, and leading to the synovial sarcomas, are examples of the neomorphic type.

 

 

Gene

LocusID

Title

GO molecular processes

GO Biological process

Phenotype

ANAPC1

64682;285069

gene ANAPC1 encoding anaphase-promoting complex 1 (meiotic checkpoint regulator).

 Possibly chaperone

 Cell cycle control

 

ank.12/ ank.13

57730

gene ank.12/ ank.13 encoding KIAA1641 protein.

ankyrin

 

 

ArfGap.8/ glamu

255326

gene ArfGap.8 (and glamu) encoding hypothetical protein LOC255326.

DNA binding

Possibly nucleotide excision repair

 

CAPRI

10156;246721

gene CAPRI encoding Ca2+-promoted Ras inactivator.

DNA binding activity; GTPase activator activity

intracellular signaling cascade; transcription regulation

 

CES1/ CESR

1066;51716

gene CES1 and CESR, encoding carboxylesterase 1 (monocyte/macrophage serine esterase 1).

serine esterase activity; hydrolase activity

embryogenesis and morphogenesis; metabolism; response to toxin; xenobiotic metabolism

 

CLN3

1201

gene CLN3 encoding ceroid-lipofuscinosis, neuronal 3, juvenile (Batten, Spielmeyer-Vogt disease).

chaperone activity

protein folding

[OMIM 204200] Ceroid, lipofuscinosis, neuronal, 3, juvenile

CPEB1

64506;

338963;6218

gene CPEB1 encoding cytoplasmic polyadenylation element binding protein 1.

nucleic acid binding activity

protein biosynthesis

 

CSAGE

9598;158511

gene CSAGE encoding taxol resistance associated gene 3.

 

 

 

DAZ.1/ DAZ3

1617;57055;

57054;57135

gene DAZ.1 encoding deleted in azoospermia ; DAZ3 encoding deleted in azoospermia 3..

RNA binding activity;nucleic acid binding activity

fertilization (sensu Animalia); spermatogenesis

 

DDX41

51428

gene DDX41 encoding DEAD (Asp-Glu-Ala-Asp) box polypeptide 41.

RNA binding activity; ATP dependent helicase activity

apoptosis; development; RNA processing

 

DEFA1

1667;1668

gene DEFA1 encoding defensin, alpha 1, myeloid-related sequence.

antifungal peptide activity

defense response; xenobiotic metabolism

 

DMRTC1

63947;203429

gene DMRTC1 encoding DMRT-like family C1.

 

 

 

EIF3S8

8663

gene EIF3S8 encoding eukaryotic translation initiation factor 3, subunit 8, 110kDa.

translation initiation factor activity

regulation of translational initiation

 

ENIGMA/ LIM.4

9260

gene ENIGMA encoding enigma (LIM domain protein).

 

intracellular signaling cascade; receptor mediated endocytosis

 

FKBP6 / FKBP6.1/ FKBP6.2 / FKBP6.3

 

8468; 54441

gene FKBP6 or .1,.2,.3, encoding FK506 binding protein 6, 36kDa.

FK506 binding activity; zinc ion binding

protein folding

 

Gaa1.0

96626;339692

gene Gaa1.0 encoding pinch-2.

 

 

 

GAGE1

2543;2574;2575; 2576;2577;2578; 2579;26748; 26749;89801

gene GAGE1 encoding G antigen 1.

 

cellular defense response

 

GAGED2/ GAGED2.1

9503

gene GAGED2 or GAGED2.1, encoding G antigen, family D, 2.

 

 

 

GAGED3 / GAGED3.1

9502

gene GAGED3 or GAGED3.1, encoding G antigen, family D, 3.

 

 

 

GOLGIN-67

23015

gene GOLGIN-67 encoding golgin-67.

 

 

 

GSTT2/ DDT

2953

gene GSTT2 encoding glutathione S-transferase theta 2.

 

 

 

GTF2H2/ GTF2H2.1

2966

gene GTF2H2.1 encoding general transcription factor IIH, polypeptide 2, 44kDa.

 

DNA repair; regulation of transcription, DNA-dependent

 

GTF2I/ GTF2I.1

2969;2970

gene GTF2I encoding general transcription factor II, i.

protein binding activity; transcription factor activity

regulation of transcription initiation from Pol II promoter, DNA-dependent; signal transduction

 

GTF2IRD2/ GTF2I.4

54441;84163

gene GTF2IRD2 encoding DKFZp434A0131 protein./ gene GTF2I.4 encoding transcription factor GTF2IRD2.

 

 

 

HEAT.2/ HEAT.4

84500

gene HEAT.2 or HEAT.4, encoding KIAA1833 protein.

 

 

 

heyu

374676

gene heyu, similar to golgi autoantigen, golgin subfamily a, 2; SY11 protein.

 

 

 

HIST2H4

8370

gene HIST2H4 encoding histone 2, H4.

DNA binding

 

 

HSFY

55410;86614; 159119;83869

gene HSFY encoding chromosome Y open reading frame 14.

transcription factor activity

regulation of transcription, DNA-dependent

 

HSGP25L2G

54732

gene HSGP25L2G encoding gp25L2 protein.

protein carrier activity

intracellular protein transport

 

IRS.4

79930

gene IRS.4 encoding Dok-like protein.

insulin receptor binding

 

 

LAT1-3TM

81893

gene LAT1-3TM encoding LAT1-3TM protein.

transcription regulator activity

 

 

LGALS7

3963

genes LGALS7 encoding lectin, galactoside-binding, soluble, 7 (galectin 7).

lectin;sugar binding activity

heterophilic cell adhesion; cell growth and/or maintenance

 

LW-1

51402

gene LW-1 encoding LW-1.

transcription factor activity

regulation of transcription, DNA-dependent

 

MAGE.2/ MAGED4

81557

gene MAGE.2 encoding melanoma antigen, family D, 4. or MAGED4 melanoma antigen, family D, 4..

 

 

 

MAGEA2

9598;266740;

4101;4111

gene MAGEA2 encoding taxol resistance associated gene 3

 

drug resistance

 

MAGEA2b

266740;

4101;4105

gene MAGEA2b encoding melanoma antigen, family A, 2, copy b.

 

 

 

MAGEA9

4108

gene MAGEA9 encoding melanoma antigen, family A, 9.

 

 

 

MORN.6

222967;285927

gene MORN.6 encoding hypothetical protein LOC222967.

 

 

 

MRC1

4360

gene MRC1 encoding mannose receptor, C type 1.

sugar binding activity; mannose binding activity; calcium ion binding activity; receptor activity

heterophilic cell adhesion; pinocytosis; receptor mediated endocytosis

 

NBR2.1/ arf.11

51326;10230

gene NBR2.1 encoding ARF protein./gene arf.11 encoding ARF protein.

 

 

 

novar

51402

gene novar encoding LW-1.

transcription factor activity

regulation of transcription, DNA-dependent

 

NXF2/ NXF2.1

56001

gene NXF2 or NXF2.1 encoding nuclear RNA export factor 2.

protein transporter activity; RNA binding activity

protein-nucleus import; mRNA processing; mRNA-nucleus export;

 

PEPP-2

84528

gene PEPP-2 encoding PEPP subfamily gene 2.

transcription factor activity

regulation of transcription, DNA-dependent

 

PLGL.1

112597;5342

gene PLGL.1 encoding hypothetical protein MGC4677.

plasmin activity

 

 

PM5 / Cna_B.0

283820;23420

gene PM5 (or Cna_B.0) encoding hypothetical protein LOC283820.

 

 

 

RANBP2L1/ RANBP2L1.1/ RANBP2L1.2

84220

gene RANBP2L1 encoding RAN binding protein 2-like 1.

RAN protein binding activity

 

 

RBMY1A1.2/ RBMY1A1.3

5940; 159163

gene RBMY1A1.2/3 encoding RNA binding motif protein, Y-linked, family 1, member A1.

RNA binding activity

RNA processing; spermatogenesis

 

RPS17

6218

gene RPS17 encoding ribosomal protein S17.

RNA binding activity

protein biosynthesis; cytosolic small ribosomal subunit (sensu Eukarya)

 

SCP.0/ SCP.7

283971;348174

gene SCP.0 or 7, encoding hypothetical protein MGC34761.

sugar binding

 

 

SERF1A.1

8293;6606;6607

gene SERF1A.1 encoding small EDRK-rich factor 1A (telomeric).

 

 

 

SFTPA1/ SFTPA1.1

6435

gene SFTPA1 encoding surfactant, pulmonary-associated protein A1.

sugar binding

 

 

SMN1/ SMN2

6606; 6607

gene SMN1 encoding survival of motor neuron 1, telomeric and SMN2 survival of motor neuron 2,centromeric.

 

 RNA processing; pre-mRNA splicing; snRNP assembly

[OMIM 158590] Spinal muscular atrophy, 4

SMP1/ UPF0220.0

23585

gene SMP1 or UPF0220.0, encoding small membrane protein 1.

 

 

 

SSX1

6756;6759

gene SSX1 encoding synovial sarcoma, X breakpoint 1.

nucleic acid binding activity; transcription co-repressor activity

regulation of transcription, DNA-dependent; cell growth and/or maintenance

[OMIM 300326] Sarcoma, synovial

SSX2

6757

gene SSX2 encoding synovial sarcoma, X breakpoint 2.

nucleic acid binding

regulation of transcription, DNA-dependent

[OMIM 300192] Sarcoma, synovial

SSX4

6759

gene SSX4 encoding synovial sarcoma, X breakpoint 4.

nucleic acid binding

regulation of transcription, DNA-dependent

 

steeju

24150

gene steeju encoding TP53TG3 protein.

 

 

 

STRC

161497

gene STRC encoding stereocilin.

 

 

[OMIM 603720] Deafness, autosomal recessive 16

SULT1A3

79008;6818

gene SULT1A3 encoding hypothetical protein MGC5178.

aryl sulfotransferase activity; transferase activity

steroid metabolism; synaptic transmission

 

TBC1D3

339284;84218

gene TBC1D3 encoding hypothetical protein FLJ11822.

membrane alanyl aminopeptidase activity

proteolysis and peptidolysis

 

THOC3

84321

gene THOC3 encoding THO complex 3.

RNA binding activity

mRNA-nucleus export; nuclear mRNA splicing, via spliceosome; transport

 

titori

55911

gene titori encoding apolipoprotein B48 receptor.

 

 

 

TTTY8

84673

gene TTTY8 encoding testis-specific transcript, Y-linked 8.

 

 

 

VCY

9084

gene VCY encoding variable charge, Y-linked.

 

 

 

WBSCR16

81554

gene WBSCR16 encoding Williams-Beuren syndrome chromosome region 16.

 

 

 

Y_phosphatase.7/ kloraby

26095

gene Y_phosphatase.7 or kloraby, encoding DKFZP566K0524 protein.

protein tyrosine phosphatase activity

protein amino acid dephosphorylation

 

 

 

5: Split or discontinuous alignments identify genome or cDNA rearrangements.

 

Background on rearrangements can be found in the book:

Sturtevant AH and Beadle GW: An introduction to genetics. 1940 ed: Saunders, Philadelphia. (class textbook recently reprinted).

a. Partition of the rearranged Gold alignments, of excellent, good or partial quality, excluding repetitions, by type of rearrangement:

 

Type

mRNA

Parts

%mRNA

Mosaic

1,199

2,444

1.65%

Inversion

126

254

0.17%

Transposition

294

608

0.40%

Suspected genome deletion

283

283

0.39%

Variable tandem repeat number

894

894

1.23%

Pseudo-split or mRNA bridging two correctly ordered and oriented contigs

36

72

0.05%

Total rearrangements

2,639

4,438

3.63%

Total alignments at this quality

72,693

76,371

100.00%

 

5b. Map of the rearranged alignments on the finished part of the chromosomes

 

The map position of the major fragment from 248 transpositions (red, above the line), 116 inversions (red, under the line), 209 suspected genome deletions (blue, above the line) and 1241 mosaic (light blue, under the line) are shown. Some chromosomes house definitely a higher density of rearrangements than other: 21 or X for example look good, 19, 22 or 17 should probably be rechecked in the suspect areas. The control is given here. To shorten the figure, the rearrangements mapping on the random contigs are not shown here.

6: Intron properties. Splicing in Gold alignments: the frequency of structural polymorphisms and intronless genes is unexpectedly high.

 

6a. Statistics on the presence or absence of introns in the GOLD alignments,

Out of the 66,419 excellent GOLD alignments, including exact duplicates (but not the almost exact duplicates)

Alignment has

# alignments

% alignments

At least one standard intron

50,830

76.5%

no intron and no alignment gap

14,343

21.6%

only non-standard introns or gaps

1,246

1.9%

at least one non-standard intron

5,068

7.6%

 

Statistics on the types of intron boundaries, on the same sample of excellent alignments.

There are on average 7.8 intron per alignment that is not intronless. The partition by type of intron boundary is as follows:

Intron boundary

Number of introns

% of all introns

Standard gt-ag

395,953

97.41%

Standard gc-ag

4,427

1.09%

Standard U12 at-ac

552

0.14%

Any other non-standard type#

5,675

1.40%

Total

406,501

100.00%

# Interpretations for the high level of such anomalous pseudo-introns are discussed in the article.

 

6b. Size histograms for standard introns, non-standard introns and variable repeat number polymorphisms, restricted or not to that in the last exon.

 

An analysis of standard and non-standard introns sizes and micro-rearrangements: comparison between those lying in last exons and all others

a) Same histogram as Figure 3 (/b below), but showing only the last intron and micro-rearrangements below 300 bp, out of a total of 52,054 last introns: 8574 standard introns (pink), 1594 non-standard (blue), 356 variable repeat number polymorphisms plus 29 suspected genome deletions (green). Notice the prevalence of micro-rearrangements, i.e. non standard introns or variable repeat numbers in the last exon, compared to the histogram for all exons (presented again right below). Notice, in the last exon, a small peak of very short standard introns, as expected, because the cost function slightly favors gt-ag or gc-ag or at-ac, especially in non-coding areas. The marked dip from -6 to +6 is also a direct consequence of the cost function, which effectively penalizes such short exons/introns. b) same data as figure 3, but collapsed. Data on the same set of 52,054 excellent alignments as in a: 72,797 standard introns (pink), 2693 non standard introns (blue), 516 micro-rearrangements (green).

 

a)

b) For comparison, same diagram but for all introns and micro-rearrangements, not just the last one.

 

 

6c: Genome map of variable repeat number polymorphisms and anomalous introns that do not correspond to suspected internal deletions in the cDNA.

 

894 variable repeat number polymorphisms are shown in red above the line, and 2417 anomalous introns (not gt-ag, gc-ag or at-ac), less than 100 bp long and not yet clearly recognized as suspected internal deletions in the cDNAs, hence possibly corresponding to polymorphisms in the genome sequence (or more rarely to actual use of non-standard introns), are represented in light blue under the line. The map is limited to the finished chromosomes and does not include the random contigs.

Correlation between the two types might indicate regions that are more fluid and tend to be more fluid (or regions where the genome assembly is of lesser quality). Note that there also appears to be some correlation with the map of the other rearrangements.

 

7: Comparing the quality of the sequence of the chromosomes

 

7a1) Number of Gold alignments of different qualities, or masked, sorted by chromosome

Chromosome

cDNA

Excellent

Good

Partial

Dubious

Bad

Masked

1

7,328

6,585

561

89

62

31

0

2*

5,227

4,731

410

57

23

6

30

3

4,432

4,013

338

50

17

14

0

4

2,958

2,672

222

33

16

15

0

5

3,413

3,077

274

38

16

8

0

6*

3,574

3,222

290

37

20

5

59

7

3,729

3,287

365

43

22

12

0

8

2,642

2,335

257

32

12

6

0

9

3,135

2,771

262

57

28

17

0

10

3,157

2,848

255

34

12

8

0

11

3,981

3,648

261

55

13

4

0

12

3,844

3,466

297

53

23

5

0

13

1,444

1,304

122

9

7

2

0

14*

2,355

2,149

184

14

7

1

234

15

2,584

2,320

216

35

10

3

0

16

3,318

2,955

288

44

19

12

0

17

4,187

3,736

350

57

29

15

0

18

1,105

965

123

10

2

5

0

19

4,307

3,793

425

49

29

11

0

20

2,045

1,834

162

25

15

9

0

21

775

664

92

7

5

7

0

22*

1,771

1,518

202

33

17

1

56

X

2,064

1,906

120

19

12

7

0

Y

135

104

21

3

7

0

0

Un

95

63

23

4

4

1

0

Masked*

379

52

60

22

105

140

379

Total

73,984

66,018

6,180

909

532

345

758

*once removed the masked highly polymorphic areas.

The diagrams corresponding to the 2 tables below are in Figure 4 of the paper. The related list and map of suspected frameshifts in the genome may be found here.

 

7.1b) Number of structural defects, rearranged and partial alignments, per chromosome

Only excellent, good and partial alignments are considered, hyper variable regions are excluded (*).

Chromosome

# Gold alignments

Suspected genome deletion

Transpo-sition

Variable repeat number

Inversion

Mosaic

Non-standard intron

1

7,235

25

32

78

12

128

381

2*

5,198

19

29

68

7

69

301

3

4,401

12

17

43

10

69

216

4

2,927

17

10

32

2

34

157

5

3,389

15

16

38

5

61

217

6*

3,549

9

15

43

5

59

204

7

3,695

15

11

39

7

60

252

8

2,624

6

9

21

4

43

134

9

3,090

12

11

65

7

71

171

10

3,137

7

14

40

3

51

195

11

3,964

18

15

60

11

62

219

12

3,816

11

9

44

5

58

207

13

1,435

5

6

21

0

20

103

14*

2,347

2

6

30

4

44

139

15

2,571

8

9

31

9

46

131

16

3,287

16

12

45

7

50

186

17

4,143

26

24

56

6

68

225

18

1,098

3

5

10

3

20

65

19

4,267

13

24

56

5

69

264

20

2,021

9

8

29

1

30

99

21

763

3

0

11

0

13

50

22*

1,753

10

9

22

3

33

116

X

2,045

4

3

12

5

29

102

Y

128

0

0

2

0

6

10

Un

90

0

0

0

0

2

5

Masked*

134

9

0

1

5

4

46

Total

73,107

274

294

897

126

1,199

4,195

 

7b) Human Genome Project Coordinators by Chromosome (From the Washington University, St Louis, Genome Sequencing Center website, April 2004)

For reference

Chr

Coordinating Center

Contact

1

Sanger Centre

Jane Rogers

2

Washington University Genome Sequencing Center

Rick Wilson

3

Baylor College of Medicine

Steve Scherer

4

Washington University Genome Sequencing Center

Rick Wilson

5

SHGC(JGI)

Jane Grimwood

6

Sanger Centre

Jane Rogers

7

Washington University Genome Sequencing Center

Rick Wilson

8

Whitehead Institute

Chad Nusbaum

9

Sanger Centre

Jane Rogers

10

Sanger Centre

Jane Rogers

11

Riken

Todd Taylor

12

Baylor College of Medicine

Steve Scherer

13

Sanger Centre

Jane Rogers

14

Genoscope

Jean Weissenbach

15

Whitehead Institute

Chad Nusbaum

16

Los Alamos National Laboratory

Norman Doggett

17

Whitehead Institute

Chad Nusbaum

18

Whitehead Institute

Chad Nusbaum

19

SHGC(JGI)

Jane Grimwood

20

Sanger Centre

Jane Rogers

21

21 Consortium

Todd Taylor

22

Sanger Centre

Jane Rogers

X

Sanger Centre

Jane Rogers

Y

Washington University Genome Sequencing Center

Rick Wilson



8: A snapshot at compared performances, strength and weaknesses of public alignment programs: UCSC, NCBI, AceView

 

Users of the main public sites rely on the results displayed, so we analyze here the properties of three public programs, NCBI 34.3, UCSC and AceView. Blat has been excellent and very stable for years now, but NCBI and AceView are still evolving, so this analysis should be considered a snapshot in time, for public data from march 10 to august 21, 2004. Updates on the comparison will certainly converge within a year or two.

Figure 5a shows the overall performance of the programs. Although the choice of the GOLD may not always be optimal, these cases are now too rare to influence notably the results presented in this section. Let us recall that the score is of the same order of magnitude as the number of bases that exactly match the genome. The overall success rate in Figure 5a is evaluated by counting the percentage of alignments whose score reaches the Gold score, is off by 1 to 5 points, 6 to 50, 51 to 500 or more than 500 points. In addition, some mRNAs may remain unaligned.

 

We then try to understand in which way the programs differ from the GOLD. To usefully pinpoint defects, we limited this analysis to the 64,938 GOLD with excellent and non-split alignments, aligning on average over 99.98% of their length with 1.4 base differences from the genome per kb. There are two broad types of difficulties: the proper determination of the area of the genome where the cDNA originated and the precise definition of the exons and introns. To quantify the precision of the local alignment (Figure 4b), we simply compare any alignment that overlaps the corresponding Gold alignment on the genome, and count the number of cases where a program

o       misses an exon. We distinguish the first exon, most often missed by programs, any internal exon, and the last exon, missed in rare instances. AceView currently has the best record there, with altogether 1.18% of the alignments missing exons; UCSC has 1.47%, NCBI 1.56%.

o       produces too many exons. We again distinguish by position: first extra exons are often spurious vector matches; others, such as most of the AceView lot, are validated cooperative first short exons. Internal extra exons are rarer. Last extra exons often consist of pure poly-A. UCSC holds the record with only 0.16% alignments with too many exons, AceView has 0.57% and NCBI 4.00%.

o       produces inaccurate exons with suboptimal boundaries. AceView has 1.18%, NCBI 2.35% and UCSC 3.04% (its weakest point).

 

If we now look at a larger scale picture, and ask how often a program assembles the mRNA to its best position on the genome and only there, different measures can be made:

o                   How many accessions are assigned to a wrong map position? We call wrong map a position not overlapping the GOLD and where the alignment is at least 50 points below the Gold score. We find that mapping is highly reliable: less than 0.02% of the accessions are mapped in wrong positions by the programs discussed. The only caveat is redundancy, if an accession may align in several places, some programs do not always discard correctly the lower quality alignments, thus keeping, in addition to the good map a number of wrong map positions. NCBI 34.3 has no such cases, AceView has 3 and UCSC has 54 instances (0.08%).

o                   A related problem is how many of the exact duplications go undetected. Of course, a program with high redundancy is expected to miss less of the true repeats, and indeed UCSC is the best, missing only 6% of the duplications. AceView misses 22% and NCBI 87%. This factor is certainly tunable, but it clearly seems important to show the exactly duplicated alignments in the 1% cases where they occur.

o                   Finally, as we showed above, more than 3% of the cDNAs align best as split alignments, different pieces landing in more than one place/one strand on the genome. Most programs apart from UCSC and Nedo do not propose split alignments in an efficient manner. We hope to have convinced the reader of the interest of such split alignments in the last phase of quality assessment of the genome assembly and that more of us will build this feature into their programs.

Every data contributor receives privately extensive lists of problems that should help them debug their program. Example lists are available in supplementary material 9.

Comparison of the results of any program to the Gold allows to classify the defects of this program, and to provide lists of suboptimal alignments to the contributors. It also allows to identify the main difficulties, and to can lists of examples for each type of problem.

 

Performance of the methods:

We only consider mRNAs with excellent non-split Gold alignment; exact duplications are included

 

 Method

Gold

AceView

UCSC

NCBI 34.3

Gold

65287

60705

59547

55336

loses 1 to 5

0

3115

4076

3233

loses 6 to 50

0

1527

1828

4688

loses 51 to 500

0

83

97

747

loses >500

0

42

41

34

Unaligned

0

61

0

893

No score

0

2

0

0

Total

65287

65535

65589

64931

mRNA tested total

64886

64886

64548

64886

Lists of alignments that could be improved...

We compared the alignments to the GOLD alignments that were selected anonymously from any of the programs who contributed. We restricted the comparison to the 64,780 accessions with excellent GOLD, i.e. aligning completely (>99%) and with very few errors (<0.4% differences from the genome). Mosaic clones, or clones with inversions or transpositions were also excluded: all the mRNAs here should align in a straightforward fashion all the way with very few errors.

Defect

AceView

UCSC

NCBI 34.3

Total comparable alignments

64,652

64,284

63,494

missed first exon

230

521

788

missed central exon

327

267

119

missed last exon

107

175

94

inaccurate exon

1,056

1,856

1,593

extra first exon

223

22

1,064

extra central exon

55

77

192

extra last exon

5

21

1,409

missed micro-insertion/deletion

546

4

1,456

extra micro insertion/deletion

4

35

0

 

The line "Total" gives the number of accessions that were mapped to the same place on the genome as the GOLD, and that could thus be compared to the GOLD in terms of exon and intron structure.  An exon is "missed" if it is part of the GOLD but the alignment proposed has no overlapping exon. An exon is "extra" if it belongs to an alignment but does not overlap any exon in the GOLD. The first exon is the 5'-most, the last the 3'-most, central is any exon in between the first and last. An inaccurate exon is an internal exon, bounded by two introns and overlapping a Gold exon, but it has at least one intron boundary not optimally chosen according to our analysis. Sometimes, the Gold solution is very close... The number of accessions with the defects are shown. The same accession may appear in more than one line in the table.

2:  Hit or miss: problems finding the map on the genome

Defect

AceView

UCSC

NCBI 34.3

Unaligned

61

0

893

Wrong map

5

1

14

Redundant

7

54

0

mRNA tested

64,886

64,548

64,886

 

This table lists all accessions where finding the place in the genome where the cDNA aligns was problematic. If an accession was tested, but failed to produce any alignment, the accession is "Unaligned". An accession that was aligned only in a position where the quality of the match is far less good than the GOLD (50 points below) is qualified of "Wrong map". An accession for which two or more independent alignments of the same part of the clone were provided, but one was far less good than the GOLD (50 points below Gold score, hence its map position is not correct) is called "Redundant": it should have been discarded.

3: Missing exact duplications and not recognizing mosaics and other split alignments

Defect

AceView

UCSC

NCBI 34.3

mRNA with missed duplication

29

15

317

mRNA with missed split

1,108

44

1,132

#mRNA with duplication tested

362

235

362

#mRNA which are split tested

1,132

1,122

1,132

 

10 Discussion: Lists of challenging problems and questions that require experimental input

 

    Long genes, extending over 800 kb of genome sequence: A list of 72 excellent GOLD alignments spread over more than 800 kb of genome, is given. The longest extending alignment in the set is that of AB020675, 2.298 Mb long, on chromosome 7. The solution was found only by the NEDO group, congratulations!

    Long introns, above 400 kb: A list of 61 excellent GOLD with challenging standard introns above 300 kb is provided. The longest intron, 1136 kb, is found by NEDO in accession AK097990, but there is a gene in the proposed intron, leaving open the possibility of a genome misassembly or of a missed duplication. Similar doubts arise for other examples from this list. One set of examples that seem to lie on a correct genome assembly, so they would be real biologically, not just examples for programmers, includes BC030832, with an intron 866 kb long. This one appears less dubious because another variant, BC038514, has an alternative exon in the same area, with an intron 846 kb long.

    Micro-introns, between 6 bp and 12 bp, with standard boundaries: A list of 21 accessions with this feature is provided. These gt-ag or gc-ag introns are probably not a result of spliceosome-dependent splicing, but rather examples of length polymorphisms. Yet they provide an excellent training set.