Background:
Identifying the genes, an essential task in
post-genome biology, requires aligning the many cDNA sequences available in the
public databases on the genome. However, the various programs used at the main
genome annotation sites propose different solutions to the exact exon structure
in about half the cases.
Results:
To help resolve this problem, we pool the
mRNA-to-genome alignments proposed by NCBI, UCSC, ensembl, AceView, and H-inv,
for 74,106 mRNA from 29,194 human genes. We carefully define a cost function
and let “GOLD”, Genomewide Optimization of Locus Description, select the best
alignment for each clone. We annotate the Gold alignments, discuss the
distribution of introns and minimal size of exons, classify the frequent
rearrangements, and propose that variable tandem-repeat-number and
micro-introns below 65 bp, which occur in 9% of the genes, are
micro-polymorphisms. We evidence striking chromosomal and regional specificity
in the control of gene duplication and discover that exact duplicates of genes
containing introns are all clustered within 3.1 megabases of each other. We
also observe interchromosomal and regional variability in the levels of base
mismatch and rearrangements, annotate suspected defects, including frameshifts,
in the genome and the cDNAs, and discuss the high frequency of intronless
genes. Finally we identify difficult alignments through programs comparison.
The current Gold, their annotations, the C program and acedb schema are
available online, or from www.ncbi.nlm.nih.gov/IEB/Research/Acembly/GOLD.
Contributions of new alignments are encouraged.
Conclusions:
Because GOLD extracts the best solutions to
difficult alignment problems from all programs, it opens a new dimension toward
a precise annotation of the human genes and genome.
The aim of the Gold project is to gather the most accurate alignments
of human cDNAs from large scale public projects on the human genome, and to use
this to extend our understanding of the human genes.
The
paper describes our main results and is linked to
a detailed
supplementary material
, itself linked to many lists of clones.
We would be happy to receive your questions, comments or suggestions , please send us an email.