Reciprocal illumination in the gene content tree of life

Syst Biol. 2006 Jun;55(3):441-53. doi: 10.1080/10635150600697416.

Abstract

Phylogenies based on gene content rely on statements of primary homology to characterize gene presence or absence. These statements (hypotheses) are usually determined by techniques based on threshold similarity or distance measurements between genes. This fundamental but problematic step can be examined by evaluating each homology hypothesis by the extent to which it is corroborated by the rest of the data. Here we test the effects of varying the stringency for making primary homology statements using a range of similarity (e-value) cutoffs in 166 fully sequenced and annotated genomes spanning the tree of life. By evaluating each resulting data set with tree-based measurements of character consistency and information content, we find a set of homology statements that optimizes overall corroboration. The resulting data set produces well-resolved and well-supported trees of life and greatly ameliorates previously noted inconsistencies such as the misclassification of small genomes. The method presented here, which can be used to test any technique for recognizing primary homology, provides an objective framework for evaluating phylogenetic hypotheses and data sets for the tree of life. It also can serve as a technique for identifying well-corroborated sets of homologous genes for functional genomic applications.

Publication types

  • Research Support, N.I.H., Extramural
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Classification / methods*
  • Computational Biology / methods*
  • Computer Simulation
  • Databases, Genetic
  • Models, Genetic
  • Phylogeny*
  • Sequence Homology