Finding optimal degenerate patterns in DNA sequences

Bioinformatics. 2003 Oct:19 Suppl 2:ii206-14. doi: 10.1093/bioinformatics/btg1079.

Abstract

Motivation: The problem of finding transcription factor binding sites in the upstream regions of given genes is algorithmically an interesting and challenging problem in computational biology. A degenerate pattern over a finite alphabet Sigma is a sequence of subsets of Sigma. A string over IUPAC nucleic acid codes is also a degenerate pattern over Sigma = {A, C, G, T}, and is used as one of the major patterns modeling transcription factor binding sites in the upstream regions of genes. However, it is known that the problem of finding a degenerate pattern consistent with both positive and negative string sets is in general NP-complete. Our aim is to devise a heuristic algorithm to find a degenerate pattern which is optimal for positive and negative string sets w.r.t. a given score function.

Results: We have proposed an enumerative algorithm called SUPERPOSITION for finding optimal degenerate patterns with a pruning technique, which works with most all reasonable score functions. The performance score of the algorithm has been compared with those of other popular motif-finding algorithms YMF, MEME and AlignACE on various sets of co-regulated genes of yeast. In the computational experiment, SUPERPOSITION has outperformed the others on several gene sets.

Availability: The python script SUPERPOSITION is available at http://www.math.kyushu-u.ac.jp/~om/softwares.html

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • Base Sequence
  • Binding Sites
  • Chromosome Mapping / methods*
  • DNA Mutational Analysis / methods
  • Molecular Sequence Data
  • Mutation
  • Protein Binding
  • Regulatory Elements, Transcriptional / genetics*
  • Sequence Analysis, DNA / methods*
  • Software*
  • Transcription Factors / genetics*
  • Transcription, Genetic / genetics*

Substances

  • Transcription Factors