Improved BLAST searches using longer words for protein seeding

Bioinformatics. 2007 Nov 1;23(21):2949-51. doi: 10.1093/bioinformatics/btm479. Epub 2007 Oct 6.

Abstract

Motivation: The blastp and tblastn modules of BLAST are widely used methods for searching protein queries against protein and nucleotide databases, respectively. One heuristic used in BLAST is to consider only database sequences that contain a high-scoring match of length at most 5 to the query. We implemented the capability to use words of length 6 or 7. We demonstrate an improved trade-off between running time and retrieval accuracy, controlled by the score threshold used for short word matches. For example, the running time can be reduced by 20-30% while achieving ROC (receiver operator characteristic) scores similar to those obtained with current default parameters.

Availability: The option to use long words is in the NCBI C and C++ toolkit code for BLAST, starting with version 2.2.16 of blastall. A Linux executable used to produce the results herein is available at: ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/protein_longwords

Publication types

  • Research Support, N.I.H., Intramural

MeSH terms

  • Algorithms
  • Computer Graphics
  • Database Management Systems*
  • Databases, Protein*
  • Information Storage and Retrieval / methods*
  • Proteins / chemistry*
  • Proteins / genetics*
  • Sequence Alignment / methods*
  • User-Computer Interface*

Substances

  • Proteins