False discovery rate control incorporating phylogenetic tree increases detection power in microbiome-wide multiple testing

Bioinformatics. 2017 Sep 15;33(18):2873-2881. doi: 10.1093/bioinformatics/btx311.

Abstract

Motivation: Next generation sequencing technologies have enabled the study of the human microbiome through direct sequencing of microbial DNA, resulting in an enormous amount of microbiome sequencing data. One unique characteristic of microbiome data is the phylogenetic tree that relates all the bacterial species. Closely related bacterial species have a tendency to exhibit a similar relationship with the environment or disease. Thus, incorporating the phylogenetic tree information can potentially improve the detection power for microbiome-wide association studies, where hundreds or thousands of tests are conducted simultaneously to identify bacterial species associated with a phenotype of interest. Despite much progress in multiple testing procedures such as false discovery rate (FDR) control, methods that take into account the phylogenetic tree are largely limited.

Results: We propose a new FDR control procedure that incorporates the prior structure information and apply it to microbiome data. The proposed procedure is based on a hierarchical model, where a structure-based prior distribution is designed to utilize the phylogenetic tree. By borrowing information from neighboring bacterial species, we are able to improve the statistical power of detecting associated bacterial species while controlling the FDR at desired levels. When the phylogenetic tree is mis-specified or non-informative, our procedure achieves a similar power as traditional procedures that do not take into account the tree structure. We demonstrate the performance of our method through extensive simulations and real microbiome datasets. We identified far more alcohol-drinking associated bacterial species than traditional methods.

Availability and implementation: R package StructFDR is available from CRAN.

Contact: chen.jun2@mayo.edu.

Supplementary information: Supplementary data are available at Bioinformatics online.

MeSH terms

  • Bacteria / genetics*
  • Genomics / methods
  • High-Throughput Nucleotide Sequencing / methods*
  • Humans
  • Microbiota / genetics*
  • Phylogeny*
  • Polymorphism, Genetic
  • Sequence Analysis, DNA / methods
  • Software*