Predicting the performance of anaerobic digestion using machine learning algorithms and genomic data

Water Res. 2021 Jul 1:199:117182. doi: 10.1016/j.watres.2021.117182. Epub 2021 Apr 22.

Abstract

Modeling of anaerobic digestion (AD) is crucial to better understand the process dynamics and to improve the digester performance. This is an essential yet difficult task due to the complex and unknown interactions within the system. The application of well-developed data mining technologies, such as machine learning (ML) and microbial gene sequencing techniques are promising in overcoming these challenges. In this study, we investigated the feasibility of 6 ML algorithms using genomic data and their corresponding operational parameters from 8 research groups to predict methane yield. For classification models, random forest (RF) achieved accuracies of 0.77 using operational parameters alone and 0.78 using genomic data at the bacterial phylum level alone. The combination of operational parameters and genomic data improved the prediction accuracy to 0.82 (p<0.05). For regression models, a low root mean square error of 0.04 (relative root mean square error =8.6%) was acquired by neural network using genomic data at the bacterial phylum level alone. Feature importance analysis by RF suggested that Chloroflexi, Actinobacteria, Proteobacteria, Fibrobacteres, and Spirochaeta were the top 5 most important phyla although their relative abundances were ranging only from 0.1% to 3.1%. The important features identified could provide guidance for early warning and proactive management of microbial communities. This study demonstrated the promising application of ML techniques for predicting and controlling AD performance.

Keywords: Anaerobic digestion; Control strategy; Genomic data; Machine learning; Performance prediction.

MeSH terms

  • Algorithms*
  • Anaerobiosis
  • Genomics
  • Machine Learning*
  • Methane

Substances

  • Methane