Are all data useful? Inferring causality to predict flows across sewer and drainage systems using directed information and boosted regression trees

Water Res. 2018 Nov 15:145:697-706. doi: 10.1016/j.watres.2018.09.009. Epub 2018 Sep 4.

Abstract

As more sensor data become available across urban water systems, it is often unclear which of these new measurements are actually useful and how they can be efficiently ingested to improve predictions. We present a data-driven approach for modeling and predicting flows across combined sewer and drainage systems, which fuses sensor measurements with output of a large numerical simulation model. Rather than adjusting the structure and parameters of the numerical model, as is commonly done when new data become available, our approach instead learns causal relationships between the numerically-modeled outputs, distributed rainfall measurements, and measured flows. By treating an existing numerical model - even one that may be outdated - as just another data stream, we illustrate how to automatically select and combine features that best explain flows for any given location. This allows for new sensor measurements to be rapidly fused with existing knowledge of the system without requiring recalibration of the underlying physics. Our approach, based on Directed Information (DI) and Boosted Regression Trees (BRT), is evaluated by fusing measurements across nearly 30 rain gages, 15 flow locations, and the outputs of a numerical sewer model in the city of Detroit, Michigan: one of the largest combined sewer systems in the world. The results illustrate that the Boosted Regression Trees provide skillful predictions of flow, especially when compared to an existing numerical model. The innovation of this paper is the use of the Directed Information step, which selects only those inputs that are causal with measurements at locations of interest. Better predictions are achieved when the Directed Information step is used because it reduces overfitting during the training phase of the predictive algorithm. In the age of "big water data", this finding highlights the importance of screening all available data sources before using them as inputs to data-driven models, since more may not always be better. We discuss the generalizability of the case study and the requirements of transferring the approach to other systems.

Keywords: Boosted regression trees; Causality; Data-driven model; Directed information; Flow prediction.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Cities
  • Michigan
  • Models, Theoretical*
  • Rain*