Use of machine learning to extract patterns from long-term monitoring data across the US

Underwood, Kristen

Background/Question/Methods: Collaborative research at long-term monitoring sites over recent decades has generated a great volume, variety and frequency of data (i.e. “big data”) for analysis of water quality. To identify drivers of solute dynamics across these networks, “big data” can present a challenge for traditional statistical methods. Machine-learning tools have evolved to identify patterns in big data and are increasingly used for dimension reduction, feature extraction, and trend identification. We used a tandem evolutionary algorithm to examine geologic, topographic, hydrologic and land cover variables for long-term monitoring sites, applied to dissolved organic carbon (DOC) dynamics. We clustered 449 dominantly-forested US catchments using two approaches: one based on mean DOC concentration in catchment stream water (mined from USGS NWIS) and split into high and low categories using Jenks natural breaks, and a second based on 54 catchment biogeophysical attributes (mined from “Catchment Attributes and MEteorology for Large-sample Studies”) used as inputs to a hierarchical agglomerative clustering algorithm. The evolutionary algorithm was then used as a feature-selection tool to search the multidimensional data space and identify combinations of catchment attributes, and specific value ranges of those attributes, that were most important in driving catchment membership in each of these clusters.

Results/Conclusions: A univariate response characterized the feature selection outcome for the case of two DOC clusters. Catchments with high mean DOC concentrations were associated with catchment-average overburden depths ≥ 5 meters. Five geographically-distinct clusters of the 449 catchments were identified, each with unique combinations of catchment attributes driving cluster membership. High mean DOC catchments occupied two of the five biogeophysical clusters. High DOC catchments located along the Gulf and Atlantic coastal regions, receive a very low percentage of precipitation in the form of snow and are characterized by thick development of sand-rich soils overlying sedimentary rock or unconsolidated parent materials of moderate-to-high porosity. In contrast, high DOC catchments in the Great Lakes and Upper Mississippi Valley regions are characterized by seasonally-uniform to summer-dominated precipitation, soils with low-to-moderate silt fraction, low-to-moderate subsurface porosity, and deciduous vegetation. These contrasting combinations of catchment attributes indicate heterogeneity in high DOC stream efflux. Results are being used by collaborating researchers to further investigate drivers of DOC flux at catchment and site scales using process-based models and bench-scale soil experiments.

PS 48 Abstract - Use of machine learning to extract patterns from long-term monitoring data across the US