COS 82-4 - Identifying and characterizing extrapolation in multivariate response data

Thursday, August 15, 2019: 9:00 AM
L010/014, Kentucky International Convention Center
Meridith L. Bartley1, Ephraim Hanks1, Erin Schliep2, Patricia A. Soranno3 and Tyler Wagner4, (1)Statistics, Pennsylvania State University, State College, PA, (2)Statistics, University of Missouri, Columbia, MO, (3)Department of Fisheries and Wildlife, Michigan State University, East Lansing, MI, (4)Ecosystem Science and Management, The Pennsylvania State University, University Park, PA
Background/Question/Methods

Extrapolation is making predictions beyond the range of the data used to estimate a statistical model. In ecological studies, it is not always obvious when and where extrapolation occurs. Previous work on identifying extrapolation has focused on univariate response data, but these methods are not directly applicable to multivariate response data, which are more and more common in ecological investigations. We extend previous work for identifying extrapolation by examining predictive variance within a univariate setting and applying novel methods to the multivariate case. We illustrate our approach through an analysis of jointly modeled lake nutrients, productivity, and clarity variables in over 7000 inland lakes from across the northeast and Midwest US. In addition, we illustrate novel exploratory approaches for identifying regions of parameter space where extrapolation are more likely to occur using classification and regression trees.

Results/Conclusions

We fit 8,910 lakes to a multivariate response linear model, calculate the posterior predictive variance, and then obtain our novel numeric multivariate prediction variance (MVPV) measure associated with extrapolation. We examine the choice a cutoff or range of cutoffs for extrapolation/interpolation and given a cutoff were able identify novel locations that are extrapolations. We further explore where extrapolations occur using a Classification and Regression Tree model. This results in most lakes' predictions remaining within the extrapolation index cutoff and thus not being identified as extrapolations. The cutoffs investigated (max value, leverage max, 0.99 quantile, and 0.95 quantile) resulted in (1, 18, 91, 443) lake multivariate response predictions being identified as extrapolations. As the cutoff values become more conservative in nature the number of extrapolations identified increases. This increasing number of extrapolations identified highlights the importance of exploring different choices for a cutoff value. Our CART model approach reveals that the most important factors associated with extrapolation include shoreline length, elevation, stream density, and lake SDF. This work results in identification of extrapolation lake locations as well as further understanding of the unique parameter space they occupy. The resulting caution shown when using joint nutrient models to estimate water quality variables at lakes with partially or completely unsampled measures is necessary for larger goals such as estimating the overall combined levels of varying water qualities in all US inland lakes.