2020 ESA Annual Meeting (August 3 - 6)

COS 73 Abstract - The consequences of omitting unavailable variables from big data models

Volker Bahn, Department of Biological Sciences, Wright State University, Dayton, OH
Background/Question/Methods

Big data unlocks new power to investigate macro-ecological questions. However, the distribution and fitness of organisms depends on both, coarse scale variables, often available, such as climate, and fine scale variables, rarely available over large extents, such as food, shelter and biotic interactions. The latter are therefore often omitted in distribution and other ecological models. Depending on their spatial patterns and relationship with other factors, omitted factors can bias models and their evaluation. Here I investigate the role of omitted factors in distribution model performance, bias, and evaluation, with respect to their characteristics such as simple and complicated functional relationships to included factors, such as food availability depending on temperature and precipitation, and random spatial pattern coincidence with other factors and the species’ distribution itself. I investigate the effects of omitted important factors on species distribution model performance and evaluation in a realistic simulation model on a 50 x 50 grid. Predictions to new landscapes, generated according to the same rules as the original landscape, serve as unbiased evaluations of the models, which are contrasted with commonly used evaluation methods on the same landscape, such as resubstitution goodness-of-fit and different hold-out methods.

Results/Conclusions

Omitting important factors led to performance decrease in distribution models, biases in predictions, and overly optimistic evaluations with standard methods. However, the effects were dependent on the characteristics of omitted variables and their relationship to other factors and the species’ distribution. The greatest performance and evaluation issues arose from omitted factors that were correlated spatially to random (but available) variables. The consequence was a selection of functionally unimportant variables in lieu of the unavailable ones, that only worked in the specific landscape the model was fit to (or more precisely overfit to). Such models evaluated deceptively well on the same landscape, but show a large performance drop when evaluated on functionally equivalent but spatially reconfigured landscapes. It is realistic to expect that many important factors are regularly omitted from ecological models based on big data. Mostly these factors will have spatial patterns at a variety of scales and will correlate with other factors. Researchers have to be aware that the resulting models are likely performing less well than standard evaluations suggest and are likely to show significant and misleading biases. Progress will depend on improving coverage of important ecological factors and more rigorous model and evaluation techniques.