2020 ESA Annual Meeting (August 3 - 6)

COS 91 Abstract - A modified occupancy modeling approach to account for classification error in automated biodiversity surveys

Justin Kitzes, Department of Biological Sciences, University of Pittsburgh, Pittsburgh, PA, Tessa Rhinehart, Biological Sciences, University of Pittsburgh, Pittsburgh, PA and Daniel Turek, Mathematics and Statistics, Williams College, Williamstown, MA
Background/Question/Methods

Acoustic recorders, camera traps, and other similar sensor-based technologies are increasingly used to conduct field surveys for species of conservation concern. The large data sets produced by these devices are almost always too large to be completely reviewed by human annotators. Machine learning-based species classifiers can be used to identify the species present in these large data sets, but the identifications produced by these classifiers often have substantial errors, including both false positives and false negatives. In the presence of these errors, ecologists and conservation biologists may resort to manual review of a subset of data, throw out large numbers of potential presences that may be imperfectly classified, or ignore classification error entirely. Here, we explore extensions to previously proposed occupancy models that are designed to account for both false negatives and false positives in occupancy surveys. Our main goal was to simultaneously model continuous scores generated by an automated classifier, which express confidence in the presence of a species in a file, along with unambiguous presence or absence annotations by a human at the file level.

Results/Conclusions

Our final model is based on a Gaussian mixture model with five unknown parameters: the probability that a species vocalization will be present in a file given that site is occupied, and the mean and variance of the classifier scores conditional on the presence or absence of a vocalization in a file. Model fitting can proceed either through maximum likelihood or with a hierarchical Bayesian approach. Using simulated data, we show that the models can be fitted, with all five parameters identifiable, for a wide range of realistic combinations of occupancy probability, vocalization rate, and classifier performance. We specifically demonstrate that a model based on continuous classifier scores returns consistently less biased estimates of parameters of interest than a model in which the continuous scores are dichotomized to presence and absence. In an empirical example using human annotated bird songs recorded in central Pennsylvania, we show our models provide robust estimates of occupancy even when only a small fraction of human annotations are used. These results demonstrate the importance of explicitly including richer descriptions of classifier uncertainty in models designed to perform ecological inference on machine learning classified data sets.