2020 ESA Annual Meeting (August 3 - 6)

OOS 64 Abstract - Improving computer vision for camera traps: Leveraging practitioner knowledge to build better models

Sara Beery1, Guanhang Wu2, Vivek Rathod2, Ronny Votel2 and Jonathan Huang2, (1)California Institute of Technology, Pasadena, CA, (2)Google
Background/Question/Methods

Camera traps are widely used to monitor animal populations and behavior, and generate vast amounts of data. There is a demonstrated need for machine learning models that can automate the process of detecting and classifying animals in camera trap images. Previous work has shown exciting results on automated species classification in camera trap data, but further analysis has shown that these results do not generalize to new cameras or new geographical regions, and struggle to categorize rare species or poor quality images. Consequently, very few organizations have successfully deployed machine learning tools for camera trap image review.

Most state-of-the-art computer vision algorithms are designed to look at one image at a time. However, when human practitioners review camera trap images they frequently flip back and forth between images taken from a single camera, learning what species are likely to be seen at that location and using good-quality images to help with the categorization of images that are blurry or poorly lit. Inspired by this, we adapt a traditional computer vision detection model to incorporate contextual information from up to a month of data from each camera trap when detecting and classifying animals. We use a flexible attention-based mechanism to overcome variability in temporal sampling, i.e. different trigger rates and burst lengths at each trigger. The contextual “memory” is built before inference using a frozen, object-centric feature extractor, and curated to extend the viable time horizon while maintaining representations of both species and salient background objects. This allows the model to use information across time in order to learn how to ignore false positives, such as trees or bushes, while improving species categorization accuracy at new camera locations.

Results/Conclusions

We test our models vs. traditional detection methods on two public camera trap datasets from very different parts of the world: Snapshot Serengeti and Caltech Camera Traps (both available on LILA.science). We explicitly hold out entire camera locations for testing the models, to analyze the generalizability of the models to new camera locations. We find that by allowing the models to use contextual information, we improve the Mean Average Precision (mAP) by 17.9% on Snapshot Serengeti, and 19.5% on Caltech Camera Traps. Our model reduces false positives and improves categorization accuracy on challenging cases, such as animals that are blurry, highly occluded, poorly lit, or obscured by weather such as heavy fog.