2020 ESA Annual Meeting (August 3 - 6)

COS 145 Abstract - Challenges in using deep learning to identify animals in the real-world

Zhongqi Miao, Department of Environmental Science, Policy, and Management, University of California, Berkeley, Berkeley, CA, Ziwei Liu, The Chinese University of Hong Kong, Hong Kong, CA, China, Kaitlyn M Gaynor, National Center for Ecological Analysis and Synthesis, University of California, Santa Barbara, Santa Barbara, CA, Stella X. Yu, Vision Science, University of California, Berkeley, Berkeley, CA and Wayne M. Getz, Environmental Science, Policy, and Management, University of California Berkeley, Berkeley, CA
Background/Question/Methods

Deep learning has attracted much attention from the ecological community for its capability of extracting and generalizing patterns from datasets with highly complicated structures, such as images, audios, and motion signals. However, despite the promising cases, deep learning may have shortcomings when applied to real-world datasets, such as camera trap images.

Results/Conclusions

Here, we consider three intrinsic limitations of deep learning in real-world applications through a case study of camera trap image classification from Gorongosa National Park.

1) Ecological datasets produce long-tailed distributions. We show that in the Gorongosa National Park camera trap dataset, fewer than 1% of the images captured are of rare and secluded animals such as pangolins, while over 60% of the data are of ubiquitous baboons and waterbucks. This extreme imbalance can cause over 80% classification accuracy differences using direct deep learning methods.

2) Data collected from multiple domains, e.g., different biomes at different times of the year, can exhibit substantial differences, even for the same semantic categories. For the Gorongosa camera trap dataset, the background appearances of different seasons (winter vs. summer) causes over a 50% accuracy decrease.

3) Ecological datasets are dynamic. Data from unseen categories and unseen domains are constantly collected and seek classification. However, traditional “training-validation” protocols fail to take the dynamics of data collection into account, because existing model updates require a large set of training samples. We inspect the relationships between deep learning classification performance and the number of annotated camera trap images per category. The results show that when fewer than 50 annotated images per category are used for fine-tuning a deep learning model, the performance will drop drastically (around 10% - 30%).