Machine learning, particularly deep-learning algorithms that employ convolutional neural networks (CNNs), is the breakthrough technology of the current half-decade when it comes to identifying all sorts of objects from images. To most scientists outside of the artificial intelligence realm, this identification process is somewhat mysterious: a black-box method that provides no insights on how the identification is being achieved. In the interests of scientists being able to apply deep-learning methods more effectively and more efficiently to their particular fields of investigation, some understanding of the mechanisms involved is needed.
This is certainly true for ecologists who have captured millions of images remotely, using satellites or movement-triggered cameras installed in the field; and processing of these images may involve tens of thousands of man-hours executed at great expense. To demystify aspects of artificial intelligence and hence facilitating automated visual image processing, we deconstruct the features that are used by a CNN that we trained to identify animal species in more than 100,000 annotated wildlife images obtained in Mozambique. This is the first time, to the best of our knowledge, that this kind of deconstruction has been undertaken on wildlife classification.
Results/Conclusions
Here we outline the current state of the art methods and present results obtained in training a CNN to classify 20 African wildlife species with an overall accuracy of 87.5% from a dataset containing 111,467 images. We demonstrate the application of a gradient-weighted class-activation-mapping (Grad-CAM) procedure to extract the most salient pixels in the final convolution layer. We show that these pixels highlight features in particular images that in some cases are similar to those used to train humans to identify these species. Further, we used mutual information methods to identify the neurons in the final convolution layer that consistently respond most strongly across a set of images of one particular species, and we then interpret the features in the image where the strongest responses occur. We also used hierarchical clustering of feature vectors associated with each image to produce a visual similarity dendrogram of identified species and to provide a cogent view of how a machine seems to “perceive” similarities and differences among species. Finally, we evaluated how images that were not part of the training set fell within our dendrogram when these images were one of the 20 species “known” to our CNN in contrast to where they fell when these images were “unknown” to our CNN.