Herbaria across the United States are digitizing millions of specimens, drastically increasing the accessibility of meticulously preserved vouchers. These specimens can contain a wealth of information about the species’ ecology, such as flower and fruit phenology, and leaf size and shape, but extracting such phenotypic data from images can be time-consuming and require manual user input. We take advantage of computational advancements in computer vision and machine learning to autonomously populate a database of phenotypic traits and metrics from digitized herbarium specimens. Purpose-built convolutional neural networks and support vector machines are leveraged for image segmentation, identifying leaves, fruit, stems, nodes, and text for processing. Machine learning and contextual algorithms are used to locate and interpret distance scales in images to convert pixel-distance into
Results/Conclusions
Our pipeline can generate usable data from images that vary widely in quality. Using leaf stand-ins with known characteristics and a variety of metric scales, we find that our application of machine learning has classification error rates of less than ten percent in most cases. This, however, is highly dependent on the composition of the voucher and the degree of overlap or demarcation between leaves. Our novel application of machine learning has the potential to vastly increase available trait information, and help ecologically-relevant hypotheses related to community dynamics, adaptation, and global climate change.