Studying communities requires accurate data on the presence and abundance of multiple species coexisting within a local area. However, most datasets consist of a subset of the total number of species that actually belong to the community. Multiple methods exist for estimating the true number of species in a community based on an incomplete sample. However, such methods provide no information on the identities of these predicted species. Here, we describe and compare multiple methods for estimating the identities of unsampled species that rely on data collected independently from the community dataset (e.g., checklists, range maps, and species distribution models). We use these data to construct probabilistic species pools that rank the likelihood of community membership and assign taxonomic identities to unsampled species predicted via the Chao-Shen species richness estimator. We compare predictions based on random sampling from a regional checklist to those that use additional information regarding regional abundance and environmental suitability derived from species distribution models. The accuracy of these methods are then calculated at several different degrees of undersampling. We use community data collected from Powdermill Nature Reserve in southwestern Pennsylvania to test these methods.
Results/Conclusions
We sampled ~40,000 individuals from 51 species. This intensive sampling was enough to adequately represent the true community based on visual inspection of the species accumulation curve. Using the Chao-Shen species richness estimator, only 6 species were missing from our sample. We used subsets of these data to assess the degree of accuracy of our methods at different levels of undersampling (i.e., 50%, 40%, 30%, and 20%). Randomly assigning taxonomic identities based on a regional checklist had relatively good predictive power at small degrees of undersampling (< 20%), but was very poor when more than 25% of species were missing. In contrast, ranking species membership based on regional abundance considerably improved prediction accuracy. Likewise, predictions based on environmental suitability estimated using species distribution models greatly improved the accuracy of our species predictions. Our results suggest that not only can the number of missing species from a sample be estimated, but that the taxonomic identities can be predicted as well. Our methods can be used together with models of species abundance distributions to estimate the local abundance of missing species. Ultimately, the ability to accurately estimate community membership and abundance across regions would revolutionize the field of community ecology.