Wed, Aug 04, 2021:On Demand
Background/Question/Methods
Soil microbial communities play critical roles in a variety of ecosystem processes, but due to data limitations involving standardization and accessibility, they have been difficult to study at large spatial and temporal scales. To facilitate their analysis, we introduce the neonMicrobe R package—a suite of downloading, pre-processing, dataset assembly, and sensitivity analysis tools for publicly available marker gene sequencing data from the National Ecological Observatory Network (NEON). NEON soil sampling is conducted multiple times per year for 47 terrestrial sites and 13 ecoclimatic domains and will continue for three decades, making it one of the largest existing standardized sampling efforts for microbial communities. We describe quality-assurance steps used to remove low-quality samples, report on the results of a sensitivity analysis used to choose appropriate processing parameters, and present basic diversity analyses. For the sensitivity analysis, we specifically examined how variation across DADA2 quality filtering parameters controlling for read truncation length (truncLen) and sequencing error tolerance (maxEE) in reverse reads would affect the retention rate of 16S reads and downstream estimates of bacterial alpha- and beta-diversity. We used amplicon sequence variant (ASV) richness and Shannon index as our metrics for alpha-diversity, and Bray-Curtis dissimilarity as our metric for beta-diversity.
Results/Conclusions Using the neonMicrobe package on the NEON soil microbial data downloaded in August 2020, we found that the 16S dataset contained over 191,000 ASVs from 5,625 samples, and the ITS dataset contained over 199,000 ASVs from 2,195 samples. From the sensitivity analysis, we found that the read retention rate varied across both truncLen and maxEE, but depended on the distribution of read quality scores in each sample, motivating careful examination of the quality profiles. Shannon diversity and observed richness were both sensitive to variation in truncLen, but not maxEE. In a given sample, varying truncLen produced significantly dissimilar representations of community composition, while varying maxEE had no effect. Therefore, we strongly caution against combining datasets processed using different values of truncLen, as this may create artifactual diversity patterns. The resulting sequence abundance tables can be linked to NEON’s other data products (e.g. soil chemical and physical data, plant community characteristics) and to soil subsamples kept in the NEON Biorepository. Using cloud computing platforms (e.g., CyVerse), the neonMicrobe package offers a fully reproducible pipeline to process DNA sequences into ecologically relevant data, which we expect to act as a valuable ecological baseline to inform future research.
Results/Conclusions Using the neonMicrobe package on the NEON soil microbial data downloaded in August 2020, we found that the 16S dataset contained over 191,000 ASVs from 5,625 samples, and the ITS dataset contained over 199,000 ASVs from 2,195 samples. From the sensitivity analysis, we found that the read retention rate varied across both truncLen and maxEE, but depended on the distribution of read quality scores in each sample, motivating careful examination of the quality profiles. Shannon diversity and observed richness were both sensitive to variation in truncLen, but not maxEE. In a given sample, varying truncLen produced significantly dissimilar representations of community composition, while varying maxEE had no effect. Therefore, we strongly caution against combining datasets processed using different values of truncLen, as this may create artifactual diversity patterns. The resulting sequence abundance tables can be linked to NEON’s other data products (e.g. soil chemical and physical data, plant community characteristics) and to soil subsamples kept in the NEON Biorepository. Using cloud computing platforms (e.g., CyVerse), the neonMicrobe package offers a fully reproducible pipeline to process DNA sequences into ecologically relevant data, which we expect to act as a valuable ecological baseline to inform future research.