R platform tools have come to dominate the microbial ecology toolkit; packages like dada2 and phyloseq give R the capability to support the complete amplicon sequencing pipeline, from sequencer output through data analysis and result presentation. Analysis packages such as DESeq2, and limma give R a substantial advantage in inferential statistics over other platforms. Being comfortable in the R environment is a fundamental skill for budding microbial ecologists. There is an ongoing and robust debate about which a posteriori normalization methods are appropriate or necessary to make ecological inferences given the known peculiarities that make ecological inferences from sequence data challenging. For instance, typical DNA extraction and sequencing methods confound species relationships and must be retroactively taken into account to make accurate inferences. Methods like rarefying the dataset have been developed to account for these biases, and within ecology are unique to sequencing projects; yet there is intense debate over whether rarefying is an appropriate approach. This begs the questions: what are the appropriate normalization techniques for microbial ecology? And, how do we benchmark disparate methods to determine which is best?
Results/Conclusions
We present a hands-on approach to teaching the bioinformatics side of microbial ecology in which students benchmark differing normalization and inferential statistical approaches using a small mock community of independently-modeled OTU abundances. This 3-module course introduces students to the R platform, it’s sequence analysis pipeline, and challenges some common assumptions about microbiome analysis. In the first module students learn the basic data carpentry skills necessary for operating in R, and complete a simple case study demonstrating the pitfalls of treating sequence data as “relative abundance” (or simple proportions) in discerning changes in microbial abundance. They demonstrate this approach leads to increased false positive rates concurrent with decreased true positive detection rates in a differential abundance analysis. In the second module, students complete the dada2 OTU picking pipeline using a mock sequence dataset. In the third module, students benchmark DESeq2, limma-voom, and MVABUND approaches to inference-making against one another. In addition, students implement different normalization and data preprocessing approaches to evaluate the impacts of e.g. rarefying methods on the accuracy of inference. These exercises demonstrate the importance of appropriate preprocessing, and the increased power in using GLM-based approaches to determining differential abundance in sequence data.