2020 ESA Annual Meeting (August 3 - 6)

COS 3 Abstract - Zen and the art of cutting the tree: Phylofactorization of ecological big-data yields a theory for how to harness the ecological data revolution

Alex Washburne, Montana State University, Jacob B. Socolar, Ecology & Evolutionary Biology, University of Connecticut, Storrs, CT, Florent Mazel, Biology, Simon Fraser University, Giulio Dalla Riva, Mathematics and Statistics, University of Canterbury, Christchurch, New Zealand and Raina K. Plowright, Microbiology and Immunology, Montana State University, Bozeman, MT
Background/Question/Methods

Ecological datasets, from eBird and GBIF to microbiomes and pathogen libraries, all have a common structure: many species and meta-data associated with those species. Our questions typically revolve around finding associations between meta-data and species: Which birds are vulnerable to land use changes? Which microbes are associated with disease? Which viruses spillover to humans and in which reservoirs are they found? Classically, biology has organized these findings with taxonomic "mowing": summarizing how groups at a fixed level, such as phyla or families, are associated with meta-data. Taxonomic mowing, however, is missing crucial patterns in our data, patterns which we have the tools to find.

Results/Conclusions

All multi-species ecological datasets have a tree-structure connecting the species, whether that's the taxonomy or the phylogeny. By embedding a means of cutting the tree into regression-based analyses, it's possible to find the lineages with the strongest association with meta-data. In this talk, I will show what happens when we step away from mowing and instead ask "which lineages best summarize our data?". Regardless whether we're studying birds or bacteria, mammals or viruses, the lineages driving patterns in our data are never at a fixed depth, revealing that taxonomic mowing is reliably missing the patterns of how species are changing in our data. This new method for agnostically searching a tree for lineages with associations - phylofactorization - was recently published in Ecological Monographs and is available in an R package, phylofactor, to help researchers maximize the value of their data.

Phylofactorization of many datasets reveals a paradigm and pedagogy for connecting biological theory with data analysis. For example, if we asked an introductory biology course "which species of tetrapods live on land, and which live in the sea?" most biologists would answer with a version of phylofactorization: tetrapods live on land, and within tetrapods there are lineages that live in the sea (Cetaceans, Pinnipeds, etc.). With this new paradigm and an R package to implement it with tools to summarize the taxonomic composition of phylogenetic lineages uncovered, simplify parallelization, and visualize the results, cutting trees to analyze biological data will prove an invaluable tool for harnessing the ecological big-data revolution. This talk will be accessible to all biology audiences, and simplify concepts like "neural networks" by showing how phylofactorization can construct neutral networks constrained to having a clear, evolutionary interpretation.