2020 ESA Annual Meeting (August 3 - 6)

SYMP 23 Abstract - Predicting phenotype from multi-scale genomic and environment data using neural networks and knowledge graphs

Anne E. Thessen1,2, Ryan Bartelme3, Michael Behrisch4, Emily Jean Cain5, Remco Chang6, Ishita Debnath7, P. Bryan Heidorn8, Pankaj Jaiswal9, David S. LeBauer5, Ab Mosca6, Monica C. Munoz-Torres10, Arun Ross11, Kent Shefchek12 and Tyson Swetnam13, (1)Oregon State University, Corvallis, OR, (2)Ronin Institute, (3)University of Arizona, (4)Information and Computing Sciences, Utrecht University, Utrecht, MA, Netherlands, (5)College of Agriculture and Life Sciences, University of Arizona, Tucson, AZ, (6)Department of Computer Science, Tufts University, Medford, MA, (7)Depatment of Epidemiology and Biostatistics, Michigan State University, East Lansing, MI, (8)School of Information & Data7, University of Arizona, Tucson, AZ, (9)Department of Botany and Plant Pathology, Oregon State University, Corvallis, OR, (10)Dept. of Environmental and Molecular Toxicology, Oregon State University, Corvallis, OR, (11)Department of Computer Science and Engineering, Michigan State University, East Lansing, MI, (12)Enviro/Molecular Toxicology, Oregon State University, Corvalis, OR, (13)CyVerse, University of Arizona, Tucson, AZ
Background/Question/Methods: To mitigate the effects of climate change on public health and conservation, we need to better understand the dynamic interplay between biological processes and environmental effects. Machine learning (ML) methods in general, and Deep Learning (DL) methods in particular, are a potential way forward because they are able to cope with the nonlinearity of natural systems. However, there are several barriers that exist, including the absence of ML-ready data. We propose to develop a machine learning framework capable of predicting phenotypes based on multi-scale data about genes and environments. A critical part of this framework are data transformation methods that map the heterogeneous input data into formats that are consumable by the ML techniques. The central hypothesis of this research is that deep learning algorithms and biological knowledge graphs will predict phenotypes more accurately across more taxa and more ecosystems than do current numerical and traditional statistical modeling methods. Our long term goal is to develop predictive analytics for organismal response to environmental perturbations using innovative data science approaches. This pilot project on predicting emergent properties of complex systems and multidimensional interactions is funded by the NSF (Award # 1939945, 1940059, 1940062, 1940330).

Results/Conclusions: We have established shared project governance, communication channels, project timeline, and data and computing environment across four universities. We have an initial data model. We have successfully reached out to three other projects for broader collaboration.