2020 ESA Annual Meeting (August 3 - 6)

COS 73 Abstract - Novel statistical methods to leverage low-coverage whole-genome sequencing data in biodiversity studies

Shahab Sarmashghi1, Metin Balaban2, Eleonora Rachtman2, Siavash Mirarab1 and Vineet Bafna3, (1)Electrical and Computer Engineering Department, UC San Diego, La Jolla, CA, (2)Bioinformatics and Systems Biology Graduate Program, UC San Diego, La Jolla, CA, (3)Computer Science and Engineering Department, UC San Diego, La Jolla, CA
Background/Question/Methods

Rapid climate change and anthropogenic destruction of natural habitats have resulted in an alarmingly fast erosion of biodiversity and portend mass extinction of many threatened species in the coming decades. Devising effective policies to stop and reverse the current trends hinges on extensive and reliable assessments of the inter- and intra-specific diversity across vulnerable ecosystems.

Whole-genome sequencing analysis provides accurate estimates of genetic diversity, and decreasing sequencing costs (<$10 per Gb) make it an attractive alternative to marker-based genetic approaches. However, most non-model organisms lack assembled genomes, and resources and expertise needed for de novo assembly remain prohibitively expensive for conservation efforts with modest budgets. Therefore, developing methods to utilize lightly sampled genomes using short reads (genome-skims) without requiring genome assembly, could be transformative for genomic ecology.

Here, we present a collection of algorithmic and machine learning methods which are assembly- and map-free and can analyze the distribution of k-mers (words of fixed length k) in low-coverage genome skims to perform: 1) identification and phylogenetic placement, 2) estimating genomic properties such as total length and repeat spectra, and 3) estimating the heterozygosity, population size and population structure.

Results/Conclusions

In our tests using hundreds of simulated genome-skims and whole-genome sequencing reads from insects and birds, our tools could accurately place the query on the phylogenetic tree in 95% of the cases, even with only 0.5X coverage, and when the closest match differed in >10% of the nucleotides.

Using light (1X coverage) genome-skimming data sampled from 622 assembled genomes from invertebrates, vertebrates, and plants in RefSeq database, our estimates of genome length were within 1% of the correct length, in contrast to the >30% errors seen from other tools. Importantly, we provided good estimates of the repeat content of these genomes even with 1X coverage, allowing us to determine if the organism had undergone a recent whole genome duplication.

We could also estimate population level diversity (<1 in 1000 bp) in multiple datasets. Our methods (a) successfully separated 13 genome skims of white rhinos into northern(9) and southern(4) subspecies; (b) assigned 92 white oaks to their continent of origin; and, (c) estimated heterozygosity within two closely related species of finch (10-4 precision at 2X coverage) while identifying population substructures related to their (sub)speciation and breeding patterns.

The results demonstrate that our computational tools enable ecologists to apply the cost-effective genome-skimming approach in a variety of large-scale ecological problems.