2018 ESA Annual Meeting (August 5 -- 10)

PS 66-198 - The fruits of provenance

Friday, August 10, 2018
ESA Exhibit Hall, New Orleans Ernest N. Morial Convention Center
Emery Boose1, Aaron Ellison1, Elizabeth Fong2, Matthew K Lau1, Barbara S. Lerner2, Jackson Okuhn3, Thomas Pasquier3 and Margo Seltzer3, (1)Harvard Forest, Harvard University, Petersham, MA, (2)Computer Science, Mt. Holyoke College, South Hadley, MA, (3)Computer Science, Harvard University, Cambridge, MA
Background/Question/Methods

The software tools that scientists use to process and analyze data are typically optimized for performance and ease of use. Few if any such tools are designed to capture and record the details of what happens as the tool performs its task(s). This detailed information, and more generally the history of an item of data from its creation to its present state, is known as provenance. It is our belief that provenance has great potential to make science more transparent, reliable, and reproducible.

In this project, we have developed tools to collect provenance for scripts written in the R statistical language, which is widely used by ecologists and environmental scientists for data analysis and visualization. The resulting tools are reaching maturity and allow users to select the level of detail to be collected, execute an R script, and store the resulting provenance in a standard format. The provenance provides a detailed record of the steps that were executed and the intermediate data values that were created in a particular execution of a script.

Results/Conclusions

Our experience with users to date suggests that few if any scientists are interested in working with provenance directly, even if it might improve their understanding of their own scripts or the scripts of others. So our efforts are now focused on developing applications that use provenance to perform tasks that support scientists in their work.

Some promising applications that we have created or are developing can or will do the following: clean a script to remove non-essential elements, identify all occurrences of a variable for quality control or error propagation, find which parts of a script require the most computation time, improve script debugging through access to intermediate data values, record details of the computing environment and versions of all libraries used, and preserve all input values (including transient values and random numbers) needed to reproduce a particular result.