The Environmental Data Initiative (EDI) is a data curation center and data archive for the ecological research community. Its goal is to accelerate the curation, archiving, and dissemination of environmental data. “FAIR” principles (Findable, Accessible, Integratable, Reusable) are implemented by providing a registered and trustworthy data repository for long-term preservation and data integrity while documenting data with rich science metadata to enable discovery, access, reuse, and integration. The EDI data repository has been in operation for over five years and was built based on almost 40 years of data management experience gained in the NSF Long-Term Ecological Research (LTER) Program. It currently houses over 42,000 data sets from LTER, biological field stations and many other NSF funded projects. 42,000 data sets is a large number, however, their combined size of about 9 terabyte and individual size of bytes to megabyte and rarely gigabytes is not. Hence, the ‘big data’ challenges EDI faces are highly variable data in structure, spatial and temporal scales, sampling methods, parameters measured, and semantics of the metadata. E.g., a simple search for ‘biodiversity’ returns well over 500 data sets and each one is unique in its sampling methods, data structure and parameter naming.
Results/Conclusions
This problem is not unique to EDI’s collection of datasets, but describes the ‘long-tail’ ecological research in general with only a few exceptions where sampling, data, and metadata are highly standardized. However, despite the discussed problems these data are extremely valuable and synthesis research has successfully used them to answer important large scale questions. In this problem space EDI is currently exploring approaches to further accelerate scientific inquiry through data management driven data harmonization. Based on experience and synthesis science input we are proposing a distributed model in which data sets are only reformatted without any aggregation or other manipulations based on a preconceived research question. We will report on the first such project harmonizing long-term community observation data sets across LTER sites and discuss considerations for choosing this approach, advantages, disadvantages, and tools developed by EDI to support synthesis scientists.