Annotating metadata to improve data discovery and reuse

O'Brien, Margaret

Background/Question/Methods

A semantic annotation is the attachment of semantic metadata to a resource such as a dataset. It provides precise definitions of concepts and clarifies the relationships between concepts in a machine-readable way, making datasets easier to discover and reuse. For example, if a dataset is annotated as being about "carbon dioxide flux" and another annotated with "CO2 flux" the information system can recognize that the datasets are about equivalent concepts. In another example, if you perform a search for datasets about "litter" (as in "plant litter"), a semantic system can to disambiguate the term from the many meanings of "litter" (as in garbage, the grouping of animals born at the same time, etc.). Ecological Metadata Language (EML) version 2.2, released in 2019, has the capacity to hold semantic statements as annotations to datasets, e.g., to describe characteristics of the dataset such as the biome where the research took place, and link columns of data to external dictionaries of measurements.

The Environmental Data Initiative (EDI) data repository ingests metadata as EML and encourages data contributors to semantically annotate the metadata they submit. EDI must be prepared to provide advice on the process and tools needed. In addition, there are many vocabularies and ontologies for researchers to choose from, and EDI needs to evaluate them for suitability for their use in different scientific domains and levels of ecological complexity. This is a major undertaking, so EDI will initially examine semantic resources to use with several types of datasets from the LTER network. We will focus on using existing ontologies, e.g., the Environment Ontology (EnvO) for concepts representing ecosystems or habitats, a domain-focused ontology such as Chemical Entities of Biological Interest (ChEBI), for annotations for chemical species, and the Ecosystem Ontology (ECSO), for measurements.

Results/Conclusions

Here we report on our experiences semantically annotating datasets from Long Term Ecological Research (LTER) sites. From learning to annotate EML metadata ourselves, we develop instructions that will help others undertake this process. We identify useful semantic resources and assess how well they are suited to capturing the higher level dataset concepts and attribute-specific concepts in the target datasets. We conclude with some initial recommendations for the ecological community about using semantic resources, and share our perspective on the criteria researchers should use to select among the semantic resources that apply to their particular ecological domain.

PS 48 Abstract - Annotating metadata to improve data discovery and reuse