Dom Fripp is senior metadata curation developer at Jisc and part of the UKRDDS team. Dom has written the following post to describe the core metadata profile developed for the Discovery Service.
By way of introducing myself and my role within Jisc, I’d like to cover some of the work I’ve been doing over the last few weeks in developing a core metadata schema that underpins the UK Research Data Discovery Service.
The early phase of this project focussed on gathering use cases and requirements from our participating pilot universities and data centres. Having a clear set of prioritised requirements has helped ensure that we can develop a Discovery Service that meets the needs of our users.
The early metadata work in the project, arising from Phase 1 and further developed by the Technical and Metadata advisory group in this second phase, looked at appropriate technical standards for the repositories and the service to work together coherently.
Once the selection of CKAN had been made as the solution for the portal, it was essential to draft a metadata schema that fulfilled the following criteria:
- Meets user requirements (based on the evidence provided by use cases and correlation with common fields in use across the research data domain).
- Simple enough (in core form) to map onto the CKAN instance that underpins the portal.
- Based on existing schemas (and good practice) in use across the international research data discovery domain.
- Flexible enough to develop along with new service and user needs.
Metadata fields arising from user requirements
To assess which metadata fields were required, the user requirements were examined systematically and in the context of commonly used fields (here’s where some cataloguer background experience really helped). This was made on the basis of studying various related schema for research data discovery, and experience of researcher needs and use of a research data repository.
Data relating to the mandatory fields was generated using some content analysis functionality in OpenRefine on a variety of sourced schema in use for research data aggregators around the world.
The fields and UR information shown are specific cases where the requirement demands a particular field to fulfill it. There are several URs that demand multiple fields and a rich metadata schema but these have not been quantified.
related objects | Geographical coordinates | subject | Publisher |
Unique Resource Identifier | Description | Funder | format |
Creator | keywords | Organisation | license |
Table 1: Fields arising from user requirements
Related objects is a place-holder field by which the group can decide on a package of fields that can best facilitate the linkage between associated records, datasets and outputs as documented in the user requirements. In further discussion about the schema it was decided to adopt a similar approach to Datacite, whose schema permits the addition.
This was contextualised with the broader Dublin Core approach to rights, which permits free text to be added, if specific identifiers are unknown or unavailable.
Evaluating existing schemas
An integral part of the metadata project work was to assess other schemas in use in other research data projects around the world, looking for evidence of community and best practice to creating a schema. The objective of this work was not to invent something new, but incorporate well established metadata practices from the burgeoning research data discovery movement. Parallel to this, richer and more mature schemas were also analysed for commonalities, standardisations and suitable vocabularies.
Eight metadata schema were assessed for mandatory fields. These were Datacite, EU Data Portal, ANDS, EUDAT (B2FIND), ETSIN, INSPIRE, ReCollect and DDI Lite.
The first task was to prepare the schemas for comparison, which meant making an evaluation of same titled fields to ensure the meaning and function was equivalent. Then the fields were clustered using k-nearest neighbour algorithms using OpenRefine.
The application of these algorithms was to find common terms between the eight lists of mandatory metadata fields. The logic was to establish what mandatory fields were most used in schema and demonstrate the statistical importance in the schema landscape, rather than the specifics of the local schema solution.
type | 8 | language | 2 | 1 | |
name | 5 | owner | 2 | group | 1 |
title | 5 | registry objects | 2 | notes | 1 |
description | 4 | Rights | 2 | Organisation | 1 |
location | 4 | Subject | 2 | originating source | 1 |
related objects | 4 | abstract | 1 | Publication status | 1 |
Keywords | 3 | agent | 1 | PublicationYear | 1 |
License | 3 | collection type | 1 | Publisher | 1 |
access | 2 | content | 1 | terms of use | 1 |
Contact | 2 | Creator | 1 | url | 1 |
Identifier | 2 | Divisions | 1 | version | 1 |
Table 2: Mandatory fields across schema
Any field that occurs more than once was incorporated into the schema. Other fields of value 1 that corresponded with fields established in the user requirements were already included on that basis. This selection process was based on popularity only. There was no judgement made about the usefulness of fields. Any field made mandatory in a research data related metadata schema is considered an important indicator of best practice.
Draft schema
When all of this work was complete, it was no surprise (and in itself, a strong clue about emergent practice) that the schema looked a lot like Datacite. There was a great deal of overlap, also Dublin Core, as one might expect given the objective of reusing fields emerging from good practice. This ranged from the adoption of the core fields required for citation to the burgeoning area of interoperable IDs such as ORCiD, Crossref and ORGiD. Datacite and Dublin Core offered up neat controlled vocabularies for some fields that were already in use and would be retained.
It was a first draft and and needed refinement, so the schema was shared for evaluation and revision over a period of ten days. This culminated in a workshop where those on the Technical and Metadata group discussed the various questions and comments raised on the document. Fortunately, these discussions were so fruitful that consensus was arrived at on almost all the issues. For example, “Funder” had been originally generated from user requirements but it was agreed that the form of this field should be taken from the Rioxx Open Access application profile, which, adapted from Dublin Core, recommends a “Project” field with associated fields for funder IDs. This adoption makes sense as it collects pre-existing metadata and, as such, adheres to the golden rule of minimising repetition.
Next steps
It is important to state that this document will be version controlled and subject to revision. Whereas it is important to reach consensus and implement a standard that can be mapped to/from and inform use, it is also vital to move with international best practice and ensure that the schema remains aligned with its sources.
This work includes close evaluation of new versions of key schemas such as Datacite (version 4), GEMINI (the geographical schema used by NERC repositories). There is also a requirement to look for good metadata practice at an international level and ensure the bedrock of the UKRDDS schema is well aligned with projects such as EUDAT.
In addition to this consistency and coherency, it is hoped that the good practice around identifiers increases. The use of identifiers is an essential part of making research systems interoperable and minimise the need to duplicate information. This is an area of development that will be explored in a future post.
The first release of the core metadata schema, including information on content and mapping, is available as a shared document – UK Research Data Discovery Service Core metadata schema Version 1.0.