Introduction
The following post is a report from the first webinar (held on 27 April 2017) for the third phase of the Research Data Discovery Service project. The aims of the webinar were to welcome new participants, provide an update of the project, introduce the new beta version (http://researchdiscoveryservice.jisc.ac.uk), highlight progress and review requirements from phase 2 and 3.
Note: slide numbers are shown in red to show how the text corresponds with the following presentation.
Welcome and introductions
All new participants and existing participants were welcomed to the webinar (Slide 3). Participants from phase 2 were thanked for agreeing to continue to be part of the project. Some of the content from previous webinars and workshops is repeated for the benefit of new participants. This is the first in a series of webinars. Future ones will be providing project updates and encouraging open discussion. There are still plans for face-to-face workshops during the project, but only when there is a need and beneficial to the project AND participants.
The project team for phase 3 were introduced and all contributed to the webinar. They are as follows:
- Christopher Brown – Project Manager
- Catherine Grout – Project Director
- Dom Fripp – Metadata Developer
- Ade Stevenson – Technical Innovations Coordinator
- Mark Winterbottom – Technical Developer
In phase 2 there were 9 HEIs and 6 Data Centres funded to participate in the project and a further 5 HEIs who volunteered later in the project. Since publicising the project in phase 3, asking for further volunteers, there are a further 9 HEIs and two organisations. (Slide 4/5). The aim is to include all HEIs in the UK with a research data collection, this will include all Shared Service pilots and IRUSdataUK pilots too.
Project update and overview
The project (Slide 6) is developing a platform that enables the discovery of research data from across UK higher education institutions and data centres, which will bring a number of benefits to these organisations (Slide 7). These benefits (Slide 8) include an increased visibility and transparency of research data. The project has been through a number of phases. Following the initial pilot (Slide 9), phase 2 funded a number of participants from HEIs and Data Centres to provide metadata for harvesting and work with the project to determine the requirements for a discovery service. There were a number of outputs from phase 2 (Slide 10), including the alpha test system with data harvested from participating HEIs and Data Centres. In phase 3 (Slide 11) the project will move from a test to a production ready, tested service, include metadata harvested from more data sources and implement further requirements. A beta version of the service is now available (http://researchdiscoveryservice.jisc.ac.uk). This will be used as the basis for further development and include a complete re-harvest from all data sources.
So far, within phase 3, the focus has been on promoting the project to expand the number of participants and a lot of technical work has been going on behind the scenes (Slide 12). The following technical update summarises the work that’s been going on to produce this latest beta version:
- Mark has been back on the project for 2 months and working on improving Infrastructure.
- Alpha site was running on a single server which worked fine for showing the concept but had a few issues:
- More than 8 services squeezed onto one server.
- Single point of failure.
- Disk was filling up with logs and data.
- Harvesting process was taking a long time.
- Database continued to grow without regular backups.
- Manual process for deploying changes (slow and painful to push new updates)
- Needed a solution that was secure, scalable, reliable and backed-up.
- Decided to split the service up into containers using Docker
- Can spread the services across multiple servers.
- Can expand services when doing heavy processing like harvesting and resource scanning.
- Can shrink resources when not running process intensive services.
- Implemented Continuous Integration
- Automate the process of pushing new versions to live.
- Automate unit testing.
- Improve speed at which we can iterate through bugs and features.
- Make use of AWS hosted services such as RDS:
- More stable, optimize DB with automated backups.
- Offloads database maintenance to Amazon.
- Since back, been working on configuring infrastructure and creating container apps.
- Next steps:
- Still need to add each organisation and harvest data.
- Work through bug and feature tickets.
- Add new HEI’s
- Continue with dev process from phase 2 where we work in 2 week sprints and bi-weekly updates are sent to the email distribution list.
System status – Review of latest updates to the service
All organisations from phase 2 will have their metadata re-harvested to the new beta system (Slide 13). This includes the volunteer HEIs. Once this is complete we will start harvesting new participants from phase 3. The endpoints for all participants are listed in a Google Doc (http://bit.ly/RDDS3_harvest_status). This includes the current status for harvesting from each endpoint. The objective is to have all these working as soon as possible. When there is an issue, the JIRA ticket listed will provide the relevant details. All participants are included in this document. The new participants are currently in the “backlog” and will be added ASAP. The tickets will be set to “Done” (closed) when complete. Further issues will result in new tickets or tickets could be reopened.
Requirements (Slide 14) are listed and tracked using JIRA (https://jiscdev.atlassian.net/projects/RDD/). The categories of requirements were defined early in phase 2 after requirements gathering at the first workshop. User stories were collected and MoSCoW prioritisation (Slide 15) was used. Requirements were extracted from these user stories and from the HEI/Data Centres requirements reports (Slide 16). Following the latest re-harvesting, the focus will be on reviewing the prioritisation of existing and new requirements and implementing them in two week sprints. These will be implemented on the beta site with an email going out to the project mailing list showing what requirements have been implemented. The current issues are harvesting and metadata mapping (Slide 17) and we’ll look at other issues once these have been resolved.
Metadata
The two key aims for phase 3 centred around metadata concern the quality and representation of the harvested metadata within the CKAN client (Slide 18).
At the end of phase 2, we launched a vote for which metadata fields in the application profile would be of most benefit to a user of the service. The results of this vote are important for two reasons. Firstly, it gives a broad consensus around what fields are considered most important for discovery and what the minimal metadata for a record should contain.
Secondly, the vote can be used to order the metadata on screen so that a user is accessing the important metadata first. This can help simplify a record at the point of discovery (good UX), enable accurate citation, and, hopefully, encourage users to click through to the original repository record, which is desirable when there is additional metadata content as source, which might be of use.
In addition to this, the University of Glasgow will be conducting a piece of work in developing clear information and guidance for service users about the complex area of dataset rights and licences. This work will broadly follow the work that has been done in the cultural heritage sector recently to solve a similar problem (see http://rightsstatements.org/en/)
There has also been discussion with CORE (https://core.ac.uk/) to compare the services and look at potential ways of working together, especially in connecting data to papers.
The poll (http://www.tricider.com/brainstorming/2mnbqfgcOJp), on which metadata fields participants think should appear at the top of the record, was reopened following the webinar to allow new participants to have the opportunity to vote. For further information, see the core metadata schema (https://goo.gl/vWCX0z) and the UKRDDS metadata profile mapping document (https://docs.google.com/spreadsheets/d/1mjatKZKdhp_tFm6xnYJFpBgPLMNDdAue9FGy-oKFBYk/edit?usp=sharing).
Phase 3 (next steps)
The next steps for phase 3 (Slide 19), includes the implementation of requirements, listed in JIRA (Slide 20), via prioritisation and development sprints. The work still required (Slide 21) includes the following:
What are the aspirations for the future service (Slide 22)? The Discovery Service fits within the umbrella of the Research Data Shared Services project (Slide 23), which, under Research @ Risk, is developing a shared service (provided by Jisc) for effective Research Data Management. This offers a number of benefits:
- Cost savings and efficiencies
- Common approaches and practice
- Research system standardisation and interoperability
The discovery service fits within this as a national aggregation service. We will be looking at integrating with the shared service further into phase 3. The “caterpillar” diagram (Slide 24) shows Jisc’s R&D process. Following the discovery and alpha stages, we’re now in the beta stage. The next step is to deliver this as a service. This is most likely to involve the Discovery Service being established (Slide 25) within the Jisc Digital Resources directorate’s set of services (https://www.jisc.ac.uk/content). This will involve consideration of a number of areas including:
- Establishing a service team and how this fits with Research Data Discovery Service activities
- CKAN production specific installs – e.g. sandbox or user acceptance test machines
- Ongoing OAI and other endpoint harvesting configuration, management and documentation
- Setup and ongoing management of admin and Discovery Service user accounts, updates, patches, and security (firewall, intrusion detection, DDOS etc.)
- Various other system admin tasks such as backup, disaster recovery, log config and rotation, DNS, proxying, caching, mail routing, system performance testing, system monitoring
- Set up of any required service supporting applications, e.g. wiki for documentation, blogs etc.
- Dealing with ongoing developments including necessary developments in response to essential new requirements or ongoing service enhancements
- Community building for use of the service
- Training / Workshops
- Promotional events and social media.
An essential part of the project is ensuring participants provide feedback on how the system is developing, confirm the requirements are implemented and checking their metadata (Slide 26). In phase 2 there were a number of advisory groups set up to support the project. Originally, there was going to be one advisory group in phase 3, but so far there hasn’t been a need as all communication is shared via the mailing list. However, we will set up groups as required, especially when we need a more focussed discussion on areas such as technical development or metadata, for example. The JISC-UKRDDS mailing list will continue as the main communication outlet and there will be further webinars to update everyone on progress. Workshops will be held as required for feedback and face-to-face discussions.
Some useful links to support the project (Slide 27) include:
Questions
Comments were made during the webinar and these were followed up via email by participants. However, a number of questions were asked and these are collated here.
What are the plans for working with Pure/Elsevier and new Pure API (v5.9), due out in June 2017?
There have been ongoing discussions with Elsevier as part of this project and the Shared Services. The service did work with a previous version of Pure via OAI-PMH. We will endeavour to use the new functionality within v5.9 to harvest into the Discovery Service.
I note the service is still linking directly to individual files. As ever, still don’t think this appropriate! Are we retaining this model?
We will look into this functionality to see if it can be improved once the harvesting and mapping work is complete. We want the system to be as easy-to-use as possible and this includes accessing the underlying data. We’ve also been looking at how other data portals work, particularly those built using CKAN.
Do we know when next RD Shared Service pilots’ day is?
This is still to be determined but the Shared Service project will contact all the pilots to let them know.
Closing comments
Participants were thanked for joining the webinar and contributing to the project. They were reminded that progress updates will be sent to the mailing list and are encouraged to actively engage with the project. There will be a demo of the Discovery Service at the next Research Data Network event and participants were encouraged to attend. This event will be held at the Ron Cooke Hub, University of York on 27/28 June 2017 (Slide 28). You can register at https://www.jisc.ac.uk/events/research-data-network-27-jun-2017. The programme is available online at https://research-data-network.readme.io/docs/4th-research-data-network-york-university.