1. INTRODUCTION
Data is an essential research asset. The 15 high-level FAIR data principles support informed reuse of data by enabling the Findability, Accessibility, Interoperability, and Reusability of digital resources (). The FAIRsFAIR project contributes to the uptake of the FAIR data principles into the European Open Science Cloud (EOSC) by developing practical solutions (e.g., expertise, recommendations, training and tools) that facilitate the application of the principles throughout the research data life cycle. The European Commission Expert Group on FAIR Data recommended that assessment metrics elaborating FAIR principles and tools implementing the metrics must be developed and piloted to facilitate the assessment of research data FAIRness by humans and machines (). In response to these recommendations, several groups have proposed assessment metrics to evaluate the implementation of the principles, notably the work undertaken by the FAIR Data Maturity Model Working Group ().
Current FAIR data assessment work addresses ‘what’ can be evaluated through metrics. A gap remains in ‘how’ these metrics can be tested in practice. The RDA FAIR Data Maturity Model Working Group notes “the exact way to evaluate data based on the core criteria is up to the owners of the evaluation approaches, taking into account the requirements of their community” (). The FAIRsFAIR project is implementing and testing FAIR data assessment metrics with several FAIR stakeholders following an iterative and use case-driven approach. The work presented in this paper started with the conceptualization of a set of metrics and is now moving to building pilots to support FAIR assessments of data objects from selected Trustworthy Digital Repositories (TDR) that are FAIR-aligned, in particular those that are CoreTrustSeal certified (). FAIR principles may be applied to any digital object. We are concerned with a subset of digital objects: research data (referred to as ‘data objects’ in this paper) that are data collected, measured, or created for the purposes of scientific analysis.
Following an overview of related work (section 2), this paper presents:
- A range of scenarios offering insights into the FAIR assessment at different stages of the data life cycle and two ongoing priority use cases (section 3).
- A minimum set of core metrics for the FAIR assessment of research data, building on existing work, including RDA outputs and evaluated and refined through several iterations. Experiences on adopting the work are discussed (section 4).
- Tools (FAIR-Aware and F-UJI) that apply the metrics in the selected use cases and the results of the evaluation carried out with FAIR stakeholders (section 5).
The conclusion addresses lessons learned and future work.
2. RELATED WORK
The metrics proposed in this paper were developed based on work described below.
The RDA FAIR Data Maturity Model WG developed a set of indicators with maturity levels, primarily intended to provide input to implementers of evaluation tools for measuring data FAIRness. This work focuses on ‘what’ should be evaluated and does not aim at elaborating ‘how’ the indicators could be evaluated in practice. FAIRsFAIR adopted this RDA recommendation and built the FAIRsFAIR metrics on these WG indicators. Further improvement was made to adjust the indicators to suit the requirements of the use cases and to define practical tests against the metrics (more details provided in section 4).
The WDS/RDA Assessment of Data Fitness for Use WG developed criteria that cover the FAIR principles as well as data quality and data curation aspects, which are intended to serve as ‘add-ons’ to the CoreTrustSeal Repository Certification requirements. The WG prototyped an online questionnaire () for reviewers to assess data on the criteria manually. We compared the criteria and their mapping against the CoreTrustSeal requirements when developing the object metrics.
Data Archiving and Networked Services (DANS) developed two prototypes to demonstrate the assessment of data FAIRness by different stakeholders. FAIRdat is aimed at data reviewers, whereas FAIR enough? addresses less data experienced researchers with a focus on increasing their understanding of what FAIR data means. The experiences and feedback gathered on the FAIRdat tool and the FAIR enough? checklist were used as input for the FAIR-Aware self-assessment tool (section 5.1).
3. FAIR DATA ASSESSMENT SCENARIOS
The FAIRness of a data object can be assessed manually, semi- or automatically at several stages across the research lifecycle (as shown in Figure 1). To better understand what needs to be considered when implementing FAIR assessments as suggested in Figure 1, we developed a set of scenarios (). Table 1 explores implementation scenarios, the motivations of the stakeholder groups involved in carrying out the assessment, the stage of the research lifecycle during which the assessment would occur, and considers the resources that would be needed to implement the described assessment. Additional scenarios may be identified through ongoing consultation with relevant stakeholders. Dotted lines in Figure 1represent the use cases addressed by the project.
SCENARIO | SHORT DESCRIPTION | IMPLEMENTATION |
---|---|---|
1 | Researchers want to check their plans for producing FAIR data at the outset of their project as part of a data management plan (DMP) process and to periodically assess FAIRness over the life of their project through updating their DMP. They also want to check that selected data are as FAIR as possible before depositing the data in any repository for wider sharing (e.g., using FAIR-Aware as detailed in section 5.1). | The assessment can be implemented by providing manual checklists or automated assessment as part of e.g., data management planning tools and could involve Research Performing Organization, Funders and/or Publishers. |
2 | Data repositories and researchers want to make it easier to provide FAIR data during the deposit process and at the point of submission to the repository. | This can be implemented by either a manual, automatic or semi-automatic checklist tool tailored to a repository’s data curation practice or by implementing a repository feature to automatically check certain aspects as part of the deposit workflow. |
3 | Data repositories want to periodically re-assess the FAIRness of the datasets they hold (e.g., using F-UJI as detailed in section 5.2). | This would support an internal review of data service provision and can be implemented by an automated assessment tool for published datasets. |
4 | Additional stakeholders (e.g., funding bodies, publishers, and certification bodies) may want to monitor research data compliance and adjust their policies and requirements accordingly. | The assessment tool to address scenario 3 can be adapted and integrated with the stakeholder’s processes. |
4. DATA OBJECT ASSESSMENT METRICS
To systematically measure the extent to which research data objects are FAIR, we propose a set of 15 core metrics (v0.3) (see Table 2). Figure 2 illustrates the development stages of the metrics. The first release (v0.1) of the FAIRsFAIR candidate metrics were derived from the consolidation of draft data maturity indicators proposed by the FAIR Data Maturity Model Working Group, and prior work carried out by the project partners such as WDS/RDA Assessment of Data Fitness for Use checklist, FAIRdat, and FAIR enough?. A mapping of FAIRsFAIR metrics to the criteria used in the above frameworks was developed to identify similarities and differences in their interpretation and representations. The comparison resulted in the first release of domain-agnostic core metrics, detailed in the report (). In the next release (), the project partners further refined the metrics, taking into account the primary use cases’ scope and requirements (section 5). The metrics were further improved based on focus group feedback and descriptions were updated based on the FAIR data maturity model guidelines and specification (). A total of 33 FAIR stakeholders, including research communities, data service providers, standard bodies, and coordination fora participated in the focus group activity between 1st May and 25th May 2020.
FAIRSFAIR OBJECT METRIC | RDA FAIR DATA MATURITY MODEL | ADOPTION AND IMPROVEMENT |
---|---|---|
FsF-F1-01D Data is assigned a globally unique identifier. | RDA-F1-02D Data is identified by a globally unique identifier | No changes to the indicator, but assessment details and related resources are specified. |
FsF-F1-02D Data is assigned a persistent identifier. | RDA-F1-01D Data is identified by a persistent identifier RDA-A1-03D Data identifier resolves to a digital object | Merged two overlapping indicators on persistence and resolvability. |
FsF-F2-01M Metadata includes descriptive core elements (creator, title, data identifier, publisher, publication date, summary, and keywords) to support data findability. | RDA-F2-01M Rich metadata is provided to allow discovery | Refined the indicator by clarifying core metadata descriptors. |
FsF-F3-01M Metadata includes the identifier of the data it describes. | RDA-F3-01M Metadata includes the identifier for the data | No changes to the indicator, but its assessment verifies the identifiers of the data and data content. |
FsF-F4-01M Metadata is offered in such a way that it can be retrieved by machines. | RDA-F4-01M Metadata is offered in such a way that it can be harvested and indexed | Rephrased to avoid jargon and put emphasis on automated (machine-aided) retrieval. |
FsF-A1-01M Metadata contains access level and access conditions of the data. | RDA-A1-01M Metadata contains information to enable the user to get access to the data | Extended the assessment by distinguishing access conditions by different data types. |
FsF-A2-01M Metadata remains available, even if the data is no longer available. | RDA-A2-01M Metadata is guaranteed to remain available after data is no longer available | Narrowed down the scope of the assessment to deleted or replaced objects. On a practical level, this indicator applies to repository assessment as continued access to metadata depends on a data repository’s preservation practice. |
FsF-I1-01M Metadata is represented using a formal knowledge representation language. | RDA-I1-02M Metadata uses machine-understandable knowledge representation | No changes to the indicator, but assessment details and related resources are specified. |
FsF-I1-02M Metadata uses semantic resources. | RDA-I1-01M Metadata uses knowledge representation expressed in standardized format | Distinguished two types of semantic resources which comprise the resources for modelling data (e.g., dcat) and the other for describing ‘contents’ (e.g., taxonomy). |
FsF-I3-01M Metadata includes links between the data and its related entities. | RDA-I3-01M Metadata includes references to other metadata RDA-I3-02M Metadata includes references to other data RDA-I3-02D Metadata includes references to other metadata RDA-I3-04M Metadata includes qualified references to other data | Merged overlapping indicators as a data object may be linked to n-types of related entities. |
FsF-R1-01MD Metadata specifies the content of the data. | RDA-R1-01M Plurality of accurate and relevant attributes are provided to allow reuse | Addressed a specific aspect of metadata plurality, which examines if the contents of a dataset are specified in the metadata, and it should be an accurate reflection of the actual data deposited. |
FsF-R1.1-01M Metadata includes license information under which data can be reused. | RDA-R1.1-01M Metadata includes information about the licence under which the data can be reused RDA-R1.1-02M Metadata refers to a standard reuse licence | Combined indicators. Standard and bespoke licenses are verified as part of the assessment. |
FsF-R1.2-01M Metadata includes provenance information about data creation or generation. | RDA-R1.2-01M Metadata includes provenance information according to community-specific standards | Refined by providing minimal metadata properties representing data provenance. |
FsF-R1.3-01M Metadata follows a standard recommended by the target research community of the data. | RDA-R1.3-01M Metadata complies with a community standard | Rephrased for clarity and to highlight the research community. |
FsF-R1.3-02D Data is available in a file format recommended by the target research community. | RDA-R1.3-01D Data complies with a community standard | Rephrased for clarity and extended the assessment to cover both open and future-proof file formats. |
4.1 METRICS SPECIFICATION
For a detailed specification of the metrics, see (). The specification follows the template modified from (). The specification covers both the ‘what’ (metrics) and ‘how’ aspects (assessment details). Each metric is aligned with the FAIR principles and the CoreTrustSeal requirements (). The mapping is critical as it indicates to what extent a CoreTrustSeal-certified repository can enable objects’ compliance with the FAIR principles as tested through the metrics. Each metric is identified following a standard naming convention. As shown in Figure 3, the identifier starts with the shortened form of the project’s name, followed by the related FAIR principle identifier and a local identifier. The last part of the identifier clarifies whether the metric will evaluate data or metadata.
The metrics correspond to all or part of one or more FAIR principles with the following exceptions:
- A1.1, A1.2 (communication protocol). We add the metric ‘FsF-A1-01M’ which evaluates the inclusion of data access level (e.g., public, restricted) and conditions in the metadata. We have defined metrics on standard communication protocols as part of a new version of the metrics () which is currently under public consultation and will be finalized in the next release of the specification.
- I2 (FAIR vocabularies). The criteria for a FAIR vocabulary require further clarification before an assessment can be designed and implemented.
The following factors influence the assessment metrics:
- In a functioning FAIR ecosystem (), a FAIR assessment depends on context beyond the object itself and beyond the repository as primary curator. Software and services are also key operational dependencies which in turn depend on a range of registries and evaluation approaches. FAIR enabling services and repositories are vital to ensure that research data objects remain FAIR over time (preservation).
- FAIR-aligned repository certification clarifies that a comprehensive FAIRness assessment of digital objects (data and metadata) also requires business information management (e.g., policies, procedures, and workflows).
- Automated testing depends on clear, machine assessable criteria. Some aspects (rich, plurality, accurate, relevant) specified in the FAIR principles still require human mediation and interpretation.
- Until mechanisms for agreeing and managing domain/community-driven criteria such as schemas and usage elements are in place, the tests based on the metrics must focus on generally applicable data and metadata characteristics.
- We recognize that data quality elements (e.g., completeness, correctness, validity, ease of data use) are important for data reuse but are not within the scope of this work.
4.2 ADOPTION AND DISCUSSION
In Table 2, we present how the FAIRsFAIR object metrics (prefixed with ‘FsF’) are related to the RDA FAIR Data Maturity Model indicators, noting cases where improvements or amendments have been made with respect to the maturity model. For further comments on the indicators, see (). At present, the metrics primarily address indicators classified by the RDA WG as essential as well as a subset of other important and useful indicators. Ultimately, we strive to define metrics to cover all FAIR principles as explicitly as possible, addressing data and metadata, and the human and machine perspectives.
As part of the adoption process, we defined assessment details and related resources for each of the object metrics through which practical tests against the metrics can be implemented. To ensure that the proposed methods are transparent and can be further improved, assessment constraints and limitations are specified. The alignment of the object metrics with CoreTrustSeal requirements helps identify areas of overlap, which can be used to unify repository requirements and FAIR object assessment.
Selected indicators have been rephrased and reinterpreted. For example, following the focus group’s feedback, we rephrased the indicator (RDA-F4-01M) to avoid technical jargon (i.e., harvested and indexed). We refined the indicator ‘RDA-F2-01M’ by proposing a minimum set of metadata properties required to enable data findability and citation as specified in existing guidelines, e.g., DataCite, Earth Science Information Partners (ESIP), and International Association for Social Science Information Services & Technology (IASSIST), EOSC Datasets Minimum Information (EDMI). The indicator ‘RDA-A1-01M’ is further extended by distinguishing the access condition properties by data types such as public, embargoed, restricted, and metadata-only, as part of the assessment.
In some cases, we merged overlapping indicators. For instance, in agreement with the current PID practice (), we consider PID resolution as the core functionality of persistent identifiers. Therefore, we combined the indicators (RDA-F1-01D, RDA-A1-03D) into one metric ‘FsF-F1-02D’. Inconsistencies of permanent resolutions of PIDs () as implemented by PID providers should be handled when implementing the metric. We merged the indicators addressing the I3 principle (qualified references to other (meta)data) into one metric (‘FsF-I3-01M’), which examines if metadata includes the links (relations) between the data and its related entities. We do not prescribe a specific set of related entities as a data object may be linked to n-types of entities (e.g., prior version, associated datasets, scholarly article, physical specimen, funder, repository, or platform). The emphasis is on representing the links between data and its associated entities, expressed through relation types, and preferably with persistent Identifiers provided for the related entities.
We defined a new metric (FsF-R1-01MD) that establishes a data content description as essential for assessing data fitness for use (). This metric evaluates if the dataset’s content is specified in the metadata, and it is an accurate reflection of the actual data deposited. Data content properties are addressed as part of a plurality of metadata elements; therefore, we map this metric to its closest principle R1 ((Meta)data are released with a clear and accessible data usage license).
5. FROM USE CASES TO APPLICATIONS
Table 3 includes two primary use cases of the project selected from the scenarios developed. Sections 5.1 and 5.2 present the two tools we developed to support FAIR assessment based on the FAIRsFAIR metrics, in support of the use cases.
USE CASE | ASSESSMENT SCENARIO (AS LISTED IN TABLE 1) | ASSESSMENT TOOL |
---|---|---|
Stakeholders (e.g., institutions, data service providers) offer a generic manual self-assessment tool to educate and raise awareness of researchers on making their data FAIR before publishing the data. | 1 | FAIR-Aware (section 5.1) |
A data service provider (e.g., data repository, data portal or registry) committed to FAIR data provision wants to programmatically measure datasets for their level of FAIRness over time. | 3 | F-UJI (section 5.2) |
5.1 RAISING AWARENESS OF FAIR DATA
The FAIR-Aware online self-assessment tool (Figure 4) aims at raising researchers’ awareness about the value of making data FAIR before depositing into a repository. Making data FAIR is still an unclear process for many researchers across various disciplines. To help researchers bridge this knowledge gap, FAIR-Aware emphasizes educating and raising awareness of FAIR data rather than measuring the extent to which their datasets are FAIR. It promotes practical understanding of the FAIR data principles and how they can increase data value and impact. The 10 assessment questions are derived from the FAIRsFAIR object metrics specification (), and cover all aspects of the FAIR data principles. Information tips available for each question provide additional explanations and context with practical examples and guidance.
Testing of the beta version was undertaken by 49 external volunteers representing several research community stakeholders (see Figure 5). The overall feedback is positive with most of the respondents (60%) finding FAIR-Aware useful for raising awareness about FAIR data principles. This finding highlights the value to the research community of making such support tools available. Practical information tips, clear guidance and the accessible language used to explain often complex terms and definitions were all identified as strengths by the respondents. However, respondents reported difficulties with some questions relating to the use of semantic vocabularies in metadata, community-endorsed metadata, and provenance metadata. The project will address these shortcomings in the information tips and with additional guidance.
The development of FAIR-Aware is iterative and extensive feedback on the metrics as well as the user interface from the testing phase is being incorporated in the next version. Suggestions include providing more discipline specific and data type examples, elaborating on the level of data access and related legal obligations, and making FAIR-Aware available in other languages. The source code is available online and customizable to facilitate adoption by other repositories and as part of FAIRsFAIR engagement and training activities.
5.2 ASSESSING PUBLISHED DATASETS FROM TRUSTWORTHY REPOSITORIES
‘FAIR differs in that it describes concise, domain-independent, high-level principles that can be applied to a wide range of scholarly outputs’ (). An automated assessment of data object FAIRness needs to cover a broad range of disciplinary data offerings, and consequently should focus on a rather small set of domain-agnostic best practices and standards which evolved in recent years. We developed an automated assessment tool (F-UJI) () for piloting the FAIR assessment of published data objects from selected trustworthy data repositories based on the core metrics outlined in section 4.1. The tool (Figure 6) performs an assessment starting from a data object identifier (e.g., PID or URL) and is based on existing Web standards and best practices endorsed by persistent identifier (PID) providers for research data. It utilizes several external resources to enable programmatic assessment of a data object, such as re3data and DataCite APIs, SPDX License List, RDA Metadata Standards Catalog, and Linked Open Vocabularies (LOV). For comprehensive information on the tests implemented through the service, see ().
We first tested the service with 500 published data objects from two data repositories (PANGAEA and WDCC CERA). Based on the results of an iterative consultation process, we provided recommendations to the repositories for improving the FAIRness of the data objects. For example, Figure 7 shows the scores of 500 PANGAEA datasets for each FAIR principle, before and after the improvement of the metadata of the datasets. As part of the first iterative improvement, PANGAEA prioritized improving the access level (Accessibility) and data content descriptions (Reusability) following our recommendations. We anticipate the data center will improve its data findability and interoperability as part of the next iteration of the process. The improvement of WDCC datasets are in progress. The two repositories provided some feedback to improve the F-UJI service, primarily fine tuning the assessment based on various levels of objects (e.g., experiment, data group and dataset), and elaborating the properties representing data provenance information. Further pilots will be undertaken with the repositories selected for in-depth collaboration through the FAIRsFAIR open calls, for example Phaidra-Italy and DataverseNO, and repositories collaborating with the project such as CSIRO Data Access Portal and DataverseNL.
FAIR-enabling services (e.g., repositories, metadata standards, licenses, and policy registries) are essential to support a fully automated evaluation, specifically when not all relevant metadata required by the assessment service are embedded in the data landing pages or metadata of datasets. Thus, planned work includes exploring the potential for interfacing the assessment tool with registries (e.g., FAIRsharing) to increase the level of automation.
6. CONCLUSIONS
The paper described how the RDA recommendation ‘FAIR Data Maturity Model: specification and guidelines’ has been adopted and adapted to what can be realistically tested to assess FAIR data objects. We presented core object assessment metrics built on the recommendation, and their pilot applications in two priority use cases, before or after data deposit in trustworthy data repositories. Testing of the tools to date has helped researchers to become more FAIR-Aware and repositories to assess their FAIR-enabling services. Early testing has motivated participating repositories to already begin improving their practices. Below we share some of the initial lessons learned.
From our experience of elaborating the RDA FAIR Data Maturity indicators in terms of metrics, and implementing and testing these metrics in practice, we want to emphasize that the development of FAIR metrics is a continuous process, therefore it should be integrated within an iterative consultation process to incorporate the feedback from the metrics’ implementation.
Since FAIR is a journey, testing the metrics iteratively with actual datasets and different users (e.g., through assessment tools) at different stages of the research data cycle is vital to apply the metrics into practice, and to identify opportunities and critical areas of the improvement of FAIR data and enabling-services. We significantly improved the metrics and their assessment methods by communicating their known limitations and opportunities openly and transparently in the metrics’ specification.
As technology as well as the community requirements will evolve over time, it is crucial to revisit the FAIR principles and the corresponding indicators to better interpret and implement them. For example, assumptions that overgeneralize the permanent identification of data and metadata objects in practice will influence the implementation of assessments and their results and continued access to metadata depends on a data repository’s sustainability and its preservation practices.
FAIR object assessment is a component of the FAIR ecosystem and requires FAIR-enabling services to be developed and refined in parallel. Trustworthy data repositories play an important role to ensure continued access to and long-term preservation of the objects and their metadata. The role of data services providers in enabling FAIR should be recognized and appreciated as they are ‘proxies’ between different FAIR stakeholders (e.g., researchers and funders).
As part of the planned work, the metrics and pilot applications will go through further iterations of improvement based on feedback from the community. FAIRsFAIR will continue to support the wider use of these tools and to refine them over the final 18 months of the project. Any resulting revisions to the indicators and their validation (or otherwise) through testing will be incorporated into the planned RDA Maintenance phase for the RDA Data Maturity Indicators. In addition, the project team will propose a badging scheme to present results of the data object assessment. Furthermore, next steps for FAIRsFAIR will explore a broader range of use cases to evaluate the metrics in applications in more lifecycle phases and with a larger variety of stakeholders trying to address a wider range of the scenarios laid out in Table 1. Focus will be on the work with CoreTrustSeal and developing concepts and workflows to integrate object assessment into repository certification (scenario 4).
The goal of the FAIRsFAIR project is to increase the availability and reuse of FAIR data. The metrics and assessment tools that we have developed help to realize this goal by supporting researchers and service providers to put FAIR into practice.