3. Data interoperability framework for dataset cataloguing

The first step in conducting research in the health domain is finding and requesting access to datasets that fulfill certain criteria based on the clinical use cases that need to be answered. In order to achieve this, it is essential to appropriately catalog the information held by various data sources and make these catalogues accessible for browsing. Typically, these catalogues are anticipated to include metadata outlining the fundamental and high-level characteristics of the datasets.

In the context of EUCAIM, Tier 1 focuses on achieving interoperability at a dataset metadata level. This entails standardizing the definition, documentation, and exchange of aggregated dataset metadata across diverse systems. Key steps for achieving interoperability at this level include adopting widely recognized metadata standards, using controlled vocabularies to prevent ambiguity, and facilitating automatic metadata exchange between systems. By achieving interoperability at the dataset metadata level and standardizing key characteristics of the EUCAIM cancer imaging datasets, we simplify the process for users and applications to find and assess whether a specific EUCAIM dataset meets their needs.

Within the European context, various activities and regulations, notably the EU regulation on the European Health Data Space (Article 55), aim to enhance and promote data sharing. The regulation emphasizes the need for health access bodies to maintain a systematically arranged dataset catalogue accessible online. To fulfill this requirement, a common generic framework is necessary.

As already described and analyzed in D5.1 (section 3.5)[2], the DCAT-AP v3.0.0, along with an extension, has been adopted as the metadata standard for dataset cataloguing and for describing the cancer imaging datasets to be registered into the EUCAIM public catalogue. The extension is necessary for incorporating the domain-specific imaging and clinical metadata required for discovering the EUCAIM cancer imaging datasets. Another parallel effort at a European Level, specifically for the health domain, has been the Health-DCAT-AP specification, which aims to extend the general DCAT-AP for describing health-related datasets that also comply with the European Health Data Space regulation. EUCAIM has leveraged and tried to build upon the unofficial Health-DCAT-AP specification[3] currently available, as well as analyze its alignment with the EUCAIM DCAT-AP extension that has been defined within the context of this project for cancer imaging-related information. The following sections describe an updated version of the metadata model described in D5.1 section 3.5, as well as provide detailed mappings of the generic DCAT-AP, the HealthDCAT-AP and the EUCAIM DCAT-AP.

3.1 EUCAIM DCAT-AP

Extending DCAT - and creating the so-called DCAT Application Profiles for specific domains - comes with a specific set of requirements that should be met:

  • The mandatory requirements defined in the DCAT-AP should be respected.

  • The controlled vocabularies of the DCAT-AP specification must be respected.

  • Recommended and optional properties could become mandatory (have stricter semantics).

  • Recommended attributes could become optional (less strict semantics).

  • New domain-specific controlled vocabularies could be defined for newly added properties.

Our methodology for extending DCAT-AP, as it was described in D5.1, section 3.5.2, was to establish first the minimum/mandatory information that should accompany the medical images and describe the datasets to be registered in the EUCAIM public catalogue. As a reminder, we adopted a bottom-up approach, gathering the obligatory information mandated by the AI4HI projects for various cancer types considered within these projects. Additionally, we explored the initiatives undertaken by the European Network of Cancer Registries (ENCR), with a focused examination of the Standard Dataset specifications document and the cancer data quality checks proposal. At the same time, we sought to leverage and build upon the work of other European initiatives, such as the BBMRI-ERIC biobank metadata catalogue, the AI4HI project metadata catalogues, as well as the AI interoperability in imaging White Paper which includes a set of required data elements useful for AI model development. Finally, for specifying the semantics and mappings of the clinical terms to be used and therefore defining the set of controlled vocabularies to be used for the newly added properties, we use the EUCAIM Hyper-ontology specification (described in section 4). All the details of the approach and an initial outcome have been described in the deliverable D5.1 on section “3.5. Public Catalogue-Metadata Model”. An updated metadata model that tries to comply with the work of the HealthDCAT-AP is given below in Tables 1 and 2 where the general dataset metadata and the domain-specific EUCAIM dataset metadata are outlined (for the general metadata as these are defined in DCAT-AP v3.0.0 only the mandatory and the recommended properties are outlined. The optional ones are excluded for conciseness).

Table 1: General dataset metadata (DCAT-AP specification with stricter semantics in some cases)

EUCAIM DCAT-AP

Property Type

Property

Description

Property IRI

Range

Cardinality

Example

Mandatory

Title

A clear and concise name for the dataset.

dct:title

rdfs:Literal

1..n

dct:title "Open Challenge Prostate Cancer V1"@en;

Mandatory

Description

A detailed description of the dataset.

dct:description

rdfs:Literal

1..n

dct:description "This ProCAncer-I project imaging dataset contains a collection of patients with mpMRI examinations (T2ax, DWI, DCE) who have confirmed PCa at biopsy and/or prostatectomy."@en

Recommended

Acronym

An acronym that identifies the dataset.

dct:alternative

rdfs:Literal

0..n

dct:alternative “TCGA"@en

Recommended

keyword

A keyword describing the dataset.

dcat:keyword

rdfs:Literal

0..n

dcat:keyword "prostate cancer"@en, "MRI performed"@en, "positive histology"@en;

Recommended

images creation year

A temporal period that the dataset covers. This corresponds to the year range that the actual (DICOM) images were created/acquired (if this has not been changed in the anonymization process). If this is not available, an estimation can be added.

dct:temporal

dct:PeriodOfTime

0..n

dct:temporal [ a dct:PeriodOfTime; dcat:endDate "2023-12-31"^^<http://www.w3.org/2001/XMLSchema#date>; dcat:startDate "2021-01-01"^^<http://www.w3.org/2001/XMLSchema#date> ];

Mandatory

contact point

Contact information of the individual/managing organization of the Dataset.

dcat:contactPoint

vcard:Kind

1..n

dcat:contactPoint [ a vcard:Organization; vcard:hasEmail <mailto:access-commitee@procancer-i.com> ];

Recommended

geographical coverage

A geographic region that is covered by the Dataset.

dct:spatial

1..n

dct:spatial <http://publications.europa.eu/resource/authority/country/GRC>;

Mandatory

Publisher

An entity (organisation) responsible for making the Dataset available.

dct:publisher

foaf:Organization

1..1

dct:publisher [ a foaf:Organization; locn:address [ a locn:Address; foaf:name "FORTH"; foaf:mbox <mailto:access-commitee@procancer-i.com>; foaf:homepage <https://forth.ics.gr>; ];];

Mandatory

Theme

A category of the dataset.

dcat:theme

fixed to: http://publications.europa.eu/resource/authority/data-themearrow-up-rightOR subproperty of dct:subject skos:Concept

1..n

dcat:theme <http://publications.europa.eu/resource/authority/data-theme/HEAL>;

Mandatory

Identifier

A unique persistent identifier of the dataset (in compliance with the findability aspect of the FAIR principles)

dct:identifier

rdfs:Literal

1..n

dct:identifier "https://catalogue.eucaim.cancerimage.eu/api/fdp/fdp_Dataset/2081ac523632f434cd5bc4056a30ad5b"^^<http://www.w3.org/2001/XMLSchema#anyURI>;

Mandatory (NSIP)

accessRights

The accessRights of the dataset.

dct:accessRights

1..n

Mandatory

rights

A statement about the conditions of access and usage of the dataset.

dct:rights

dct:RightsStatement (fixed to a predefined set of values presented in D5.1)

dct:rights [ a dct:RightsStatement; rdfs:label "Authorisation to access, view and process in-situ the datasets"@en ];

Mandatory

applicableLegislation

The legislation that mandates the creation or management of the Dataset.

dcatap:applicableLegislation

rdfs:Resource

1..n

dcatap:applicableLegislation <http://data.europa.eu/eli/reg/2022/868/oj>;

Recommended

modification date

The most recent date on which the Dataset was changed or modified.

dct:modified

rdfs:Literal typed as xsd:date, xsd:dateTime, xsd:gYear or xsd:gYearMonth

0..1

dct:modified "2024-02-05T18:47:54Z"^^<http://www.w3.org/2001/XMLSchema#dateTime>;

Recommended

sample

A sample distribution of the dataset.

adms:sample

dcat:Distribution

0..n

adms:sample [a dcat:Distribution ;

dct:description "Synthetic data of the HealthPilot Use Case"@en;

dcat:downloadURL <https://github.com/CAVDgit/EHDS2_UC_Sciensano/blob/main/use_case_1_synthetic_data_10K_individuals.csvarrow-up-right>;

dcat:mediaType <http://www.iana.org/assignments/media-types/text/tab-separated-valuesarrow-up-right> ;

];

Mandatory

provenance

A statement about the lineage of a Dataset, including information about how the data was created, or processed, including methodologies, tools, and protocols used.

dct:provenance

dct:ProvenanceStatement

provenance

dct:provenance [ a dct:ProvenanceStatement;

rdfs:label "This data is sourced from several existing datasets, including the Duke dataset, ParcTauli and TCGA datasets. These datasets collectively provide comprehensive demographic and clinical data relevant to the project's objectives"@en ];

Mandatory

Type

A type of the Dataset.

dct:type

skos:Concept (there is a predefined set of values presented in D5.1).

1..n

dct:type a skos:Concept ; skos:prefLabel "Annotated Dataset"@en .

Mandatory

Version

The version of the dataset.

dcat:version

rdfs:Literal (in SemVer or CalVer format)

1..1

dcat:version "20231122"

Mandatory

accessURL*

A URL that gives information about accessing the dataset. In EUCAIM, this is the URL of the negotiator service.

dcat:accessURL

rdfs:Resource

1..1

dcat:accessURL <https://negotiator.eucaim.cancerimage.eu/collection/a96b56cd-59d4-444a-8e59-32a7fb0d7dea> ;

Recommended

license*

A license under which the Dataset is made available, assuming there is one license for all Dataset Distributions. If each Distribution has different licenses they should be included at the Distribution level with 1..1 relationship.

dcterms:license

dcterms:LicenseDocument (ideally under CC licenses for interoperability)

0..*

dcterms:license [ a dcterms:LicenseDocument; dcterms:identifier <http://creativecommons.org/licenses/by/4.0/> ; ];

Recommended

imageSize (in GB)*

The total size of all Distributions in the dataset, which is mainly the image size.

dcat:byteSize

xsd:decimal

0..1

dcat:byteSize "325"^^xsd:decimal

Recommended

format*

The file format of the Distributions included in the Dataset.

dct:format

dct:MediaTypeOrExtent (IANA Media Types)

0..n

dct:format <https://www.iana.org/assignments/media-types/application/dicom>;

*These properties are properties of the “Distribution” Entity. However, they will be included in the metadata catalogue at a dataset level as miltivalued attributes.

Table 2: EUCAIM DCAT-AP domain-specific metadata

EUCAIM DCAT-AP

Property Type

Property

Description

Property IRI

Range

Cardinality

Example

Mandatory

age low

The minimum age of subjects within the dataset.

eucaim:ageLow

rdfs:Integer

1..1

eucaim:ageLow "18" ^^xsd:int ;

Mandatory

age high

The maximum age of subjects within the dataset.

eucaim:ageHigh

rdfs:Integer

1..1

eucaim:ageHigh "18" ^^xsd:int ;

Recommended

age median

The median age of subjects within the dataset.

eucaim:ageMedian

rdfs:Integer

0..1

eucaim:ageMedian "45" ^^xsd:int ;

Mandatory

birthsex

BirthSex of subjects in the dataset.

eucaim:birthsex

skos:Concept

1..*

eucaim:birthsex <https://cancerimage.eu/ontology/EUCAIM#COM1000177>

Mandatory

number of studies

Total count of DICOM studies.

eucaim:nbrOfStudies

rdfs:Integer

1..1

eucaim:nbrOfStudies "8789" ^^xsd:int ;

Mandatory

number of subjects

Total count of unique individuals in the dataset.

eucaim:nbrOfSubjects

rdfs:Integer

1..1

eucaim:nbrOfSubjects "8237" ^^xsd:int ;

Recommended

number of series

Total count of DICOM series.

eucaim:nbrOfSeries

rdfs:Integer

1..1

eucaim:nbrOfSeries "24567" ^^xsd:int ;

Mandatory

intended purpose

The primary objective for which the dataset was created.

eucaim:intendedPurpose

dpv:Purpose

1..n

eucaim:intendedPurpose[ a dpv:Purpose ; dct:description " The primary objective of this dataset is the detection of prostate cancer with high accuracy both in peripheral and transitional zones to identify which men have cancer and those with no cancer."@en;] ;

Mandatory

collection method

This attribute defines the scope of data aggregation within the dataset. It specifies how data records are organized based on different criteria, allowing users to understand the context in which the data was collected.

eucaim:collectionMethod

subproperty of dct:subject skos:Concept (fixed to a predefined set of values presented in D5.1)

1..n

eucaim:collectionMethod a skos:Concept ; skos:prefLabel "Only-Image"@en.

Mandatory

quality label

A statement related to quality of the Dataset, including rating, quality certificate as per the EHDS requirements.

dqv:hasQualityAnnotation

dqv:QualityCertificate

1..1

dqv:hasQualityAnnotation [a dqv:QualityCertificate ; oa:hasTarget <https://…/dataset/123>; oa:hasBody <https://…/certificatearrow-up-right>; oa:motivatedBy dqv:qualityAssessment];

Mandatory

legal basis

Legal basis used to justify processing of data or use of technology in accordance with a law.

dpv:hasLegalBasis

dpv:LegalBasis

1..n

dpv:hasLegalBasis [

a dpv:LegalBasis ;

dct:description "Deliberation no. 21/028 of february 18, 2021, last amended on june 18, 2021, relating to the communication of data to pseudonymized personal character relating to the healthdata of.. , as part of the EUCAIM project and the subsequent processing of personal data pseudonymised by…"@en;

dct:source <https://cancerimage.eu/file/view/AXkNfdPml9vUUfvGGfJr?filename=21-028-f212-AFMPS-dataset-modifi%C3%A9e%20le%2018%20juin%202021.pdf> ; ] ;

Mandatory

condition

The primary cancer condition of individuals in the dataset.

eucaim:hasCondition

skos:Concept (EUCAIM controlled vocabulary based on ICD-03 and SNOMED)

1..1

Mandatory

image modality

The set of modalities for the images in the dataset.

eucaim:hasImageModality

skos:Concept (EUCAIM controlled vocabulary based on DICOM and Radlex)

1..n

eucaim:hasImageModality <https://cancerimage.eu/ontology/EUCAIM#IMG1000022arrow-up-right>

(Magnetic Resonance Imaging)

Mandatory

image vendor

Manufacturer of the imaging device as it is defined in DICOM tag (0008,0070).

eucaim:hasImageVendor

skos:Concept (EUCAIM controlled vocabulary)

1..n

Mandatory

image body part

Anatomical areas captured in the images.

eucaim:hasImageBodyPart

skos:Concept (EUCAIM controlled vocabulary)

1..n

eucaim:hasImageBodyPart <https://cancerimage.eu/ontology/EUCAIM#BP1000233arrow-up-right>

(Neck and chest)

The full mappings between DCAT-AP v3.0.0, the current Health-DCAT-AP, and the EUCAIM DCAT-AP are described in: Mappings of DCAT application profilesarrow-up-right. In the worksheet, some properties have been highlighted with different colors, to denote either stricter or less strict semantics to the current HealthDCAT-AP specification, as these were discussed in the EUCAIM WP5 related working group.

3.2 FAIR principles compliance

For supporting dataset metadata interoperability, it is also crucial to consider the FAIR principles. These principles guide the development of metadata to ensure that datasets are easily discoverable, accessible, interoperable, and reusable across diverse environments. The DCAT-AP, serving as the standard framework for dataset descriptions, aligns seamlessly with the FAIR principles, which introduce another set of requirements that must be met. The Research Data Alliance introduced the FAIR Data Maturity Model, which assesses the level of adherence to the FAIR principles and consists of different maturity levels, typically labeled as F1, F2, F3, and F4, etc. which represent increasing levels of compliance with the FAIR principles. Each level corresponds to specific indicators that can be seen as requirements for the “FAIRification” of the datasets, which should be analyzed and verified with respect to the specific restrictions and requirements of the EUCAIM project.

Finally, interoperability on a dataset metadata level involves facilitating automatic metadata exchange between systems. The concept of the FAIR Data Point (FDP)[4] comes into play as a metadata service that adheres to the FAIR principles and offers a reference implementation (an API) enabling data owners to expose data and metadata in a FAIR manner based on the DCAT metadata standard. Although it is not a requirement for the data holders to have an FDP for exposing their datasets in a machine-readable format for Tier 1 or 2 (although EUCAIM will recommend and facilitate its adoption even on these tiers), EUCAIM will adopt it in the central EUCAIM metadata catalogue in order to not only expose its dataset metadata in an automatic way to other dataset catalogues increasing their visibility and discoverability, but also to harvest dataset metadata from already established infrastructures which have an FDP service on their catalogues.

Last updated