3. Data interoperability framework for dataset cataloguing
The first step in conducting research in the health domain is finding and requesting access to datasets that fulfill certain criteria based on the clinical use cases that need to be answered. In order to achieve this, it is essential to appropriately catalog the information held by various data sources and make these catalogues accessible for browsing. Typically, these catalogues are anticipated to include metadata outlining the fundamental and high-level characteristics of the datasets.
In the context of EUCAIM, Tier 1 focuses on achieving interoperability at a dataset metadata level. This entails standardizing the definition, documentation, and exchange of aggregated dataset metadata across diverse systems. Key steps for achieving interoperability at this level include adopting widely recognized metadata standards, using controlled vocabularies to prevent ambiguity, and facilitating automatic metadata exchange between systems. By achieving interoperability at the dataset metadata level and standardizing key characteristics of the EUCAIM cancer imaging datasets, we simplify the process for users and applications to find and assess whether a specific EUCAIM dataset meets their needs.
Within the European context, various activities and regulations, notably the EU regulation on the European Health Data Space (Article 55), aim to enhance and promote data sharing. The regulation emphasizes the need for health access bodies to maintain a systematically arranged dataset catalogue accessible online. To fulfill this requirement, a common generic framework is necessary.
As already described and analyzed in D5.1 (section 3.5)[2], the DCAT-AP v3.0.0, along with an extension, has been adopted as the metadata standard for dataset cataloguing and for describing the cancer imaging datasets to be registered into the EUCAIM public catalogue. The extension is necessary for incorporating the domain-specific imaging and clinical metadata required for discovering the EUCAIM cancer imaging datasets. Another parallel effort at a European Level, specifically for the health domain, has been the Health-DCAT-AP specification, which aims to extend the general DCAT-AP for describing health-related datasets that also comply with the European Health Data Space regulation. EUCAIM has leveraged and tried to build upon the unofficial Health-DCAT-AP specification[3] currently available, as well as analyze its alignment with the EUCAIM DCAT-AP extension that has been defined within the context of this project for cancer imaging-related information. The following sections describe an updated version of the metadata model described in D5.1 section 3.5, as well as provide detailed mappings of the generic DCAT-AP, the HealthDCAT-AP and the EUCAIM DCAT-AP.
3.1 EUCAIM DCAT-AP
Extending DCAT - and creating the so-called DCAT Application Profiles for specific domains - comes with a specific set of requirements that should be met:
The mandatory requirements defined in the DCAT-AP should be respected.
The controlled vocabularies of the DCAT-AP specification must be respected.
Recommended and optional properties could become mandatory (have stricter semantics).
Recommended attributes could become optional (less strict semantics).
New domain-specific controlled vocabularies could be defined for newly added properties.
Our methodology for extending DCAT-AP, as it was described in D5.1, section 3.5.2, was to establish first the minimum/mandatory information that should accompany the medical images and describe the datasets to be registered in the EUCAIM public catalogue. As a reminder, we adopted a bottom-up approach, gathering the obligatory information mandated by the AI4HI projects for various cancer types considered within these projects. Additionally, we explored the initiatives undertaken by the European Network of Cancer Registries (ENCR), with a focused examination of the Standard Dataset specifications document and the cancer data quality checks proposal. At the same time, we sought to leverage and build upon the work of other European initiatives, such as the BBMRI-ERIC biobank metadata catalogue, the AI4HI project metadata catalogues, as well as the AI interoperability in imaging White Paper which includes a set of required data elements useful for AI model development. Finally, for specifying the semantics and mappings of the clinical terms to be used and therefore defining the set of controlled vocabularies to be used for the newly added properties, we use the EUCAIM Hyper-ontology specification (described in section 4). All the details of the approach and an initial outcome have been described in the deliverable D5.1 on section “3.5. Public Catalogue-Metadata Model”. An updated metadata model that tries to comply with the work of the HealthDCAT-AP is given below in Tables 1 and 2 where the general dataset metadata and the domain-specific EUCAIM dataset metadata are outlined (for the general metadata as these are defined in DCAT-AP v3.0.0 only the mandatory and the recommended properties are outlined. The optional ones are excluded for conciseness).
Table 1: General dataset metadata (DCAT-AP specification with stricter semantics in some cases)
EUCAIM DCAT-AP
Property Type
Property
Description
Property IRI
Range
Cardinality
Example
Mandatory
Title
A clear and concise name for the dataset.
dct:title
rdfs:Literal
1..n
dct:title "Open Challenge Prostate Cancer V1"@en;
Mandatory
Description
A detailed description of the dataset.
dct:description
rdfs:Literal
1..n
dct:description "This ProCAncer-I project imaging dataset contains a collection of patients with mpMRI examinations (T2ax, DWI, DCE) who have confirmed PCa at biopsy and/or prostatectomy."@en
Recommended
Acronym
An acronym that identifies the dataset.
dct:alternative
rdfs:Literal
0..n
dct:alternative “TCGA"@en
Recommended
keyword
A keyword describing the dataset.
dcat:keyword
rdfs:Literal
0..n
dcat:keyword "prostate cancer"@en, "MRI performed"@en, "positive histology"@en;
Recommended
images creation year
A temporal period that the dataset covers. This corresponds to the year range that the actual (DICOM) images were created/acquired (if this has not been changed in the anonymization process). If this is not available, an estimation can be added.
dct:temporal
dct:PeriodOfTime
0..n
dct:temporal [ a dct:PeriodOfTime; dcat:endDate "2023-12-31"^^<http://www.w3.org/2001/XMLSchema#date>; dcat:startDate "2021-01-01"^^<http://www.w3.org/2001/XMLSchema#date> ];
Mandatory
contact point
Contact information of the individual/managing organization of the Dataset.
dcat:contactPoint
vcard:Kind
1..n
dcat:contactPoint [ a vcard:Organization; vcard:hasEmail <mailto:access-commitee@procancer-i.com> ];
Recommended
geographical coverage
A geographic region that is covered by the Dataset.
dct:spatial
http://publications.europa.eu/resource/authority/country/ OR dct:Location
1..n
dct:spatial <http://publications.europa.eu/resource/authority/country/GRC>;
Mandatory
Publisher
An entity (organisation) responsible for making the Dataset available.
dct:publisher
foaf:Organization
1..1
dct:publisher [ a foaf:Organization; locn:address [ a locn:Address; foaf:name "FORTH"; foaf:mbox <mailto:access-commitee@procancer-i.com>; foaf:homepage <https://forth.ics.gr>; ];];
Mandatory
Theme
A category of the dataset.
dcat:theme
fixed to: http://publications.europa.eu/resource/authority/data-themeOR subproperty of dct:subject skos:Concept
1..n
dcat:theme <http://publications.europa.eu/resource/authority/data-theme/HEAL>;
Mandatory
Identifier
A unique persistent identifier of the dataset (in compliance with the findability aspect of the FAIR principles)
dct:identifier
rdfs:Literal
1..n
dct:identifier "https://catalogue.eucaim.cancerimage.eu/api/fdp/fdp_Dataset/2081ac523632f434cd5bc4056a30ad5b"^^<http://www.w3.org/2001/XMLSchema#anyURI>;
Mandatory (NSIP)
accessRights
The accessRights of the dataset.
dct:accessRights
1..n
dcterms:accessRights <http://publications.europa.eu/resource/authority/access-right/RESTRICTED> ;
Mandatory
rights
A statement about the conditions of access and usage of the dataset.
dct:rights
dct:RightsStatement (fixed to a predefined set of values presented in D5.1)
dct:rights [ a dct:RightsStatement; rdfs:label "Authorisation to access, view and process in-situ the datasets"@en ];
Mandatory
applicableLegislation
The legislation that mandates the creation or management of the Dataset.
dcatap:applicableLegislation
rdfs:Resource
1..n
dcatap:applicableLegislation <http://data.europa.eu/eli/reg/2022/868/oj>;
Recommended
modification date
The most recent date on which the Dataset was changed or modified.
dct:modified
rdfs:Literal typed as xsd:date, xsd:dateTime, xsd:gYear or xsd:gYearMonth
0..1
dct:modified "2024-02-05T18:47:54Z"^^<http://www.w3.org/2001/XMLSchema#dateTime>;
Recommended
sample
A sample distribution of the dataset.
adms:sample
dcat:Distribution
0..n
adms:sample [a dcat:Distribution ;
dct:description "Synthetic data of the HealthPilot Use Case"@en;
dcat:downloadURL <https://github.com/CAVDgit/EHDS2_UC_Sciensano/blob/main/use_case_1_synthetic_data_10K_individuals.csv>;
dcat:mediaType <http://www.iana.org/assignments/media-types/text/tab-separated-values> ;
];
Mandatory
provenance
A statement about the lineage of a Dataset, including information about how the data was created, or processed, including methodologies, tools, and protocols used.
dct:provenance
dct:ProvenanceStatement
provenance
dct:provenance [ a dct:ProvenanceStatement;
rdfs:label "This data is sourced from several existing datasets, including the Duke dataset, ParcTauli and TCGA datasets. These datasets collectively provide comprehensive demographic and clinical data relevant to the project's objectives"@en ];
Mandatory
Type
A type of the Dataset.
dct:type
skos:Concept (there is a predefined set of values presented in D5.1).
1..n
dct:type a skos:Concept ; skos:prefLabel "Annotated Dataset"@en .
Mandatory
Version
The version of the dataset.
dcat:version
rdfs:Literal (in SemVer or CalVer format)
1..1
dcat:version "20231122"
Mandatory
accessURL*
A URL that gives information about accessing the dataset. In EUCAIM, this is the URL of the negotiator service.
dcat:accessURL
rdfs:Resource
1..1
dcat:accessURL <https://negotiator.eucaim.cancerimage.eu/collection/a96b56cd-59d4-444a-8e59-32a7fb0d7dea> ;
Recommended
license*
A license under which the Dataset is made available, assuming there is one license for all Dataset Distributions. If each Distribution has different licenses they should be included at the Distribution level with 1..1 relationship.
dcterms:license
dcterms:LicenseDocument (ideally under CC licenses for interoperability)
0..*
dcterms:license [ a dcterms:LicenseDocument; dcterms:identifier <http://creativecommons.org/licenses/by/4.0/> ; ];
Recommended
imageSize (in GB)*
The total size of all Distributions in the dataset, which is mainly the image size.
dcat:byteSize
xsd:decimal
0..1
dcat:byteSize "325"^^xsd:decimal
Recommended
format*
The file format of the Distributions included in the Dataset.
dct:format
dct:MediaTypeOrExtent (IANA Media Types)
0..n
dct:format <https://www.iana.org/assignments/media-types/application/dicom>;
*These properties are properties of the “Distribution” Entity. However, they will be included in the metadata catalogue at a dataset level as miltivalued attributes.
Table 2: EUCAIM DCAT-AP domain-specific metadata
EUCAIM DCAT-AP
Property Type
Property
Description
Property IRI
Range
Cardinality
Example
Mandatory
age low
The minimum age of subjects within the dataset.
eucaim:ageLow
rdfs:Integer
1..1
eucaim:ageLow "18" ^^xsd:int ;
Mandatory
age high
The maximum age of subjects within the dataset.
eucaim:ageHigh
rdfs:Integer
1..1
eucaim:ageHigh "18" ^^xsd:int ;
Recommended
age median
The median age of subjects within the dataset.
eucaim:ageMedian
rdfs:Integer
0..1
eucaim:ageMedian "45" ^^xsd:int ;
Mandatory
birthsex
BirthSex of subjects in the dataset.
eucaim:birthsex
skos:Concept
1..*
eucaim:birthsex <https://cancerimage.eu/ontology/EUCAIM#COM1000177>
Mandatory
number of studies
Total count of DICOM studies.
eucaim:nbrOfStudies
rdfs:Integer
1..1
eucaim:nbrOfStudies "8789" ^^xsd:int ;
Mandatory
number of subjects
Total count of unique individuals in the dataset.
eucaim:nbrOfSubjects
rdfs:Integer
1..1
eucaim:nbrOfSubjects "8237" ^^xsd:int ;
Recommended
number of series
Total count of DICOM series.
eucaim:nbrOfSeries
rdfs:Integer
1..1
eucaim:nbrOfSeries "24567" ^^xsd:int ;
Mandatory
intended purpose
The primary objective for which the dataset was created.
eucaim:intendedPurpose
dpv:Purpose
1..n
eucaim:intendedPurpose[ a dpv:Purpose ; dct:description " The primary objective of this dataset is the detection of prostate cancer with high accuracy both in peripheral and transitional zones to identify which men have cancer and those with no cancer."@en;] ;
Mandatory
collection method
This attribute defines the scope of data aggregation within the dataset. It specifies how data records are organized based on different criteria, allowing users to understand the context in which the data was collected.
eucaim:collectionMethod
subproperty of dct:subject skos:Concept (fixed to a predefined set of values presented in D5.1)
1..n
eucaim:collectionMethod a skos:Concept ; skos:prefLabel "Only-Image"@en.
Mandatory
quality label
A statement related to quality of the Dataset, including rating, quality certificate as per the EHDS requirements.
dqv:hasQualityAnnotation
dqv:QualityCertificate
1..1
dqv:hasQualityAnnotation [a dqv:QualityCertificate ; oa:hasTarget <https://…/dataset/123>; oa:hasBody <https://…/certificate>; oa:motivatedBy dqv:qualityAssessment];
Mandatory
legal basis
Legal basis used to justify processing of data or use of technology in accordance with a law.
dpv:hasLegalBasis
dpv:LegalBasis
1..n
dpv:hasLegalBasis [
a dpv:LegalBasis ;
dct:description "Deliberation no. 21/028 of february 18, 2021, last amended on june 18, 2021, relating to the communication of data to pseudonymized personal character relating to the healthdata of.. , as part of the EUCAIM project and the subsequent processing of personal data pseudonymised by…"@en;
dct:source <https://cancerimage.eu/file/view/AXkNfdPml9vUUfvGGfJr?filename=21-028-f212-AFMPS-dataset-modifi%C3%A9e%20le%2018%20juin%202021.pdf> ; ] ;
Mandatory
condition
The primary cancer condition of individuals in the dataset.
eucaim:hasCondition
skos:Concept (EUCAIM controlled vocabulary based on ICD-03 and SNOMED)
1..1
eucaim:condition <https://cancerimage.eu/ontology/EUCAIM#CLIN1000075>
(Cancer of prostate)
Mandatory
image modality
The set of modalities for the images in the dataset.
eucaim:hasImageModality
skos:Concept (EUCAIM controlled vocabulary based on DICOM and Radlex)
1..n
eucaim:hasImageModality <https://cancerimage.eu/ontology/EUCAIM#IMG1000022>
(Magnetic Resonance Imaging)
Mandatory
image vendor
Manufacturer of the imaging device as it is defined in DICOM tag (0008,0070).
eucaim:hasImageVendor
skos:Concept (EUCAIM controlled vocabulary)
1..n
Mandatory
image body part
Anatomical areas captured in the images.
eucaim:hasImageBodyPart
skos:Concept (EUCAIM controlled vocabulary)
1..n
eucaim:hasImageBodyPart <https://cancerimage.eu/ontology/EUCAIM#BP1000233>
(Neck and chest)
The full mappings between DCAT-AP v3.0.0, the current Health-DCAT-AP, and the EUCAIM DCAT-AP are described in: Mappings of DCAT application profiles. In the worksheet, some properties have been highlighted with different colors, to denote either stricter or less strict semantics to the current HealthDCAT-AP specification, as these were discussed in the EUCAIM WP5 related working group.
3.2 FAIR principles compliance
For supporting dataset metadata interoperability, it is also crucial to consider the FAIR principles. These principles guide the development of metadata to ensure that datasets are easily discoverable, accessible, interoperable, and reusable across diverse environments. The DCAT-AP, serving as the standard framework for dataset descriptions, aligns seamlessly with the FAIR principles, which introduce another set of requirements that must be met. The Research Data Alliance introduced the FAIR Data Maturity Model, which assesses the level of adherence to the FAIR principles and consists of different maturity levels, typically labeled as F1, F2, F3, and F4, etc. which represent increasing levels of compliance with the FAIR principles. Each level corresponds to specific indicators that can be seen as requirements for the “FAIRification” of the datasets, which should be analyzed and verified with respect to the specific restrictions and requirements of the EUCAIM project.
Finally, interoperability on a dataset metadata level involves facilitating automatic metadata exchange between systems. The concept of the FAIR Data Point (FDP)[4] comes into play as a metadata service that adheres to the FAIR principles and offers a reference implementation (an API) enabling data owners to expose data and metadata in a FAIR manner based on the DCAT metadata standard. Although it is not a requirement for the data holders to have an FDP for exposing their datasets in a machine-readable format for Tier 1 or 2 (although EUCAIM will recommend and facilitate its adoption even on these tiers), EUCAIM will adopt it in the central EUCAIM metadata catalogue in order to not only expose its dataset metadata in an automatic way to other dataset catalogues increasing their visibility and discoverability, but also to harvest dataset metadata from already established infrastructures which have an FDP service on their catalogues.
Last updated