5. Interoperability framework for federated processing

For enabling federated processing, data holders should implement a semantic and syntactic interoperability layer across their datasets. Semantic as how data meaning is consistent across datasets (this layer should also be implemented in tier 2), and syntactic as how data is structurally persisted within a database.

Syntactic interoperability at this tier is important so that any tool or AI/ML model processing the data is aware of the format and the structure of the local dataset, and these aspects are not addressed by the conceptual specifications (entities, relationships, terminologies) of the hyper-ontology.

5.1 CDM business requirements

Prior to selecting a CDM, we conducted an initial analysis of the main requirements, expectations, and constraints from various stakeholders. Our approach involved engaging with representatives from the AI4HI projects and requesting specific information, as follows:

The specific cancer types that each project focused on.
The clinical questions/use cases addressed by each project.
The clinical and imaging data used to answer these questions, including mandatory and optional information.
The format of the raw data available and whether standardized terminologies were used for different data types, along with the versions of these terminologies.
The anonymization techniques/profiles employed by each project to ensure compliance with GDPR and national data privacy laws.
Details about the modalities of radiological images collected and the imaging metadata associated with them, or extracted, if applicable.
Information regarding the format of segmentation masks, if they exist.
The chosen common data model and whether it covers all data types, with a straightforward mapping from the raw data.

This information was collected and documented in the ORSD document described in the previous section. The outcome of the analysis was outlined in D5.1 (section 3). It is evident that there are many challenges to be addressed, as the AI4HI projects are dealing with different cancer types, with only three out of five projects to deal with a common type of cancer, i.e. breast and prostate cancer, different use cases, and therefore different clinical and imaging data to support these use cases, different terminologies, different anonymization profiles, different formats for the segmentations, and although all of them have standardized data models, the OMOP-CDM and the FHIR resources as a data model, these are also different. Most importantly, as some of the AI4HI projects are getting finalized, they have no plan of transforming their datasets to a specific standard, as they have all selected and adopted the data model that serves the needs of the respective project. In addition to the AI4HI projects, we need to take into consideration constraints that might arise from new data holders willing to join the EUCAIM federation, which might have either standardized data models or totally ad-hoc models and might also have different capabilities, in terms of technical facilities and resources in general.

Following the collection of information from the AI4HI projects, several group meetings were conducted with different domain experts within the consortium, including AI experts, data holders, software engineers, and legal teams, to define the data model business requirements for the project. The most critical requirements are presented below:

EUCAIM should support as many input formats as possible for raw clinical and imaging data, which may or may not comply with interoperability standards.
The data model should be terminology-agnostic, accommodating different terminologies seamlessly.
Minimization of the effort required from clinical data managers to prepare data for federated processing and analysis through the platform.
The data model must fully comply with GDPR and national privacy laws.
The data model should comprehensively represent all target data types at their intended level of detail, including clinical, demographic, radiomic, and laboratory data.
The data model should be extensible to allow for additional/new data to be represented.
The data model must provide an interface for accessing and querying data for the purpose of training federated AI models.
Data transformations from the raw source to the AI training dataset should be as straightforward as possible.
The data model should be structured in a way (usually in a tabular format) that simplifies the retrieval of records in the training dataset, regardless of the training plan of an AI algorithm.

Within EUCAIM, two potential frameworks for data harmonization and standardization are being explored, as mentioned in the TEHDAS recommendations on a Data Quality Framework document^[16]. One approach involves transforming all datasets held by a data holder to comply with a specific internationally adopted standard (e.g., OMOP-CDM). The other approach entails preparing the dataset for delivery based on a specific data schema that includes the necessary harmonization rules, controlled vocabularies, and standards.

In the first approach, harmonization is driven by a standard design, resulting in a dataset that is comprehensible to the community and can be used for federated analysis and to support interoperability with other research infrastructures and networks (e.g., OHDSI, Darwin EU, EHDEN). However, this method requires significant upfront effort (although only done once per dataset) and is only accessible after extracting, semantically mapping, and transforming all data sources to the standard data model. This ties the research question specification to the semantic constraints of the standard model specification.

In the second approach, harmonization is driven by the materialization of specific information in a bespoke data model, where each transformation is limited to specific entities and variables of interest. This, however, limits the reuse of the data in other contexts and introduces an additional data model for specific purposes. It is important to note that preparing datasets for secondary use should not be limited to mapping concepts. It also requires developing data models that provide a logical harmonized schema, integrating different health data sources among data holders.

In the context of EUCAIM, we explored different approaches to be considered for Tier 3 (federated processing/analysis and AI model development), which is the maximum level of interoperability to be achieved in EUCAIM, based on the two aforementioned harmonization frameworks. These approaches are analyzed in the following section, and which guided many decisions regarding the CDM (e.g. structure, format).

5.2 Data harmonization approaches for the federated processing/analysis.

5.2.1 Scenario 1: EUCAIM Hyper-Ontology Based CDM for Analysis

The architecture for this scenario is shown in Figure 20. This case outlines two distinct pathways for integrating data from AI4HI repositories or already established repositories adopting standards (OMOP, FHIR) and new data holders with ad-hoc models.

Established repositories (e.g. AI4HI projects): implement a mediator/data access service that dynamically transforms and structures data according to the hyper-ontology and CDM specification.
Other data holders (e.g. hospitals): undergo an Extract Transform Load (ETL) process, directly converting their local data into an EUCAIM hyper-ontology based CDM.

Figure 20: EUCAIM CDM for analysis & OMOP, FHIR, EUCAIM local data models. For OMOP and FHIR a mediator and mapping component is necessary.

In this examined scenario, researchers access a Data Access Service in order to request specific information to create their model’s input dataset (cohort) in a tabular form (e.g. csv). Established repositories (e.g. AI4HI repositories) utilize a mediator service and a mapping component to transform queries based on the hyper-ontology concepts (e.g., age at diagnosis, modality) to the local CDM query language and the local CDM concepts. It is in a way the same mapping component/service as in the mediator in Tier 2, but in this case, the mediator doesn’t return aggregated information, but rather specific hyper-ontology based attributes (e.g. age at diagnosis, modality, PSA etc.). This required information can be subsequently stored in a tabular form (e.g. csv, parquet) file along with the corresponding images in a POSIX path, that the federated processing service is able to access. For new data holders, an ETL process aligns datasets directly with the EUCAIM hyper-ontology based CDM specification.

The advantages of this approach are:

The researchers are able to slice and dice the information available according to the needs of their analysis/use case and the inputs of their respective models in an easy and user-friendly way through the data access service.
Federated Learning scenarios are easier for the researchers since they can specify what type of data (and format) want to be available on each federated node.
Eliminates the need for AI4HI repositories to go through an ETL process for transforming their data, but rather create a mapping component that transforms only the requested information on the fly and on demand.
Streamlines data transformation for new data holders through an ETL process, without implementing any mediator/mapping component.

The disadvantages of this approach are:

A model registry or a UI is required so that researchers are able to specify what’s the “granularity” their models/tools want to have their input to (e.g. which variables)
A data access service is needed to accept specifications of the needed dataset and create (materialize) dynamic cohorts based on these, which increases complexity.
The mediator component's on-the-fly data transformation (materialization) is technically challenging.
Adopts a bespoke data model for new providers (based on the hyper-ontology), limiting its utility outside EUCAIM.

5.2.2 Scenario 2: Integration with OMOP-FHIR for Wider Compatibility

In this scenario, new data holders can opt to convert their data into either OMOP-CDM or FHIR based standards. This facilitates easier integration with EUCAIM, in a similar way to the AI4HI projects and enhances data utility beyond the EUCAIM ecosystem. Therefore:

Figure 21: OMOP-FHIR local adopted standards– EUCAIM based CDM for analysis with mediator and mapping components necessary for all nodes in the federation.

Established (AI4HI) repositories and compliant data holders to OMOP/FHIR standards use a mediator service as in Scenario 1. (EUCAIM will need to provide mediator components (OMOP/FHIR) to the new data holders (i.e. customized versions of them, as even the same CDM has differences in the way the information is structured as we described in section 4.)
Non-compliant data holders to OMOP/FHIR standards undergo an ETL process to comply with either OMOP or FHIR standards.

Figure 21 shows the architectural design of this approach. The advantages of this approach compared to Scenario 1 is that new data holders align with well-established standard generic data models, enhancing interoperability and impact beyond EUCAIM. However, the disadvantage of this approach is that a mediator service and a mapping component should be implemented for this case as well, so that all OMOP and FHIR based repositories are harmonized for data analysis, with all the disadvantages this mediator service entails, as described in scenario 1.

5.2.3 Scenario 3: Simplifying Integration Through ETL process

This approach mandates all participating repositories to undergo a one-time ETL process, conforming to the EUCAIM hyper-ontology based CDM, thereby reducing technical complexities associated with mediator services. In this case all federated nodes can use the same (simpler) Data Access Service implementation that exports data from the CDM into a common format. Figure 22 shows the architectural design of this approach.

Figure 22: EUCAIM based CDM for all nodes participating in the federation. This would require a one-time transformation and no mediator/mapping component is necessary.

5.2.4 Scenario 4. EUCAIM hyper-ontology only for federated query purposes, OMOP-CDM for analysis

In this scenario, the EUCAIM hyper-ontology is only applicable for Tier 2 for the federated query purposes and is not used for federated processing. The architectural design of this approach is outlined in Figure 23.

All participating repositories should conform to the OMOP-CDM standard data model and go through an ETL process (apart from the OMOP-CDM ones – although some adaptation will be needed to address specific issues as described in section 4.1). The federated processing service could directly access an SQlite^[17] file (for example) with the whole OMOP-CDM relational schema available, perform any desired query and transform it to any tabular format for input to the AI model or for analysis.

Figure 23: OMOP-CDM as the EUCAIM CDM for federated processing and analysis. Hyper-ontology only for federated queries.

The approach of not having a data access service in this case, but rather providing the whole dataset for researchers to use and slice and dice information, could also be applied to the

previous scenarios as well, regardless of the chosen CDM for analysis. However, the disadvantage of this approach is that all nodes need to both go through an ETL process, but also have a mediator for Tier 2, as this conforms to the hyper-ontology concepts and terms (for bridging the gaps between OMOP and FHIR standards). This approach could also be used with a FHIR-based standard, however, as we described and analyzed in D5.1, OMOP-CDM is more appropriate as a CDM for analysis and AI related operations. In addition, another drawback of this approach is that researchers are given an SQLite file/relational database to deal with, which requires knowledge of both OMOP-CDM and SQL query language, and not a tabular format that AI experts are usually engaged and accustomed with, which can be dynamically formed for their purposes. In this case, another access service could be added on top of the OMOP-CDM databases for a more user-friendly access to the underlying data.

5.3. The EUCAIM Common Data Model

5.3.1. CDM Selection Rationale

Based on the aforementioned analysis and the requirements from various stakeholders, i.e., AI experts, data model experts and AI4HI project representatives, Scenario 1 and Scenario 3 were deemed the most appropriate for supporting all the necessary processes for querying and transforming information required by the AI model algorithms and frameworks. Consequently, the EUCAIM CDM for analysis and federated processing/learning will be based on the hyper-ontology specification, which underpins the EUCAIM logical data model.

It is important to note that EUCAIM will not mandate the adoption of Scenario 1 or Scenario 3, which involves either a mediator implementation or a one-time ETL process, respectively. However, the EUCAIM partners agreed that a one-time transformation to the EUCAIM CDM is more straightforward and easier to implement, therefore this will be the recommended approach.

As we initially described in Section 4.4.3, the mCODE conceptual model was identified as the most appropriate basis for grounding the hyper-ontology in the oncology domain, especially to build the core layer of the hyper-ontology model by ontologically analyzing and explicitly and semantically representing the mCODE basic specifications. The rationale behind this decision is multifold.

Although the OMOP-CDM and FHIR standards are widely used for standardizing and exchanging healthcare data, they have limitations when it comes to AI-related tasks, especially those requiring tabular data for model training and analysis. OMOP-CDM excels in transforming and standardizing data from diverse healthcare sources into a common format, which is beneficial for interoperability and large-scale observational studies. However, due to its generic nature, and the fact that it is an observational-based model, it makes it unsuitable and not much straightforward for querying oncology related information by AI experts. For example, through its oncology extension most of the cancer modifiers, as these are defined in the OMOP-CDM specification, are represented as “Measurements”, limiting the semantics of cancer stages, cancer grades, extensions, invasions etc. Similarly, the basic FHIR (Fast Healthcare Interoperability Resources) specification is designed to facilitate real-time data exchange between healthcare systems, with its primary focus being on ensuring that different systems can communicate effectively. However, FHIR’s hierarchical and often complex data structures are not inherently suited for the tabular data formats required by many AI algorithms and frameworks. As a reference, all tools currently available in EUCAIM, which are thoroughly described and analyzed in D5.4 require clinical and imaging metadata in a tabular format.

Due to the aforementioned reasons, EUCAIM explored the two most prominent data models in oncology: mCODE (Minimal Common Oncology Data Elements)^[18] and OSIRIS^[19] (Interoperability and data sharing of clinical and biological data in oncology) which are both event-based models. mCODE, introduced by the ASCO and a group of collaborators, provides a standardized set of essential oncology data elements, ensuring interoperability and data consistency, which is critical for building reliable AI models. Although mCODE is based on FHIR, it narrows down the scope to oncology-specific data elements, making it easier to extract and query relevant information for cancer research and AI applications. On the other hand, OSIRIS, developed by INCa, offers a minimum data set for the sharing of clinico-biological data in oncology. Its relational model makes it easier to represent and manipulate as tabular data, which is ideal for AI model training. This structure allows for efficient querying, aggregation, and analysis of large datasets.

All options considered, the EUCAIM CDM will leverage and build upon the conceptual model of the mCODE specification and the OSIRIS data framework, leveraging the strengths of each framework, as well as accounting for the specific constraints underpinned by the secondary use of data and the AI4HI projects. For example, both models contain mandatory attributes, which cannot be supported by the available knowledge of the AI4HI projects, and that is due to GDPR and anonymization strategies followed by each project for reducing risks of re-identification of patients, and the fact that the clinical information collected by the projects accompany the imaging data. As an example, all date related attributes included in both the OSIRIS and mCODE specifications are not part of the knowledge collected from the AI4HI projects due to the anonymization of the clinical information. Instead, relative relations based on events such as diagnosis or treatment (e.g., events that happened X months after baseline/diagnosis/treatment) are included.

Summarizing, in the context of EUCAIM, mCODE will be the basis conceptual model for representing various cancer types, cancer stages, performance status metrics and scales, as well as assessments (e.g. radiological assessments (ACR Reporting and Data Systems (RADS)^[20]), and it is also generally more advantageous due to the fact that it is built on the FHIR based standard, which can be exploited, if necessary, in other contexts, for exchanging purposes. In addition, OSIRIS’ relational model nature, and its approach of creating pivot tables (.csv files) for use in AI related processes supports efficient data selection for data preprocessing, feature extraction, and model training, ultimately enhancing the development of AI applications in oncology, and therefore EUCAIM will follow the same approach for facilitating AI experts in selecting specific cohorts as input to their models, by the use of pivot tables.

A first version of the EUCAIM Data Dictionary is described in the following section. A more detailed version is also available at: EUCAIM_CDM_mCODE_based_v1.0.xlsx

5.3.2. EUCAIM Data Dictionary

The EUCAIM CDM classifies all the clinical patient data into 6 different domains according to the mCODE specification:

5.3.2.1 Patient

The patient information group allows for general information about the patient including demographics, and the patient's managing organization.

Table 7 7: The EUCAIM CDM: Patient group

Group

Entity

Data Element

Definition

EUCAIM Required

Occurrences Allowed

Data Type

Patient

Identifier

Anonymized patient identifier which is unique within the context of the system.

Required

1..1

string

Patient

Gender

Administrative Gender - the gender that the patient is considered to have for administration and record keeping purposes.

Optional

0..1

CodeableConcept

Patient

Ethnicity

Concepts classifying the person into a named category of humans sharing common history, traits, geographical origin or nationality.

Optional

0..1

CodeableConcept

Patient

Race

Concepts classifying the person into groups based on their physical appearance

Optional

0..1

CodeableConcept

Patient

Birth Year

The year of birth for the individual.

Optional (required if diagnosis age is not available)

0..1

Integer (>1900, <current year)

Patient

Managing Organization

Organization that is the custodian of the patient record. Need to know who recognizes this patient record, manages and updates it.

Required

0..1

Organization

Patient

Care Provider

Patient's primary care provider organization.

Optional

0..1

Organization

Patient

Birth Sex

A code classifying the person's sex assigned at birth.

Required

1..1

CodeableConcept

Cancer Patient

Deceased

Indicates if the individual is deceased or not.

Optional

0..1

boolean

Cancer Patient

Cause of death

Main cause of death of the patient

Optional (conditional on deceased)

0..1

CodeableConcept

Cancer Patient

Date of last contact

Date of last contact if not deceased, or date of death if deceased.

Optional (conditional on deceased)

0..1

Date

Organization

Identifier

Identifies this organization across multiple systems

Optional

1..1

String

Organization

Name

Name used for the organization

Optional

1..1

String

5.3.2.2 Health Assessment

The health assessment group contains information related to the patient’s general health before and after treatment. This includes Comorbidities, Laboratory Tests, Performance Assessments (ECOG), Vital Signs, Family Member History, and Patient History of Metastatic Cancer.

Table 8 8: The EUCAIM CDM: Health assessment group

Group

Entity

Data Element Name

Definition

EUCAIM Required

Occurrences Allowed

Data Type

Health Assessment

Family Member History

Subject

The patient that the family history is about

Required

1..1

Reference: Patient

Family Member History

Relationship

Relationship to the subject

Required

1..1

CodeableConcept

Family Member History

Condition Code

Condition that the related person had

Required

1..1

CodeableConcept

Family Member History

Onset Age

When condition first manifested on the relative.

Optional

0..1

Age

History of Metastatic Cancer

Code

Type of observation

Optional

0..1

CodeableConcept

History of Metastatic Cancer

Value

The information determined as a result of making the observation, if the information has a simple value.

Optional

0..1

boolean

Comorbidities

Focus

Comorbid conditions are typically defined with respect to a specific 'index' condition. For example, comorbid condition categories would be those specified by CDC, namely obesity, renal disease, respiratory disease, etc.

Optional

0..*

Reference: PrimaryCancerCondition

Comorbidities

Comorbid Condition Present

A comorbid condition that is known to be present Required (conditional)

Required (conditional)

0..*

CodeableConcept

Comorbidities

Comorbid Condition Absent

A condition that is NOT present, related to the patient. Required (conditional)

Required (conditional)

0..*

CodeableConcept

Comorbidities

Code

Describes what was observed. Sometimes this is called the observation "name".

Required

1..1

CodeableConcept

Comorbidities

Subject

The patient whose comorbidities are recorded.

Optional

0..1

Reference: CancerPatient

ECOG Performance Status

Category

A code that classifies the general type of observation being made.

Required

1..1

CodeableConcept

Histologic Grade

Subject

Patient whose test result is recorded.

Required

1..1

Reference: CancerPatient

Histologic Grade

ValueAsConcept

The Laboratory result value. If a coded value, the value CodeableConcept.code should be selected from SNOMED CT, if the concept exists.

Required

1..1

CodeableConcept

Histologic Grade

ValueAsNumber

The Laboratory result value. If a numeric value, value Quantity.code shall be selected from [UCUM](http://unitsofmeasure.org).

Required

1..1

Quantity

Histologic Grade

Method

Indicates the mechanism used to perform the observation.

Optional

0..1

CodeableConcept

5.3.2.4 Cancer Treatments

The cancer treatment group includes treatment techniques used to treat cancer patients, categorized as: medications, surgery, and radiotherapy.

Table 1010: The EUCAIM CDM: Cancer treatment group

Group

Entity

Data Element Name

Definition

EUCAIM Required

Occurrences Allowed

Data Type

Treatment