4. Detailed Architecture

This section describes the components of the architecture in detail.

4.1. Dashboard

The users of the dashboard are mainly data requesters (e.g. researcher) who want to use EUCAIM data in the context of research and innovation, e.g., for analysis, AI training, AI validation, but also data and tool providers that want to get information and submit applications. Therefore, the Dashboard will also guide data providers to the documentation page explaining the general process required for becoming a data provider in EUCAIM and listing all the supporting tools provided by EUCAIM, e.g. for data preparation, data quality checking, etc., including information on how to access these supporting tools.

4.1.1. High-level architecture

The dashboard will integrate a number of applications under a coherent interface. The main set of applications is depicted in figure 3. This figure shows 6 different workflows:

In blue, workflows that could be done by an anonymous data requester.
In purple, workflows that could be done by a data provider.
In green, workflows that could be done by an authenticated data requester.
In yellow, workflows related to the access and provisioning requests.
In orange, workflows related to the access to data.
In black, workflows related to the processing of data.

The figure also considers three realms:

The anonymous public area, accessible by anonymous users.
The authenticated area, accessible only to authenticated users.
The provider’s area, accessible only to users authorised to access the provider’s data.

figure 3: Dashboard workflows.

4.1.2. User interactions

This section describes the interactions of the different user profiles through the Dashboard. Not all the user stories are implemented through the Dashboard. Actions related to the preparation of the local node and the data should be performed on premises.

The interactions are grouped into four main profiles (data provider, tool provider, data requester, and governing body member).

A colour coding is used to reference the corresponding workflow in figure 3. Actions (verbs) are underlined.

As a data provider, I want to go to a website to understand the processes involved in data preparation and provision, find instructions and description of as well as links to supporting tools for data preparation and uploading.
- I want to be able to register and log into the platform.
- I want to upload my already prepared data to the Central Repository and create a collection.
- I want to see information about the usage of my collections.
As a tool provider, I want to go to a website to understand the processes involved in the interface, documentation, maintenance, support, coding and security requirements and find instructions and description as well as links to supporting tools for interfacing to the repositories.
- I want to be able to register and log into the platform.
- I want to upload my already prepared tool to the Marketplace.
- I want to see information about the usage of my tool.
As a data requester (researcher), I want to go to a website and understand the terms of usage and the access conditions, as well as an overview of the available data.
- I want to see metadata of datasets in a public catalogue.
- I want to be able to register and log into the platform.
- I want to query the datasets based on specific search criteria (disease, imaging modalities, age groups …).
- I want to request/negotiate access to that data (if needed and possible).
- I want to have an overview of the dataset I selected and my authorisation in the user view of the catalogue
- I want to be able to create a selection of samples from the data I am authorised to access.
- I want to apply processing tools to the data that is in a federated or central repository
- I want to get my results.
- I may like to inform the providers of interesting results obtained thanks to the use of their data for their consideration.

4.1.3. Envisioned Technologies

The Dashboard will be a website that will integrate the Graphical User Interfaces of the different components in a seamless environment with a common design. The website main page will be made using React framework and deployed on a Node.js server. A MongoDB database will be used to store data. All the components will be embedded in Docker containers and will run on a managed platform based on Kubernetes. This does not preclude other components integrated into the dashboard (such as the public catalogue or the federated query) from using other software stacks.

4.1.4. Actions

This section describes the actions that need to be implemented to achieve the MM2 (early prototype) milestone.

Develop a proposal of the pages in an editable document (link).
Develop a common cooperative design for the Dashboard.
Make a landing page with links to all components.

4.2. Authentication and Authorisation Infrastructure (AAI)

Identification of the people accessing the platform, based on the performed authentication is a critical task for later executing authorization decisions. These tasks are a way to enable users’ access only to the relevant parts of the platform based on their roles and external claims made about their identity. A typical example is the ability to control access to the available metadata by including and/or excluding metadata sets when performing queries based on the users’ level of identity vetting, affiliation with the research entity or controlled access grants obtained via registered access applications.

4.2.1. High-level architecture

The services of the central hub will directly rely on the Life Science AAI (LS AAI) for Authentication and Authorisation, which will use external institutional IdPs for authentication. The architecture should be interoperable with external federated providers which should trust on the LS AAI for authentication, but could manage their own AAI instances. This way, a user registered in the LS AAI could browse and explore the dataset’s metadata available in the central hub, whose services will trust on the LS AAI tokens and will use the group entitlements as authorisation information. The use of GA4GH f token sets will also be considered.

figure 4: Architecture Diagram for the AAI services

Figure 4 shows the interactions among the main components. A user will access the Dashboard (1) and get authenticated through the LS AAI (2), which will trust on the Institutional IdP (3), providing a token back to the atlas. Other services of the Atlas (4) can verify the token in the LS AAI for verifying the identity.

When a user gets the authorization to access a specific provider (I), the policy enforcement component of the Atlas could trigger the inclusion of the user in a specific group (II) (this process could be human-based). This will allow the user accessing the dashboard to access the provider services. The Atlas Dashboard will directly forward to the provider services (A), which will request the login through the provider’s AAI (B). If the user is registered with LS AAI as IdP, the authentication (C) will be automatic. Then, if the user is the first time it logs in, the Provider’s AAI instance can create the user automatically if the proper group in LS AAI is given (D). However, this process could be more restrictive if the provider does not accept this approach (e.g. by requesting a full registration in the system). This approach will only request that the Federated Providers trust LS AAI as IdP and the configuration of the AAI provider’s instance to create users automatically (e.g. Keycloak can do it).

This approach will require:

Creation of an EUCAIM group with different roles for fine-grain differentiation of access rights.
Implement the dashboard with the LS AAI as major AAI, including registration.
Support the LS AAI from the provider’s side, potentially supporting the management of groups/roles for the authorisation.

The AAI should manage different groups/roles so the permissions can be defined. Depending on the granularity, we could aim for individual group permissions for providers or a coarse level. An example of roles could be:

EUCAIM General role. This is the minimum role given to the registered users, and allows a user to browse the catalogue and search for aggregated metadata. It will be granted automatically when a user registers the platform.
EUCAIM Federation Requester. This is a catch-all role given to all the users that have access granted. Providers can decide to authorise this role for a coarse-grain access permission.
EUCAIM Provider X Requester. To implement fine-grained access to a provider by a user, we could create a separate group/role per provider. In this way, a user may be authorised to selectively access the providers. Providers can simply implement the authorisation rules for its own group.
EUCAIM Provider. Providers will need this role to access the provider’s page and link their services to the federation.

4.2.2. User interactions for the AAI

This section describes the interactions of the users with the AAI.

Sign up into the platform via AAI - with an existing AAI account. A user will register the platform accepting the Terms of Usage and will authenticate using his/her own institutional IdP. The “EUCAIM external role” will be given.
Sign up into the platform via AAI - sign up for AAI account as well. Similar to the previous case, but for those who do not have an institutional IdP accepted by the Life Sciences AAI.
Login into the platform via AAI.
Log out of the platform.
Approve/reject out-of-band data transfer from AAI into the platform.
Link AAI account with the local account on the platform. A user can link an account created with an external IdP and a local IdP.

4.2.2.1. User’s Registration

First interaction of a user with respect to the AAI will be related to the registration.

A user accesses the Dashboard of EUCAIM.
A user signs up in the platform
1. Through the Life Sciences AAI
2. Using his/her Institutional IdP
3. Accepts the User’s & Privacy Policy.
4. User’s data is registered in the Platform.
5. The User receives an “EUCAIM” basic role which will allow him/her just to browse metadata (no access to actual data).

4.2.2.2. User’s Log-in

A user accesses the Dashboard of EUCAIM.
A user logs in the platform

2.1. Through the Life Sciences AAI

2.2. Using his/her Institutional IdP

4.2.2.3. Service AA validation

The validation of a token can be performed through the Identity service, or off-line checking the validity of the signature. The former procedure is recommendable, as a user may have the credentials revoked.

4.2.3. Envisioned Technologies

Life Science AAI. See documentation in https://lifescience-ri.eu/ls-login.html.
Keycloak (https://www.keycloak.org/). Although the objective is to rely directly on the managing of users and groups on Life Sciences AAI, external providers may prefer to have their own Identity and User Management service.

4.2.4. Actions

The following actions should be implemented to prepare the integration.

Prepare a mock-up with a Dashboard landing page, LS AAI and a Keycloak instance.
Agree on the authentication and authorisation model for the services.
Define the roles to be implemented.

4.3. Repository Federation

The federation of a repository deals with the preparatory actions required at the provider side to join the federation. This case has several actions in common with the request of storing the data in the central repository.

The workflow goes through four stages, already described as user stories in section 3.

Application of a data provider to join the federation (usDP1). The information about the rights and obligations and the procedures to be followed by a provider to contribute with data to the EUCAIM federation. This will be implemented through the Dashboard.
Quality control, both from the legal and technical perspectives (usDP2) and data preparation from a provider. This should be done on the provider’s side, supported by tools provided by the EUCAIM consortium. This will be clearly described in the agreement to be signed by the provider and the EUCAIM platform. This case will be slightly different from the case of a provider which would contribute to uploading the data in the central storage, as part of the quality control could be done at the central node resources.
Setup of a data node to connect to the federation (usDP3). Providers joining the federation have to install, operate and maintain a node to connect to the central services (at least AAI, Federated Catalogue and Federated Query, and optionally Processing Services, Monitoring, Traceability and Data Transfer Services).
Curated data uploaded into the federated local data node (usDP4). The data selected and prepared in usDP2 has to be made available in the local data node so it is linked to the EUCAIM federation.

4.3.1. High-level architecture

Figure 5 depicts the interactions of the Local Node with the services of the central node. These interactions are described in detail in the following subsections.

Figure 5: Interactions of a local provider and the central node.

The data provider should at least connect with the public catalogue, the federated query services and either the access end-point or the processing service. AAI services should trust the AAI authorisation schema of EUCAIM. Details on how these services interact are given in the rest of the subsections.

More detail is requested for the organization of a data node at the hospital level. The organization will depend on the specific structure of the services at the provider level, but it should consider a data warehouse where the data is extracted prior to being shared in the federation (or uploaded into the central storage).

Figure 6 shows the architecture of the structure of the local data node at La Fe University hospital.

figure 6: Architecture of La Fe University Hospital Data Warehouse.

In this architecture, three different areas are differentiated:

Identifiable data sources, which refer to the data storages that host the clinical real world data.
Pseudonymised data storages, which store data extracted from the RWD storages and goes through the process of standardisation and linkage.
Anonymised data storages, which host the fully anonynised data that is shared in the federation.

EUCAIM will provide tools for the anonymisation, quality analysis, harmonisation and FAIR compliance. Data will be organized into collections. The local node should provide connectivity with the central hub services as described before.

4.3.2. Interactions to join the federation

The procedures to join the federation are clearly described in D2.1 Onboarding invitation package and on the operational procedures in D4.1 First EUCAIM Operational Platform

4.3.3. Envisioned Technologies

XXX

https://github.com/samply/bridgehead

4.4. Public Catalogue

Data discovery is the essential first step for data re-use. The catalogue stores the metadata, offering the researchers the descriptive information about the available datasets. It will show data access conditions. At the same time, the catalogue offers a platform for data owners to display their datasets.

The Dashboard will present a public catalogue with the metadata of the federated datasets or collections available

Dynamically obtained through the collection of the metadata from the providers registered in EUCAIM, which will register the collections in the public catalogue.
The specification of the metadata of the collections will be a subset of the Hyperontology, defining the fields and variables to be used.
The catalogue will include some operational information from the providers, such as the access conditions. Providers may offer different access conditions: Authorisation to download the datasets, authorisation to access, view and process in-situ the datasets and authorisation to remotely process the datasets without the ability to access and visualise data, even remotely.
The catalogue metadata may be available even to anonymous users but searching should be available only to authenticated users (see Federated Query).

4.4.1. High-level architecture

The catalogue is the browsing interface where relevant metadata can be exposed by data providers and found by researchers (federated catalogue explorer). The catalogue is intended to:

Allow data providers to expose the metadata of their digital objects in a way that fulfils the FAIR (Findability, Accessibility, Interoperability and Reusability) Data Principles.
Allow data requesters to discover information about digital objects they are interested in.
Provide meaningful information about digital objects for both humans and software agents.

The metadata elements displayed in the catalogue are a limited list of standardised options, including access conditions; these constraints are configured by specifying a metadata schema. Metadata will be gathered from federated repositories (e.g., repositories of partner institutions, existing AI4HI projects and data warehouse catalogues) and mapped to the EUCAIM metadata model by the creation of data dictionaries. Standard vocabularies will be used to harmonise and facilitate interoperability. Ideally, the process of uploading the metadata entries to the EUCAIM catalogue will be automated. Figure 7 shows the architecture of the federated catalogue, following the same colouring convention of the previous figures.

Ffigure 7: Architecture of the Federated Catalogue.

4.4.2. User interactions

The user interactions for this service in EUCAIM are listed here.

As a data requester
- I want to explore the catalogue to find a dataset of interest accessing a limited set of variables.
- I want to know conditions for data use and data access instructions.
As a data provider
- I want to make my data discoverable, by pushing the metadata of my collections into the Federated Catalogue.

4.4.2.1. Explore the catalogue

Anonymous users can explore the catalogue and retrieve basic information. Other functionalities such as searching for collections fulfilling specific criteria will be limited to registered users only.

An anonymous user can access the federated catalogue through the Dashboard, which will display the collections registered in EUCAIM.
An anonymous user will be able to retrieve basic aggregated information of a specific collection (data model to be defined in the Hyperontology).
An anonymous user can retrieve the access condition for a specific collection.

4.4.2.2. Register collections into the catalogue

The registration of a collection will be triggered by the providers, which will define the procedure. Providers may consider to periodically register their collections into the federated catalogue, either through an automatic or manual procedure, or to even register immediately upon their creation at the provider’s side.

A data provider would like to register a collection in the Federated Catalogue. The data provider should prepare the metadata of the collection following EUCAIM’s specification.
A data provider is authorised to register collections in the Federated Catalogue. The authorisation requires the submission of an application and the signature of a Service Level Agreement. This will be implemented in the resource management section.
A data provider service registers the collection in the Federated Catalogue.

4.4.3. Envisioned Technologies

Molgenis (i.e. https://molgenis.eibir-edc.org/#/ developed in EUCanImage)
https://data.europa.eu/en (Official data portal of the European Commision, based on DCAT, seems good to connect to this).
Molgenis Fair API (https://molgenis.gitbook.io/molgenis/interoperability/guide-fair) with DCAT model.

4.4.4. Actions

Make inventory of catalogues currently used by AI4HI projects and other possible useful catalogues and make decision on catalogue for M9
Define the Metadata model for the catalogue (link to WP5).
Define a reduced set of metadata that is available publicly in the catalogue without registration.
Map repositories metadata to EUCAIM metadata model.
Define standards for the Metadata catalogue service (FAIR Data Point)
Define the API for the Metadata catalogue Service
- To query and retrieve data from providers.
- To interact with the explorer.
Define the look and feel for the Federated Catalogue Explorer (low priority).

4.5. Federated Query

In order to enable the Federated Query and Federated Access interactions, the EUCAIM Federated Data Query Service is designed to fulfil a core functionality of sending Federated Queries within the EUCAIM data infrastructure and aggregating the results. These queries are designed to either return aggregated information of the cases fulfilling the filters of the request. Once the access is granted, other searching mechanisms could be considered to return data indexes for each part of the data infrastructure, or to return the federated data itself, as is, or in segments.

The EUCAIM data infrastructure is comprised of EUCAIM Data Nodes (either Data Provider Nodes or the EUCAIM Central Node) and is extended through integration with the infrastructure of Repository Projects (i.e. AI4HI). Because these extended infrastructures hold their own Data Querying and Data access mechanisms, and their data models can be heterogeneous, each of these “Data Junctions” acts as a technical walled garden for creating a common Federated Query API to access them.

To overcome this and other resulting challenges, the Mediator component has been established. The Mediator, acting as a sort of middleware, is deployed at the site of each Data Junction, and:

Is provided with access to each Data Junction’s API.
Receives Federated Queries coming from the EUCAIM platform.
Transforms them to be compatible with the API.
Forwards the transformed request and receives the result.
Transforms the result to be compatible with the EUCAIM Federated Data Query Service
Returns the result to the EUCAIM Federated Data Query Service.

These steps describe the core functionality of the Mediator component, as it is created to support the EUCAIM Federated Data Query Service.

The result of a query will be the set of collections that fulfil the searching criteria, including the number of cases in each collection that match the filter. This will give an appraisal of the size of data of interest in each specific provider. If the number of cases matching the query is below a specific threshold, the service will not provide results.

4.5.1. High-level architecture

The Federated Query architecture is shown in figure 8 component will construct the query following the Hyperontology Common Data Model and forward it to the mediator endpoints of the providers. Each provider’s mediator endpoint will adapt the query to the specific syntax and structure of the local provider and query its endpoint. The results will be transformed and sent back to the Federated Query service, which will store them on a temporary database which will be used by the Federated Data Explorer to display them.

Figure 8: Architecture of the Federated Query service.

4.5.2. User interactions

There is only one user interaction with the federated search: Search for collections/datasets fulfilling a specific searching filter.

4.5.2.1. Query datasets fulfilling a specific criteria

An Authenticated user searches for data through the Dashboard by providing a query expression. An example could be to accept SQL syntax (it can be easily transformed in a query to a Dataframe for example) or directly other specifications such as GraphQL.
The Federated Query service sends a search expression to the providers registered in the platform according to the common data model.
1. The Query is transformed to the specific API of each provider by the mediator.
2. The output of the query is transformed to the EUCAIM data model and filtered according to the privacy criteria.
The provider returns the aggregated information, which is stored in a database.
The results are shown in the Federated Data Explorer.

4.5.3. Envisioned Technologies

This needs to be completed.

GA4GH Beacon v2 (https://github.com/ga4gh-beacon/beacon-v2/)
BBMRI-Sample Locator (https://github.com/sample/lens), a technical solution for generating search queries in multiple query languages (Beacon v2, CQL, AQL, ...) already using the Lifescience AAI and integrated with Negotiator.

4.5.4. Actions

Define the EUCAIM data schema.
Define the API for the Federated Query Service,
Agree on the query criteria (Sex, Cancer type, Image Modality, Age Group, data access conditions).
Define the output (count, collection / dataset identifiers and access point, access policies).
Define the searching language.
Define the functionalities available in the Federated Data Explorer
Define the look and feel for the Federated Data Explorer.

4.6. Resource Negotiation

In this section, we will discuss the tools and services required for submitting, refining and approving or rejecting both access requests and tool or data provisioning.

In order to integrate and apply the Data Access Committee's decisions with EUCAIM’s platform services, the Federated Data Explorer will also request access control metadata for its user in order to enable or disable access to specific functionalities based on the user’s role, access level, and depending on each provider’s access policy.

4.6.1. High-level architecture

Figure 9 shows the high-level architecture of the negotiation service. Three actors will interact with the services: the data requester, who will initiate the process and provide the necessary information for evaluating the request; the evaluation committee boards which will evaluate the technical, ethical, legal and scientific part of the proposal; and the Security Manager who implements the access permissions.

Figure 9. High Level interaction of the components in the negotiation.

There are three scenarios considered in the authorisation of the data access:

A user identifies useful datasets in the Catalogue, from the central storage, and requests access to them. In this case, the Access Committee can take the decision based on the application.
A user identifies useful datasets in the Catalogue, from the federation, and requests access to them. In this case, additionally and depending on the agreement the negotiator system will inform the providers who will have to positively confirm the request.
A user defines inclusion and exclusion criteria for their needed datasets, and requests the federation to make data newly available. In this case, special agreements should be signed with the data providers.

4.6.2. User interactions

An access negotiation process similar to the one needed in EUCAIM is shown in the figure 10. After having found a set of interesting resources a user can define requests for data access. Such access requests can be targeted towards one or several data providers at the same time. A minimum set of information needs to be provided with each request. Particular data providers may ask for specific additional information to be provided according to their needs, thus extending this minimum set. A first principal check on the validity of the request will be done before being sent to the Access committee. After that step negotiation starts with one or several data providers which means communication either on a bilateral channel with a singular provider or with all included/selected data providers. Within this process additional information may be questioned by the data providers or the requesting researcher to figure out if request and potential data offers are matching. If there is a match between requester and data provider the provider may flag availability of what was requested, otherwise the provider may step back.

figure 10: Flow of interactions for requesting access to a collection.

In summary, the steps are the following:

A data requester selects the collections that she wants to access through the federated query and adds them to the MyLibrary section.
The data requester submits an application to access the data, including a project with the purpose of the study (including hypothesis and methodology), the selection of cases (from the output of the filter of the federated query), the data fields needed, the ethical approval, the planned tools to use, the approval and endorsement of her institution, and eventually the need for processing resources).
The EUCAIM Access committee evaluates the completeness of the application and requests technical, ethical & legal and scientific assessment of the application.
If the application fulfils the eligibility criteria, data providers are queried about the availability of such data (a user may request variables that are not fully available in the providers’ data sets).
The Data requester receives the result of the application and the data access conditions, and decides if she will have to refine the request. For example, a requester may find that the access condition for one of the providers does not fit her working procedure or that the real availability of data from a provider is insufficient.
When the negotiation is completed, and depending on the conditions of the providers, EUCAIM generates a Data Sharing, Data Processing or Data Transfer Agreement that the Data Requester has to sign.
When signed, EUCAIM security manager assigns the proper roles and permissions to the user.
The providers either trust on the permissions in Life Science AAI or implement the permissions on their premises.
The results produced by the requester may be pulled to the provider, therefore becoming part of EUCAIM. The adoption of the resulting data will require the verification of the quality by the provider. The access conditions may even oblige the data requester to make the resulting data available in the platform.

4.6.3. Envisioned Technologies

The service may be based on the negotiator (https://negotiator.bbmri-eric.eu), described in https://www.liebertpub.com/doi/10.1089/bio.2020.0144.

4.6.4. Actions

This are the actions envisaged for the short time:

Define the documentation requested by the Data Access Committee in collaboration by WP2
Define the look and feel for the Data Access Committee Platform.

4.7. Access to Data

Once a data requester user has access granted to a specific set of collections, the user would be able to access such data according to the access conditions of the provider. At this moment, EUCAIM identifies three access models:

Full access to data, including downloading. A provider may register fully public collections which could be downloaded to the data requester premises.
Full on-premises access to data, which will enable a user to view and process data inside the resources of a provider, without the capability of downloading data (including processed data) from the platform. Access could be for a limited period of time.
Restricted on-premises access to data, which will enable a user to process the data using a set of authorised tools without the ability of displaying data (and obviously without the capability to download them). In this case the provider may release a reduced sample of significant data to facilitate the data requester to understand the nature of the data and its organisation. Again, access could be granted for a limited period of time only.

In any case, the providers should include a detailed description of how the data is made available, considering:

The format of the files with the data (e.g. DICOM, OMOP JSON files)
The structure of the files in the system (such as a hierarchical organisation of case / study / series / image file).
A data dictionary for the fields in hyperontology.

The collections that a user has access granted will be displayed on a personal area of the user in the Dashboard. This personal area will include the endpoints for accessing the data, in case of the first two access models. The access will forward the user to the provider’s services to access and explore the data.

4.7.1. High-level architecture

The architecture for the data access in cases 1 and 2 is depicted in figure 11. Data access for case 3 will be closely linked to the data processing case and described in the following section.

Figure 11: Data Access architecture, cases 1 and 2. Case 3 will be described along with the processing

4.7.2. User interactions

4.7.2.1. Data Access in case 1 (full access)

An Authenticated user has the access endpoint for the collections requested in their MyLibrary area of the Dashboard.
The user accesses the provider’s services to access the data, so she can download it to their premises or to a processing provider storage.

4.7.2.2. Data Access in case 2 (Full on-premises access to data)

An Authenticated user has the access endpoint for the collections requested in their MyLibrary area of the Dashboard.
The user explores the data and performs data analytics on the provider’s premises.

4.7.2.3. Data Access in case 3 (Restricted on-premises access to data)

An Authenticated user will choose the data collection for a processing job through the processing interface of the Dashboard.
The user will be able to run processing jobs to refine the subset of data of interest (e.g. filtering by modality, protocols, orientation, etc.).
This filtering will be used as an input for the processing jobs.

4.7.3. Envisioned Technologies

Data access will depend on the provider’s resources.

4.8. Federated Processing

In this section, we will discuss Federated Processing, which is a method of processing data that involves multiple entities collaborating on a task without sharing their data directly. This is achieved by using a decentralised architecture that allows each entity to maintain control over its data while still contributing to the task at hand.

4.8.1. High-level architecture

The high-level architecture of our Federated Processing system involves a central platform, called EUCAIM, that coordinates the collaboration between the entities. The platform is built using state-of-the-art technologies that enable secure data sharing and processing, while ensuring the privacy and confidentiality of the data.

As part of our Federated Processing system, we plan a distributed analysis engine that can deploy and support the entire life cycle of Federated Analysis or Federated Learning experiments.

Figure 12: Layers of the Distributed and Federated processing component.

As depicted in figure 12, using a federated learning use-case as example, we would propose the following architecture:

Analysis Platform REST API: user interface where the user triggers an experiment and retrieves the results (The API can be called from the platform’s dashboard, its command line tools, or from the EUCAIM general dashboard).
Message broker: based on RabbitMQ, with which we establish/trigger the communication between the client nodes and central node..
Management framework: In order to trigger the experiment at each node, whatever framework is instantiated (e.g. a docker-based local implementation of the Flower client, at the local node, and Flower server, at the master node, are instantiated).
Federated Learning models: these are materialised using git and mounted into the FL framework client docker image.

A simplified version of the architecture is presented in figure 13.

Figure 13: Architecture of the distributed processing environment.

In a more general case, Flower would be replaced by containers providing the desired server and client functionalities and adapted to be compatible with the Management framework. We envision three types of analysis tools to be integrated in the system:

Tool

Type

Description

Site- specific

Data Access Tools?

Interactive Remote Tool

Interactive session (i.e. Jupyter notebook) remotely deployed in the federated site accessing the local data storage

Data Preprocessing Tools

Batch Remote Tool

Preprocessing pipeline accessing the local storage and producing datasets ready to be consumed by FL

FL analysis tools

FL Tool (server+clients)

Currently, the prototype integrates Flower, supporting both ML and DL analyses

4.8.1.1. Requirements

Functional requirements:

Service should provide OIDC-compliant APIs and with appropriate documentation (ideally machine readable) for inter-module communication.
Service should be able to interoperate with other components of the platforms, i.e. main dashboard, data query and access services, etc.
The federated processing platform should be able to automatically start the process at the local sites.
Service should allow for user identification through the EUCAIM AAI system.
Service should provide the capability to load and choose a specific model/analysis
Service should provide example aggregator methods for the supported ML and DL models
Service should provide an interface for the user to start and monitor the process
Service should support a secure communication system between the different nodes

Non functional requirements:

Hardware: GPUs that support CUDA and have at least 12GBs of VRAM and 32GBs RAM, recommended: GPUs with more 24GBs VRAM, 64GBs RAM
Allowed outgoing network connection to be able to connect to EUCAIM core services and Federated processing management.
Allowed software installation (docker or singularity containers)
Allowed access to Data APIs

4.8.2. User interactions

To access the EUCAIM platform, a user must first register through the Life Sciences Login (LS Login) using their institutional identity provider (IdP). Once registered, the user can access the dashboard and sign up for the platform. During the registration process, the user must accept the User's & Privacy Policy, and their data is registered in the platform. Depending on their role, the user may have access to certain features and functionalities.

4.8.2.1. User Registration

A user accesses the Dashboard of EUCAIM.
A user signs up in the platform
1. Through the Life Sciences AAI
2. Using his/her Institutional IdP
3. Accepts the User’s & Privacy Policy.
4. User’s data is registered in the Platform.
5. The User receives an “external user” role or even no role.

4.8.2.2. User flow for a data scientist

From here, depending on their role, the flow would be different. For instance, for a data scientist the flow would be the following:

Access to the “Data Processing Service” from the services catalogue at the EUCAIM Dashboard
Configure the federated analysis experiment run
Select Tool (i.e. AI-model, data pre-processing workflow, etc)
Set up Tool (i.e training parameters)
Select among available (remote) datasets/nodes, filtered according to the authorization credentials.
Set min technical requirements (CPUs/RAM/GPU) (optional)
Launch the federated analysis workflow as an unattended job
Monitor job status
Retrieve and analyse results

4.8.3. Envisioned Technologies

ELIXIR Tools platform: https://elixir-europe.org/platforms/tools
RabbitMQ Message broker
openVRE based dashboard
Building Blocks Strategy for packaging
Docker/singularity containers for software deployment
OIDC based AAI (KeyCloak))

4.8.4. Actions

Design and coordination of the data loaders (mediators) associated with the different local repositories that would materialise the data ready to be analysed.
Definition and implementation of a REST API supporting OIDC to manage federated analysis tasks.
Design and implementation of a dashboard (web-based user interface) for composing, launching and monitoring federated analysis experiments to the underlying distributed platform.
Agree and define requirements for integrating any tool (or “FL analysis tools”) into the FL-framework

4.9. Central Repository

4.9.1. Overview

EUCAIM needs a Central Repository, which is used for storing data:

that cannot be stored inside a local organisation because the data needs to be seen by others (e.g. annotated by external parties not having access to the local organisation);
the local organisation is not able to store this data, either technically, financially or legally;
where the local organisation is not able to participate in a federated/distributed analysis on premise (due to any reason) and the data holder/controller wants to participate in the federated analysis experiment;
that is donated to the research community.

There are roughly two types of data to be stored centrally: imaging data and non-imaging data, where the non-imaging data consists of clinical (observational) variables and genetic descriptions.

4.9.2. Requirements

4.9.2.1. User Requirements

The user requirements are formatted as user stories and are grouped by role/profile. The following user profiles are considered in this section: data providers, data requesters,

As a data provider,

I want to store my data for long term use and collaborations (not related to the runtime of EUCAIM).
I want to be supported with preparing my data for sharing with the central repository, by being able to access detailed guidelines, training material and/or tools recommendations (e.g., for data de-identification, annotation, quality checking, etc.).
I want to be informed of the conditions that apply to storing the data at the Central Repository.
I need to be able to control access to my data stored in the repository on collection/project level.
I want to be able to update the meta-data of a collection I previously uploaded when some information in the meta-data gets outdated.
I want my imaging data to be stored as DICOM on the Central Repository.
I want to upload any other clinical data related to the imaging cases in the Central repository in a way I can link the data belonging to the same entity.
I want to have a staging area where I can curate data, assure the quality and prepare it for being released as a collection, including the possibility to have it reviewed by someone.
I want to be able to store non-imaging data (clinical variables and genetic descriptions) through interactive forms.
As a consortium, we need to have a staging area for doing final curation and Quality Assurance (de-id, quality, etc.) as part of the Central Repository. (So we can meet a minimal quality standard we need to define in the Data Management Plan and “Rules of Contributing” if something like that is planned for in this consortium).

As a data requester,

I want to inspect the imaging data that is stored on the Central Repository.
And I am logged in, I want to be able to see a full list of cases I am authorised to have access to, that have been uploaded to the Central Repository using a case explorer.
And I am logged in, I want to be able to visualise the imaging data I am authorised to access, in an interactive environment.
As an authorised data requester, I want to be able to preprocess imaging data I have access to in an interactive environment.
As a data requester, I want to be able to find the data stored on the Central Repository through a query mechanism in a EUCAIM catalogue or portal (or one of the catalogues/portals part of the network)
As a data user, I want to be able to create datasets composed of subgroups of cases of the Central Repository.

Remaining questions to distil requirements from:

Who are the users of the central storage? Two types of users are considered: Data controllers contributing with data and data requesters accessing it.
What is the “decision tree” on when to store data on the central storage: This will be the result of submitting an application.
Does the central storage need to be linked to the storages of the partner projects? In what directions? Does it need to do data federation? The central storage is another element of the federation and acts as an infrastructure for those that cannot host local nodes.
How do requesters get access? (Through Dashboard & DAC)
How do providers get access? (Depends on the underlying technology and rules of participation)
Types of data: imaging and non-imaging (clinical). Do we store these together, or do we use separate storages (e.g., EGA as clinical data storage)? It will depend on the technology.
Data model and federated query: It will support the Federated Query service of the Platform.
Authorised data providers will be able to upload anonymised data to the central repository. The request for having the rights to upload data will include the availability of a Data Management Plan, the assessment of the quality and the compliance to the anonymisation guidelines, among other considerations.
Data requesters (researchers) will also have a personal data space to store temporary or own additional data which will not be part of the federation.

4.9.3. Envisioned Technologies

The model will support more than one instance of a Central Repository. In the envisaged deployment we foresee two instances, which should be able to cater for the same requirements as discussed in the section above. Technically, all instances of the Central Repository should be accessible through the same kind of programmatic interfaces to be compatible with the dashboard and the other components in EUCAIM it needs interfacing with. A Central Repository instance needs to be able to take part in a Federated Processing experiment and as such needs to be able to provide interfacing that is required for that, this follows from the Federated Processing section in this document.

The choice for a specific instance of the Central Repository can be based on local regulatory or legal requirements, proximity/affinity to specialised or specific facilities or equipment, etc.

4.9.3.1. QP-Insights® (Quibim)

The QP-Insights system allows the management, storage and analysis of large volumes of multi-omic data and medical images in clinical studies and research projects like Primage, ChAImeleon, ProCancer-I or RadioVal. It will act as a case explorer being the central node interface for the storage of the repository located at the UPV and also the interface in the federated nodes for the clinical validation of the image tools and algorithms.

4.9.3.2. Health-RI XNAT (https://xnat.bmia.nl/) (EMC)

Used in e.g. EuCanImage, euCanSHare, RadioVal, EOSC4Cancer.

Data ingestion: multiple flavours, basically anything that can talk to the XNAT DICOM Receiver

Should support the proposed Data Model applicable for Imaging Data from TaskForce One and WP5 tasks.

4.9.3.3. CHAIMELEON Repository technology

CHAIMELEON uses a central platform that combines processing and storage nodes, providing in-situ processing capabilities for the data stored. The platform uses a K8s deployment for the management of the platform services and for running the processing tasks. Data is stored in a Data Lake implemented through Ceph. Figure 14 shows a diagram of the architecture. The code of the deployments and the services is available in github[2].

Figure 14: Architecture of CHAIMELEON Platform.

4.9.4. High-level architecture

The high-level architecture of the central storage is depicted in figure 15. The components described are:

Q API: Query API
Catalogue: Place where to store metadata about the collections contained in the Repository
AAI: Authentication & Authorization Infrastructure
FDP: Fair Data Point (means of exposing the metadata of a repository: https://www.fairdatapoint.org/)
Web UI: User interface for data management, application management, data browsing and inspection, use of a case explorer and data annotation, etc.
A/IO API: Access and Data I/O API.
Annotation and analysis: data resulting from annotation and analysis of raw data.

figure 15: Architecture of the Central repository.

4.9.5. User interactions

4.9.5.1. User’s Registration

A user accesses the Dashboard of EUCAIM.
A user signs up in the platform
1. Through the Life Sciences AAI
2. Using his/her Institutional IdP
3. Accepts the User’s & Privacy Policy.
4. User’s data is registered in the Platform.
5. The User receives an “EUCAIM general role”.

4.9.6. Actions

Inviting the T4.4 participants for formulating this architecture as a group
Make sure the schematic shows the relations the repository has with it’s connected components and services in the full architecture model

4.10. Helpdesk

Providing technical support and assistance to the users accessing the EUCAIM platform is another fundamental aspect to be considered. To assist the main stakeholders of the EUCAIM platform and receive the proper level of support, the project leverages on the long-running experience of the EGI.eu in providing IT support to the users of the distributed and federated e-Infrastructure.

The EUCAIM Helpdesk will act as a single point of contact for all users of the EUCAIM platform for requesting help, support and other requests. It provides ticket management and allows to track the inquiries related to EUCAIM services, resources, projects and general questions.

4.10.1. High-level architecture

The EUCAIM Helpdesk is based on the open-source Zammad Ticketing System[3]. The high-level architecture of the Helpdesk service is shown in the Figure 16.

Figure 16. High-level architecture of the EUCAIM Helpdesk (based on the EOSC Helpdesk). EOSC Stands for the European Open Science Cloud (https://eosc.eu/). Jira is a proprietary issue tracking product from Atlassian (https://www.atlassian.com/software/jira).

The major features offered by the the EUCAIM Helpdesk are:

Open-source technology: The helpdesk is built on open-source technology, ensuring transparency, flexibility, and the ability to leverage community contributions.
Modern UI: The helpdesk features a modern and user-friendly interface, making it easy for users to navigate and access the assistance they need. The interface is designed to enhance the overall user experience and simplify interaction with the helpdesk.
Smart Search: The helpdesk is equipped with a powerful search functionality that allows users to quickly find relevant information. By leveraging advanced search algorithms, users can easily locate solutions, knowledge articles, and other resources to resolve their queries efficiently.
Knowledge Base (KB): The helpdesk incorporates a comprehensive knowledge base that contains a repository of valuable information, such as FAQs, troubleshooting guides, and best practices. Users can access this extensive knowledge base to find self-help resources and solutions before reaching out to the Support Team.
Custom workflows: The helpdesk enables the creation of custom workflows tailored to specific community requirements. This includes filters, automatic ticket assignment, automatic escalation procedures, and notifications. These workflows streamline the support process, ensuring that tickets are properly categorised, assigned to the appropriate agents, and escalated when necessary.
Multiple access channels: The helpdesk provides users with various channels to access support. Whether it's through a web portal, email, webform, messenger or other communication channels, users can choose the most convenient way to interact with the helpdesk and receive timely assistance.
Easy integration with other services: The Helpdesk is a scalable solution which can be easily integrated with other community helpdesks, messengers like Slack, rocket chat etc.

The service is already connected to the EOSC (European Open Science Cloud) AAI. In the context of the EUCAIM project, the Authentication and Authorization framework will be further enhanced in order to support the EOSC Life AAI.

4.10.2. User interactions

There are several ways to submit ticket to EOSC Helpdesk using different interfaces:

Via the EUCAIM Helpdesk dashboard (requires login via EOSC AAI).
Via e-mail. A dedicated mailing-list will be set-up for serving the requests of the project.

4.10.3. Envisioned technology

Life Science AAI. See documentation in https://lifescience-ri.eu/ls-login.html.
Zammad Ticketing System. See here: https://zammad.com/en

4.10.4. Actions

The following actions should be implemented to prepare the integration:

Define the specific Support Units (SU) to be configured in the EUCAIM Helpdesk
Integrate the EUCAIM Helpdesk with the EOSC Life AAI.
Define a Technical Support Teams and how it articulates with the helpdesk structuring

Previous3. High-level Architecture Next5. MM2 Early Prototype

Last updated 1 year ago