-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Hands-on example for implementing the federated data analysis infrastructure using DataSHIELD for two epidemiological studies with harmonised data
NFDI4Health (National Research Data Infrastructure for personal health data)
T5.2 Epidemiology of chronic diseases, T3.7 Distributed data analysis infrastructure
DFG project number: 442326535
Authors: Carolina Schwedhelm, Sofia Maria Siampani, Florian Schwarz, Katharina Nimptsch, Matthias Schulze, Hajo Zeeb, Tobias Pischon
Contact people: "Katharina Nimptsch ([email protected]), Sofia Maria Siampani ([email protected])
Affiliations: Max Delbrück Center for Molecular Medicine (MDC), German Institute of Human Nutrition Potsdam-Rehbrücke (DIfE), Leibniz Institute for Prevention Research and Epidemiology (BIPS)
Version: 2.0
Last updated: 15.07.2024
In accordance with data protection regulations and intellectual property rights, federated data analysis is a major opportunity to analyse data distributed in different Data Holding Organisations without physically sharing them. DataSHIELD is a software solution for secure data analysis of personal health data in the programming language R, in which data holders can keep their data behind a firewall on dedicated servers (Opal Servers) while researchers can remotely analyse data under tight control, send analyses requests and receive summary statistics back.[1]
Data access is granted to the researcher on the basis of a research project. Within the consortium National Research Data Infrastructure for Personal Health Data (NFDI4Health), there are currently two ongoing projects that serve as pilot studies to implement the necessary infrastructure to conduct federated data analysis:
- Systematic investigation of methodological limitations in the derivation of exploratory dietary patterns
- Association of dietary sugar intake, sugar sweetened beverages and related foods with prospective changes in body fatness and chronic disease risk
These pilot studies are being carried out with the participation of 16 German epidemiological studies. As a prerequisite for joint data analysis with DataSHIELD, study data have to be harmonised. Within NFDI4Health, we developed and published a harmonisation protocol (https://github.com/nfdi4health/data-harmonisation-protocol/wiki)[2] and a standard operating procedures (SOP) protocol for implementing the necessary data access infrastructure (https://github.com/nfdi4health/opal-datashield-sop/wiki)[3] to support our partners in the process. The data analysis infrastructure has already been successfully implemented at 7 institutes (other 6 installations are in progress), including ActivE Study at MDC and EPIC-Potsdam at DIfE.
The following summarises the necessary steps for the execution of the pilot studies (represented also in Figure 1):
1a. Collecting Metadata: The Data Holding Organisations (DHOs) provide the necessary metadata to the researcher, to give an insight of the variables that will be needed for the analysis.
1b. Setup of distributed data analysis infrastructure: The DHOs implement the necessary infrastructure including installation and configuration of Opal, R server with DataSHIELD packages and a database engine (e.g. MySQL, MongoDB).
2 . Data Harmonisation: The DHOs harmonise their datasets following the harmonisation protocol. The R package “Rmonize” is being used.
3 . Preparation of Datasets & Data Upload: The DHOs upload the data dictionary and the harmonised dataset in their local Opal server. They give permissions to the researcher to analyse the data remotely.
4 . Data analyses: The researcher can start their federated analysis on DataSHIELD using the Central R Server, hosted by MDC.
Figure 1: Steps in the harmonisation and federated data analysis pilot studies in NFDI4Health
This protocol explains the process required to implement the data access infrastructure for federated analysis using DataSHIELD in the context of a specific scientific project, starting with collection of metadata and ending with the data analyses. We show the steps in detail using the example of two of the participating epidemiological studies in the pilot studies and using a subset of the data with a handful of selected variables that are included in the pilot studies.
ActivE: The aim of this small cross-sectional study (N = 50) was to develop a prediction model to assess activity-related energy expenditure using accelerometry. ActivE was conducted by the Max Delbrück Center for Molecular Medicine (MDC), which participates as a co-applicant in the NFDI4Health consortium. ActivE included participants (male and female) aged 20-69 years old and was conducted from 2012-2014. While the focus of the ActivE study is on activity-related energy expenditure, collected data included diet (using a 7-day dietary record), anthropometry, and medical history. [4]
EPIC-Potsdam: The collaborative prospective cohort study European Prospective Investigation into Cancer and Nutrition (EPIC) includes two German study centres (Potsdam and Heidelberg). EPIC-Potsdam is conducted by the German Institute of Human Nutrition Potsdam-Rehbruecke (DIfE), which participates as a co-applicant in the NFDI4Health consortium. EPIC-Potsdam is a large cohort study (N > 20,000) originally focusing on cancer and nutrition, but now has been expanded to investigate a wide range of health outcomes and lifestyle risk factors. Participants (male and female) aged 35-65 years old were recruited in 1994-19985. [5]
We selected four of the variables included in the pilot studies to illustrate the process:
- sex (binary variable),
- age (continuous variable),
- smoking status (categorical variable), and
- sodium/potassium intake ratio (continuous variable)
Both the ActivE Study (https://csh.nfdi4health.de/resource/44) and the EPIC-Potsdam Study (https://csh.nfdi4health.de/resource/9) have been published on the Health Study Hub. To collect the variable-level metadata, a template modified from Maelstrom Research (https://www.maelstrom-research.org/) including a list of all the required variables for the pilot studies was filled out, providing all the details about how the information was obtained. A detailed description of this template is provided in the harmonisation protocol. [2]
In parallel, while the collection of metadata progresses, the Opal/DataSHIELD infrastructure was set up at MDC and DIfE with the help of the IT departments. The SOP (https://github.com/nfdi4health/opal-datashield-sop/wiki) was used to guide the set up.
Following the data harmonisation protocol2 (see step 2 of the harmonisation protocol), the harmonisation strategy was developed for each of the selected variables in ActivE Study and EPIC-Potsdam, respectively (see Table 1).
Table 1: Harmonisation strategy for the selected variables in EPIC-Potsdam and ActivE Study.
Next, the harmonisation documents, shown also in the harmonisation protocol [2] with screenshots, (templates downloadable here: https://maelstrom-research.github.io/Rmonize-documentation/articles/a-Glossary-and-templates.html) were prepared as per the harmonisation protocol2 (steps 2-4 in the protocol) for performing the harmonisation using the R package “Rmonize” [6], developed by Maelstrom Research. As part of the harmonisation using “Rmonize”, an additional variable for participant ID is needed. Therefore, for our example, the complete list of variables in the harmonised datasets is:
- ID
- SEX
- AGE
- SMOKE_ST
- SOD_POT
Based on the results of the harmonisation process using “Rmonize”, the harmonised (study-specific) dataset is saved in csv format. The data dictionary corresponding to this dataset is also available from the harmonisation process; it includes the following columns: variable name, variable label, variable value type (i.e., text, integer, decimal). Categorical variables are of value type integer.
Both the harmonised dataset and corresponding data dictionary described above are uploaded to the Opal server as per the Opal/DataSHIELD SOP [3]. Figure 2a shows how the uploaded data dictionary looks like for ActivE Study and Figure 2b for EPIC.
Figure 2a: Data dictionary uploaded in Opal (ActivE dataset)
Figure 2b: Data dictionary uploaded in Opal (EPIC dataset)
The last step before data analysis is possible, is to create users and setting DataSHIELD permissions.
Access is granted based on the requirements of a research project after signing a bilateral contract between the analyst and the DHO, and a cooperation contract with MDC. To gain access and credentials to the Central R Server (Figure 3), the analyst should contact Sofia Siampani ([email protected]). Potential costs for using the Central R Server will be agreed upon in advance. In the future, this process will be streamlined by the Central Access Point, which is currently in development.
Figure 3: Central R Server (https://workbench.posit.mdc-berlin.de/), hosted by MDC
With the credentials, which are specific to the analyst, a connection can be open to access the data remotely via DataSHIELD. Figure 3 shows the R code to connect to the servers of MDC and DIfE and obtain access to the prepared harmonised datasets.
Figure 4: Login code to connect to ActivE and EPIC datasets
Figure 5: Initial steps of analysis on ActivE and EPIC datasets, showing descriptive summary statistics for a continuous variable (AGE) and a categorical variable (SEX).
Within DataSHIELD, there are many functions which are primed for an initial analysis of an unknown dataset. In most cases, the analyst aims to gather basic information on the dataset first, such as getting to know variable names and the length of a dataset or the data type of a variable (Figure 4). Using summary statistic functions, the analyst is also able to perform quality checks on the dataset to ensure that it consists of reasonable values (e.g. interquartile range for the nature of a variable). A list and description of available DataSHIELD functions can be found here: https://data2knowledge.atlassian.net/wiki/spaces/DSDEV/overview.
This protocol gives a hands-on example of the practical steps to prepare and successfully connect to the data access infrastructure including data harmonisation for federated analysis (DataSHIELD) for two epidemiological studies participating in NFDI4Health. The protocol has been made public on the GitHub repository of NFDI4Health (https://github.com/nfdi4health/datashield-access-protocol/wiki), where Data Holding Organisations interested in the implementation of this federated analysis infrastructure can consult and follow the steps of this protocol and repeat the process in other studies.
[1] Marcon Y, Bishop T, Avraam D, Escriba-Montagut X, Ryser-Welch P, Wheater S, et al. Orchestrating privacy-protected big data analyses of data from different resources with R and DataSHIELD. PLOS Computational Biology. 2021;17(3):e1008880.
[2] Schwedhelm C, Nimptsch K, Pischon T, Jannasch F, Schulze MB, Perrar I, et al. Data harmonisation protocol for pilot studies in Use Case 5.1 ‘Nutritional Epidemiology’ and 5.2 ‘Epidemiology of Chronic diseases’ Version 1.0 2023 [updated 2023/05/22/. Available from: https://github.com/nfdi4health/data-harmonisation-protocol/wiki.
[3] Siampani SM, Schwedhelm C, Nimptsch K, Pischon T. Standard Operating Procedure for Installation and Configuration of Opal DataSHIELD in NFDI4Health, Version 2.0 2023 [updated 2023/07/17/. Available from: https://github.com/nfdi4health/opal-datashield-sop/wiki.
[4] Jeran S, Steinbrecher A, Haas V, Mähler A, Boschmann M, Westerterp KR, et al. Prediction of activity-related energy expenditure under free-living conditions using accelerometer-derived physical activity. Scientific Reports. 2022;12(1):16578.
[5] Boeing H, Wahrendorf J, Becker N. EPIC-Germany – A Source for Studies into Diet and Risk of Chronic Diseases. Annals of Nutrition and Metabolism. 1999;43(4):195-204.
[6] Fabre G (2023). Rmonize: Support Retrospective Harmonization of Data. R package version 1.0.1, https://github.com/maelstrom-research/Rmonize/