Skip to content

Latest commit

 

History

History
147 lines (127 loc) · 22.4 KB

README.md

File metadata and controls

147 lines (127 loc) · 22.4 KB

Plasma cell-free DNA methylomes for hepatocellular carcinoma (HCC) detection and monitoring of recurrence after liver resection or transplantation

Introduction

The detection of hepatocellular carcinoma (HCC) via mutational profiling of cell-free DNA (cfDNA) in blood plasma is hampered by mutational diversity and the challenge of acquiring tumor tissue for targeted gene panels. Alternatively, DNA methylation patterns can illuminate biological processes without requiring prior mutation-specific knowledge. Our investigation evaluated the effectiveness of cell-free methylated DNA immunoprecipitation and sequencing (cfMeDIP-Seq) in identifying HCC and monitoring its recurrence after surgery. This research represents the first application of cfMeDIP-Seq to HCC for both its identification and the tracking of residual disease post-operatively. We collected baseline plasma samples from HCC patients undergoing curative interventions or liver transplants, as well as from healthy individuals, including living liver donors. Utilizing cfMeDIP-Seq and machine learning, we identified HCC-specific differentially methylated regions (DMRs) and developed a HCC methylation score (HMS). This score accurately distinguished HCC with high specificity and sensitivity, and importantly, predicted the likelihood of postoperative recurrence. Analysis of post-surgery follow-up samples using cfMeDIP-Seq revealed that changes in HMS were indicative of future recurrence, demonstrating cfMeDIP-Seq’s utility for both precise HCC detection and recurrence prediction without the need for tumor-specific mutations.
Our findings demonstrate the potential of cfMeDIP-Seq for sensitive and specific HCC detection irrespective of tumor type, as well as for predicting recurrent disease post-surgery. The comprehensive analysis of tumor-agnostic cfDNA methylomes accurately detected HCC and forecasted recurrence following liver resection or transplantation. This innovative approach may address current challenges in minimally invasive HCC diagnosis using liquid biopsy, offering significant implications for screening, early detection, treatment decisions, and disease progression monitoring and therapy response assessment.

Contents

The repository contains bash scripts, data processing scripts, and R scripts designed to reproduce figures for the manuscript "Plasma cell-free DNA methylomes for hepatocellular carcinoma detection and monitoring after liver resection or transplantation"

HCC cfMeDIP Study Workflow

Graphical_abstract

HCC cfMeDIP Data Analysis Scheme

HCC_cfMeDIP_data_analysis_shceme_v2

Setup

R scripts were executed in either R version 3.5.0 or 4.0.1, depending on the specific package dependencies (listeded in Table 1), utilizing RStudio versions (2022.12.0+353 ~ 2023.12.0+369).

The R packages required are detailed within each script.

Dependencies

Tool Function Version Running_Platform Language Alternative_tool Link
FastQC FASTQ QC 0.11.5 H4H_shell Java http://www.bioinformatics.babraham.ac.uk/projects/fastqc
MultiQC ❶ FASTQ QC
❷ count reads
1.7 H4H_shell Python https://multiqc.info
fastp ❶ ISD QC3
❷ trim
0.23.1 H4H_shell C++ ❶ Trim Galore
❷ Trimmomatic
❸ Cutadapt
https://github.com/OpenGene/fastp
bowtie2 align 2.4.5 H4H_shell C++ BWA-mem https://bowtie-bio.sourceforge.net/bowtie2/index.shtml
qualimap BAM QC 2.2 H4H_shell Java http://qualimap.conesalab.org
SAMtools sort - index 1.14 H4H_shell C Picard http://www.htslib.org
sambamba deduplicate 0.7.0 H4H_shell D ❶ SAMtools
❷ Picard
https://lomereiter.github.io/sambamba
MEDIPS ❶ QC2
❷ export wig
❸ DMRs/ROIs(edgeR)
1.50.0 H4H_R/3.5.0 R DESeq2 https://doi.org/doi:10.18129/B9.bioc.MEDIPS
MeDEStrand absolute_methylation_level 0.0.0.9000 H4H_R/3.5.0 R https://github.com/jxu1234/MeDEStrand
sva/ComBat_seq reduce batch effect 3.46.0 H4H_R/4.0.1 R https://github.com/zhangyuqing/ComBat-seq
edgeR ❶ TMM normalization
❷ CPM counts
3.28.0 H4H_R/4.0.1 R https://bioconductor.org/packages/release/bioc/html/edgeR.html
limma ❶ vCounts
❷ DMRs(CPM-trend)
3.42.0 ❶ H4H_R/4.0.1
❷ RStudio
R https://bioconductor.org/packages/release/bioc/html/limma.html
FactoMineR/factoextra PCA 2.8/1.0.7 ❶ H4H_R/4.0.1
❷ RStudio
R https://rpkgs.datanovia.com/factoextra/index.html
randomForest ❶ build & train models
❷ classification prediction
4.7.1.1 ❶ H4H_R/3.5.0
❷ RStudio
R https://www.stat.berkeley.edu/users/breiman/RandomForests
glmnet regularized linear modeling 4.1.7 ❶ H4H_R/3.5.0
❷ RStudio
R https://glmnet.stanford.edu/index.html
caret model training configuration 6.0 ❶ H4H_R/3.5.0
❷ RStudio
R https://topepo.github.io/caret
pROC AUROC 1.18.0 RStudio R https://xrobin.github.io/pROC
ggplot2 data visualization 3.4.2 RStudio R https://ggplot2.tidyverse.org

Scripts List

Step Script Analysis Panels Generated
1 step_1_fastq_to_bam.sh
1s step_1s_ISD.sh QC Figure S2B/2C
2 step_2_bam_to_wig.R
3 step_3_wig_to_txt.sh
4 step_4_matrices_processing.R
5 step_5_medips_qc.R QC Figure S2A
6 step_6_hcc_classifiy_hms.R kFold modeling
AUROC
HMS
PCA
Confusion matrix
Simply moving average
Postoperative HMS trajectory
Figure 2
Figure 3
Figure 4
Figure 5
Figure S3C
Figure S4
7 step_7_hcc_subtyping.R HCC subgrouping Figure S3A/B
7s step_7s_one_vs_each_hcc.R OnevsEach.hcc() required by hcc_subgroup.R

Supporting Files List

Name Size Script using this file Description
hg19_chr1_22_m_coord.rds 90 MB step_4_matrices_processing.R genomic coordinates for 4801145 300bp bins (hg19, chr:1-22 and M)
hg19_chr1_22_coord.rds 94.4 MB - optional if only chr:1-22 needed
hg19_chr1_22_x_y_coord.rds
101.4 MB - optional if only chr:X&Y needed
hg19_chr1_22_m_x_y_coord.rds 101.4 MB - optional if only chr:M needed
black_bin_v2.RData 7.8 MB step_6_hcc_classifiy_hms.R 109120 bins falling in ENCODE blacklist regions
n236_cpm.rds 1.31 GB - cfMeDIP signals (summed CPM) analysis
n236_lcpm.rds 1.33 GB step_6_hcc_classifiy_hms.R matrix for kFold cross-validation limma-trend
vCount_n236.RData 1.33 GB - option for limma-voom
sample.rds 1KB step_6_hcc_classifiy_hms.R Clinical metadata
sample_sq.rds 1KB step_6_hcc_classifiy_hms.R Clinical metadata
sample_124.rds 1KB step_7_hcc_subtyping.R Clinical metadata
kFold_ML_res.RData 1.5 MB step_6_hcc_classifiy_hms.R machine learning result of discovery cohort
Val.prob.list.rds 11.1 MB step_6_hcc_classifiy_hms.R machine learning result of validation cohort

FastQ to BAM Processing Summary

File Type Storage Path Operation/Function
Merged FastQ Files ../bam cat for merging L001 and L002
FastQC Pre-Processing ../fastqc_pre fastqc for initial QC
Fastp Reports ../fastqc_pre/fastp_report fastp for quality and adapter trimming
FastQC Post-Processing ../fastqc_post fastqc for QC after fastp
Aligned SAM Files Temporary storage bowtie2 for alignment
Unsorted BAM Files Temporary storage samtools view for SAM to BAM conversion
Sorted BAM Files ../bam samtools sort for sorting BAM
BAM Index Files ../bam samtools index for indexing BAM
Pre-QC Reports ../fastq_bam_QC samtools flagstat and qualimap
Deduplicated BAM Files ../bam sambamba markdup for deduplication
Post-QC Reports ../fastq_bam_QC samtools flagstat and qualimap

Please note that temporary files, such as unsorted BAM and aligned SAM files, are removed after they are no longer needed to save storage space.

MEDIPS_QC Results List

This table clearly delineates the output files from the MEDIPS QC analysis, where to find them, and which specific function in the MEDIPS package or custom script is responsible for generating each file. This comprehensive overview aids in understanding the workflow and accessing specific results for further analysis or reporting.

Result File Description File Pattern/Type Storage Path Generating Function
Saturation Plots _SaturationPlot.pdf outdir/pdf/ MEDIPS.plotSaturation()
Sequence Coverage Pie Charts _SeqCoveragePlot_Pie.pdf outdir/pdf/ MEDIPS.plotSeqCoverage() (pie)
Sequence Coverage Histograms _SeqCoveragePlot_Hist.pdf outdir/pdf/ MEDIPS.plotSeqCoverage() (hist)
Sequence Coverage Pie Charts (Including Non-Unique Reads) _SeqCoveragePlot_Pie_uniqF.pdf outdir/pdf/ MEDIPS.plotSeqCoverage() (pie)
Sequence Coverage Histograms (Including Non-Unique Reads) _SeqCoveragePlotHist_uniqF.pdf outdir/pdf/ MEDIPS.plotSeqCoverage() (hist)
Saturation Metrics (FigureS2A, BAM QC) _saturation.txt outdir/QCstats/ MEDIPS.saturation()
Coverage Metrics _coverage.txt outdir/QCstats/ MEDIPS.seqCoverage()
CpG Enrichment Analysis (FigureS2A, BAM QC) _enrichment.txt outdir/QCstats/ MEDIPS.CpGenrich()
Comprehensive QC Statistics _QCstats.txt outdir/QCstats/ Custom aggregation of MEDIPS data
Window Coordinates per Chromosome _window_per_chr.csv outdir/ or specified subdirectory Custom script using MEDIPS data
PNG versions of Saturation, Coverage, and Calibration Plots Various PNG files outdir/png/ MEDIPS.plotSaturation()
MEDIPS.plotSeqCoverage()
MEDIPS.plotCalibrationPlot()

H4H - R Compatibility Chart

Package Version H4H/R3.5 H4H/R4.0 H4H/R4.2 --mem Running Time
MEDIPS 1.50.0 ≥ 180G ~1 day
MeDEStrand 0.0.0.9000 ≥ 180G 1~2 days
ChAMP 2.28.0 NA ≥ 300G 1~2 days
DESeq2 1.38.2 ≥ 180G ~1 day
limma 3.42.0 1~2 days
edgeR 3.28.0 1~2 days
FactoMineR 2.8
factoextra 1.0.7
dplyr 1.1.1
tidyr 1.3.0
sva 3.46.0 ≥ 500G ~1 day
caret 6.0 ≥ 80G ~2 days
randomForest 4.7.1.1 ≥ 80G ~2 days
glmnet 4.1.7 ≥ 80G ~2 days

Key Arguments

Tools Arguments Support Data Dependencies
fastp -w 10
-g
-f 5
-F 5
--adapor_F==
--adapor_R==
--correction
--html=$.html
Adaptor Sequences Table
Bowtie2 -p 10
--minins 50
--maxins 500
-x hg19
hg19.1.bt2
hg19.2.bt2
hg19.3.bt2
hg19.4.bt2
hg19.rev.1.bt2
hg19.rev.2.bt2
samtools view -@ 10
-bS
-f 2
-F 4
qualimap -c
--java-mem-size=30G
-gff hg19.gtf
-outformat pdf
-nt 10
hg19.gtf
sambamba markdup -r
-p
-t 10
--overflow-list-size 600000
MEDIPS/QC - BSgenome="BSgenome.Hsapiens.UCSC.hg19"
- uniq <- 1
- extend <- 300
- shift <- 0
-ws <- 300
- paired <- TRUE
- chr.select <- paste0("chr",c(1:22,"X","Y","M"))
NA BSgenome.Hsapiens.UCSC.hg19
MEDIPS/wig_to_counts - BSgenome="BSgenome.Hsapiens.UCSC.hg19"
- uniq <- 1e-3
- extend <- 300
- shift <- 0
- ws <- 300
- paired <- TRUE
- chr.select <- paste0("chr",c(1:22,"M"))
hg19_chr1_22_m_coord.rds
hg19_all_chr_coord.rds
hg19_chr1_22_x_coord.rds
hg19_chr1_22_x_y_coord.rds
BSgenome.Hsapiens.UCSC.hg19
edgeR
MeDEStrand - BSgenome="BSgenome.Hsapiens.UCSC.hg19"
- uniq <- 1
- extend <- 300
- shift <- 0
- ws <- 300
- paired <- TRUE
- chr.select <- paste0("chr",c(1:22,"M"))
BSgenome.Hsapiens.UCSC.hg19
BSgenome
GenomicRanges
MEDIPSData

Data availability

Raw sequencing data for all 203 HCC samples and 23 healthy controls are deposited at the European Genome-phenome Archive (EGA) under the accession number EGAS50000000450. Raw sequencing files for another 12 healthy normal controls can be requested at https://www.ontariohealthstudy.ca/for-researchers/data-access-formsand-templates. Processed intermediate data and objects used to replicate the results in this study are available at https://doi.org/10.5281/zenodo.11251606.

Contact

Kui Chen, [email protected]