-
Notifications
You must be signed in to change notification settings - Fork 1
/
dolphinsuite.Rmd
193 lines (175 loc) · 11.9 KB
/
dolphinsuite.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
---
title: "DolphinSuite"
---
<style>
table, th, td {
border: 1px solid black;
border-collapse: collapse;
}
body {
text-align: justify}
</style>
All raw sequencing data (scRNA-Seq, scATAC-Seq and CAGE-Seq) were processed by pipelines developed within the interactive pipeline manager [DolphinNext](./dolphinsuite.html) ([Yukselen et. al 2020](https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-020-6714-x)). The metadata as well as the processed datasets are currently being hosted in [DolphinSuite](./dolphinsuite.html), an end-to-end bioinformatics analysis platform.
## DolphinSuite Platforms
<a href="https://www.umassmed.edu/biocore/" target="_blank">UMass Chan Medical School BioCore</a> has developed **DolphinSuite**, a bioinformatics platform to support the analysis of high throughput data. DolphinSuite tracks samples from sample collection to data processing (sequencing, proteomics, metabolomics) and provides interactive analysis using an intuitive web interface. DolphinSuite is built to ensure secure access to the processed data using 3rd party applications for tailor-made analysis and data sharing.
DolphinSuite is comprised of three major components that supports distinct aspects of high throughput data analysis.
<table>
<tbody>
<tr>
<td style="text-align:center;width:33%; vertical-align: text-top;">
<p> <a href="https://dnext.dolphinnext.com" target="_blank" style="font-size: 1.5em;">DolphinNext (Dnext)</a> </p>
<img width="300" height="220" src="./images/dolphinnext.png" class="center">
<br>
<p> **Dnext**: a distributed data processing platform for high throughput genomics </p>
</td>
<td style="text-align:center;width:33%; vertical-align: text-top;">
<p> <a href="https://dmeta.dolphinnext.com" target="_blank" style="font-size: 1.5em;">DolphinMeta (Dmeta)</a> </p>
<img width="300" height="200" src="./images/dmeta.png" class="center">
<br>
<p> **Dmeta**: automated data processing and management using ontology-based metadata tracking system </p>
</td>
<td style="text-align:center; width:33%; vertical-align: text-top;">
<p> <a href="https://dportal.dolphinnext.com" target="_blank" style="font-size: 1.5em;">DolphinPortal (Dportal)</a> </p>
<img width="300" height="200" src="./images/dportal.png" class="center">
<br>
<p> **Dportal**: a user-friendly and interactive data browser for filtering and querying</p>
</td>
</tr>
</tbody>
</table>
<br>
<br>
## Preprocessing Pipelines
DolphinNext is a revolutionary pipeline management system that allows users to <strong>interactively customize data pre-processing</strong> workflows and ensures <strong>reproducibility</strong> of analysis results. Below, we give a summary of three pipelines that are used to process scRNA-Seq, scATAC-Seq and CAGE-Seq reads incorporated in this project.
<table>
<tbody>
<tr>
<td style="text-align:center;width:40%; vertical-align: text-top;">
<p> <a href="https://dnext.dolphinnext.com/index.php?np=1&id=12" target="_blank" style="font-size: 1.2em;">scRNA-Seq (Tasic et. al 2018) Pipeline</a> </p>
<img width="400" height="240" src="./images/scRNA-Seq pipeline.png" class="center">
<br>
</td>
<td style="width:60%; vertical-align: text-top; padding: 10px">
<p> The <strong>scRNA-Seq (Tasic et. al 2018) pipeline</strong> includes Quality Control, rRNA filtering, Genome Alignment using STAR and estimates gene expression levels by GenomeAlignment Package in R. </p>
<p> <strong>Steps</strong>: </p>
<ol>
<li>For Quality Control, we use FastQC to create qc outputs. There are optional read quality filtering (trimmomatic), read quality trimming (trimmomatic), adapter removal (cutadapt) processes available.</li>
<li>STAR is used to count or filter out common RNAs (eg. rRNA, miRNA, tRNA, piRNA etc.).</li>
<li>STAR is used to align RNA-Seq reads to a genome.</li>
<li>SummarizeOverlaps is used to calculate gene counts.</li>
</ol>
<p style="margin : 0; padding-top:0;">**Docker**: <a href="https://hub.docker.com/r/dolphinnext/tasic2018_scrnaseq_pipeline" target="_blank">https://hub.docker.com/r/dolphinnext/tasic2018_scrnaseq_pipeline</a></p>
<p>**GitHub**: <a href="https://github.com/dolphinnext/tasic2018_scrnaseq_pipeline" target="_blank">https://github.com/dolphinnext/tasic2018_scrnaseq_pipeline</a> </li></p>
</td>
</tr>
<tr>
<td style="text-align:center;width:40%; vertical-align: text-top;">
<p> <a href="https://dnext.dolphinnext.com/index.php?np=1&id=18" target="_blank" style="font-size: 1.2em;">scATAC-Seq (Graybuck et. al 2021) Pipeline</a> </p>
<img width="400" height="180" src="./images/scATAC-Seq pipeline.png" class="center">
<br>
</td>
<td style="width:60%; vertical-align: text-top; padding: 10px">
<p> The <strong>scATAC-Seq (Graybuck et. al 2021) pipeline</strong> includes a set of tools necessary for pre-processing ATAC-Seq libraries prepared as in (Graybuck et. al 2021).</p>
<p> <strong>Steps</strong>: </p>
<ol>
<li>Trim Galore tool is used to remove adapters from reads.</li>
<li>Trimmed and untrimmed reads are aligned to the reference using Bowtie.</li>
<li>Duplicates reads are deleted using Samtools.</li>
</ol>
<p style="margin : 0; padding-top:0;">**Docker**: <a href="https://hub.docker.com/r/dolphinnext/graybuck2021_scatacseq_pipeline " target="_blank">https://hub.docker.com/r/dolphinnext/graybuck2021_scatacseq_pipeline </a></p>
<p>**GitHub**: <a href="https://github.com/dolphinnext/graybuck2021_scatacseq_pipeline " target="_blank">https://github.com/dolphinnext/graybuck2021_scatacseq_pipeline </a> </li></p>
</td>
</tr>
<tr>
<td style="text-align:center;width:40%; vertical-align: text-top;">
<p> <a href="https://dnext.dolphinnext.com/index.php?np=1&id=38" target="_blank" style="font-size: 1.2em;">scATAC-Seq (Mich et. al 2021) Pipeline</a> </p>
<img width="400" height="180" src="./images/mich pipeline.png" class="center">
<br>
</td>
<td style="width:60%; vertical-align: text-top; padding: 10px">
<p> The <strong>scATAC-Seq (Mich et. al 2021) pipeline</strong> includes a set of tools necessary for pre-processing ATAC-Seq libraries prepared as in (Mich et. al 2021).</p>
<p> <strong>Steps</strong>: </p>
<ol>
<li>Reads are aligned to the reference using Bowtie2.</li>
<li>Low quality, secondary and unmapped reads are deleted using Samtools.</li>
</ol>
<p style="margin : 0; padding-top:0;">**Docker**: <a href="https://hub.docker.com/r/dolphinnext/mich2021_scatacseq_pipeline " target="_blank">https://hub.docker.com/r/dolphinnext/mich2021_scatacseq_pipeline </a></p>
<p>**GitHub**: <a href="https://github.com/dolphinnext/mich2021_scatacseq_pipeline " target="_blank">https://github.com/dolphinnext/mich2021_scatacseq_pipeline </a> </li></p>
</td>
</tr>
<tr>
<td style="text-align:center;width:40%; vertical-align: text-top;">
<p> <a href="https://dnext.dolphinnext.com/index.php?np=1&id=21" target="_blank" style="font-size: 1.2em;">CAGE-Seq Pipeline</a> </p>
<img width="400" height="160" src="./images/CAGE-Seq pipeline.png" class="center">
<br>
</td>
<td style="width:60%; vertical-align: text-top; padding: 10px">
<p> The <strong>CAGE-seq pipeline</strong> includes rRNA filtering, Genome Alignment using STAR, and peak calling using MACS2. </p>
<p> <strong>Steps</strong>: </p>
<ol>
<li>STAR is used to count or filter out common RNAs (eg. rRNA, miRNA, tRNA, piRNA etc.).</li>
<li>STAR is used to align RNA-Seq reads to a genome.</li>
<li>MACS2 is used for calling peaks in aligned bam files.</li>
</ol>
<p style="margin : 0; padding-top:0;">**Docker**: <a href="https://hub.docker.com/r/dolphinnext/cageseq_pipeline" target="_blank">https://hub.docker.com/r/dolphinnext/cageseq_pipeline</a></p>
<p>**GitHub**: <a href="https://github.com/dolphinnext/cageseq_pipeline" target="_blank">https://github.com/dolphinnext/cageseq_pipeline</a> </li></p>
</td>
</tr>
</tbody>
</table>
<br>
## Downstream Analysis Pipelines
DolphinNext can also be used to design downstream analysis pipelines that may incorporate processed sequencing datasets to further investigate
cell types of interests and build necessary files for online applications such as [Cellxgene](./cellxgenebrowser.html) and [UCSC Track Hub](./ucscbrowser.html).
All **Aggregate Analysis** pipelines incorporate a few number of common processes as well as steps geared towards each individual sequencing technology or dataset.
<p> <strong>Steps</strong>: </p>
<ol>
<li>Aggregating Bam Files with unique mapped reads to Cell Types.</li>
<li>Peak Calling with MACS2.</li>
<li>Creating BigWig files for UCSC Genome Browser.</li>
<li>Creating auxiliary files for building custom UCSC track hubs.</li>
<li>Single cell RNA downstream analysis using Seurat (only for scRNA-Seq pipeline)</li>
<li>Making h5ad files for visualization in Cellxgene (only for scRNA-Seq pipeline)</li>
</ol>
We have designed an additional downstream analysis pipeline (see <strong>Integrate-MultiOmics-Tracks</strong>) to build a large custom UCSC Track Hub of combined scRNA-Seq, scATAC-Seq and CAGE-Seq analysis. The resulting UCSC track hub can be accessed from here: [mouse](./ucscbrowser_mouse.html) and [human](./ucscbrowser_human.html).
<br>
<table>
<tbody>
<tr>
<td style="text-align:center;width:33%; vertical-align: text-top;">
<p> <a href="https://dnext.dolphinnext.com/index.php?np=1&id=20" target="_blank" style="font-size: 1.2em;">Aggregate Analysis (Tasic 2018)</a> </p>
<img width="300" height="180" src="./images/aggregate Tasic.png" class="center">
</td>
<td style="text-align:center; width:33%; vertical-align: text-top;">
<p> <a href="https://dnext.dolphinnext.com/index.php?np=1&id=19" target="_blank" style="font-size: 1.2em;">scATAC-Seq Aggregate Analysis</a> </p>
<img width="300" height="180" src="./images/aggregate Graybuck.png" class="center">
</td>
<td style="text-align:center;width:33%; vertical-align: text-top;">
<p> <a href="https://dnext.dolphinnext.com/index.php?np=1&id=22" target="_blank" style="font-size: 1.2em;">Cage-Seq Aggregate Analysis</a> </p>
<img width="300" height="130" src="./images/aggregate CAGE-Seq.png" class="center">
</td>
</tr>
<tr>
<td style="text-align:center;width:33%; vertical-align: text-top;">
<p> <a href="https://dnext.dolphinnext.com/index.php?np=1&id=23" target="_blank" style="font-size: 1.2em;">Integrate-MultiOmics-Tracks</a> </p>
<img width="300" height="280" src="./images/multi-omics pipeline.png" class="center">
<br>
</td>
<td colspan=2 style="width:66%; vertical-align: text-top; padding: 10px">
<p> The <strong>Integrate-MultiOmics-Tracks</strong> pipeline generates a custom <strong>USCS Track Hub</strong> and auxiliary tables for interrogating RNA, ATAC and CAGE seq reads from following studies: </p>
<!-- <ol> -->
<!-- <li>ATAC-seq (Graybuck et. al 2021): Enhancer viruses for combinatorial cell-subclass- specific labeling.</li> -->
<!-- <li>RNA-seq (Tasic et. al 2018): Shared and distinct transcriptomic cell types across neocortical areas.</li> -->
<!-- <li>CAGE-seq: FANTOM5 CAGE profiles of human and mouse samples.</li> -->
<!-- </ol> -->
<p> <strong>Steps</strong>: </p>
<ol>
<li>Collect bigWig files from multiple DolphinNext runs.</li>
<li>Build a custom UCSC track hub for all tracks of scRNA-Seq, scATAC-Seq and CAGE-Seq reads</li>
<li>Aggregate counts from scRNA-seq data for each cell type.</li>
</ol>
</td>
</tr>
</tbody>
</table>
<br>