This quick start guide uses an iCLIP dataset as an example. If you want to apply UMI-tools in a single cell RNA-Seq analysis, please see the Single cell tutorial <Single_cell_tutorial.md>
_
The following steps will guide you through a short example of how to
use the UMI-tools
package to process data with UMIs added to them. The
data used comes from one of the control replicates from Mueller-Mcnicoll et al,
Genes Dev (2016) 30: 553. We
have adaptor trimmed and filtered the data to reduce its size.
The general pipeline is:
extract UMI from raw reads -> map reads -> deduplicate reads based on UMIs
The most computationally intensive part of this is the middle part -
mapping the reads. It is also the least interesting for us here. To aid
the ability of folks to follow along without having to worry if they
have the correct indexs install etc, we have provided a BAM file of
the mapped reads from this example. You can download it here. It
will need indexing with samtools index
before use. You can then skip
straight to Step 5 below.
The easiest way to install UMI-Tools
is using your favorite python
package manager, either pip
or conda
. If you don't know which you
have installed, we recommend trying both, starting with conda
:
$ conda install -c https://conda.anaconda.org/toms umi_tools
Alternatively, pip
:
$ pip install umi_tools
If neither of these work, see our installation guide.
The test data we are going to use is a control sample from a recent iCLIP experiment. We have processed this data to remove the various adaptors that you find in iCLIP data. You can download the trimmed file here:
$ wget https://github.com/CGATOxford/UMI-tools/releases/download/v0.2.3/example.fastq.gz
If you're using macOS use:
$ curl -L "https://github.com/CGATOxford/UMI-tools/releases/download/v0.2.3/example.fastq.gz" -o "example.fastq.gz"
The file is about 100Mb, and takes a couple of minutes to download on our system.
UMIs are strings of random nucleotides attached to the start of
reads. Before the reads are mapped the random nucleotides must be
removed from the reads, but the sequence must be kept. The extract
command of UMI-Tools
moves the UMI from the read to the read name.
Several techniques that use UMIs mix the UMI sequence in with a
library barcode. In this case we want to remove the random part of the
barcodes, but leave the library part so that the reads can be
de-multiplexed. We specify this using the --bc-pattern
parameter to
extract
. Ns represent the random part of the barcode and Xs the
fixed part. For example, in a standard iCLIP experiment, the barcode
is made of 3 random bases, followed by a 4 base library barcode,
followed by 2 more random bases. Thus the --bc-pattern
would be
"NNNXXXXNN".
Read: TAGCCGGCTTTGCCCAATTGCCAAATTTTGGGGCCCCTATGAGCTAG
Barcode: NNNXXXXNN
|
v
TAGCCGGCT
|
V
random-> TAG CCGG CT <- random
^
|
library
Processed read: CCGGTTGCCCAATTGCCAAATTTTGGGGCCCCTATGAGCTAG
^^^^
The processed reads could then be passed to a demultiplexing tool to deal with the library part of the barcode.
Since the file we have downloaded contains only one library, here we will treat the whole barcode as a UMI, and so the pattern will contain only Ns.
$ umi_tools extract --stdin=example.fastq.gz --bc-pattern=NNNNNNNNN --log=processed.log --stdout processed.fastq.gz
Note that extract
can output to a gzipped or uncompressed file
depending on the file extension. It can also output to stdout
if no
output is specified. A log file is saved containing the parameters
with which extract
was run and the frequency of each UMI
encountered. This can be redirected with --log
or suppressed with
--supress-stats
(run parameters are still output).
The next step is to map the reads (in real life, you might also want
to demultiplex, trim and quality filter the reads). Below we will use
bowtie
to map the reads to the mouse genome and samtools
to create
a BAM file from the results. If you don't wish to spend the time doing
this, or don't have access to bowtie
or samtools
(or suitable
alternatives), we provide a premapped BAM file (see command at the end of this step).
First map the reads with your favorite read mapper, here bowtie
,
using parameters from the paper which we stole the sample from. This
assumes the mm9 bowtie
index and fasta are in your current
directory.
$ bowtie --threads 4 -v 2 -m 10 -a mm9 <( gunzip < processed.fastq.gz ) --sam > mapped.sam
Next we need to convert the SAM file to BAM (actually dedup
will use
SAM files for single ended analysis, but it's much slower).
$ samtools import mm9.fa mapped.sam mapped.bam
The BAM now needs to be sorted and indexed:
$ samtools sort mapped.bam -o example.bam
$ samtools index example.bam
If you want to skip the mapping, you can get the file here. It will still need indexing (see "samtools index" command above):
$ wget https://github.com/CGATOxford/UMI-tools/releases/download/v0.2.3/example.bam
Again for macOS use the following to download:
$ curl -L "https://github.com/CGATOxford/UMI-tools/releases/download/v0.2.3/example.bam" -o "example.bam"
Now that we have a mapped, sorted, indexed file, we can proceed to run the deduplication procedure on it:
$ umi_tools dedup -I example.bam --output-stats=deduplicated -S deduplicated.bam
The --output-stats
option is optional, but selecting it will provide a range
of statistics about the run. One of the most interesting is the
distribution of edit distances (here named deduplicated_edit_distance.tsv).
The content of this file after running the above will look something like:
directional-adjacency | directional-adjacency_null | edit_distance | unique | unique_null |
---|---|---|---|---|
10465 | 10465 | Single_UMI | 8976 | 8976 |
0 | 1 | 0 | 0 | 2 |
2 | 8 | 1 | 1167 | 33 |
164 | 73 | 2 | 496 | 183 |
211 | 258 | 3 | 281 | 566 |
700 | 916 | 4 | 730 | 1663 |
395 | 320 | 5 | 317 | 617 |
89 | 4 | 6 | 59 | 5 |
21 | 2 | 7 | 21 | 2 |
0 | 0 | 8 | 0 | 0 |
The first two columns show the distribution of average edit distances
between UMIs found at a single base in the genome after deduplication
with the directional-adjacency method (the default). Thus in the third
line we see that there are 2 bases in the genome where the average
edit distance between the UMIs found at that base is 1. The second
column is what we would expect to see if UMIs were randomly
distributed between mapping locations (taking into account any biases
in the overall usage of particular UMI sequences). The last two
columns the same, but for the naive unique
deduplication method
where every UMI is assumed to represent an independent molecule in the
biological sample. Looking at the third row, we see that there are
1167 positions where the average edit distance between
UMIs is 1, whereas in the random null (in the final column) we would
only expect to see 33 such bases.
The statistics options signficantly reduce the speed at which
deduplication is performed and increase the memory usage. If time or
memory usage is an issue, try running without the --output-stats
option.
If paired-end sequencing has been performed, it is necessary to make sure that the UMI sequence is added to both reads. When processing, provide the second read like so:
$ umi_tools extract -I pair.1.fastq.gz --bc-pattern=NNNXXXXNN \
--read2-in=pair.2.fastq.gz --stdout=processed.1.fastq.gz \
--read2-out=processed.2.fastq.gz
This assumes the UMI is on the 5' end of read1. For other
possibilities (such as a UMI on both reads) see umi_tools extract --help
. After paired-end mapping, paired end deduplication can be
achieved by adding the --paired
option to the call to dedup
:
$ umi_tools dedup -I mapped.bam --paired -S deduplicated.bam
Paired deduplicating is signficantly slower and more memory intensive than single-ended.
For some applications it may be neccessary to mark the duplicates but retain all reads, for example,
where the PCR duplicates are used to correct sequence errors by generating a consensus sequence. In
these cases, the group command can be used to mark each read with its read group. Optionally a flatfile
detailing the read groups and read identifiers can also be output using the --group-out
option:
$ umi_tools group -I mapped.bam --paired --group-out=groups.tsv --output-bam -S mapped_grouped.bam
The output bam will contain two tags: UG = read group id, BX = read group UMI.
The tag containing the read group UMI can be modified with the --umi-group-tag
option.
The groups flatfile contains the following columns:
- read_id
- contig
- position
- umi = raw umi
- umi_count = how many times was this umi observed at the same alignment coordinates
- final_umi = the error corrected umi
- final_umi_count = how many times was the umi observed at the same alignment coordinates, inc. error correction
- unique_id = the unique identifier for this group
The case below could be handled by umi_tools dedup by specifiying --bc-pattern=NNXXXXNN but we include it here as an example of how the fastq should be formatted following extraction:
UMI is bases 3-7, bases 1-2 and 7-8 are the sample barcode and need to be removed
@HISEQ:87:00000000 read1
AAGGTTGCTGATTGGATGGGCTAG
DA1AEBFGGCG01DFH00B1FF0B
+
will become:
@HISEQ:87:00000000_GGTT read1
TGATTGGATGGGCTAG
1AFGGCG01DFH00B1FF0B
+
See umi_tools extract --help
and umi_tools dedup --help
for details of
futher possibilities.