Helwak, Aleksandra, et al. 'Mapping the human miRNA interactome by CLASH reveals frequent noncanonical binding.' Cell 153.3 (2013): 654-665. http://dx.doi.org/10.1016/j.cell.2013.03.043
- Biomart queries include:
- mRNA : transcript_biotype=protein_coding; cdna; (limited to where a RefSeq Protein Identifier Exists)
- lncRNA : transcript_biotype=lncrna; cdna
- rRNA : transcript_biotype=lncrna; cdna
- rRNA : transcript_biotype=[Mt_rRNA, rRNA, rRNA_pseudogene]; transcript_exon_intron
- tRNA : transcript_biotype=Mt_tRNA; transcript_exon_intron
- other : transcript_biotype=[all remaining]; transcript_exon_intron
For a detailed description of the current sequences and queries utilized, see "Current Reference Details" below.
The reference pipeline is designed using Nextflow, and has been tested on Nextflow/23.04.1. Dependency handling is performed with conda modules (containerized implementation in development).
- Required program dependencies are:
- Required Python Packages:
The scripts can be run by executing the first script: "00_run_all.sh" using the presupplied conda configuration, or by making all required resources (seqkit, python3) available on the system path.
The Hyb program has requirements about the formatting of the FASTA file used for the reference.
- Currently identified requirements include:
No description in FASTA sequence header (no whitespace characters)
- Sequence identifier be of the form of "{1}_{2}_{seqid}_{biotype}"{1}: Arbitrary Identifier (ENSG... for Ensembl Sequences){2}: Arbitrary Identifier (ENST... for Ensembl Sequences){3}: Name of gene/miRNA{4}: Ensembl-style transcript_biotype.(Note, "microRNA" must be used in place of "miRNA" for recognition by Hyb)
- {1}, {2}, {seqid}, and {biotype} should contain only [a-z], [A-Z], [0-9],
"-", and "|" characters.
"_", ".", and "," characters are specifically excluded from identifiers.
Examples:
>ENSG00000003137_ENST00000001146_CYP26B1_mRNA
.....
>MIMAT0000062_MirBase_let-7a_microRNA
TGAGGTAGTAGGTTGTATAGTT
Thanks to Grzegorz Kudla ( https://github.com/gkudla ) for providing information on Hyb reference creation.
Text of: ./01_notes.sh
Download a reference sequence library for the Hyb program from Ensembl
using the Biomart python module.
Library construction is based on the protocol provided in the supplemental methods of:
Helwak, Aleksandra, et al. 'Mapping the human miRNA interactome by CLASH reveals
frequent noncanonical binding.' Cell 153.3 (2013): 654-665.
http://dx.doi.org/10.1016/j.cell.2013.03.043
( Supplemental methods section found only in PDF-fulltext )
Biomart queries include:
protein_coding (as cDNA)
lncRNA (as cDNA)
All remaining gene_biotypes
as unspliced transcripts ('transcript_exon_intron')
tRNAs: genomic tRNA database http://gtrnadb.ucsc.edu/)
rRNAs: NCBI Genbank Database, rRNA sequences (NR_003287.4, NR_003286.4);
miRNAs: miRBase release 22.1 (http://www.mirbase.org): mature human miRNAs.
These sequences are then formatted in the required {}_{}_{name}_{biotype} header
format for Hyb, and all extra '.' and '_' symbols are removed.
Original biotypes from the hOH7 Hyb database are:
Ig, lincRNA, microRNA, miscRNA, mRNA, mtrRNA, pr-tr, pseudo, rRNA, snoRNA, snRNA, Trec, tRNA
In this version, biotypes are passed through as with the ensembl 'transcript_biotype' field.
In order to facillitate unambiguous miRNA alignment, mature iRNA sequences are aligned to the
reference transcriptome, and any alignemnts within transcripts are masked. This is performed to
ensure both that each given miRNA sequence has only a single reference alignment, as well as
to allow miRNA precursor transcripts to be identified as hybrid targets.
"""
echo "${NOTES}"