GitHub - RenneLab/hkref: Hybkit-Ref: Up-to-date Genomic Sequence Reference Database for Hyb

hkref (Hybkit-Reference)

GitHub release (latest by date including pre-releases)

This repository is a part of the hybkit project.

Full hybkit project documentation is available at hybkit's ReadTheDocs.

Description:

This repository includes an up-to-date human genomic sequence reference designed to be compaitble with the Hyb program for chimeric (hybrid) read calling for ribonomics experiments.

The method for reference library construction is based on the protocol provided in the supplemental methods of:

Helwak, Aleksandra, et al. 'Mapping the human miRNA interactome by CLASH reveals frequent noncanonical binding.' Cell 153.3 (2013): 654-665. http://dx.doi.org/10.1016/j.cell.2013.03.043

The reference library is primarily based on sequences downloaded from Ensembl via the Biomart API, with the use of miRBase for mature miRNA sequences and a few other sequence sources.

Biomart queries include:

mRNA : transcript_biotype=protein_coding; cdna; (limited to where a RefSeq Protein Identifier Exists)
lncRNA : transcript_biotype=lncrna; cdna
rRNA : transcript_biotype=lncrna; cdna
rRNA : transcript_biotype=[Mt_rRNA, rRNA, rRNA_pseudogene]; transcript_exon_intron
tRNA : transcript_biotype=Mt_tRNA; transcript_exon_intron
other : transcript_biotype=[all remaining]; transcript_exon_intron

For a detailed description of the current sequences and queries utilized, see "Current Reference Details" below.

Run Reference Creation Pipeline:

The reference pipeline is designed using Nextflow, and has been tested on Nextflow/23.04.1. Dependency handling is performed with conda modules (containerized implementation in development).

Required program dependencies are:

nextflow (tested v23.04.1)
seqkit (tested v2.5.1)
bedops (tested v2.4.39)

Required Python Packages:

pandas (tested pandas=1.3.5)
natsort (tested natsort=7.1.1)
pybiomart (tested pybiomart=0.2.0)
biothings-client (tested biothings_client=0.2.6)
biopython (tested biopython=1.79)
pyyaml (tested pyyaml=6.0)

The scripts can be run by executing the first script: "00_run_all.sh" using the presupplied conda configuration, or by making all required resources (seqkit, python3) available on the system path.

Hyb Reference Specification:

The Hyb program has requirements about the formatting of the FASTA file used for the reference.

Currently identified requirements include:

No description in FASTA sequence header (no whitespace characters)
Sequence identifier be of the form of "{1}_{2}_{seqid}_{biotype}"

{1}: Arbitrary Identifier (ENSG... for Ensembl Sequences)

{2}: Arbitrary Identifier (ENST... for Ensembl Sequences)

{3}: Name of gene/miRNA

{4}: Ensembl-style transcript_biotype.

(Note, "microRNA" must be used in place of "miRNA" for recognition by Hyb)
{1}, {2}, {seqid}, and {biotype} should contain only [a-z], [A-Z], [0-9],

"-", and "|" characters.
"_", ".", and "," characters are specifically excluded from identifiers.

Examples:

>ENSG00000003137_ENST00000001146_CYP26B1_mRNA
.....
>MIMAT0000062_MirBase_let-7a_microRNA
TGAGGTAGTAGGTTGTATAGTT

Thanks to Grzegorz Kudla ( https://github.com/gkudla ) for providing information on Hyb reference creation.

Current Reference Details:

Text of: ./01_notes.sh

Download a reference sequence library for the Hyb program from Ensembl
using the Biomart python module.

Library construction is based on the protocol provided in the supplemental methods of:
Helwak, Aleksandra, et al. 'Mapping the human miRNA interactome by CLASH reveals
frequent noncanonical binding.' Cell 153.3 (2013): 654-665.
http://dx.doi.org/10.1016/j.cell.2013.03.043
( Supplemental methods section found only in PDF-fulltext )

Biomart queries include:
  protein_coding (as cDNA)
  lncRNA (as cDNA)
  All remaining gene_biotypes
      as unspliced transcripts ('transcript_exon_intron')

tRNAs:  genomic tRNA database http://gtrnadb.ucsc.edu/)
rRNAs:  NCBI Genbank Database, rRNA sequences (NR_003287.4, NR_003286.4);
miRNAs: miRBase release 22.1 (http://www.mirbase.org): mature human miRNAs.

These sequences are then formatted in the required {}_{}_{name}_{biotype} header
format for Hyb, and all extra '.' and '_' symbols are removed.

Original biotypes from the hOH7 Hyb database are:
Ig, lincRNA, microRNA, miscRNA, mRNA, mtrRNA, pr-tr, pseudo, rRNA, snoRNA, snRNA, Trec, tRNA
In this version, biotypes are passed through as with the ensembl 'transcript_biotype' field.

In order to facillitate unambiguous miRNA alignment, mature iRNA sequences are aligned to the
reference transcriptome, and any alignemnts within transcripts are masked. This is performed to
ensure both that each given miRNA sequence has only a single reference alignment, as well as
to allow miRNA precursor transcripts to be identified as hybrid targets.

"""
echo "${NOTES}"

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
bin		bin
bin_nf		bin_nf
.gitattributes		.gitattributes
.gitignore		.gitignore
.lfsconfig		.lfsconfig
00_run_all.sh		00_run_all.sh
01_notes.sh		01_notes.sh
02_hkref_build.nf		02_hkref_build.nf
LICENSE		LICENSE
README.rst		README.rst
nextflow.config		nextflow.config
settings_db_hsa.yaml		settings_db_hsa.yaml
settings_output.yaml		settings_output.yaml
settings_project.yaml		settings_project.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hkref (Hybkit-Reference)

Description:

Run Reference Creation Pipeline:

Hyb Reference Specification:

Current Reference Details:

About

Releases

Languages

License

RenneLab/hkref

Folders and files

Latest commit

History

Repository files navigation

hkref (Hybkit-Reference)

Description:

Run Reference Creation Pipeline:

Hyb Reference Specification:

Current Reference Details:

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Languages