Alignment

By subject	By project
Alignment
Articles and resources
Annotation
bam, cram, bed, fastq
Coverage
Genome assembly
Phylogenetics
MISC
RNA-seq
Variant analysis
Visualization

Alignment

Jim Kent's utils UCSC
Synteny maps
alignment.avr_pw_dist.pl [alignment.fasta] [1=print distances] prints average pairwise distance for multiple alignment, and all pairwise distances if $2==1
alignment.check_3x.sh [alignment.fasta] prints how many sequences in an alignment are not 3x in length
alignment.check_conservative_site.sh
alignment.consensus.2seq.pl [alignment.fasta] prints a consensus of two DNA sequences
alignment.consensus.pl
alignment.count_gaps.sh counts % of positions with gaps in an alignment
alignment.detect_stops.pl [alignment.fasta] [--print_pos] prints internal stops and their positions
alignment.fa2fasta.pl converts fixed line lengths fasta to long line format.
alignment.fasta2slice.pl
alignment.filter_not3x.sh
alignment.fix_ends.sh
alignment.muscle.sh.
alignment.nucleotide_diversity.pl
alignment.occupancy.pl
alignment.protein2dna.sh [name.aln.fasta] reverse translates amino acid alignment into DNA alignment.
alignment.remove_ambiguity.pl
alignment.remove_cons_n_gaps.pl
alignment.remove_gaps.pl
alignment.remove_seqs_having_stops.pl
alignment.remove_sp4occupancy.pl
alignment.remove_stops_n_gaps2.pl removes stops, gaps, and sites around indels.
alignment.remove_stops_n_gaps.pl removes triplets with stops or gaps.
alignment.reverse_complement.pl
alignments.concatenate1.pl
alignments.concatenate.pl
alignment.sort_by_id.pl
alignment.substitution_profile.pl
alignment.tr_trv.pl
alignment.UCSC.dm6droSim1.sh builds a pairwise alignment of D.melanogaster and D.simulans.

Annotation

genes.R - retrieves ENSEMBL annotations from biomaRt, run Rscript genes.R --help

bam, cram, bed, fastq

jvarkit: bamstats04, bamstats05.
bedtools - genome arithmetics.
fastqc
trimmomatic
PRINSEQ
bam2fq.sh converts bam to fastq, uses samtools and bedtools
bam.reads_number.sh reports the number of paired reads in a bam file, uses samtools
bam.remove_region.sh removes reads from a bam file located at regions specified by a bed file. Badly filtered rRNA-depleted RNA-seq samples may have huge coverage of low complexity regions. It is better to filter those with prinseq, however sometimes it is necessary to remove a particular region.
bam.sort.sh - sorts a bam file before calculating coverage.
basespace-cli. New Illumina sequencers upload data to basespace cloud. bs utility copies data from cloud to HPC. To copy bcl files: bs cp //./Runs/<project_name>/Data .
bcl2fastq.sh converts raw bcl Illumina files to fastq.
cram2fq.sh converts cram to fastq, uses cramtools wrapper from bcbio
samtools quickcheck -vvvv [file.bam] checks the integrity of a bam file.

Coverage

bam.coverage.bamstats05.sh - very fast coverage calculation for a bam and a bed file with bamstats05 (2 min for WES bam).
bam.coverage.sh - slow base-wise coverage calculation with bedtools + median coverage statistics.

Genome assembly

The wisdom here is to avoid large genomes, polyploid genomes, and creating your own genome assembler. See my lecture (in Russian). Better to have multiple libraries for large genomes, with the good amount of mate pairs with 5k,10k,20k insert size. For a serious work a special computing node is necessary (1-2T RAM). Surprisingly, such a node is not that expensive: just buy a cheap big 4CPU SuperMicro server capable to carry up to 4-8T RAM, buy RAM, and insert it into the server. Avoid vendors and sales persons. Look for engineers to cooperate. For smaller genomes I prefer spades, for larger ones velvet + platanus. Remember to clean up reads (check for contamination, quality trimming).

genome_assembly.spades.sh runs spades assembler. Spades is the best assembler for genomes up to 100G.
genome_assembly.n50.sh [contigs.fasta] calculates N50 metrics.

Phylogenetics

Calculating and plotting phylogenetic trees is very rewarding. However, the tree may be misleading. Briefly, steps are:

build a good(!) alignment
concatenate many genes
fit the model with modeltest (GTR+Г+I is usually the winner for protein coding genes)
run RAXML and MrBayes and compare two trees
visualize with Dendroscope.
tree.raxml_boot.pbs infers a phylogenetic tree for cdna alignment with 100 bootstrap replicates. For details, read The Phylogenetic handbook.

MISC

hla.athlates.sh HLA typing with Athlates.

RNA-seq

bcbio rna-seq pipeline does STAR alignment, variant calling, expression measurements, isoform reconstruction, and quality metrics. Run sample-wise to speed up.
rnaseq.star.sh - 2pass on the fly STAR alignment for a single sample. Two passes are recommended to enhance alignment and calculation of counts for novel splice junctions (1st pass discovers junctions, 2nd pass makes an alignment).
rnaseq.feature_counts.sh [file.bam] calculates features (reads) for RPKM calculation in R, outputs length of the genes. TPMs are generally better (http://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/), because you can compare expression values across samples, and they are calculated by default in bcbio rnaseq pipeline, but GTEX values unlike Protein Atlas values are in RPKMs, and still many people think in terms of RPKMs. GTEX has much more samples than HPA.
rnaseq.load_rpkm_counts.R calculates RPKM counts from feature_counts output.

De novo transcriptome assembly

Differential expression (DE)

edgeR.
rnaseq.dexpression.R - common utils for DE.
rnaseq.dexpression.katie.R - edgeR DE, batch effect correction, pheatmap, GO, pathways.
Gene Set Enrichment Analysis (GSEA). Run with all protein-coding genes, expression in cpm.

Micro-RNA

I did a couple of analysis in A.thaliana and S.tuberosum (potato). First I run smallRNA-seq pipeline from bcbio. It is much better tuned for human miRNA, where it does discovery of novel miRNA (because primary tools are developed mostly for human). For non-human species it does a good job of mapping to mirBase and has an extensive report in R.
For novel miRNA discovery I tried shortstack.

Splicing analysis

qorts and junctionseq do the job. Finally they produce html report for all differentially expressed genes, plotting exon usage, junction counts, novel juctions, as a nice plots. The problem is that RNA is quite noisy by its biological nature, so expect to have many events and many false positives. Additionaly you have tracks for IGV. They are useful, because default exon usage plots are not for exons from the canonical isoform, but for a set of chunks from the flattened annotation of many isoforms, and people are asking questions, why here are 125 exons not 108. It is hard to see the correspondence to exons of the canonical isoforms for long genes from those plots. Loading tracks in IGV helps: you see junctions, counts, and the alignment, people begin to understand you.

Additionaly I use sailfish relative isoform expression levels which bcbio rna-seq pipeline outputs by default.

qorts

rnaseq.qorts.makeflatgff.sh
rnaseq.qorts.qc.sh [file.bam] calculates counts from bam file
rnaseq.qorts.merge_novel_splices.sh merges junctions from all samples
rnaseq.qorts.get_novel_exons.sh [sample] [max_novel_exon_lengths] discovers novel exons using qorts files as input. A novel exon is reported when two novel junctions are found at the distance less then max_novel_exon_lengths.
rnaseq.splicing.junction_seq.R runs junction seq analysis in R
rnaseq.splicing.junction_seq.sh runs R script in the queue

Variant analysis, vcf files

For variant calling I use bcbio ensemble approach on per-family basis. In brief, 2 out of 4 (gatk-haplotype, samtools, freebayes, and platypus) algorithms should be voting for a variant to be called. This allows to achieve increased sensitivity required for research, compared to conservative strategy of the genetic testing laboratory.

VT toolkit,VT toolkit source:biallelic sites decomposition, info fields removal
bcftools
RTG: accurate vcf comparison, vcfstats.
vcf.rtg.validate.sh - how to run rtg vcfeval validation.
bcbio.pbs [project] submits bcbio project to the queue.
bcbio.cleanup.sh [family] cleans up after bcbio and prepares necessary tables for excel report generator.
bcbio.prepare_families.sh creates symlinks, folders, config files to run variant calling for multiple families.
bcbio.prepare_samples.sh creates symlinks, folder structure, config files to run multiple projects for bam file generation.
vcf.split_multisample.sh splits multisample vcfs.
vcf.kinship.R calculates kinship for a family using SNPRelate and plots a pedigree. Sometimes it is useful when studying cohorts of families. When samples are mislabeled (it happens), the whole study makes no sense, at least until relatives are placed together to call and interpret variants.

Gemini staff - for CHEO, MH, and Muscular projects

Gemini is a database and framework for variant analysis. Bcbio variant calling pipeline outputs variants in gemini format.

gemini.decompose.sh decomposes and normalizes variants with vt.
gemini.vep.sh annotates vcf file with VEP.
gemini.vep2gemini.sh loads VEP annotated vcf to the GEMINI database.
gemini.gemini2txt.sh dumps gemini database into txt file with decompressed genotypes.
cre.R creates report for import to excel from the gemini.txt dump.
gemini.from_rnaseq.sh creates gemini database and rare harmful variants report from bcbio's rna-seq pipeline output.
gemini.refseq.sh from vcf to refseq table for excel report generator.
gemini.vep.parse.pl parses VEP annotation.
gemini.vep.refseq.sh annotates variants with RefSeq transcripts (NMs) using VEP.
gemini.variant_impacts.sh extracts variant impacts for all transcripts.

Articles and resources

https://precision.fda.gov is an effort to standardize computational pipelines.
https://www.clinicalgenome.org/ is created by authors of ACMG guidelines.
Viral Genomics & Bioinformatics - Glasgow
Gene set enrichment analysis
Pedigree chart designer draws pedigree diagrams.
List of journals for the journal/article club

By project

6. RNA-seq in Mendelian muscle diseases [2016-2018]

Developed into crt
Published in AJHG

5. Malignant hyperthermia [2016]

4. CHEO [2016-2019]

Developed into cre

bcbio.prepare_families.sh creates symlinks, folders, config files to run variant calling for multiple families, or use bcbio.prepare_samples.sh if some samples need to re-generate bam files.
bcbio.array.pbs runs multiple bcbio projects as a job array. I have tried a parallel execution of BCBIO, with IPython, quite successfully, but finally with enigmatic faults,maybe because of the cluster issues.
cheo.check_if_done.sh [bcbio_job.output] check which bcbio jobs are done, useful when running 100x families
bcbio.cleanup.sh [family] cleans up after bcbio and prepares necessary tables for excel report generator
gemini.gemini2report.R generates reports for excel import
cheo.c4r_database.sh prepares variants from a family to be merged in a database seen_in_c4r
cheo.c4r_database_merge.pl merges variant evidence from many samples
gemini.gemini2report.R again, to add information about C4R frequencies.
cheo.filtered_vcf.sh prepares filtered vcf file corresponding to excel report.

3. Gammarus [2012-2016]

gam.filter.sh - filters reads
gam.blastp.pbs, gam.blastp.sh - blastp for othomcl
gam.ortho.sh - orthologization of transcript with orthomcl
gam.orthomcl.make_alignments.pl - writes fasta alignments for N species (1:1 orthologs) after orthomcl

2. Spectrum [2010-2012]

1. Reversals [2009-2012]

PAML

Name		Name	Last commit message	Last commit date
Latest commit History 878 Commits
assembly		assembly
bam		bam
barcodes		barcodes
bcbio		bcbio
blast		blast
clusters		clusters
cram		cram
crispr		crispr
dragen		dragen
fastq		fastq
fusions		fusions
gatkcnv		gatkcnv
gene_panels		gene_panels
paml		paml
phylogeny		phylogeny
phylogicndt		phylogicndt
purecn		purecn
quality_control		quality_control
scripts		scripts
variants		variants
vep		vep
visualization		visualization
LICENSE		LICENSE
README.md		README.md
journals.md		journals.md
variants.R		variants.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Alignment

Annotation

bam, cram, bed, fastq

Coverage

Genome assembly

Phylogenetics

MISC

RNA-seq

De novo transcriptome assembly

Differential expression (DE)

Micro-RNA

Splicing analysis

qorts

Variant analysis, vcf files

Gemini staff - for CHEO, MH, and Muscular projects

Articles and resources

By project

6. RNA-seq in Mendelian muscle diseases [2016-2018]

5. Malignant hyperthermia [2016]

4. CHEO [2016-2019]

3. Gammarus [2012-2016]

2. Spectrum [2010-2012]

1. Reversals [2009-2012]

About

Releases

Packages

Contributors 2

Languages

License

naumenko-sa/bioscripts

Folders and files

Latest commit

History

Repository files navigation

Alignment

Annotation

bam, cram, bed, fastq

Coverage

Genome assembly

Phylogenetics

MISC

RNA-seq

De novo transcriptome assembly

Differential expression (DE)

Micro-RNA

Splicing analysis

qorts

Variant analysis, vcf files

Gemini staff - for CHEO, MH, and Muscular projects

Articles and resources

By project

6. RNA-seq in Mendelian muscle diseases [2016-2018]

5. Malignant hyperthermia [2016]

4. CHEO [2016-2019]

3. Gammarus [2012-2016]

2. Spectrum [2010-2012]

1. Reversals [2009-2012]

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages