By subject | By project |
---|---|
Alignment | |
Articles and resources | |
Annotation | |
bam, cram, bed, fastq | |
Coverage | |
Genome assembly | |
Phylogenetics | |
MISC | |
RNA-seq | |
Variant analysis | |
Visualization |
- Jim Kent's utils UCSC
- Synteny maps
- alignment.avr_pw_dist.pl [alignment.fasta] [1=print distances] prints average pairwise distance for multiple alignment, and all pairwise distances if $2==1
- alignment.check_3x.sh [alignment.fasta] prints how many sequences in an alignment are not 3x in length
- alignment.check_conservative_site.sh
- alignment.consensus.2seq.pl [alignment.fasta] prints a consensus of two DNA sequences
- alignment.consensus.pl
- alignment.count_gaps.sh counts % of positions with gaps in an alignment
- alignment.detect_stops.pl [alignment.fasta] [--print_pos] prints internal stops and their positions
- alignment.fa2fasta.pl converts fixed line lengths fasta to long line format.
- alignment.fasta2slice.pl
- alignment.filter_not3x.sh
- alignment.fix_ends.sh
- alignment.muscle.sh.
- alignment.nucleotide_diversity.pl
- alignment.occupancy.pl
- alignment.protein2dna.sh [name.aln.fasta] reverse translates amino acid alignment into DNA alignment.
- alignment.remove_ambiguity.pl
- alignment.remove_cons_n_gaps.pl
- alignment.remove_gaps.pl
- alignment.remove_seqs_having_stops.pl
- alignment.remove_sp4occupancy.pl
- alignment.remove_stops_n_gaps2.pl removes stops, gaps, and sites around indels.
- alignment.remove_stops_n_gaps.pl removes triplets with stops or gaps.
- alignment.reverse_complement.pl
- alignments.concatenate1.pl
- alignments.concatenate.pl
- alignment.sort_by_id.pl
- alignment.substitution_profile.pl
- alignment.tr_trv.pl
- alignment.UCSC.dm6droSim1.sh builds a pairwise alignment of D.melanogaster and D.simulans.
- genes.R - retrieves ENSEMBL annotations from biomaRt, run
Rscript genes.R --help
- jvarkit: bamstats04, bamstats05.
- bedtools - genome arithmetics.
- fastqc
- trimmomatic
- PRINSEQ
- bam2fq.sh converts bam to fastq, uses samtools and bedtools
- bam.reads_number.sh reports the number of paired reads in a bam file, uses samtools
- bam.remove_region.sh removes reads from a bam file located at regions specified by a bed file. Badly filtered rRNA-depleted RNA-seq samples may have huge coverage of low complexity regions. It is better to filter those with prinseq, however sometimes it is necessary to remove a particular region.
- bam.sort.sh - sorts a bam file before calculating coverage.
- basespace-cli. New Illumina sequencers upload data to basespace cloud.
bs utility copies data from cloud to HPC. To copy bcl files:
bs cp //./Runs/<project_name>/Data .
- bcl2fastq.sh converts raw bcl Illumina files to fastq.
- cram2fq.sh converts cram to fastq, uses cramtools wrapper from bcbio
samtools quickcheck -vvvv [file.bam]
checks the integrity of a bam file.
- bam.coverage.bamstats05.sh - very fast coverage calculation for a bam and a bed file with bamstats05 (2 min for WES bam).
- bam.coverage.sh - slow base-wise coverage calculation with bedtools + median coverage statistics.
The wisdom here is to avoid large genomes, polyploid genomes, and creating your own genome assembler. See my lecture (in Russian). Better to have multiple libraries for large genomes, with the good amount of mate pairs with 5k,10k,20k insert size. For a serious work a special computing node is necessary (1-2T RAM). Surprisingly, such a node is not that expensive: just buy a cheap big 4CPU SuperMicro server capable to carry up to 4-8T RAM, buy RAM, and insert it into the server. Avoid vendors and sales persons. Look for engineers to cooperate. For smaller genomes I prefer spades, for larger ones velvet + platanus. Remember to clean up reads (check for contamination, quality trimming).
- genome_assembly.spades.sh runs spades assembler. Spades is the best assembler for genomes up to 100G.
- genome_assembly.n50.sh [contigs.fasta] calculates N50 metrics.
Calculating and plotting phylogenetic trees is very rewarding. However, the tree may be misleading. Briefly, steps are:
- build a good(!) alignment
- concatenate many genes
- fit the model with modeltest (GTR+Г+I is usually the winner for protein coding genes)
- run RAXML and MrBayes and compare two trees
- visualize with Dendroscope.
- tree.raxml_boot.pbs infers a phylogenetic tree for cdna alignment with 100 bootstrap replicates. For details, read The Phylogenetic handbook.
- hla.athlates.sh HLA typing with Athlates.
- bcbio rna-seq pipeline does STAR alignment, variant calling, expression measurements, isoform reconstruction, and quality metrics. Run sample-wise to speed up.
- rnaseq.star.sh - 2pass on the fly STAR alignment for a single sample. Two passes are recommended to enhance alignment and calculation of counts for novel splice junctions (1st pass discovers junctions, 2nd pass makes an alignment).
- rnaseq.feature_counts.sh [file.bam] calculates features (reads) for RPKM calculation in R, outputs length of the genes. TPMs are generally better (http://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/), because you can compare expression values across samples, and they are calculated by default in bcbio rnaseq pipeline, but GTEX values unlike Protein Atlas values are in RPKMs, and still many people think in terms of RPKMs. GTEX has much more samples than HPA.
- rnaseq.load_rpkm_counts.R calculates RPKM counts from feature_counts output.
- edgeR.
- rnaseq.dexpression.R - common utils for DE.
- rnaseq.dexpression.katie.R - edgeR DE, batch effect correction, pheatmap, GO, pathways.
- Gene Set Enrichment Analysis (GSEA). Run with all protein-coding genes, expression in cpm.
I did a couple of analysis in A.thaliana and S.tuberosum (potato). First I run smallRNA-seq pipeline from bcbio.
It is much better tuned for human miRNA, where it does discovery of novel miRNA (because primary tools are developed mostly for human).
For non-human species it does a good job of mapping to mirBase and has an extensive report in R.
For novel miRNA discovery I tried shortstack.
qorts and junctionseq do the job. Finally they produce html report for all differentially expressed genes, plotting exon usage, junction counts, novel juctions, as a nice plots. The problem is that RNA is quite noisy by its biological nature, so expect to have many events and many false positives. Additionaly you have tracks for IGV. They are useful, because default exon usage plots are not for exons from the canonical isoform, but for a set of chunks from the flattened annotation of many isoforms, and people are asking questions, why here are 125 exons not 108. It is hard to see the correspondence to exons of the canonical isoforms for long genes from those plots. Loading tracks in IGV helps: you see junctions, counts, and the alignment, people begin to understand you.
Additionaly I use sailfish relative isoform expression levels which bcbio rna-seq pipeline outputs by default.
- rnaseq.qorts.makeflatgff.sh
- rnaseq.qorts.qc.sh [file.bam] calculates counts from bam file
- rnaseq.qorts.merge_novel_splices.sh merges junctions from all samples
- rnaseq.qorts.get_novel_exons.sh [sample] [max_novel_exon_lengths] discovers novel exons using qorts files as input. A novel exon is reported when two novel junctions are found at the distance less then max_novel_exon_lengths.
- rnaseq.splicing.junction_seq.R runs junction seq analysis in R
- rnaseq.splicing.junction_seq.sh runs R script in the queue
For variant calling I use bcbio ensemble approach on per-family basis. In brief, 2 out of 4 (gatk-haplotype, samtools, freebayes, and platypus) algorithms should be voting for a variant to be called. This allows to achieve increased sensitivity required for research, compared to conservative strategy of the genetic testing laboratory.
- VT toolkit,VT toolkit source:biallelic sites decomposition, info fields removal
- bcftools
- RTG: accurate vcf comparison, vcfstats.
- vcf.rtg.validate.sh - how to run rtg vcfeval validation.
- bcbio.pbs [project] submits bcbio project to the queue.
- bcbio.cleanup.sh [family] cleans up after bcbio and prepares necessary tables for excel report generator.
- bcbio.prepare_families.sh creates symlinks, folders, config files to run variant calling for multiple families.
- bcbio.prepare_samples.sh creates symlinks, folder structure, config files to run multiple projects for bam file generation.
- vcf.split_multisample.sh splits multisample vcfs.
- vcf.kinship.R calculates kinship for a family using SNPRelate and plots a pedigree. Sometimes it is useful when studying cohorts of families. When samples are mislabeled (it happens), the whole study makes no sense, at least until relatives are placed together to call and interpret variants.
Gemini is a database and framework for variant analysis. Bcbio variant calling pipeline outputs variants in gemini format.
- gemini.decompose.sh decomposes and normalizes variants with vt.
- gemini.vep.sh annotates vcf file with VEP.
- gemini.vep2gemini.sh loads VEP annotated vcf to the GEMINI database.
- gemini.gemini2txt.sh dumps gemini database into txt file with decompressed genotypes.
- cre.R creates report for import to excel from the gemini.txt dump.
- gemini.from_rnaseq.sh creates gemini database and rare harmful variants report from bcbio's rna-seq pipeline output.
- gemini.refseq.sh from vcf to refseq table for excel report generator.
- gemini.vep.parse.pl parses VEP annotation.
- gemini.vep.refseq.sh annotates variants with RefSeq transcripts (NMs) using VEP.
- gemini.variant_impacts.sh extracts variant impacts for all transcripts.
- https://precision.fda.gov is an effort to standardize computational pipelines.
- https://www.clinicalgenome.org/ is created by authors of ACMG guidelines.
- Viral Genomics & Bioinformatics - Glasgow
- Gene set enrichment analysis
- Pedigree chart designer draws pedigree diagrams.
- List of journals for the journal/article club
5. Malignant hyperthermia [2016]
Developed into cre
- bcbio.prepare_families.sh creates symlinks, folders, config files to run variant calling for multiple families, or use bcbio.prepare_samples.sh if some samples need to re-generate bam files.
- bcbio.array.pbs runs multiple bcbio projects as a job array. I have tried a parallel execution of BCBIO, with IPython, quite successfully, but finally with enigmatic faults,maybe because of the cluster issues.
- cheo.check_if_done.sh [bcbio_job.output] check which bcbio jobs are done, useful when running 100x families
- bcbio.cleanup.sh [family] cleans up after bcbio and prepares necessary tables for excel report generator
- gemini.gemini2report.R generates reports for excel import
- cheo.c4r_database.sh prepares variants from a family to be merged in a database seen_in_c4r
- cheo.c4r_database_merge.pl merges variant evidence from many samples
- gemini.gemini2report.R again, to add information about C4R frequencies.
- cheo.filtered_vcf.sh prepares filtered vcf file corresponding to excel report.
- gam.filter.sh - filters reads
- gam.blastp.pbs, gam.blastp.sh - blastp for othomcl
- gam.ortho.sh - orthologization of transcript with orthomcl
- gam.orthomcl.make_alignments.pl - writes fasta alignments for N species (1:1 orthologs) after orthomcl
- PAML