Skip to content

4. basic usage examples

Rauf Salamzade edited this page Oct 26, 2024 · 20 revisions

prepTG (preparing target genomes database)

prepTG formats and parses information in provided GenBank files or can run prodigal (for bacteria only!) for gene-calling if provided FASTA files and subsequently create GenBank files.

Create a target genomes database from user provided genomes (in FASTA or GenBank format) provided in a folder.

prepTG -i Folder_with_Genomes_to_Search/ -o prepTG_DB/

Create a target genomes database from user provided genomes (in FASTA or GenBank format) provided in a folder and include all genomes assigned as a certain bacterial genus or species in GTDB (e.g. Cutibacterium acnes):

prepTG -i Folder_with_User_Provided_Genomes/ -g "Cutibacterium acnes" -o prepTG_DB/

Caution

BE CAREFUL, WELL-SEQUENCED TAXA CAN RESULT IN LARGE PREPTG DATABASES AND LARGE FILES IN THE FAI RESULTS!!!

Download a pre-made target genomes database based on distinct representative genomes for a variety of taxa:

prepTG -d Cutibacterium -o prepTG_DB/

For additional details on prepTG (e.g. how to download genomes from NCBI), please check out the 1. more info on prepTG wiki page.

fai (finding homologous instances of query gene clusters)

1. Provide GenBank(s) of known instance(s) of gene cluster

fai -i Known_GeneCluster.gbk -tg prepTG_Database/ -o fai_Results/

Here the Known_GeneCluster_GenBank.gbk represents a GenBank corresponding to a reference of a single gene-cluster of interest. Multiple reference gene cluster GenBanks can be provided. If multiple GenBanks are provided, homolog groups are identified between them to simplify the DIAMOND search operation.

MIBiG users, rejoice, you can download a GenBank for any entry using the "Download Cluster GenBank file" link. This input format is made in mind for most users of BGC prediction software, such as antiSMASH or GECCO.

2. Provide gene-cluster coordinates along a FASTA reference genome

fai -r Reference.fasta -rc scaffold01 -rs 40201 -re 45043 -tg prepTG_Database/ -o fai_Results/

Provide the coordinates of a gene-cluster along a reference genome. This option is likely the most compatible with sources of gene-clusters from various websites such as ICEberg, IslandViewer, and PHASTER.

3. Provide proteins gene-cluster using set of proteins that should be co-clustered (similar to cblaster!)

fai -pq Gene-Cluster_Query_Proteins.faa -tg prepTG_Database/ -o fai_Results/

In this format a FASTA file with protein sequences belonging to the gene-cluster is used for searching in target genomes. This is the same format as what cblaster uses. Note, this input format does not allow for assessment of syntenic similarity between the query gene-cluster(s) and homologous instances identified in target genomes.

4. Provide a single query protein and use to extract surrounding +/-20kb of homologs in target genomes (inspired by CORASON)

Note, this option is still experimental. The concept of looking at variability in the context of a focal gene stems from CORASON but we don't use RBH and only an adjustable E-value threshold to identify homologs in target genomes. Unlike, the other 3 ways to run fai to identify gene clusters - where syntenic support can be used to better infer orthology - here we are more limited and can only infer homology. We might pair the -sq argument with another to provide a reference genome for the single query protein eventually.

fai -sq Single_Query_Protein.faa -tg prepTG_Database/ -o fai_Results/ -f 20000

For additional details on fai (e.g. how it relates to cblaster and lsaBGC-Expansion, plots it can create to assess homologous gene-clusters detected), please check out the 2. more info on fai wiki page. For alternative tools, in particular webservers, which can be used in the place of fai, check out the 5.1 tutorial for using zol with output from fast.genomics and CAGECAT wiki page.

zol (summarize information across homologous instances of a gene cluster)

zol -i Genbanks_Directory/ -o zol_Results/

if running after fai, then the input directory would be the Homologous_GenBanks_Directory/ subdirectory. So the typical run through the workflow would likely involve a command similar to the following:

zol -i fai_Results/Final_Results/Homologous_Gene_Cluster_GenBanks/ -o zol_Results/

By default, zol will scale to around 100 to 300 distinct gene clusters, if you have more and you suspect there is some redundancy, you can use dereplication via the -d option to collapse very similar gene-cluster instances down and use only representative gene clusters to determine ortholog groups before expanding back out to compute evolutionary stats! More recently, CD-HIT was also introduced as a way to determine protein clusters in place of ortholog groups if memory limits are a significant issue.

zol produces an XLSX spreadsheet report (within the sub-directory Final_Results/) where rows correspond to each individual ortholog group/homolog-group and columns provide basic stats, consensus order, annotation information using multiple databases, and evolutionary/selection-inference statistics. Coloring is automatically applied on select quantitative field for users to more easily assess trends.

Important

We recommend providing a custom-annotation database as a FASTA file of protein sequences with headers corresponding to unique identifiers via the -cd argument because this will allow you to more easily link the ortholog groups to known genes from a well studied instance of the gene cluster if that exists!

Annotation databases include: KEGG, NCBI's PGAP, PaperBLAST, VOGs (phage related genes), MIBiG (genes from characterized BGCs), VFDB (virulence factors), CARD (antibiotic resistance), ISfinder (transposons/insertion-sequences).

For details on the stats/annotations zol infers, please refer to the zol wiki page.

image

Use for dereplication of gene cluster GenBanks to ease visualization with clinker or CORASON

Another application of zol is to use it for preliminary dereplication for visualization with clinker, CORASON, etc.

zol uses skani to perform dereplication with adjustable options (see zol --help).

Note, skani estimates for ANI and AF become less reliable when working with contigs <10kb, so zol-based dereplication should only be used for gene clusters 10 kb or larger.

# Run zol with dereplication requested
zol -i GenBanks_Directory/ -o zol_Results/ -d 

# Reference dereplicated representative GenBanks/gene clusters as input for clinker analysis
clinker zol_Results/Dereplicated_GenBanks/*.gbk -p clinker_visualization.html

cgc and cgcg (visualize zol results for 100s to 1000s of gene cluster instances)

If tables are not your thing or you want to incorporate the zol data into a figure, you can use cgc or cgcg. In particular, cgcg will create an HTML file with an interactive network of ortholog groups with edges representative gene order information and coloring some quantitative evolutionary statistic:

cgcg -i zol_Results/ -o cgcg_Results/

Clone this wiki locally