Skip to content

5.1 tutorial for using zol with output from fast.genomics and CAGECAT

Rauf Salamzade edited this page May 16, 2024 · 1 revision

While prepTG and fai offer convenient options for finding homologous/orthologous instances of a gene cluster in a target database of genomes, to search large databases of genomes will require the use of a server or access to a considerable amount of disk space.

CAGECAT and fast.genomics are great server-based alternatives for determining sets of homologous gene clusters/neighborhoods which can then be investigated in more depth using zol.

Investigating gene clusters from fast.genomics

fast.genomics (https://fast.genomics.lbl.gov/cgi/search.cgi) is a great web application by Price and Arkin for finding high divergence homologs and gaining a phylogenetic perspective of conservation.

Step 1: Open up example gene neighborhood

Go to the fast.genomics web application and click the "Gene neighborhoods" link to go to an example neighborhood:

Step 2: Export neighborhood information

Click on the "table_of_genes" link to download information for the genomes/proteins in the gene neighborhood.

Step 3: Transform neighborhood information into individual GenBank files and run zol

You could also use the gene neighborhood GenBank files as input to create manual visualizations via clinker of pyGenomeViz.

# with zol conda environment activated
# browse_ING2E5A_RS06865.tsv is the file downloaded in Step 2.
fastgenomicsNeighborhoodToGenBanks.py -i browse_ING2E5A_RS06865.tsv -o fast.genomics_neighborhood_output/

# run zol with 4 threads
zol -i fast.genomics_neighborhood_output/Gene_Cluster_GenBank_Files/ -o zol_Results/ -c 4

image

Investigating gene clusters from CAGECAT

Will expand in the near future! But users can just download clusters identified in GenBank format and either:

  1. run zol directly on them by providing the uncompressed directory as input with -r option to rename locus-tags (because CAGECAT gene cluster GenBank files feature protein_id identifiers instead of locus_tag identifiers)

or

  1. use the cagecatProcess.py script to automatically recreate the GenBank files with the value of protein_id qualifiers copied over and assigned as values to locus_tag qualifiers

Note, because zol uses codon alignments for some statistics and CDS features in exported gene cluster GenBank files from CAGECAT do not feature specifics on exon coordinates, CAGECAT to zol is not currently viable for fungal/eukaryotic investigations. The zol suite does support eukaryotic investigations and if you are interested in this, please look at prepTG and fai pages for further information.