-
Notifications
You must be signed in to change notification settings - Fork 4
5.3 visualization of 1000s of gene clusters using cgc or cgcg
cgc (pronounced see( 👀 )-gc; but standing for Collapsed Gene Cluster) and cgcg (pronounced kinda like squeegee; standing for Collapsed Gene Cluster Graph) are programs to create visuals from zol results that can summarize data from thousands of gene cluster instances. It features multiple options to help users create highly customized visuals.
Note
Many great tools exist for creating micro-synteny visualizations, see this page for recommendations, but these approaches are inherently difficult to scale. zol can be useful if creating those types of plots are of interest still, in particular, because it integrates skani for gene cluster dereplication.
We will continue with the example from the tutorial on using salt
to demonstrate first cgcg and then cgc.
Important
cgcg uses the amazing gravis framework for network visualization by Robert Haas underneath the hood and which we recommend citing as well given the effort they put into developing it. cgcg draws inspiration from the STRING web-application by the Bork lab and various pan-genome visualization software.
Note
cgc is inspired in part by Figure 3 from this study by Wang et al. 2018.
Picking up from having run fai in the salt
tutorial to identify crt in Staphylococcus genomes, we can run zol as follows.
To showcase cgc's usefulness for comparative genomics (uh genetics), we run zol with all instances of the crt operon from S. aureus selected as a focal set and the remainder of instances of the operon from the genomes of other species as the comparator set.
# write a list of all S. aureus crt operon instances to a file to provide to zol next
ls -lht fai_Results/Final_Results/Homologous_Gene_Cluster_GenBanks/ | awk '{print $9}' | grep '_aureus_' > staph_aureus_instances.txt
# run zol - note, the "-q" option really speeds things up by using MUSCLE super5 mode for
# protein aligments
zol -i fai_Results/Final_Results/Homologous_Gene_Cluster_GenBanks/ -o zol_Results/ -c 50 -f staph_aureus_instances.txt -q
We can first try running cgc with default options (+ -p
- which just requests producing plots in PNG format instead of PDF):
cgc -i zol_Results/ -o cgc_Results/ -p
This will give us a pretty bland plot:
Note, ortholog groups are scaled by their median length in basepairs and shown in the consensus order/direction across all gene cluster instances.
We can apply other options to create a more insightful figure as such:
cgc -i zol_Results/ -o cgc_Results/ -t conservation conservation focal_conservation tajimas_d entropy \
-c grey white_to_black light_to_dark_purple red_white_blue light_to_dark_green \
-rh 1.7 1 1 1 1 -sl -p -sc -ld 0.04 -lts 3.0 -b 0.5 -w 8 -nl \
-fl "Conservation amongst crt gene clusters in S. aureus genomes"
-t/--tracks
: is used to specify which tracks/information from zol analysis you want to include in the figure. Note, the first one is used for coloring the gene schematic at the bottom of the figure.-c/--color_scheme
: is used to specify color palettes for the tracks. There are preset defaults available that are listed in the usage for cgc. More custom color schemes can also be provided through a simple configuration file.-rh/--relative_heights
: is used to specify the relative heights of the individual tracks to each other.-sl/--show_labels
is used to request showing labels of ortholog groups.-sc/--show_copy_count_status
is used to indicate which ortholog groups are not found in single-copy across gene cluster instances. This results in the red dot above the gene schematic of one of the ortholog groups (OG_26).-ld/label_distance
is used to control the distance of the labels to the gene schematics.-lts/label_text_sizes
is used to control the label text size.-b/--bottom_spacing
is used to control the white space at the bottom of the plot (to help prevent legends from being cut off)-w/--width
is the width of the plot in inches (default is 7).-nl/no_legend
is to request no legend for the gene schematic coloring to be shown.-fl/--focal_set_title
is to request a custom title for the focal conservation track.
You will notice the ortholog group labels are sort of arbitrary. While you can go in the zol report and look at the functional annotations from the variety of databases to figure out what is what, you can also provide a FASTA file of proteins with labels as headers. cgc will perform orthology matching of those sequences to consensus sequences of ortholog groups to then assign appropriate labels in the figure. For instance, we can use the query FASTA for fai from the salt tutorial to see which of the genes correspond to the 5 crt genes as such:
cgc -i zol_Results/ -o cgc_Results/ -t conservation conservation focal_conservation tajimas_d entropy \
-c grey white_to_black light_to_dark_purple red_white_blue light_to_dark_green \
-rh 1.7 1 1 1 1 -sl -p -sc -ld 0.04 -lts 3.0 -b 0.5 -w 8 -nl \
-fl "Conservation amongst crt gene clusters in S. aureus genomes" \
-d crt_proteins.faa
cgc writes an R script in the resulting output directory that it then runs to create the visualization.
This allows users with familiarity in R, specifically ggplot2, to further customize this script directly and simply rerun it to recreate the visual, with the zol conda env active this can be done as the following:
Rscript cgc_Results/cgc_script.R
The resulting PDF can be opened in either InkScape (free!!!) or Adobe Illustrator to further customize the figure since it is a vector graphic.
cgc does not visualize all ortholog groups by default, only those which are found in greater than 10% of gene cluster instances input into zol. This could lead to misinterpretation of results and is something to be cautious of. For instance, based on orthology cutoffs for coverage and identity, ortholog groups might be more or less split up than users might desire. It is thus important to assess the zol spreadsheet where information on all gene clusters should be reported.
In addition, zol features an option -d/--custom_database
that works differently than the -d/--annotation_db_faa
option in cgc. Specifically, the option in zol will annotate comprehensively and the spreadsheet report will show the best hits (which meet a default E-value threshold) for all ortholog groups to protein sequences in the provided custom database (along with alignment E-values in parentheses). It will not just look for the best match between the custom database of proteins and the ortholog group consensus sequences. This option in zol should let you make assessments of ortholog group partitioning more easily!
usage: cgc [-h] -i ZOL_RESULTS_DIR -o OUTDIR [-t TRACKS [TRACKS ...]] [-c COLOR_SCHEMES [COLOR_SCHEMES ...]] [-cc CUSTOM_COLOR_SCHEME_FILE]
[-d ANNOTATION_DB_FAA] [-ol OG_LABEL_FILE] [-m MIN_CONSERVATION] [-rh RELATIVE_HEIGHTS [RELATIVE_HEIGHTS ...]] [-sl] [-ld LABEL_DISTANCE]
[-lts LABEL_TEXT_SIZE] [-sc] [-q SQUISH_FACTOR] [-b BOTTOM_SPACING] [-fl FOCAL_SET_TITLE] [-cl COMPARATOR_SET_TITLE]
[-lgs LEGEND_TITLE_SIZE] [-nl] [-p] [-l LENGTH] [-w WIDTH]
Program: cgc
Author: Rauf Salamzade
Affiliation: Kalan Lab, UW Madison, Department of Medical Microbiology and Immunology
collapsed gene clusters (cgc; pun: see(ಠಠ)-gc)
cgc is a visualization tool that takes as input zol results and uses R libraries for
plotting to generate a customizable figure that collapses information across hundreds
to thousands of gene cluster instances in a "collapsed" figure.
---------------------------------------------------------------------------------------------
Accepted tracks:
---------------------------------------------------------------------------------------------
- conservation, tajimas_d, entropy, upstream_entropy, median_beta_rd, max_beta_rd, median_gc
If comparative analysis was performed in zol, then addtional tracks include:
- focal_conservation, alternate_conservation, fst
---------------------------------------------------------------------------------------------
Color scheme options:
---------------------------------------------------------------------------------------------
- white_to_black, red_white_blue, light_to_dark_gold, light_to_dark_blue, light_to_dark_green,
light_to_dark_purple, light_to_dark_red, grey, black, blue, red, purple, green, gold.
Color scheme detailed customization can be performed by providing a tab delimited file with
four columns: (1) track_name, (2) low-value color (hex code), (3) mid-value color (hex code),
(4) high-value color (hex-code).
---------------------------------------------------------------------------------------------
Example command:
---------------------------------------------------------------------------------------------
cgc -i zol_Results_Directory/ -t conservation tajimas_d -c white_to_black blue_white_red
options:
-h, --help show this help message and exit
-i ZOL_RESULTS_DIR, --zol_results_dir ZOL_RESULTS_DIR
Path to zol results directory.
-o OUTDIR, --outdir OUTDIR
Output directory for cgc analysis.
-t TRACKS [TRACKS ...], --tracks TRACKS [TRACKS ...]
Tracks to use, note the first track is used to color the gene
schematic directly. Should be provided in desired order (bottom
to top) [Default is "conservation tajimas_d"]
-c COLOR_SCHEMES [COLOR_SCHEMES ...], --color_schemes COLOR_SCHEMES [COLOR_SCHEMES ...]
Specify default color schemes avaibility by name in desired
order matching --tracks [Default is "white_to_black blue_white_red"].
-cc CUSTOM_COLOR_SCHEME_FILE, --custom_color_scheme_file CUSTOM_COLOR_SCHEME_FILE
Tab-delimited file with 4 columns: (1) track_name, (2) low-value
color hexcode, (3) mid-value color hexcode, (4) high-value color hexcode.
-d ANNOTATION_DB_FAA, --annotation_db_faa ANNOTATION_DB_FAA
A FASTA file of proteins with headers to match to ortholog group
consensus sequences. Will then use headers of matches (up until first
space) as ortholog labels.
-ol OG_LABEL_FILE, --og_label_file OG_LABEL_FILE
Tab-delimited file with 2 columns: (1) ortholog group ID,
(2) desired label.
-m MIN_CONSERVATION, --min_conservation MIN_CONSERVATION
Minimum proportion of gene clusters an ortholog group needs to be found
in to be shown in the figure [Default is 0.1].
-rh RELATIVE_HEIGHTS [RELATIVE_HEIGHTS ...], --relative_heights RELATIVE_HEIGHTS [RELATIVE_HEIGHTS ...]
Relative heights of individual tracks [Default is "1 1"]
-sl, --show_labels Show labels which by default are OG identifiers, if --og_label_file
or --annotatIon_db_faa are not specified.
-ld LABEL_DISTANCE, --label_distance LABEL_DISTANCE
The distance between genes and labels [Default is 0.025].
-lts LABEL_TEXT_SIZE, --label_text_size LABEL_TEXT_SIZE
Label text size [Default is 2.0].
-sc, --show_copy_count_status
Show whether genes are single-copy or not.
-q SQUISH_FACTOR, --squish_factor SQUISH_FACTOR
A numeric value controlling vertical squishing of tracks in the plot - this
value should be negative to squish or positive to give more space and
proportional to the values provided for relative heights [Default is 0]
-b BOTTOM_SPACING, --bottom_spacing BOTTOM_SPACING
Extra space for bottom of the plot to make sure legend does not get cut off.
Should be proportional to values for relative heights [Default is 0.1].
-fl FOCAL_SET_TITLE, --focal_set_title FOCAL_SET_TITLE
Title to use for the focal set conservation track if requested.
If spaces present, surround by quotes [Default is "Focal Set Conservation"].
-cl COMPARATOR_SET_TITLE, --comparator_set_title COMPARATOR_SET_TITLE
Title to use for the alternate set conservation track if requested. If spaces
present, surround by quotes [Default is "Comparator Set Conservation"].
-lgs LEGEND_TITLE_SIZE, --legend_title_size LEGEND_TITLE_SIZE
Change legend title size [Default is 10].
-nl, --no_legend Do not show legend for coloring of the gene schematic.
-p, --png Create plot as PNG, default is PDF.
-l LENGTH, --length LENGTH
Specify the height/length of the heatmap plot in inches [Default is 7].
-w WIDTH, --width WIDTH
Specify the width of the heatmap plot in inches [Default is 7].