Skip to content

5.3 visualization of 1000s of gene clusters using cgc or cgcg

Rauf Salamzade edited this page Oct 26, 2024 · 16 revisions

cgc (pronounced see( 👀 )-gc; but standing for Collapsed Gene Cluster) and cgcg (pronounced kinda like squeegee; standing for Collapsed Gene Cluster Graph) are programs to create visuals from zol results that can summarize data from thousands of gene cluster instances. It features multiple options to help users create highly customized visuals.

Note

Many great tools exist for creating micro-synteny visualizations, see this page for recommendations, but these approaches are inherently difficult to scale. zol can be useful if creating those types of plots are of interest still, in particular, because it integrates skani for gene cluster dereplication.

Example: Visualizing over 300 instances of the crt operon across the Staphylococcus genus

We will continue with the example from the tutorial on using salt to demonstrate first cgcg and then cgc.

cgcg demonstration

Important

cgcg uses the amazing gravis framework for network visualization by Robert Haas underneath the hood and which we recommend citing as well given the effort they put into developing it. cgcg draws inspiration from the STRING web-application by the Bork lab and various pan-genome visualization software.

cgc demonstration

Note

cgc is inspired in part by Figure 3 from this study by Wang et al. 2018.

1. Run zol with comparative analysis requested

Picking up from having run fai in the salt tutorial to identify crt in Staphylococcus genomes, we can run zol as follows.

To showcase cgc's usefulness for comparative genomics (uh genetics), we run zol with all instances of the crt operon from S. aureus selected as a focal set and the remainder of instances of the operon from the genomes of other species as the comparator set.

# write a list of all S. aureus crt operon instances to a file to provide to zol next
ls -lht fai_Results/Final_Results/Homologous_Gene_Cluster_GenBanks/ | awk '{print $9}' | grep '_aureus_'  > staph_aureus_instances.txt

# run zol - note, the "-q" option really speeds things up by using MUSCLE super5 mode for
# protein aligments
zol -i fai_Results/Final_Results/Homologous_Gene_Cluster_GenBanks/ -o zol_Results/ -c 50 -f staph_aureus_instances.txt -q

2. Run cgc in default mode

We can first try running cgc with default options (+ -p - which just requests producing plots in PNG format instead of PDF):

cgc -i zol_Results/ -o cgc_Results/ -p

This will give us a pretty bland plot:

cgc_plot

Note, ortholog groups are scaled by their median length in basepairs and shown in the consensus order/direction across all gene cluster instances.

3. Spruce it up

We can apply other options to create a more insightful figure as such:

cgc -i zol_Results/ -o cgc_Results/ -t conservation conservation focal_conservation tajimas_d entropy \
    -c grey white_to_black light_to_dark_purple red_white_blue light_to_dark_green \
    -rh 1.7 1 1 1 1  -sl -p -sc -ld 0.04 -lts 3.0 -b 0.5 -w 8 -nl \
    -fl "Conservation amongst crt gene clusters in S. aureus genomes"

Argument breakdown:

  • -t/--tracks: is used to specify which tracks/information from zol analysis you want to include in the figure. Note, the first one is used for coloring the gene schematic at the bottom of the figure.
  • -c/--color_scheme: is used to specify color palettes for the tracks. There are preset defaults available that are listed in the usage for cgc. More custom color schemes can also be provided through a simple configuration file.
  • -rh/--relative_heights: is used to specify the relative heights of the individual tracks to each other.
  • -sl/--show_labels is used to request showing labels of ortholog groups.
  • -sc/--show_copy_count_status is used to indicate which ortholog groups are not found in single-copy across gene cluster instances. This results in the red dot above the gene schematic of one of the ortholog groups (OG_26).
  • -ld/label_distance is used to control the distance of the labels to the gene schematics.
  • -lts/label_text_sizes is used to control the label text size.
  • -b/--bottom_spacing is used to control the white space at the bottom of the plot (to help prevent legends from being cut off)
  • -w/--width is the width of the plot in inches (default is 7).
  • -nl/no_legend is to request no legend for the gene schematic coloring to be shown.
  • -fl/--focal_set_title is to request a custom title for the focal conservation track.

cgc_plot

4. Add labels with meaning

You will notice the ortholog group labels are sort of arbitrary. While you can go in the zol report and look at the functional annotations from the variety of databases to figure out what is what, you can also provide a FASTA file of proteins with labels as headers. cgc will perform orthology matching of those sequences to consensus sequences of ortholog groups to then assign appropriate labels in the figure. For instance, we can use the query FASTA for fai from the salt tutorial to see which of the genes correspond to the 5 crt genes as such:

cgc -i zol_Results/ -o cgc_Results/ -t conservation conservation focal_conservation tajimas_d entropy \
    -c grey white_to_black light_to_dark_purple red_white_blue light_to_dark_green \
    -rh 1.7 1 1 1 1  -sl -p -sc -ld 0.04 -lts 3.0 -b 0.5 -w 8 -nl \
    -fl "Conservation amongst crt gene clusters in S. aureus genomes" \
    -d crt_proteins.faa 

cgc_plot

More manual editing of the figure:

1. Edit the R script directly and rerun it (requires familiarity with R)

cgc writes an R script in the resulting output directory that it then runs to create the visualization.

This allows users with familiarity in R, specifically ggplot2, to further customize this script directly and simply rerun it to recreate the visual, with the zol conda env active this can be done as the following:

Rscript cgc_Results/cgc_script.R

2. Edit the PDF as a vector graphic using InkScape or Adobe Illustrator

The resulting PDF can be opened in either InkScape (free!!!) or Adobe Illustrator to further customize the figure since it is a vector graphic.

Caution: Assessing orthology quality via direct assessment of zol's spreadsheet still advised

cgc does not visualize all ortholog groups by default, only those which are found in greater than 10% of gene cluster instances input into zol. This could lead to misinterpretation of results and is something to be cautious of. For instance, based on orthology cutoffs for coverage and identity, ortholog groups might be more or less split up than users might desire. It is thus important to assess the zol spreadsheet where information on all gene clusters should be reported.

In addition, zol features an option -d/--custom_database that works differently than the -d/--annotation_db_faa option in cgc. Specifically, the option in zol will annotate comprehensively and the spreadsheet report will show the best hits (which meet a default E-value threshold) for all ortholog groups to protein sequences in the provided custom database (along with alignment E-values in parentheses). It will not just look for the best match between the custom database of proteins and the ortholog group consensus sequences. This option in zol should let you make assessments of ortholog group partitioning more easily!

Usage:

usage: cgc [-h] -i ZOL_RESULTS_DIR -o OUTDIR [-t TRACKS [TRACKS ...]] [-c COLOR_SCHEMES [COLOR_SCHEMES ...]] [-cc CUSTOM_COLOR_SCHEME_FILE]
           [-d ANNOTATION_DB_FAA] [-ol OG_LABEL_FILE] [-m MIN_CONSERVATION] [-rh RELATIVE_HEIGHTS [RELATIVE_HEIGHTS ...]] [-sl] [-ld LABEL_DISTANCE]
           [-lts LABEL_TEXT_SIZE] [-sc] [-q SQUISH_FACTOR] [-b BOTTOM_SPACING] [-fl FOCAL_SET_TITLE] [-cl COMPARATOR_SET_TITLE]
           [-lgs LEGEND_TITLE_SIZE] [-nl] [-p] [-l LENGTH] [-w WIDTH]

        Program: cgc
        Author: Rauf Salamzade
        Affiliation: Kalan Lab, UW Madison, Department of Medical Microbiology and Immunology

        collapsed gene clusters (cgc; pun: see(ಠಠ)-gc)

        cgc is a visualization tool that takes as input zol results and uses R libraries for
        plotting to generate a customizable figure that collapses information across hundreds
        to thousands of gene cluster instances in a "collapsed" figure.

        ---------------------------------------------------------------------------------------------
        Accepted tracks:
        ---------------------------------------------------------------------------------------------
    - conservation, tajimas_d, entropy, upstream_entropy, median_beta_rd, max_beta_rd, median_gc
                                                                  
        If comparative analysis was performed in zol, then addtional tracks include:
        - focal_conservation, alternate_conservation, fst

        ---------------------------------------------------------------------------------------------
    Color scheme options:
        ---------------------------------------------------------------------------------------------
        - white_to_black, red_white_blue, light_to_dark_gold, light_to_dark_blue, light_to_dark_green, 
        light_to_dark_purple, light_to_dark_red, grey, black, blue, red, purple, green, gold.
                                                                  
        Color scheme detailed customization can be performed by providing a tab delimited file with
        four columns: (1) track_name, (2) low-value color (hex code), (3) mid-value color (hex code),
        (4) high-value color (hex-code). 
                                                                  
        ---------------------------------------------------------------------------------------------
        Example command:
        ---------------------------------------------------------------------------------------------
        cgc -i zol_Results_Directory/ -t conservation tajimas_d -c white_to_black blue_white_red
                                                                  


options:
  -h, --help            show this help message and exit
  -i ZOL_RESULTS_DIR, --zol_results_dir ZOL_RESULTS_DIR
                        Path to zol results directory.
  -o OUTDIR, --outdir OUTDIR
                        Output directory for cgc analysis.
  -t TRACKS [TRACKS ...], --tracks TRACKS [TRACKS ...]
                        Tracks to use, note the first track is used to color the gene
                        schematic directly. Should be provided in desired order (bottom
                        to top) [Default is "conservation tajimas_d"]
  -c COLOR_SCHEMES [COLOR_SCHEMES ...], --color_schemes COLOR_SCHEMES [COLOR_SCHEMES ...]
                        Specify default color schemes avaibility by name in desired
                        order matching --tracks [Default is "white_to_black blue_white_red"].
  -cc CUSTOM_COLOR_SCHEME_FILE, --custom_color_scheme_file CUSTOM_COLOR_SCHEME_FILE
                        Tab-delimited file with 4 columns: (1) track_name, (2) low-value
                        color hexcode, (3) mid-value color hexcode, (4) high-value color hexcode.
  -d ANNOTATION_DB_FAA, --annotation_db_faa ANNOTATION_DB_FAA
                        A FASTA file of proteins with headers to match to ortholog group
                        consensus sequences. Will then use headers of matches (up until first
                        space) as ortholog labels.
  -ol OG_LABEL_FILE, --og_label_file OG_LABEL_FILE
                        Tab-delimited file with 2 columns: (1) ortholog group ID,
                        (2) desired label.
  -m MIN_CONSERVATION, --min_conservation MIN_CONSERVATION
                        Minimum proportion of gene clusters an ortholog group needs to be found
                        in to be shown in the figure [Default is 0.1].
  -rh RELATIVE_HEIGHTS [RELATIVE_HEIGHTS ...], --relative_heights RELATIVE_HEIGHTS [RELATIVE_HEIGHTS ...]
                        Relative heights of individual tracks [Default is "1 1"]
  -sl, --show_labels    Show labels which by default are OG identifiers, if --og_label_file
                        or --annotatIon_db_faa are not specified.
  -ld LABEL_DISTANCE, --label_distance LABEL_DISTANCE
                        The distance between genes and labels [Default is 0.025].
  -lts LABEL_TEXT_SIZE, --label_text_size LABEL_TEXT_SIZE
                        Label text size [Default is 2.0].
  -sc, --show_copy_count_status
                        Show whether genes are single-copy or not.
  -q SQUISH_FACTOR, --squish_factor SQUISH_FACTOR
                        A numeric value controlling vertical squishing of tracks in the plot - this
                        value should be negative to squish or positive to give more space and
                        proportional to the values provided for relative heights [Default is 0]
  -b BOTTOM_SPACING, --bottom_spacing BOTTOM_SPACING
                        Extra space for bottom of the plot to make sure legend does not get cut off.
                        Should be proportional to values for relative heights [Default is 0.1].
  -fl FOCAL_SET_TITLE, --focal_set_title FOCAL_SET_TITLE
                        Title to use for the focal set conservation track if requested.
                        If spaces present, surround by quotes [Default is "Focal Set Conservation"].
  -cl COMPARATOR_SET_TITLE, --comparator_set_title COMPARATOR_SET_TITLE
                        Title to use for the alternate set conservation track if requested. If spaces
                        present, surround by quotes [Default is "Comparator Set Conservation"].
  -lgs LEGEND_TITLE_SIZE, --legend_title_size LEGEND_TITLE_SIZE
                        Change legend title size [Default is 10].
  -nl, --no_legend      Do not show legend for coloring of the gene schematic.
  -p, --png             Create plot as PNG, default is PDF.
  -l LENGTH, --length LENGTH
                        Specify the height/length of the heatmap plot in inches [Default is 7].
  -w WIDTH, --width WIDTH
                        Specify the width of the heatmap plot in inches [Default is 7].