Skip to content

5.3 visualization of 1000s of gene clusters using cgc or cgcg

Rauf Salamzade edited this page Dec 5, 2024 · 16 revisions

cgc (pronounced see( 👀 )-gc; but standing for Collapsed Gene Cluster) and cgcg (pronounced kinda like squeegee; standing for Collapsed Gene Cluster Graph) are programs to create visuals from zol results that can summarize data from thousands of gene cluster instances. It features multiple options to help users create highly customized visuals.

Note

Many great tools exist for creating micro-synteny visualizations, see this page for recommendations, but these approaches are inherently difficult to scale. zol can be useful if creating those types of plots are of interest still, in particular, because it integrates skani for gene cluster dereplication.

Example: Visualizing over 300 instances of the crt operon across the Staphylococcus genus

We will continue with the example from the tutorial on using salt to demonstrate first cgcg and then cgc. For both demonstrations, we need to run zol first.

Run zol with comparative analysis requested

Picking up from having run fai in the salt tutorial to identify crt in Staphylococcus genomes, we can run zol as follows.

To showcase cgc's usefulness for comparative genomics (uh genetics), we run zol with all instances of the crt operon from S. aureus selected as a focal set and the remainder of instances of the operon from the genomes of other species as the comparator set.

# write a list of all S. aureus crt operon instances to a file to provide to zol next
ls -lht fai_Results/Final_Results/Homologous_Gene_Cluster_GenBanks/ | awk '{print $9}' | grep '_aureus_'  > staph_aureus_instances.txt

# run zol 
zol -i fai_Results/Final_Results/Homologous_Gene_Cluster_GenBanks/ -o zol_Results/ -c 4 -f staph_aureus_instances.txt

cgcg demonstration

cgcg creates a collapsed pan-gene-cluster network graph with orthogroups (OGs) as nodes and edges as order information (gathered during zol's inference of OG consensus order). The border color of ortholog groups correspond to directionality of the ortholog group. This is needed for proper interpretation of ortholog group order based on edges. Black corresponds to the sense direction and red corresponds to the antisense direction. Its been tested to work with domain ortholog groups too based on zol's domain-mode but less thoroughly - so please let us know if there are difficulties through a GitHub issues ticket. The coloring can be set to a variety of quantitative evolutionary statistics computed by zol. The final plot is an interactive HTML which can be further modified/tailored to a user's preference and easily be made into a publishable figure.

Important

cgcg uses the amazing gravis framework for network visualization by Robert Haas underneath the hood and which we recommend citing as well given the effort they put into developing it. cgcg draws inspiration from the STRING web-application by the Bork lab and various pan-genome visualization software.

1. Simple run of cgcg

We can run cgcg quite simply as such:

cgcg -i zol_Results/ -o cgcg_Results/

This will produce an HTML report and a PNG (for the coloring legend) in a subdirectory called Final_Results/ with the cgcg output directory.

If we open up the HTML we will see the following:

2. Command-line or GUI based adjustment of visuals

Within the HTML we can add node labels, which are encoded as orthogroup identifiers, and adjust other features of the plot to get a more publishable and informative representation of the plot:

We can also adjust some properties of the plot via the commandline. Some options include:

  • -mc / --min-conservation : Minimum proportion of gene clusters an ortholog group must be found in for it to be shown in the plot. [The default value 0.25].
  • -mer / --min-edge-ratio : Minimum ratio of weight between two ortholog groups to the maximum weight observed to be shown [Default is 0.05]. This is to avoid distracting rare edges, especially because arrows do not scale with edge width.
  • -c / --color-track : The statistic for node coloring - options include: conservation [default], tajimas_d, entropy, upstream_entropy, median_beta_rd, fst, alternative_conservation. Check out the options --low-value, --high-value, --low-value-color, and --high-value-color to control the scales and colors.
  • -sl / --show-labels : Show orthogroup labels. This can also just be turned on, with label size adjusted in the interface of the HTML file.
  • -cul/ --custom-labels : Path to a tab-separated file with OG identifiers as first column and labels to use as the second column.
  • -cuc / --custom-colors : Path to a tab-separated file with OG identifiers as first column and hex-code for colors to use as the second column.
  • -sm / --show-major-path : Show the most supported path between ortholog groups by coloring such edges in gold instead of the default grey.
  • -as / --arrow-size : Control the size of the arrow (unfortunately not possible to scale via the HTML or automatically with edge width).
  • -u / --undirected-graph : Flag which can be used to hide arrows.
  • -nbs / --node-border-size : Control the size of the node borders depicting information on ortholog group directionality: ⚫ = sense, 🔴 = antisense.
  • -bc / --background-color : Color of the background - default is white.
  • -f / --flip : Flag to flip ortholog group direction and direction of arrows on edges. Should function like reverse-complementing the graph.

3. Visualize sequence entropy instead of orthogroup conservation

We can also change the quantitative statistic being plotted:

cgcg -i zol_Results/ -o cgcg_Results_Enteropy/ -c entropy -f -vl 0.0 \
     -vh 1.0 -cl "#81c785" -ch "#284f2a" -cul OG_Labels.txt

where OG_Labels.txt contains the following mapping:

OG_22   crtQ
OG_23   crtP
OG_24   crtM
OG_25   crtN
OG_26   crtO

Exploring zol results using the cgcg network view

Beyond creating publishable figures, cgcg visual report can also be used as a primary means to explore zol results. By clicking nodes, users will see details - including select annotation and evolutionary statistics from the zol report:

cgc demonstration

Note

cgc is inspired in part by Figure 3 from this study by Wang et al. 2018.

1. Run cgc in default mode

We can first try running cgc with default options (+ -p - which just requests producing plots in PNG format instead of PDF):

cgc -i zol_Results/ -o cgc_Results/ -p

This will give us a pretty bland plot:

Note, ortholog groups are scaled by their median length in basepairs and shown in the consensus order/direction across all gene cluster instances.

2. Spruce it up

We can apply other options to create a more insightful figure as such:

cgc -i zol_Results/ -o cgc_Results/ -t conservation conservation focal_conservation tajimas_d entropy \
    -c grey white_to_black light_to_dark_purple red_white_blue light_to_dark_green \
    -rh 1.7 1 1 1 1  -sl -p -sc -ld 0.04 -lts 3.0 -b 0.5 -w 8 -nl \
    -fl "Conservation amongst crt gene clusters in S. aureus genomes"

Argument breakdown:

  • -t/--tracks: is used to specify which tracks/information from zol analysis you want to include in the figure. Note, the first one is used for coloring the gene schematic at the bottom of the figure.
  • -c/--color_scheme: is used to specify color palettes for the tracks. There are preset defaults available that are listed in the usage for cgc. More custom color schemes can also be provided through a simple configuration file.
  • -rh/--relative_heights: is used to specify the relative heights of the individual tracks to each other.
  • -sl/--show_labels is used to request showing labels of ortholog groups.
  • -sc/--show_copy_count_status is used to indicate which ortholog groups are not found in single-copy across gene cluster instances. This results in the red dot above the gene schematic of one of the ortholog groups (OG_26).
  • -ld/label_distance is used to control the distance of the labels to the gene schematics.
  • -lts/label_text_sizes is used to control the label text size.
  • -b/--bottom_spacing is used to control the white space at the bottom of the plot (to help prevent legends from being cut off)
  • -w/--width is the width of the plot in inches (default is 7).
  • -nl/no_legend is to request no legend for the gene schematic coloring to be shown.
  • -fl/--focal_set_title is to request a custom title for the focal conservation track.

3. Add labels with meaning

You will notice the ortholog group labels are sort of arbitrary. While you can go in the zol report and look at the functional annotations from the variety of databases to figure out what is what, you can also provide a FASTA file of proteins with labels as headers. cgc will perform orthology matching of those sequences to consensus sequences of ortholog groups to then assign appropriate labels in the figure. For instance, we can use the query FASTA for fai from the salt tutorial to see which of the genes correspond to the 5 crt genes as such:

cgc -i zol_Results/ -o cgc_Results/ -t conservation conservation focal_conservation tajimas_d entropy \
    -c grey white_to_black light_to_dark_purple red_white_blue light_to_dark_green \
    -rh 1.7 1 1 1 1  -sl -p -sc -ld 0.04 -lts 3.0 -b 0.5 -w 8 -nl \
    -fl "Conservation amongst crt gene clusters in S. aureus genomes" \
    -d crt_proteins.faa 

More manual editing of the figure:

1. Edit the R script directly and rerun it (requires familiarity with R)

cgc writes an R script in the resulting output directory that it then runs to create the visualization.

This allows users with familiarity in R, specifically ggplot2, to further customize this script directly and simply rerun it to recreate the visual, with the zol conda env active this can be done as the following:

Rscript cgc_Results/cgc_script.R

2. Edit the PDF as a vector graphic using InkScape or Adobe Illustrator

The resulting PDF can be opened in either InkScape (free!!!) or Adobe Illustrator to further customize the figure since it is a vector graphic.

Caution: Assessing orthology quality via direct assessment of zol's spreadsheet still advised

cgc does not visualize all ortholog groups by default, only those which are found in greater than 10% of gene cluster instances input into zol. This could lead to misinterpretation of results and is something to be cautious of. For instance, based on orthology cutoffs for coverage and identity, ortholog groups might be more or less split up than users might desire. It is thus important to assess the zol spreadsheet where information on all gene clusters should be reported.

In addition, zol features an option -d/--custom_database that works differently than the -d/--annotation_db_faa option in cgc. Specifically, the option in zol will annotate comprehensively and the spreadsheet report will show the best hits (which meet a default E-value threshold) for all ortholog groups to protein sequences in the provided custom database (along with alignment E-values in parentheses). It will not just look for the best match between the custom database of proteins and the ortholog group consensus sequences. This option in zol should let you make assessments of ortholog group partitioning more easily!

Usage:

cgcg

Usage: cgcg [-h] -i ZOL_RESULTS_DIR -o OUTDIR [-c COLOR_TRACK] [-vl LOW_VALUE] [-vh HIGH_VALUE] [-cl LOW_VALUE_COLOR] [-ch HIGH_VALUE_COLOR] [-nc NA_VALUE_COLOR]
            [-mc MIN_CONSERVATION] [-mer MIN_EDGE_RATIO] [-sl] [-cul CUSTOM_LABELS] [-cuc CUSTOM_COLORS] [-as ARROW_SIZE] [-nbs NODE_BORDER_SIZE] [-f] [-u] [-bc BACKGROUND_COLOR]
            [-sm]

        Program: cgcg
        Author: Rauf Salamzade
        Affiliation: Kalan Lab, UW Madison, Department of Medical Microbiology and Immunology

        collapsed gene clusters graph (cgcg; pun: see(ಠಠ)-gcg)

        cgcg takes as input a zol results directory and uses the gravis library to summarize
        gene clusters into a graphical representation where nodes are ortholog groups and edges
        indicate the Markovian information determined by zol as part of its algorithm for
        determining consensus order/direction. The coloring of the node corresponds to a
        quantitative evolutionary statistic. The size of the node is the median length in 100 bp
        units. The border color of the node indicates whether it is in the sense or antisense
        direction. It is inspired by various pangenome graph visualization software and the STRING
        web application by the Bork lab.

        Note, if you want to remove border colors - you can do so by setting --node-border-size to 0,
        but then you should probably also switch to an undirected graph to avoid misintrepretation
        of gene order/directionality.

        Also, the "major path" which can be shown in gold is simply the path which is most supported
        across gene cluster instances, it is not all inclusive of every ortholog group like the
        consensus path calculated in zol.
        ---------------------------------------------------------------------------------------------
        Example command:
        ---------------------------------------------------------------------------------------------
        $ cgcg -i zol_Results_Directory/ -o cgcg_Results/ --color-track conservation
        ---------------------------------------------------------------------------------------------
        Citation notice:
        ---------------------------------------------------------------------------------------------
        - gravis: Interactive graph visualizations with Python and HTML/CSS/JS.
          https://github.com/robert-haas/gravis by Haas 2021.

        - zol & fai: large-scale targeted detection and evolutionary investigation of gene clusters
          https://www.biorxiv.org/content/10.1101/2023.06.07.544063v3 by Salamzade et al. 2023

Options:
  -h, --help            show this help message and exit
  -i, --zol-results-dir ZOL_RESULTS_DIR
                        Path to zol results directory.
  -o, --outdir OUTDIR   Output directory for cgc analysis.
  -c, --color-track COLOR_TRACK
                        The track from the zol spreadsheet to use for coloring.
                        Options include: conservation, tajimas_d, entorpy,
                        upstream_entropy, median_beta_rd, max_beta_rd, fst,
                        alternative_conservation. [Default is conservation].
  -vl, --low-value LOW_VALUE
                        Low value [Default is 0.0].
  -vh, --high-value HIGH_VALUE
                        High value [Default is 1.0].
  -cl, --low-value-color LOW_VALUE_COLOR
                        Hex code for color for low-value. Surround by quotes
                        [Default is "#a2b7e8"].
  -ch, --high-value-color HIGH_VALUE_COLOR
                        Hex code for color for high-value. Surround by quotes
                        [Default is "#102f75"].
  -nc, --na-value-color NA_VALUE_COLOR
                        The hex code for setting color of NA/non-numeric values.
                        Suround by quotes. [Default is" #adadad"].
  -mc, --min-conservation MIN_CONSERVATION
                        Minimum conservation of ortholog group to be shown
                        [Default is 0.25].
  -mer, --min-edge-ratio MIN_EDGE_RATIO
                        Minimum ratio of weight between two ortholog groups to the maximum weight observed to be shown [Default is 0.05].
  -sl, --show-labels    Show orthogroup labels.
  -cul, --custom-labels CUSTOM_LABELS
                        Tab-separated file with OG identifiers as first column
                        and labels to use as the second column.
  -cuc, --custom-colors CUSTOM_COLORS
                        Tab-separated file with OG identifiers as first column
                        and hex-code for colors to use as the second column.
  -as, --arrow-size ARROW_SIZE
                        Arrow size [Default is 10].
  -nbs, --node-border-size NODE_BORDER_SIZE
                        Node border size [Default is 1].
  -f, --flip            Flag to flip the direction of arrows and the border
                        coloring for orthogroup consensus direction.
  -u, --undirected-graph
                        Flag for hiding arrows.
  -bc, --background-color BACKGROUND_COLOR
                        The background color. Surround by quotes [Default is
                        "#FFFFFF"].
  -sm, --show-major-path
                        Flag to show edges which belong to the major path in
                        gold.

cgc

usage: cgc [-h] -i ZOL_RESULTS_DIR -o OUTDIR [-t TRACKS [TRACKS ...]] [-c COLOR_SCHEMES [COLOR_SCHEMES ...]] [-cc CUSTOM_COLOR_SCHEME_FILE]
           [-d ANNOTATION_DB_FAA] [-ol OG_LABEL_FILE] [-m MIN_CONSERVATION] [-rh RELATIVE_HEIGHTS [RELATIVE_HEIGHTS ...]] [-sl] [-ld LABEL_DISTANCE]
           [-lts LABEL_TEXT_SIZE] [-sc] [-q SQUISH_FACTOR] [-b BOTTOM_SPACING] [-fl FOCAL_SET_TITLE] [-cl COMPARATOR_SET_TITLE]
           [-lgs LEGEND_TITLE_SIZE] [-nl] [-p] [-l LENGTH] [-w WIDTH]

        Program: cgc
        Author: Rauf Salamzade
        Affiliation: Kalan Lab, UW Madison, Department of Medical Microbiology and Immunology

        collapsed gene clusters (cgc; pun: see(ಠಠ)-gc)

        cgc is a visualization tool that takes as input zol results and uses R libraries for
        plotting to generate a customizable figure that collapses information across hundreds
        to thousands of gene cluster instances in a "collapsed" figure.

        ---------------------------------------------------------------------------------------------
        Accepted tracks:
        ---------------------------------------------------------------------------------------------
    - conservation, tajimas_d, entropy, upstream_entropy, median_beta_rd, max_beta_rd, median_gc
                                                                  
        If comparative analysis was performed in zol, then addtional tracks include:
        - focal_conservation, alternate_conservation, fst

        ---------------------------------------------------------------------------------------------
    Color scheme options:
        ---------------------------------------------------------------------------------------------
        - white_to_black, red_white_blue, light_to_dark_gold, light_to_dark_blue, light_to_dark_green, 
        light_to_dark_purple, light_to_dark_red, grey, black, blue, red, purple, green, gold.
                                                                  
        Color scheme detailed customization can be performed by providing a tab delimited file with
        four columns: (1) track_name, (2) low-value color (hex code), (3) mid-value color (hex code),
        (4) high-value color (hex-code). 
                                                                  
        ---------------------------------------------------------------------------------------------
        Example command:
        ---------------------------------------------------------------------------------------------
        cgc -i zol_Results_Directory/ -t conservation tajimas_d -c white_to_black blue_white_red
                                                                  


options:
  -h, --help            show this help message and exit
  -i ZOL_RESULTS_DIR, --zol_results_dir ZOL_RESULTS_DIR
                        Path to zol results directory.
  -o OUTDIR, --outdir OUTDIR
                        Output directory for cgc analysis.
  -t TRACKS [TRACKS ...], --tracks TRACKS [TRACKS ...]
                        Tracks to use, note the first track is used to color the gene
                        schematic directly. Should be provided in desired order (bottom
                        to top) [Default is "conservation tajimas_d"]
  -c COLOR_SCHEMES [COLOR_SCHEMES ...], --color_schemes COLOR_SCHEMES [COLOR_SCHEMES ...]
                        Specify default color schemes avaibility by name in desired
                        order matching --tracks [Default is "white_to_black blue_white_red"].
  -cc CUSTOM_COLOR_SCHEME_FILE, --custom_color_scheme_file CUSTOM_COLOR_SCHEME_FILE
                        Tab-delimited file with 4 columns: (1) track_name, (2) low-value
                        color hexcode, (3) mid-value color hexcode, (4) high-value color hexcode.
  -d ANNOTATION_DB_FAA, --annotation_db_faa ANNOTATION_DB_FAA
                        A FASTA file of proteins with headers to match to ortholog group
                        consensus sequences. Will then use headers of matches (up until first
                        space) as ortholog labels.
  -ol OG_LABEL_FILE, --og_label_file OG_LABEL_FILE
                        Tab-delimited file with 2 columns: (1) ortholog group ID,
                        (2) desired label.
  -m MIN_CONSERVATION, --min_conservation MIN_CONSERVATION
                        Minimum proportion of gene clusters an ortholog group needs to be found
                        in to be shown in the figure [Default is 0.1].
  -rh RELATIVE_HEIGHTS [RELATIVE_HEIGHTS ...], --relative_heights RELATIVE_HEIGHTS [RELATIVE_HEIGHTS ...]
                        Relative heights of individual tracks [Default is "1 1"]
  -sl, --show_labels    Show labels which by default are OG identifiers, if --og_label_file
                        or --annotatIon_db_faa are not specified.
  -ld LABEL_DISTANCE, --label_distance LABEL_DISTANCE
                        The distance between genes and labels [Default is 0.025].
  -lts LABEL_TEXT_SIZE, --label_text_size LABEL_TEXT_SIZE
                        Label text size [Default is 2.0].
  -sc, --show_copy_count_status
                        Show whether genes are single-copy or not.
  -q SQUISH_FACTOR, --squish_factor SQUISH_FACTOR
                        A numeric value controlling vertical squishing of tracks in the plot - this
                        value should be negative to squish or positive to give more space and
                        proportional to the values provided for relative heights [Default is 0]
  -b BOTTOM_SPACING, --bottom_spacing BOTTOM_SPACING
                        Extra space for bottom of the plot to make sure legend does not get cut off.
                        Should be proportional to values for relative heights [Default is 0.1].
  -fl FOCAL_SET_TITLE, --focal_set_title FOCAL_SET_TITLE
                        Title to use for the focal set conservation track if requested.
                        If spaces present, surround by quotes [Default is "Focal Set Conservation"].
  -cl COMPARATOR_SET_TITLE, --comparator_set_title COMPARATOR_SET_TITLE
                        Title to use for the alternate set conservation track if requested. If spaces
                        present, surround by quotes [Default is "Comparator Set Conservation"].
  -lgs LEGEND_TITLE_SIZE, --legend_title_size LEGEND_TITLE_SIZE
                        Change legend title size [Default is 10].
  -nl, --no_legend      Do not show legend for coloring of the gene schematic.
  -p, --png             Create plot as PNG, default is PDF.
  -l LENGTH, --length LENGTH
                        Specify the height/length of the heatmap plot in inches [Default is 7].
  -w WIDTH, --width WIDTH
                        Specify the width of the heatmap plot in inches [Default is 7].
Clone this wiki locally