Skip to content

8. overview of prior updates

Rauf Salamzade edited this page Oct 26, 2024 · 11 revisions

Updates

versions 1.5.0

  • Added --domain-mode in zol
  • Introduced cgcg visualization program
  • Introduced zol-scape

versions 1.4.1-1.4.12

  • Updated defaults for options in zol.
  • Added cgc and salt.

version 1.4.0

  • Correct fai cutoff for minimum proportion of distinct query genes to properly account for background genes. The calculation in the spreadsheet reports was being correctly calculated.
  • Set default syntenic correlation threshold to 0.0 (off) from previous default of 0.6
  • Update prepTG to reference data from GTDB R220.
  • Update splitting of DIAMOND alignment in zol to allow for parallelization
  • Introduce CD-HIT protein clustering option/mode instead of InParanoid-like approach
  • Add option to control re-inflation CD-HIT parameters
  • New scripts and tutorial on using fast.genomics and CAGECAT to get sets of homologous gene clusters to investigate in zol

version 1.3.20

  • Update extraction of gene cluster GenBank files from full GenBank files in fai to be much more efficient in fai. Time difference mostly noticeable for metagenomic application where full GenBank files can be quite large.

version 1.3.19

Major Updates:

  • Slight update to core ortholog group determining algorithm in zol to reduce memory consumption and aid scalability without a dereplicaiton/re-inflation approach.
  • Slight update to processing of miniprot protein mappings to account for overlap in exon coordinates (best scoring exon is selected in such cases) in prepTG.
  • New mode in zol where users can provide known instances of a gene cluster and determine appropriate parameters for searching for additional instances using fai.
Minor Updates:
  • New script to extract proteins from GenBank files into FASTA format, extractProteinsFromGenBank.py.

version 1.3.18

  • Add option to prepTG to easily/automatically create databases for any bacterial genus/species in GTDB.

version 1.3.17

  • Update fai catching of cases when no homologous BGC instance is found among target genomes.
  • Round metrics in fai's report.
  • Temporarily remove abon, atpoc, apos from Docker wrapper as these are not yet working - will need to update the bash script for simplifying docker usage at some point later.

version 1.3.12-1.3.16

  • Introduce apos (assess plasmid-ome similarity) and atpoc (assess temperate phage-ome conservation) to assess conservation of a focal sample's plasmids and phages across some set of target/database genomes (e.g. all other genomes in the same species/genus as the sample)
  • Add prodigal-gv option in prepTG and fai.
  • Add simple BLASTp search option in place of fai for abon.
  • Make minor corrections for newly introduced programs.

version 1.3.11

  • Introduce abon!
  • Update links to newer versions of precompiled prepTG databases for select bacterial taxa.
  • Update wiki documentation.

version 1.3.10

  • Introduce clean up option in fai.
  • Reorganize fai's results directory.
  • Generate Tiny AAI figure and an XLSX spreadsheet in fai to allow for manual curation of homologous gene clusters detected.

version 1.3.8

  • Update script for downloading annotation databases to account for changes in naming structure in the tar.gz directory with PGAP HMMs.

version 1.3.7

  • Add option to prepTG to download premade databases for certain bacterial taxa/genera hosted on Zenodo.
  • Add option to prepTG to construct a species tree based on skani ANI + neighbor-joining on Zenodo.
  • Add option to provide species tree in fai and generate a phylo-heatmap of gene cluster searching results.
  • Loosen restrictions around the need for a core ortholog group in zol analysis.

version 1.3.6

  • Fix conditional statement in determination of 'consensus directionality' in zol - should be flipped.

version 1.3.5

  • Fix mis-spelling of "Oomolog Group" to "Ortholog Group" in consolidated zol report.

version 1.3.4

  • Fix mismapping of parameter names and arguments in file for provenane for fai (introduced in 1.3.3 after incorporation of single query mode).
  • Add consideration point for dereplication in zol help and README to only be used when working with gene-clusters >10kb.

version 1.3.3

  • Correct and clairfy usage of "key protein" filters in fai.
  • Introduce single query mode in fai, whereby users can use a single gene as a query to look at differences in surrounding context CORASON style.
  • Add miniprot (v0.7) dependency to conda yaml file (and planning to bioconda).

version 1.3.2

  • Allow for failures of specific databases (i.e., if hosting server goes down) in setup_annotation_dbs.py.

version 1.3.1

  • Update for release.

version 1.3.0

  • Add better support for query GenBanks without locus tags for CDS features in fai & clearer message to simply use the -r/--rename_lt flag to automatically rename locus tags if this is the case for input GenBanks for zol.
  • Switch to pyhmmer for faster annotation in zol.

version 1.2.10

  • Update CITATION.cff

version 1.2.9

  • Minor changes to code documentation and updates to citation references README.
  • Added reporting on steps to console for prepTG.
  • Slight updates to plotting function in fai to allow more robust parsing of GenBanks.

version 1.2.8

  • Update README to add Bioconda installation guide.
  • Add more comprehensive comments to python modules with the bulk of the code.
  • Add traceback statement to all functions to generate detailed reports of what might be causing issues if they arise.
  • Switch to consistently using the term ortholog groups (instead of ortholog groups) in the code/messages/results/comments.
  • Updated to more flexible inputting of query GenBanks in fai.
  • Corrected processing of cases where GenBanks with CDS features are provided as ready to go in prepTG.

version 1.2.6 & 1.2.7

  • Additional changes to allow for better incorporation into bioconda.

version 1.2.5

  • Additional safety for when statistics are unavailable to incorporate into the consolidated report.

version 1.2.4

  • Docker set up should now work.
  • fixed bug introduced in 1.2.3 related to new names for arguments in prepTG in prepTG
  • note, will update bioconda recipe after release to get size of release tar.gz.

version 1.2.3

  • updated argument names to prepTG.
  • updated the way version information was being reported in programs to make more compatible with bioconda.
  • added initial attempt at Dockerfile for creating Docker image and auxiliary scripts to ease usage.
  • will likely make another update or two in the near future to get Docker and bioconda options working.

version 1.2.2

  • added initial attempt at bioconda recipe - no changes to core programs.
  • introduced ZOL (all captials) - wrapper of the 3 main programs - for use as entrypoint in Docker image.

Version 1.2.1

  • add line in beginning of fai to request "fork" method for multiprocessing to work on macOS with python >=v3.8.
  • clean up unused functions and simplify yaml file for specifying conda environment.

Minor Update - 05/05/2023

  • update parsing of PGAP HMMs directory after extracting with tar.

Version 1.2/1.02

  • prepTG sample to GenBank relations now specified locally so creation of database is not locked into one location.
  • Individual pickle files produced by prepTG per genome/metagenome for lower memory use with fai.
  • New "Gene-Clumper" mode for gene-cluster discovery in fai, which is now the default.
  • Fixed bug pertaining to overlap between merged gene-clusters based on --max_gene_disconnect parameter when using "HMM" mode.
  • Improved filtering and retention of GenBanks in zol.
  • Fixed bug in re-inflation method in zol.

Version 1.1/1.01

  • Remove unused individual proteome files in prepTG database directory.
  • Store only gene-location information for scaffolds with hits by query proteins in fai to keep memory use low.
  • Introduce parallelization to HMM step of fai and use global variables to access common data without duplicating in memory.
  • Improve parsing of different input formats for fai and generate new PDF at end mapping individual protein names to non-redundantified protein queries.
  • Declare "< 3 segrating sites found" as reason for inability to calculate Tajima's D instead of just "NA", which could also arise from not enough sequences or the sequence length threshold being met.