Skip to content

1. more info on prepTG

Rauf Salamzade edited this page Jan 31, 2024 · 12 revisions

prepTG creates a database directory of genomes to search for homologous instances of reference/query gene-clusters in using fai.

The input is simply a directory of either FASTA or GenBank formatted files - with CDS features for the latter - representing bacterial genomes or metagenomes.

For eukaryotic genomes full GenBank format with CDS features are expected; however, FASTA formatted assemblies may instead be provided if a "reference proteome" is provided.

Check out example commands for prepTG on the 4. basic usage examples wiki page.

Gene-calling bacterial genomes using prodigal and pyrodigal

For bacterial genomes or bacterial metagenomes, users are able to use pyrodigal (default) or prodigal to perform de novo gene calling. More recently, we also have the availability of prodigal-gv as an option for gene-calling when phages are the gene clusters of interest.

Gene-mapping in eukaryotic genomes using miniprot

For eukaryotic genomes, users are able to map a high-quality gene-calling prediction for some reference genome to the remainder of the genomes. This approach is generally recommended only for single-species investigations and has only been tested with microbial eukaryotic genomes of a modest size (e.g. fungii, not gigantic genomes such as those of plants).

prepTG usage

usage: prepTG [-h] [-d DOWNLOAD_PREMADE] [-i INPUT_DIR] [-g GTDB_TAXON] -o OUTPUT_DIR [-l LOCUS_TAG_LENGTH] [-r] [-gcm GENE_CALLING_METHOD] [-m] [-rp REFERENCE_PROTEOME] [-cst] [-c CPUS]
              [-mm MAX_MEMORY] [-v]

	Program: prepTG
	Author: Rauf Salamzade
	Affiliation: Kalan Lab, UW Madison, Department of Medical Microbiology and Immunology

	Prepares a directory of target genomes for being searched for query gene clusters using fai.
	
	Premade databases of representative genomes are available for the following genera:

	Acinetobacter (n=1,643), Bacillales (n=3,150), Corynebacterium (n=726), Enterobacter (n=878), 
	Enterococcus (n=937), Escherichia (n=2,436), Klebsiella (n=1,022), Listeria (n=353), 
	Mycobacterium (n=744), Pseudomonas (n=2,666), Salmonella (n=308), Staphylococus (n=496), 
	Streptomyces (n=1,555), Streptococcus (n=2,452), Cutibacterium (n=27), Neisseria (n=414),
	Lactobacillus (n=541), and Micromonospora (n=211).

	In addition, users can simply request all genomes belonging to a specific species/genus 
	in GTDB R214 to be downloaded.
	
        ----------------------------------------------------------------------------------------
	> Example commands:

	1. Setup a prepTG database which includes some local genomes and all Cutibacterium granulosum
	   genomes: 
	
	    prepTG -i User_Genomes_Directory/ -g "Cutibacterium granulosum" -o prepTG_Database/
	
	2. Setup local prepTG database by downloading a premade one of representative 
	   Cutibacterium genomes:

	    prepTG -d Cutibacterium -o prepTG_Database/

	----------------------------------------------------------------------------------------
	> Considerations
	If FASTA format is provided, assumption is that genomes are prokaryotic and 
	pyrodigal/prodigal will be used to perform gene-calling. Eukaryotic genomes can 
	be provided as FASTA format but the --reference_proteome file should be used in 
	such case to map proteins from a reference proteome (from the same species ideally) 
	on to the target genomes. This will prevent detection of new genes in gene-clusters 
	detected by fai but synchronize gene-calling and allow for better similarity 
	assessment between orthologous genes.
	

options:
  -h, --help            show this help message and exit
  -d DOWNLOAD_PREMADE, --download_premade DOWNLOAD_PREMADE
                        Download and setup pre-made databases of representative genomes for specific taxon/genus.
                        Provide name of the taxon, e.g. "Escherichia"
  -i INPUT_DIR, --input_dir INPUT_DIR
                        Directory with target genomes (either featuring GenBanks or FASTAs).
  -g GTDB_TAXON, --gtdb_taxon GTDB_TAXON
                        Name of a GTDB-R214 valid genus or species, should be surrounded by
                        quotes (e.g. "Escherichia coli").
  -o OUTPUT_DIR, --output_dir OUTPUT_DIR
                        Output directory, which can then be provided as input for the
                        "-tg" argument in fai.
  -l LOCUS_TAG_LENGTH, --locus_tag_length LOCUS_TAG_LENGTH
                        Length of locus tags to set. Default is 3, allows for <~18k genomes.
  -r, --rename_locus_tags
                        Whether to rename locus tags if provided for CDS features in GenBanks.
  -gcm GENE_CALLING_METHOD, --gene_calling_method GENE_CALLING_METHOD
                        Method to use for gene calling. Options are: pyrodigal, prodigal,
                        or prodigal-gv. [Default is pyrodigal].
  -m, --meta_mode       Flag to use meta mode instead of single for pyrodigal/prodigal.
  -rp REFERENCE_PROTEOME, --reference_proteome REFERENCE_PROTEOME
                        Provide path to a reference proteome to use for protein/gene-calling
                        in target genomes - which should be in FASTA format.
  -cst, --create_species_tree
                        Use skani to infer a neighbor-joing based species tree for the genomes.
  -c CPUS, --cpus CPUS  Total number of cpus/threads to use for running OrthoFinder2/prodigal.
                        [Default is 1].
  -mm MAX_MEMORY, --max_memory MAX_MEMORY
                        Uses resource module to set soft memory limit. Provide in Giga-bytes.
                        Generally memory shouldn't be a major concern unless working
                        with hundreds of large metagenomes. [currently
                        experimental; default is None].
  -v, --version         Get version and exit.