-
Notifications
You must be signed in to change notification settings - Fork 4
1. more info on prepTG
prepTG creates a database directory of genomes to search for homologous instances of reference/query gene-clusters in using fai.
The input is simply a directory of either FASTA or GenBank formatted files - with CDS features for the latter - representing bacterial genomes or metagenomes.
For eukaryotic genomes full GenBank format with CDS features are expected; however, FASTA formatted assemblies may instead be provided if a "reference proteome" is provided.
Check out example commands for prepTG on the 4. basic usage examples wiki page.
For bacterial genomes or bacterial metagenomes, users are able to use pyrodigal (default) or prodigal to perform de novo gene calling. More recently, we also have the availability of prodigal-gv as an option for gene-calling when phages are the gene clusters of interest.
For eukaryotic genomes, users are able to map a high-quality gene-calling prediction for some reference genome to the remainder of the genomes. This approach is generally recommended only for single-species investigations and has only been tested with microbial eukaryotic genomes of a modest size (e.g. fungii, not gigantic genomes such as those of plants).
Usage: prepTG [-h] [-i INPUT_DIR] [-g GTDB_TAXON] [-gr GTDB_RELEASE] [-d DOWNLOAD_PREMADE] -o OUTPUT_DIR [-r] [-ro] [-l LOCUS_TAG_LENGTH]
[-gcm GENE_CALLING_METHOD] [-m] [-rp REFERENCE_PROTEOME] [-cst] [-ma] [-c THREADS] [-mm MAX_MEMORY] [-v]
Program: prepTG
Author: Rauf Salamzade
Affiliation: Kalan Lab, UW Madison, Department of Medical Microbiology and Immunology
Prepares a directory of target genomes for being searched for query gene clusters using fai.
Premade databases of representative genomes are available for the following genera:
Acinetobacter (n=1,643), Bacillales (n=3,150), Corynebacterium (n=726), Enterobacter (n=878),
Enterococcus (n=937), Escherichia (n=2,436), Klebsiella (n=1,022), Listeria (n=353),
Mycobacterium (n=744), Pseudomonas (n=2,666), Salmonella (n=308), Staphylococus (n=496),
Streptomyces (n=1,555), Streptococcus (n=2,452), Cutibacterium (n=27), Neisseria (n=414),
Lactobacillus (n=541), and Micromonospora (n=211).
In addition, users can simply request all genomes belonging to a specific species/genus
in GTDB R214 to be downloaded.
----------------------------------------------------------------------------------------
> Example commands:
1. Setup a prepTG database which includes some local genomes in FASTA format
$ prepTG -i User_Genomes_Directory/
2. Setup a prepTG database which includes some local genomes and all Cutibacterium granulosum
genomes in GTDB R214:
$ prepTG -i User_Genomes_Directory/ -g "Cutibacterium granulosum" -o prepTG_Database/
3. Setup local prepTG database by downloading a premade one of representative
Cutibacterium genomes:
$ prepTG -d Cutibacterium -o prepTG_Database/
----------------------------------------------------------------------------------------
> Considerations
If FASTA format is provided, assumption is that genomes are prokaryotic and
pyrodigal/prodigal will be used to perform gene-calling. Eukaryotic genomes can
be provided as FASTA format but the --reference-proteome file should be used in
such case to map proteins from a reference proteome (from the same species ideally)
on to the target genomes. This will prevent detection of new genes in gene-clusters
detected by fai but synchronize gene-calling and allow for better similarity
assessment between orthologous genes.
If you are interested in inferring horizontal gene transfer using salt downstream
and are working with bacterial genomes - consider issuing the "--mge-annotation"
flag to annotate phage, plasmid and IS element associated proteins.
If GenBank files are provided, CDS features are expected and further each CDS
feature must contain a "translation" qualifier which features the protein sequence
and optionally a "locus_tag" qualifier. Options to consider if not every CDS has a
"locus_tag" include --rename-locus-tags and --rename-problem-gbk-lts.
Options:
-h, --help show this help message and exit
-i, --input-dir INPUT_DIR
Directory with target genomes (either featuring GenBanks or FASTAs).
-g, --gtdb-taxon GTDB_TAXON
Name of a GTDB valid genus or species to incorporate genomes from.
Should be surrounded by
quotes (e.g. "Escherichia coli").
-gr, --gtdb-release GTDB_RELEASE
GTDB release to use. [Current default is R220].
-d, --download-premade DOWNLOAD_PREMADE
Download and setup pre-made databases of representative genomes
for specific taxon/genus. Provide name of the taxon,
e.g. "Escherichia"
-o, --output-dir OUTPUT_DIR
Output directory, which can then be provided as input for the
"-tg" argument in fai.
-r, --rename-locus-tags
Whether to rename locus tags if provided for CDS features in
GenBanks.
-ro, --rename-problem-gbk-lts
Whether to rename locus tags of only problem GenBank
files which have CDS but no locus_tag qualifier. By default
such GenBank files are skipped.
-l, --locus-tag-length LOCUS_TAG_LENGTH
Length of locus tags to set. Default is 3, allows for <~18k
genomes.
-gcm, --gene-calling-method GENE_CALLING_METHOD
Method to use for gene calling. Options are: pyrodigal, prodigal,
or prodigal-gv. [Default is pyrodigal].
-m, --meta-mode Flag to use meta mode instead of single for pyrodigal/prodigal.
-rp, --reference-proteome REFERENCE_PROTEOME
Provide path to a reference proteome to use for protein/
gene-calling in target genomes - which should be in FASTA
format.
-cst, --create-species-tree
Use skani to infer a neighbor-joing based species
tree for the genomes.
-ma, --mge-annotation
Perform MGE annotation of proteins - for
bacterial genomes only.
-c, --threads THREADS
The number of threads to use.
[Default is 1].
-mm, --max-memory MAX_MEMORY
Uses resource module to set soft memory limit. Provide
in Giga-bytes. Configured in the shell environment
[Default is None; experimental].
-v, --version Get version and exit.
Option (Short) | Option (Long) | Description |
---|---|---|
-i |
--input-dir |
Directory with target genomes (either featuring GenBanks or FASTAs). If GenBank files are provided - they must feature CDS features with gene predictions. They should preferably also include a "locus_tag" for each CDS feature, but this can be overcome using the --rename-locus-tags or --rename-problem-gbk-lts options. |
-d |
--download-premade |
Download and setup pre-made databases of representative genomes for specific taxon/genus selected by skDER. For details on available databases, see this wiki page. Provide the name of the taxon, e.g. "Escherichia". |
-g |
--gtdb-taxon |
Name of a GTDB valid genus or species to incorporate genomes from. Should be surrounded by quotes (e.g. "Escherichia coli"). |
-gr |
--gtdb-release |
GTDB release to use. [Current default is R220]. |
-o |
--output-dir |
Output directory, which can then be provided as input for the -tg argument in fai. |
-l |
--locus-tag-length |
Length of locus tags to set. Default is 3, allows for <~18k genomes. |
-r |
--rename-locus-tags |
Whether to rename locus tags for CDS features if GenBank files are provided as input. |
-ro |
--rename-problem-gbk-lts |
Whether to rename locus tags of only problem GenBank files which have CDS but no locus_tag qualifier. By default such GenBank files are skipped. |
-gcm |
--gene-calling-method |
Method to use for gene calling. Options are: pyrodigal, prodigal, or prodigal-gv. [Default is pyrodigal]. |
-m |
--meta-mode |
Flag to use metagenomic mode for pyrodigal/prodigal. |
-rp |
--reference-proteome |
Provide path to a reference proteome to use for miniprot-based protein mapping in target genomes. The input proteome should be in FASTA format. |
-cst |
--create-species-tree |
Use skani ANI predictions to infer a neighbor-joining based species tree for the genomes. Not recommended if working with diverse genome sets spanning multiple species. |
-ma |
--mge-annotation |
Perform MGE annotation of proteins - for bacterial genomes only. This is only used for the salt program downstream, be careful when using for large datasets as it increases prepTG runtime significantly. |