Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prodigal to pyrodigal #2306

Open
wants to merge 17 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 15 additions & 8 deletions anvio/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -295,13 +295,12 @@ def TABULATE(table, header, numalign="right", max_width=0):
['--prodigal-translation-table'],
{'metavar': 'INT',
'default': None,
'help': "This is a parameter to pass to the Prodigal for a specific translation table. This parameter "
'help': "This is a parameter to pass to the Pyrodigal-gv for a specific translation table. This parameter "
"corresponds to the parameter `-g` in Prodigal, the default value of which is 11 (so if you do "
"not set anything, it will be set to 11 in Prodigal runtime. Please refer to the Prodigal "
"not set anything, it will be set to 11 in Pyrodigal-gv runtime. Please refer to the Prodigal "
"documentation to determine what is the right translation table for you if you think you need "
"it.)"}
),

'skip-gene-calling': (
['--skip-gene-calling'],
{'default': False,
Expand All @@ -314,14 +313,22 @@ def TABULATE(table, header, numalign="right", max_width=0):
['--prodigal-single-mode'],
{'default': False,
'action': 'store_true',
'help': "By default, anvi'o will use prodigal for gene calling (unless you skipped gene calling, or provided "
"anvi'o with external gene calls). One of the flags anvi'o includes in prodigal run is `-p meta`, which "
"optimizes prodigal's ability to identify genes in metagenomic assemblies. In some rare cases, for a "
"given set of contigs prodigal will yield a segmentation fault error due to one or more genes in your "
'help': "By default, anvi'o will use pyrodigal-gv for gene calling (unless you skipped gene calling, or provided "
"anvi'o with external gene calls). One of the flags anvi'o includes in pyrodigal-gv run is `-p meta`, which "
"optimizes pyrodigal-gv's ability to identify genes in metagenomic assemblies. In some rare cases, for a "
"given set of contigs pyrodigal-gv will yield a segmentation fault error due to one or more genes in your "
"collections will confuse the program when it is used with the `-p meta` flag. While anvi'o developers "
"are not quite sure under what circumstances this happens, we realized that removal of this flag often "
"solves this issue. If you are dealing with such cyrptic errors, the inclusion of `--skip-prodigal-meta-flag` "
"will instruct anvi'o to run prodigal without the `-meta` flag, and may resolve this issue for you."}
"will instruct anvi'o to run pyrodigal-gv without the `-meta` flag, and may resolve this issue for you."}
),
'full-gene-calling-report': (
['--full-gene-calling-report'],
{'metavar': 'FILE',
'default': None,
'help': "When anvi'o is done with gene calling using pyrodigal, it only stores some data about individual gene "
"calls. Using this parameter you can pass an output file to report most comprehensive data on gene calls "
"as a TAB-delimited text file with gene caller ids matching to those that are stored in the contigs-db."}
),
'remove-partial-hits': (
['--remove-partial-hits'],
Expand Down
2 changes: 1 addition & 1 deletion anvio/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -114,7 +114,7 @@
pan_default = "presence-absence"
trnaseq_default = "cov"

default_gene_caller = "prodigal"
default_gene_caller = "pyrodigal-gv"

# see https://github.com/merenlab/anvio/issues/1358
gene_call_types = {'CODING': 1,
Expand Down
4 changes: 2 additions & 2 deletions anvio/dbops.py
Original file line number Diff line number Diff line change
Expand Up @@ -4443,7 +4443,7 @@ def create(self, args):
"skipped. Please make up your mind.")

if (external_gene_calls_file_path or skip_gene_calling) and prodigal_translation_table:
raise ConfigError("You asked anvi'o to %s, yet you set a specific translation table for prodigal. These "
raise ConfigError("You asked anvi'o to %s, yet you set a specific translation table for pyrodigal-gv. These "
"parameters do not make much sense and anvi'o is kindly asking you to make up your "
"mind." % ('skip gene calling' if skip_gene_calling else 'use external gene calls'))

Expand Down Expand Up @@ -4636,7 +4636,7 @@ def create(self, args):
self.run.info('External gene calls file have AA sequences?', external_gene_calls_include_amino_acid_sequences, mc='green')
self.run.info('Proper frames will be predicted?', (not skip_predict_frame), mc='green')
else:
self.run.info('Is prodigal run in single mode?', ('YES' if prodigal_single_mode else 'NO'), mc='green')
self.run.info('Is pyrodigal-gv run in single mode?', ('YES' if prodigal_single_mode else 'NO'), mc='green')

self.run.info('Ignoring internal stop codons?', ignore_internal_stop_codons)
self.run.info('Splitting pays attention to gene calls?', (not skip_mindful_splitting))
Expand Down
6 changes: 3 additions & 3 deletions anvio/docs/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
"artifacts_accepted": ['fasta-txt'],
"anvio_workflows_inherited": [],
"third_party_programs_used": [
('Gene calling', ['prodigal']),
('Gene calling', ['pyrodigal-gv']),
('HMM search', ['HMMER']),
('Gene taxonomy', ['krakenuniq', 'centrifuge']),
('Sequence search against various databases', ['DIAMOND'])
Expand All @@ -35,7 +35,7 @@
('Quality control of short reads', ['illumina-utils']),
('Assembly', ['IDBA-UD', 'metaSPAdes', 'MEGAHIT']),
('BAM file manipulations', ['samtools']),
('Gene calling', ['prodigal']),
('Gene calling', ['pyrodigal-gv']),
('HMM search', ['HMMER']),
('Gene taxonomy', ['krakenuniq', 'centrifuge']),
('Read recruitment', ['Bowtie2'])
Expand Down Expand Up @@ -116,7 +116,7 @@
'metaSPAdes': {'link': "https://cab.spbu.ru/software/meta-spades/"},
'MEGAHIT': {'link': 'https://github.com/voutcn/megahit'},
'samtools': {'link': 'http://www.htslib.org/'},
'prodigal': {'link': 'https://github.com/hyattpd/Prodigal'},
'pyrodigal-gv': {'link': 'https://github.com/althonos/pyrodigal-gv'},
'HMMER': {'link': 'http://hmmer.org/'},
'Bowtie2': {'link': 'https://github.com/BenLangmead/bowtie2'},
'krakenuniq': {'link': 'https://github.com/fbreitwieser/krakenuniq'},
Expand Down
2 changes: 1 addition & 1 deletion anvio/docs/programs/anvi-export-gene-calls.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ anvi-export-gene-calls -c %(contigs-db)s \
--list-gene-callers
{{ codestop }}

Running this will export all of your gene calls identified by the gene caller [prodigal](https://github.com/hyattpd/Prodigal) (assuming it is in your %(contigs-db)s:
Running this will export all of your gene calls identified by the gene caller [pyrodigal-gv](https://github.com/althonos/pyrodigal-gv) (assuming it is in your %(contigs-db)s:

{{ codestart }}
anvi-export-gene-calls -c %(contigs-db)s \
Expand Down
28 changes: 27 additions & 1 deletion anvio/docs/programs/anvi-gen-contigs-database.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,8 @@ When run on a %(contigs-fasta)s this program will,

* **Soft-split contigs** longer than 20,000 bp into smaller ones (you can change the split size using the `--split-length` flag). When the gene calling step is not skipped, the process of splitting contigs will consider where genes are and avoid cutting genes in the middle. For very, very large assemblies this process can take a while, and you can skip it with `--skip-mindful-splitting` flag.

* **Identify open reading frames** using [Prodigal](http://prodigal.ornl.gov/), UNLESS, (1) you have used the flag `--skip-gene-calling` (no gene calls will be made) or (2) you have provided %(external-gene-calls)s.
* **Identify open reading frames** using [pyrodigal-gv](https://github.com/althonos/pyrodigal-gv) which is an extension of [pyrodigal](https://github.com/althonos/pyrodigal) ([doi:10.21105/joss.04296](https://doi.org/10.21105/joss.04296)) (which builds upon [prodigal](http://prodigal.ornl.gov/), the approach originally implemented by Hyatt et al ([doi:10.1186/1471-2105-11-119](https://doi.org/10.1186/1471-2105-11-119)).
Additionally, it includes metagenomics models for giant viruses and viruses with alternative genetic codes by Camargo et al [doi:10.1038/s41587-023-01953-y](https://doi.org/10.1038/s41587-023-01953-y). **UNLESS**, (1) you have used the flag `--skip-gene-calling` (no gene calls will be made) or (2) you have provided %(external-gene-calls)s. See other details related to gene calling below.

{:.notice}
This program can work with compressed input FASTA files (i.e., the file name ends with a `.gz` extention).
Expand Down Expand Up @@ -61,6 +62,31 @@ anvi-gen-contigs-database -f %(contigs-fasta)s \
--ignore-internal-stop-codons
{{ codestop }}

### Gene calling

By default, this program will use [pyrodigal](https://github.com/althonos/pyrodigal) for gene calling, and the key aspects of the resulting information about genes in input sequences are stored in the resulting %(contigs-db)s files. That said, gene calls include much more information than what will be stored in the database. If you need a more comprehensive report on gene calls, you can run %(anvi-gen-contigs-database)s with the following parameter:

{{ codestart }}
anvi-gen-contigs-database -f %(contigs-fasta)s \
-o %(contigs-db)s \
--full-gene-calling-report OUTPUT.txt
{{ codestop }}

In this case the `OUTPUT.txt` will contain additional data, and the gene caller ids will match to those that are stored in the database therefore tractable all downstream analyses that will stem from the resulting %(contigs-db)s. Here is a snippet from an example file:

|**gene_callers_id**|**contig**|**start**|**stop**|**direction**|**partial**|**partial_begin**|**partial_end**|**confidence**|**gc_cont**|**rbs_motif**|**rbs_spacer**|**score**|**cscore**|**rscore**|**sscore**|**start_type**|**translation_table**|**tscore**|**uscore**|**sequence**|**translated_sequence**|
|:--|:--|:--|:--|:--|:--|:--|:--|:--|:--|:--|:--|:--|:--|:--|:--|:--|:--|:--|:--|:--|:--|
|0|NC_009091|169|1327|f|False|False|False|99.99|0.3013|||153.96808116167358|151.14841116167358|-0.9874499999999999|2.8196700000000003|ATG|11|2.8971|0.9100200000000003|ATGGAAATTATTTGTAATCAAAATGAATTA(...)|MEIICNQNELNNAIQLVSKAVASRPTHPIL(...)|
|1|NC_009091|1388|2036|f|False|False|False|99.99|0.2885|GGA/GAG/AGG|5-10bp|81.87259573141999|71.58136573142|2.71875|10.291229999999999|ATG|11|2.8971|4.67538|ATGGTCCTTAATTATGGAAATGGTGAAAAT(...)|MVLNYGNGENVWMHPPVHRILGWYSRPSNL(...)|
|2|NC_009091|2039|4379|f|False|False|False|99.99|0.3260|GGA/GAG/AGG|5-10bp|352.2303246874159|347.0894946874159|2.71875|5.14083|ATG|11|2.8971|-0.47502|ATGATAAATCAAGAAAATAATGATCTATAT(...)|MINQENNDLYDLNEALQVENLTLNDYEEIC(...)|
|3|NC_009091|4426|5887|f|False|False|False|99.99|0.3347|||194.19481790304437|188.61028790304437|-0.9874499999999999|5.584530000000002|ATG|11|2.8971|3.6748800000000017|ATGTGCGGAATAGTTGGAATCGTTTCTTCA(...)|MCGIVGIVSSDDVNQQIYDSLLLLQHRGQD(...)|
|4|NC_009091|5883|8325|r|False|False|False|99.99|0.2731|||318.93104438253084|315.33533438253085|-0.9874499999999999|3.5957100000000004|ATG|11|2.8971|1.6860600000000003|ATGGATAAGAAAAACTTCACTTCTATCTCA(...)|MDKKNFTSISLQEEMQRSYLEYAMSVIVGR(...)|
|5|NC_009091|8402|9266|r|False|False|False|99.99|0.2719|||99.52927145207937|93.12346145207937|-0.9874499999999999|6.405809999999998|ATG|11|2.8971|4.496159999999998|ATGAAAAAGTTTTTACAAAGAATACTCTGG(...)|MKKFLQRILWISLISFYFLQIKKVQAIVPY(...)|
|6|NC_009091|9262|10219|r|False|False|False|99.99|0.3239|||107.44082411540528|104.45237411540528|-0.9874499999999999|2.9884500000000003|ATG|11|2.8971|1.0788000000000002|ATGATTAATAGGATTCAAGACAAAAAAGAA(...)|MINRIQDKKEISKKLKERAIFEGFTIAGIA(...)|
|7|NC_009091|10365|11100|f|False|False|False|99.99|0.3319|||71.66956065705634|79.71184065705633|-0.9874499999999999|-8.042279999999998|TTG|11|-9.60915|2.55432|TTGGTTGAATCTAATCAAAATCAAGATTCC(...)|MVESNQNQDSNLGSRLQQDLKNDLIAGLLV(...)|
|8|NC_009091|11103|11721|f|False|False|False|99.99|0.2993|||69.10757680331382|69.03014680331381|-0.9874499999999999|0.07743000000000055|ATG|11|2.8971|-1.8322199999999995|ATGCATAATAGATCTCTTTCTAGAGAATTA(...)|MHNRSLSRELSLISLGLIKDKGDLKLNKFQ(...)|
|(...)|(...)|(...)|(...)|(...)|(...)|(...)|(...)|(...)|(...)|(...)|(...)|(...)|(...)|(...)|(...)|(...)|(...)|(...)|(...)|(...)|(...)|

### Changing k-mer size

You can change the k-mer size by modifying the `--kmer-size` parameter:
Expand Down
2 changes: 1 addition & 1 deletion anvio/docs/programs/anvi-report-inversions.md
Original file line number Diff line number Diff line change
Expand Up @@ -133,7 +133,7 @@ For every inversion, anvi'o can report the surrounding genes and their function

You can use the flag `--num-genes-to-consider-in-context` to choose how many genes to consider upstream/downstream of the inversion. By default, anvi'o report three genes downstream, and three genes upstream.

To select a specific gene caller, you can use `--gene-caller`. The default is prodigal.
To select a specific gene caller, you can use `--gene-caller`. The default is pyrodigal-gv.

If you want to skip this step, you can use the flag `--skip-recovering-genomic-context`.

Expand Down
Loading