Skip to content

Commit

Permalink
a few fixes
Browse files Browse the repository at this point in the history
  • Loading branch information
mschecht committed Dec 11, 2024
1 parent 947f3a3 commit 083ed02
Showing 1 changed file with 36 additions and 34 deletions.
70 changes: 36 additions & 34 deletions anvio/docs/workflows/ecophylo.md
Original file line number Diff line number Diff line change
Expand Up @@ -156,7 +156,7 @@ Genes predicted from genomes and metagenomes can be partial or complete dependin
To remove partial genes from the ecophylo analysis, the user can assign `true` for `--filter-out-partial-gene-calls` parameter so that only complete open-reading frames are processed.

{:.notice}
What is below is the default settings in the ecophylo %(workflow-config)s file.
Below is the default settings in the ecophylo %(workflow-config)s file.

```bash
{
Expand All @@ -169,27 +169,12 @@ What is below is the default settings in the ecophylo %(workflow-config)s file.
}
```

### Multiple sequence alignment step with MUSCLE

One step of ecophylo is to perform a multiple sequence alignment of the recruited homologs and depending on your application, this could recruit thousands of ORFs which make the MSA a challenging feat. By default, the ecophylo is designed for quick insights, and thus the %(workflow-config)s file uses MUSCLE parameters to perform a large MSA, swiftly:

```bash
"align_sequences": {
"threads": 5,
"additional_params": "-maxiters 1 -diags -sv -distance1 kbit20_3"
},
```

However, these parameters may not be optimal for your use case. For example, maybe you are trying to explore branches patterns of a specific protein family and would prefer to have mulitple interations of the MSA. Please explore the MUSCLE documentation to [documentation](https://www.drive5.com/muscle/muscle.html) customize the MSA step for your needs. You can replace the `additional_params` with whatever MUSCLE parameters that are best for you.

### discovery-mode: ALL open-reading frames

However, maybe you're a risk taker, a maverick explorer of metagenomes. Complete or partial you accept all genes and their potential tree bending shortcomings! In this case, set `--filter-out-partial-gene-calls false` in the %(workflow-config)s.

{:.notice}
Simultaneously exploring complete and partial ORFs will increase the distribution of sequence lengths and thus impact sequence clustering. We recommend adjusting `cluster_X_percent_sim_mmseqs` to `"--cov-mode": 1` to help insure ORFs of all length properly cluster together. Please refer to the [MMseqs2 user guide description of --cov-mode](https://mmseqs.com/latest/userguide.pdf) for more details.

#FIXME: we ALWAYS recommend --cov-mode 1 to group protein fragments as well as overextended ORFs caused by early or late stop codons.
Simultaneously exploring complete and partial ORFs will increase the distribution of sequence lengths and thus impact sequence clustering. By default, we used mmseqs `"--cov-mode": 1` in the rule `cluster_X_percent_sim_mmseqs` to help insure ORFs of all lengths properly cluster together. Please refer to the [MMseqs2 user guide description of --cov-mode](https://mmseqs.com/latest/userguide.pdf) for more details and adjust the parameter to suite your scientific endeavors.

```bash
{
Expand Down Expand Up @@ -218,6 +203,19 @@ Now that you have fine tuned the gene family input into the ecophylo workflow, i
{:.notice}
It's common that not all genomes or metagenomes will have the gene family of interest either due to it not being detect by the input HMM or filtered out during the QC steps. Please check this log file for %(contigs-db)s that did not contain your gene family of interest: `00_LOGS/contigDBs_with_no_hmm_hit_*.log`

### Multiple sequence alignment step with MUSCLE

One step of ecophylo is to perform a multiple sequence alignment of the recruited homologs and depending on your application, this could recruit thousands of ORFs which make the MSA a challenging feat. By default, the ecophylo is designed for quick insights, and thus the %(workflow-config)s file uses MUSCLE parameters to perform a large MSA, swiftly:

```bash
"align_sequences": {
"threads": 5,
"additional_params": "-maxiters 1 -diags -sv -distance1 kbit20_3"
},
```

However, these parameters may not be optimal for your use case. For example, maybe you are trying to explore branches patterns of a specific protein family and would prefer to have mulitple interations of the MSA. Please explore the MUSCLE documentation to [documentation](https://www.drive5.com/muscle/muscle.html) customize the MSA step for your needs. You can replace the `additional_params` with whatever MUSCLE parameters that are best for you.

## tree-mode: Insights into the evolutionary patterns of target genes

This is the simplest implementation of ecophylo where only an amino acid based phylogenetic tree is calculated. The workflow will extract the target gene from input assemblies, cluster and pick representatives, then calculate a phylogenetic tree based on the amino acid representative sequences. There are two sub-modes of [tree-mode](#tree-mode-insights-into-the-evolutionary-patterns-of-target-genes) which depend on how you pick representative sequences, [NT-mode](#nt-mode) or [AA-mode](#aa-mode) where extracted genes associated nucleotide version (NT) or the amino acid (AA) can be used to cluster the dataset and pick representatives, respectively.
Expand Down Expand Up @@ -265,7 +263,7 @@ Be sure to change the `--min-seq-id` of the `cluster_X_percent_sim_mmseqs` rule
### Visualize the output

```bash
PROTEIN=""
PROTEIN="" # Replace with name of protein from hmm_list.txt
anvi-interactive -t 05_TREES/"${PROTEIN}"/"${PROTEIN}"_renamed.nwk \
-p 05_TREES/"${PROTEIN}"/"${PROTEIN}"-PROFILE.db \
--fasta 05_TREES/"${PROTEIN}"/"${PROTEIN}"_renamed.faa \
Expand Down Expand Up @@ -295,7 +293,7 @@ To initialize [profile-mode](#profile-mode-insights-into-the-ecological-and-evol
To visualize the output of [profile-mode](#profile-mode-insights-into-the-ecological-and-evolutionary-patterns-of-target-genes-and-environments), run %(anvi-interactive)s on the %(contigs-db)s and %(profile-db)s located in the `METAGENOMICS_WORKFLOW` directory.

```bash
PROTEIN=""
PROTEIN="" # Replace with name of protein from hmm_list.txt
anvi-interactive -p METAGENOMICS_WORKFLOW/06_MERGED/"${PROTEIN}"/PROFILE.db \
-c METAGENOMICS_WORKFLOW/03_CONTIGS/"${PROTEIN}"-contigs.db \
--manual
Expand All @@ -305,7 +303,7 @@ anvi-interactive -p METAGENOMICS_WORKFLOW/06_MERGED/"${PROTEIN}"/PROFILE.db \
Just want a quick look at the tree without read recruitment results?

```bash
PROTEIN=""
PROTEIN="" # Replace with name of protein from hmm_list.txt
anvi-interactive -t 05_TREES/"${PROTEIN}"/"${PROTEIN}"_renamed.nwk \
-p 05_TREES/"${PROTEIN}"/"${PROTEIN}"-PROFILE.db \
--fasta 05_TREES/"${PROTEIN}"/"${PROTEIN}"_renamed.faa \
Expand All @@ -323,13 +321,13 @@ This is just a code outline. Please adjust parameters for the various steps to m

**Step 1.** Make collection of bad branches

Make a collection of branches you would like to remove and safe it!
Make a collection of branches you would like to remove and save it!

**Step 2.** Export collection and remove those sequences from the protein fasta file

```bash
HOME_DIR="ECOPHYLO"
PROTEIN=""
PROTEIN="" # Replace with name of protein from hmm_list.txt
cd $HOME_DIR

mkdir SUBSET_TREE
Expand Down Expand Up @@ -359,7 +357,7 @@ clusterize "FastTree SUBSET_TREE/"${ALIGNMENT_PREFIX}"_trimmed_filtered.faa > SU
**Step 3.** Use `anvi-split` to remove bad branches from the ecophylo interface

```bash
PROTEIN=""
PROTEIN="" # Replace with name of protein from hmm_list.txt

grep -v -f SUBSET_TREE/bad_branches_headers.txt SUBSET_TREE/collection-DEFAULT.txt | sed 's|EVERYTHING|EVERYTHING_curated|' > SUBSET_TREE/my_bins.txt

Expand Down Expand Up @@ -399,7 +397,7 @@ add_split_string_to_tree <- function(IN_PATH, OUT_PATH) {
write.tree(tree, file = OUT_PATH)
}

PROTEIN=""
PROTEIN="" # Replace with name of protein from hmm_list.txt
add_split_string_to_tree(IN_PATH = glue("{PROTEIN}_trimmed_filtered_FastTree.nwk"),
OUT_PATH = glue("{PROTEIN}_trimmed_filtered_FastTree_ed.nwk"))
{{ codestop }}
Expand Down Expand Up @@ -436,7 +434,6 @@ The ecophylo workflow by default uses [FastTree](http://www.microbesonline.org/f
}
```


## Common questions

### The ecophylo workflow died at the run_metagenomics_workflow rule and printed this message in the log file, what should I do?
Expand All @@ -457,10 +454,10 @@ Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
```

One explanation for this error is none of the metagenomes in the %(samples-txt)s. you provided mapped any reads to extracted target proteins. To test this run the following command and see if you get this error:
One explanation for this error is none of the metagenomes in the %(samples-txt)s you provided mapped any reads to extracted target proteins. To test this run the following command and see if you get this error:

```bash
$ grep -B 10 "Nothing to merge" ECOPHYLO_WORKFLOW/METAGENOMICS_WORKFLOW/00_LOGS/run_metagenomics_workflow.log
$ grep "Nothing to merge" ECOPHYLO_WORKFLOW/METAGENOMICS_WORKFLOW/00_LOGS/run_metagenomics_workflow.log
Command 'set -euo pipefail; echo Nothing to merge for Ribosomal_S11. This should only happen if all profiles were empty (you can check the log file: 00_LOGS/Ribosomal_S11-anvi_merge.log to see if that is indeed the case). This file was created just so that your workflow would continue with no error (snakemake expects to find these output files and if we don't create them, then it will be upset). As we see it, there is no reason to throw an error here, since you mapped your metagenome to some fasta files and you got your answer: whatever you have in your fasta file is not represented in your metagenomes. Feel free to contact us if you think that this is our fault. sincerely, Meren Lab >> 00_LOGS/Ribosomal_S11-anvi_merge.log' returned non-zero exit status 1.
```
Expand Down Expand Up @@ -525,7 +522,7 @@ ECOPHYLO_WORKFLOW/
To visualize the results of the different ecophylo runs, just change the paths to include the different proteins:

```bash
PROTEIN=""
PROTEIN="" # Replace with name of protein from hmm_list.txt
anvi-interactive -c METAGENOMICS_WORKFLOW/03_CONTIGS/"${PROTEIN}"-contigs.db -p METAGENOMICS_WORKFLOW/06_MERGED/"${PROTEIN}"/PROFILE.db
```

Expand Down Expand Up @@ -557,22 +554,27 @@ Finally, edit the `"HOME"` string to a new path to ensure you make a new directo

### Can I add more genomes and metagenomes to my analysis?

Yes you can add more genomes and metagenomes in your %(metagenomes)s, %(external-genomes)s, and %(samples-txt)s
Yes you can add more genomes and metagenomes in your %(metagenomes)s, %(external-genomes)s, and %(samples-txt)s.

BUT, you need to do a couple of steps first so that Snakemake can restart all the processes and maintain as much data as possible:

If you are just adding more metagenomes to your %(samples-txt)s, we luckily can preserve the majority of files in the `METAGENOMICS_WORKFLOW/`. First, edit your %(samples-txt)s files to add more metagenomes, then delete these files below, and finally restart the workflow.
```bash
HOME_DIR="ECOPHYLO_WORKFLOW_asdf"
HOME_DIR="ECOPHYLO_WORKFLOW"
PROTEIN="Ribosomal_S11"
rm -rf "${HOME_DIR}"/METAGENOMICS_WORKFLOW/03_CONTIGS/
rm -rf "${HOME_DIR}"/METAGENOMICS_WORKFLOW/05_ANVIO_PROFILE/
rm -rf "${HOME_DIR}"/METAGENOMICS_WORKFLOW/06_MERGED/
rm -rf "${HOME_DIR}"/METAGENOMICS_WORKFLOW/07_SUMMARY/
rm -rf "${HOME_DIR}"/METAGENOMICS_WORKFLOW/"${PROTEIN}"_ECOPHYLO_WORKFLOW_state.json
rm -rf "${HOME_DIR}"/METAGENOMICS_WORKFLOW/"${PROTEIN}"_add_default_collection.done
rm -rf "${HOME_DIR}"/METAGENOMICS_WORKFLOW/"${PROTEIN}"_state_imported_profile.done
rm -rf "${HOME_DIR}"/METAGENOMICS_WORKFLOW/fasta.txt
rm -rf "${HOME_DIR}"/METAGENOMICS_WORKFLOW/metagenomics_config.json
rm -rf "${HOME_DIR}"/METAGENOMICS_WORKFLOW/metagenomics_workflow.done
rm -rf "${HOME_DIR}"/METAGENOMICS_WORKFLOW/samples.txt
```
```

If you want to add more assemblies to your %(metagenomes)s or %(external-genomes)s, you'll sadly need to delete the whole `METAGENOMICS_WORKFLOW/` and restart the workflow.
```bash
HOME_DIR="ECOPHYLO_WORKFLOW"
PROTEIN="Ribosomal_S11" # Replace with name of protein from hmm_list.txt
rm -rf "${HOME_DIR}"/METAGENOMICS_WORKFLOW/
```

0 comments on commit 083ed02

Please sign in to comment.