a few fixes

merenlab · Dec 11, 2024 · 083ed02 · 083ed02
1 parent 947f3a3
commit 083ed02
Showing 1 changed file with 36 additions and 34 deletions.
diff --git a/anvio/docs/workflows/ecophylo.md b/anvio/docs/workflows/ecophylo.md
@@ -156,7 +156,7 @@ Genes predicted from genomes and metagenomes can be partial or complete dependin
 To remove partial genes from the ecophylo analysis, the user can assign `true` for `--filter-out-partial-gene-calls` parameter so that only complete open-reading frames are processed.
 
 {:.notice}
-What is below is the default settings in the ecophylo %(workflow-config)s file.
+Below is the default settings in the ecophylo %(workflow-config)s file.
 
 ```bash
 {
@@ -169,27 +169,12 @@ What is below is the default settings in the ecophylo %(workflow-config)s file.
 }
 ```
 
-### Multiple sequence alignment step with MUSCLE
-
-One step of ecophylo is to perform a multiple sequence alignment of the recruited homologs and depending on your application, this could recruit thousands of ORFs which make the MSA a challenging feat. By default, the ecophylo is designed for quick insights, and thus the %(workflow-config)s file uses MUSCLE parameters to perform a large MSA, swiftly: 
-
-```bash
-"align_sequences": {
-    "threads": 5,
-    "additional_params": "-maxiters 1 -diags -sv -distance1 kbit20_3"
-},
-```
-
-However, these parameters may not be optimal for your use case. For example, maybe you are trying to explore branches patterns of a specific protein family and would prefer to have mulitple interations of the MSA. Please explore the MUSCLE documentation to [documentation](https://www.drive5.com/muscle/muscle.html) customize the MSA step for your needs. You can replace the `additional_params` with whatever MUSCLE parameters that are best for you. 
-
 ### discovery-mode: ALL open-reading frames
 
 However, maybe you're a risk taker, a maverick explorer of metagenomes. Complete or partial you accept all genes and their potential tree bending shortcomings! In this case, set `--filter-out-partial-gene-calls false` in the %(workflow-config)s.
 
 {:.notice}
-Simultaneously exploring complete and partial ORFs will increase the distribution of sequence lengths and thus impact sequence clustering. We recommend adjusting `cluster_X_percent_sim_mmseqs` to `"--cov-mode": 1` to help insure ORFs of all length properly cluster together. Please refer to the [MMseqs2 user guide description of --cov-mode](https://mmseqs.com/latest/userguide.pdf) for more details.
-
-#FIXME: we ALWAYS recommend --cov-mode 1 to group protein fragments as well as overextended ORFs caused by early or late stop codons.
+Simultaneously exploring complete and partial ORFs will increase the distribution of sequence lengths and thus impact sequence clustering. By default, we used mmseqs `"--cov-mode": 1` in the rule  `cluster_X_percent_sim_mmseqs` to help insure ORFs of all lengths properly cluster together. Please refer to the [MMseqs2 user guide description of --cov-mode](https://mmseqs.com/latest/userguide.pdf) for more details and adjust the parameter to suite your scientific endeavors. 
 
 ```bash
 {
@@ -218,6 +203,19 @@ Now that you have fine tuned the gene family input into the ecophylo workflow, i
 {:.notice}
 It's common that not all genomes or metagenomes will have the gene family of interest either due to it not being detect by the input HMM or filtered out during the QC steps. Please check this log file for %(contigs-db)s that did not contain your gene family of interest: `00_LOGS/contigDBs_with_no_hmm_hit_*.log`
 
+### Multiple sequence alignment step with MUSCLE
+
+One step of ecophylo is to perform a multiple sequence alignment of the recruited homologs and depending on your application, this could recruit thousands of ORFs which make the MSA a challenging feat. By default, the ecophylo is designed for quick insights, and thus the %(workflow-config)s file uses MUSCLE parameters to perform a large MSA, swiftly: 
+
+```bash
+"align_sequences": {
+    "threads": 5,
+    "additional_params": "-maxiters 1 -diags -sv -distance1 kbit20_3"
+},
+```
+
+However, these parameters may not be optimal for your use case. For example, maybe you are trying to explore branches patterns of a specific protein family and would prefer to have mulitple interations of the MSA. Please explore the MUSCLE documentation to [documentation](https://www.drive5.com/muscle/muscle.html) customize the MSA step for your needs. You can replace the `additional_params` with whatever MUSCLE parameters that are best for you. 
+
 ## tree-mode: Insights into the evolutionary patterns of target genes
 
 This is the simplest implementation of ecophylo where only an amino acid based phylogenetic tree is calculated. The workflow will extract the target gene from input assemblies, cluster and pick representatives, then calculate a phylogenetic tree based on the amino acid representative sequences. There are two sub-modes of [tree-mode](#tree-mode-insights-into-the-evolutionary-patterns-of-target-genes) which depend on how you pick representative sequences, [NT-mode](#nt-mode) or [AA-mode](#aa-mode) where extracted genes associated nucleotide version (NT) or the amino acid (AA) can be used to cluster the dataset and pick representatives, respectively.
@@ -265,7 +263,7 @@ Be sure to change the `--min-seq-id` of the `cluster_X_percent_sim_mmseqs` rule
 ### Visualize the output
 
 ```bash
-PROTEIN=""
+PROTEIN="" # Replace with name of protein from hmm_list.txt
 anvi-interactive -t 05_TREES/"${PROTEIN}"/"${PROTEIN}"_renamed.nwk \
                  -p 05_TREES/"${PROTEIN}"/"${PROTEIN}"-PROFILE.db \
                  --fasta 05_TREES/"${PROTEIN}"/"${PROTEIN}"_renamed.faa \
@@ -295,7 +293,7 @@ To initialize [profile-mode](#profile-mode-insights-into-the-ecological-and-evol
 To visualize the output of [profile-mode](#profile-mode-insights-into-the-ecological-and-evolutionary-patterns-of-target-genes-and-environments), run %(anvi-interactive)s on the %(contigs-db)s and %(profile-db)s located in the `METAGENOMICS_WORKFLOW` directory.
 
 ```bash
-PROTEIN=""
+PROTEIN="" # Replace with name of protein from hmm_list.txt
 anvi-interactive -p METAGENOMICS_WORKFLOW/06_MERGED/"${PROTEIN}"/PROFILE.db \
                  -c METAGENOMICS_WORKFLOW/03_CONTIGS/"${PROTEIN}"-contigs.db \
                  --manual
@@ -305,7 +303,7 @@ anvi-interactive -p METAGENOMICS_WORKFLOW/06_MERGED/"${PROTEIN}"/PROFILE.db \
 Just want a quick look at the tree without read recruitment results?
 
 ```bash
-PROTEIN=""
+PROTEIN="" # Replace with name of protein from hmm_list.txt
 anvi-interactive -t 05_TREES/"${PROTEIN}"/"${PROTEIN}"_renamed.nwk \
                  -p 05_TREES/"${PROTEIN}"/"${PROTEIN}"-PROFILE.db \
                  --fasta 05_TREES/"${PROTEIN}"/"${PROTEIN}"_renamed.faa \
@@ -323,13 +321,13 @@ This is just a code outline. Please adjust parameters for the various steps to m
 
 **Step 1.** Make collection of bad branches
 
-Make a collection of branches you would like to remove and safe it!
+Make a collection of branches you would like to remove and save it!
 
 **Step 2.** Export collection and remove those sequences from the protein fasta file
 
 ```bash
 HOME_DIR="ECOPHYLO"
-PROTEIN=""
+PROTEIN="" # Replace with name of protein from hmm_list.txt
 cd $HOME_DIR
 
 mkdir SUBSET_TREE
@@ -359,7 +357,7 @@ clusterize "FastTree SUBSET_TREE/"${ALIGNMENT_PREFIX}"_trimmed_filtered.faa > SU
 **Step 3.** Use `anvi-split` to remove bad branches from the ecophylo interface
 
 ```bash
-PROTEIN=""
+PROTEIN="" # Replace with name of protein from hmm_list.txt
 
 grep -v -f SUBSET_TREE/bad_branches_headers.txt SUBSET_TREE/collection-DEFAULT.txt | sed 's|EVERYTHING|EVERYTHING_curated|' > SUBSET_TREE/my_bins.txt
 
@@ -399,7 +397,7 @@ add_split_string_to_tree <- function(IN_PATH, OUT_PATH) {
   write.tree(tree, file = OUT_PATH)
 }
 
-PROTEIN=""
+PROTEIN="" # Replace with name of protein from hmm_list.txt
 add_split_string_to_tree(IN_PATH = glue("{PROTEIN}_trimmed_filtered_FastTree.nwk"),
                          OUT_PATH = glue("{PROTEIN}_trimmed_filtered_FastTree_ed.nwk"))
 {{ codestop }}
@@ -436,7 +434,6 @@ The ecophylo workflow by default uses [FastTree](http://www.microbesonline.org/f
 }
 ```
 
-
 ## Common questions
 
 ### The ecophylo workflow died at the run_metagenomics_workflow rule and printed this message in the log file, what should I do?
@@ -457,10 +454,10 @@ Shutting down, this might take some time.
 Exiting because a job execution failed. Look above for error message
 ```
 
-One explanation for this error is none of the metagenomes in the %(samples-txt)s. you provided mapped any reads to extracted target proteins. To test this run the following command and see if you get this error:
+One explanation for this error is none of the metagenomes in the %(samples-txt)s you provided mapped any reads to extracted target proteins. To test this run the following command and see if you get this error:
 
 ```bash
-$ grep -B 10 "Nothing to merge" ECOPHYLO_WORKFLOW/METAGENOMICS_WORKFLOW/00_LOGS/run_metagenomics_workflow.log
+$ grep "Nothing to merge" ECOPHYLO_WORKFLOW/METAGENOMICS_WORKFLOW/00_LOGS/run_metagenomics_workflow.log
 Command 'set -euo pipefail;  echo Nothing to merge for Ribosomal_S11. This should only happen if all profiles were empty (you can check the log file: 00_LOGS/Ribosomal_S11-anvi_merge.log to see if that is indeed the case). This file was created just so that your workflow would continue with no error (snakemake expects to find these output files and if we don't create them, then it will be upset). As we see it, there is no reason to throw an error here, since you mapped your metagenome to some fasta files and you got your answer: whatever you have in your fasta file is not represented in your  metagenomes. Feel free to contact us if you think that this is our fault. sincerely, Meren Lab >> 00_LOGS/Ribosomal_S11-anvi_merge.log' returned non-zero exit status 1.
 ```
 
@@ -525,7 +522,7 @@ ECOPHYLO_WORKFLOW/
 To visualize the results of the different ecophylo runs, just change the paths to include the different proteins:
 
 ```bash
-PROTEIN=""
+PROTEIN="" # Replace with name of protein from hmm_list.txt
 anvi-interactive -c METAGENOMICS_WORKFLOW/03_CONTIGS/"${PROTEIN}"-contigs.db -p METAGENOMICS_WORKFLOW/06_MERGED/"${PROTEIN}"/PROFILE.db
 ```
 
@@ -557,22 +554,27 @@ Finally, edit the `"HOME"` string to a new path to ensure you make a new directo
 
 ### Can I add more genomes and metagenomes to my analysis?
 
-Yes you can add more genomes and metagenomes in your %(metagenomes)s, %(external-genomes)s, and %(samples-txt)s
+Yes you can add more genomes and metagenomes in your %(metagenomes)s, %(external-genomes)s, and %(samples-txt)s.
 
 BUT, you need to do a couple of steps first so that Snakemake can restart all the processes and maintain as much data as possible:
 
+If you are just adding more metagenomes to your %(samples-txt)s, we luckily can preserve the majority of files in the `METAGENOMICS_WORKFLOW/`. First, edit your %(samples-txt)s files to add more metagenomes, then delete these files below, and finally restart the workflow.
 ```bash
-HOME_DIR="ECOPHYLO_WORKFLOW_asdf"
+HOME_DIR="ECOPHYLO_WORKFLOW"
 PROTEIN="Ribosomal_S11"
-rm -rf "${HOME_DIR}"/METAGENOMICS_WORKFLOW/03_CONTIGS/
-rm -rf "${HOME_DIR}"/METAGENOMICS_WORKFLOW/05_ANVIO_PROFILE/
 rm -rf "${HOME_DIR}"/METAGENOMICS_WORKFLOW/06_MERGED/
 rm -rf "${HOME_DIR}"/METAGENOMICS_WORKFLOW/07_SUMMARY/
 rm -rf "${HOME_DIR}"/METAGENOMICS_WORKFLOW/"${PROTEIN}"_ECOPHYLO_WORKFLOW_state.json
 rm -rf "${HOME_DIR}"/METAGENOMICS_WORKFLOW/"${PROTEIN}"_add_default_collection.done
 rm -rf "${HOME_DIR}"/METAGENOMICS_WORKFLOW/"${PROTEIN}"_state_imported_profile.done
-rm -rf "${HOME_DIR}"/METAGENOMICS_WORKFLOW/fasta.txt
 rm -rf "${HOME_DIR}"/METAGENOMICS_WORKFLOW/metagenomics_config.json
 rm -rf "${HOME_DIR}"/METAGENOMICS_WORKFLOW/metagenomics_workflow.done
 rm -rf "${HOME_DIR}"/METAGENOMICS_WORKFLOW/samples.txt
-```
+```
+
+If you want to add more assemblies to your %(metagenomes)s or %(external-genomes)s, you'll sadly need to delete the whole `METAGENOMICS_WORKFLOW/` and restart the workflow.
+```bash
+HOME_DIR="ECOPHYLO_WORKFLOW"
+PROTEIN="Ribosomal_S11" # Replace with name of protein from hmm_list.txt
+rm -rf "${HOME_DIR}"/METAGENOMICS_WORKFLOW/
+```