Summary

Note

This repository contains a legacy version of a workflow for finding SARS-CoV-2 cryptic lineages in the Sequence Read Archive (SRA), originally developed by David A. Baker and Devon Gregory. It is no longer actively maintained, but is pinned here for reproducibility along with Suarez et al. 2025 (manuscript in prep).

Please note that while this version is no longer maintained, a newer alternative is under active development at dholab. Users looking to search SRA for cryptic lineages themselves should reach out to the O'Connor group for updates on when the project will be open-sourced.

Summary

This workflow will query SRA for the latest samples, then align and determine variants phased to a unique reads.
Unlike a typical VCF, this reports the unique reads in the sample. If a sample has multiple variants on the same read, it will be treated a collection of "phased variants" rather than individual variants.
It has been automated and has successfully processed ~140,000 SRA samples for wastewater.

Workflow overview.

sra_query_daily.py to get the latest SRA entries and compare to already completed files
- This ensures if the script is still running and you kick off another query, it will not double download files.
stage_sra_loop_daily.py
- This adds a list of files needing to be queried, in batches to CHTC, based on the projected file sizes
- This will send up to a set number of files per node (30 is default), and also if the total input file size per SRA query is under 5GB.
- If the entry is >5GB it will only query up to one file per node.
- This is because 1.) Large files can time out or take very long to process, beyond time limits of the nodes.
- By limiting file sizes we mitigate this source of error from happening and affecting multiple SRA entries
sra_cryptic_loop.sh
- This runs in CHTC and will run a single node workflow of the steps to download, align, and aggregate the unique sequence/ phased variants
- This runs a hardcoded pipeline of the following scripts: derep.py, multivariant.poy, SAM_Refiner.py, SRA_fetch.py and Variant_extractor.py
- This loops all SRA entries that were sent to the node.
Extract results_aggregate.py
- Files are packaged as a tar or compressed tar (tar.gz)
- These files need to be sorted by type and
Aggregate Results
- This then takes the extracted results and condenses them into fewer files that are easier to back up and save.
Agggregate Unique Sequences
- This will aggregate the unique sequences into a single file if it is above the 1.4% allele fraction threshold

Running the workflow

If you are not useing a orchestrator to run you pipeline you will only need to run the the modules/sra_cryptic_loop.sh script

sra_name=SRA12345678
/path/to/repository/modules/sra_cryptic_loop.sh -s ${sra_name}

You can also use an orchestrator such as CHTC Serpent
- https://github.com/DABAKER165/chtc_serpent_v2
To run the orchestrator you will need access to a HPCcondor HPC platform.

References

BBMap
- BBMap – Bushnell B. – sourceforge.net/projects/bbmap/
- Deployed Version 39.01
Minimap2
- Li, H. (2018). Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34:3094-3100. doi:10.1093/bioinformatics/bty191
- Deployed version minimap2-2.17
Samtools
- Twelve years of SAMtools and BCFtools Petr Danecek, James K Bonfield, Jennifer Liddle, John Marshall, Valeriu Ohan, Martin O Pollard, Andrew Whitwham, Thomas Keane, Shane A McCarthy, Robert M Davies, Heng Li GigaScience, Volume 10, Issue 2, February 2021, giab008, https://doi.org/10.1093/gigascience/giab008
- Deployed version 1.9

Contributors

David A. Baker
- Automation
  - sra_query_daily.py
  - sra_cryptic_loop.sh
  - state_sra_loop_daily.py poller
  - chtc_serpent for HTC computing.
- Aggregation
  - aggregate_results
  - extract_results_aggregate.py
Devon Gregory
- Download from SRA and Create unique sequencing files.
  - derep.py, multivariant.poy, SAM_Refiner.py, SRA_fetch.py and Variant_extractor.py were wrtitten by

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
config		config
config_daily		config_daily
docker		docker
modules		modules
sra_poller		sra_poller
static_files		static_files
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
config_example.json		config_example.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Summary

Workflow overview.

Running the workflow

References

Contributors

About

Releases

Languages

dholab/SRA_wastewater_lineages

Folders and files

Latest commit

History

Repository files navigation

Summary

Workflow overview.

Running the workflow

References

Contributors

About

Resources

Stars

Watchers

Forks

Releases

Languages