Note
This repository contains a legacy version of a workflow for finding SARS-CoV-2 cryptic lineages in the Sequence Read Archive (SRA), originally developed by David A. Baker and Devon Gregory. It is no longer actively maintained, but is pinned here for reproducibility along with Suarez et al. 2025 (manuscript in prep).
Please note that while this version is no longer maintained, a newer alternative is under active development at dholab. Users looking to search SRA for cryptic lineages themselves should reach out to the O'Connor group for updates on when the project will be open-sourced.
- This workflow will query SRA for the latest samples, then align and determine variants phased to a unique reads.
- Unlike a typical VCF, this reports the unique reads in the sample. If a sample has multiple variants on the same read, it will be treated a collection of "phased variants" rather than individual variants.
- It has been automated and has successfully processed ~140,000 SRA samples for wastewater.
- sra_query_daily.py to get the latest SRA entries and compare to already completed files
- This ensures if the script is still running and you kick off another query, it will not double download files.
- stage_sra_loop_daily.py
- This adds a list of files needing to be queried, in batches to CHTC, based on the projected file sizes
- This will send up to a set number of files per node (30 is default), and also if the total input file size per SRA query is under 5GB.
- If the entry is >5GB it will only query up to one file per node.
- This is because 1.) Large files can time out or take very long to process, beyond time limits of the nodes.
- By limiting file sizes we mitigate this source of error from happening and affecting multiple SRA entries
- sra_cryptic_loop.sh
- This runs in CHTC and will run a single node workflow of the steps to download, align, and aggregate the unique sequence/ phased variants
- This runs a hardcoded pipeline of the following scripts: derep.py, multivariant.poy, SAM_Refiner.py, SRA_fetch.py and Variant_extractor.py
- This loops all SRA entries that were sent to the node.
- Extract results_aggregate.py
- Files are packaged as a tar or compressed tar (tar.gz)
- These files need to be sorted by type and
- Aggregate Results
- This then takes the extracted results and condenses them into fewer files that are easier to back up and save.
- Agggregate Unique Sequences
- This will aggregate the unique sequences into a single file if it is above the 1.4% allele fraction threshold
- If you are not useing a orchestrator to run you pipeline you will only need to run the the modules/sra_cryptic_loop.sh script
sra_name=SRA12345678
/path/to/repository/modules/sra_cryptic_loop.sh -s ${sra_name}
- You can also use an orchestrator such as CHTC Serpent
- To run the orchestrator you will need access to a HPCcondor HPC platform.
- BBMap
- BBMap – Bushnell B. – sourceforge.net/projects/bbmap/
- Deployed Version 39.01
- Minimap2
- Li, H. (2018). Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34:3094-3100. doi:10.1093/bioinformatics/bty191
- Deployed version minimap2-2.17
- Samtools
- Twelve years of SAMtools and BCFtools Petr Danecek, James K Bonfield, Jennifer Liddle, John Marshall, Valeriu Ohan, Martin O Pollard, Andrew Whitwham, Thomas Keane, Shane A McCarthy, Robert M Davies, Heng Li GigaScience, Volume 10, Issue 2, February 2021, giab008, https://doi.org/10.1093/gigascience/giab008
- Deployed version 1.9
-
David A. Baker
- Automation
- sra_query_daily.py
- sra_cryptic_loop.sh
- state_sra_loop_daily.py poller
- chtc_serpent for HTC computing.
- Aggregation
- aggregate_results
- extract_results_aggregate.py
- Automation
-
Devon Gregory
- Download from SRA and Create unique sequencing files.
- derep.py, multivariant.poy, SAM_Refiner.py, SRA_fetch.py and Variant_extractor.py were wrtitten by
- Download from SRA and Create unique sequencing files.