-
Notifications
You must be signed in to change notification settings - Fork 1
4 Usage
Vclust provides four commands: prefilter
, align
, cluster
, and deduplicate
. Calls to these commands follow the structure:
vclust command -i <input> -o <output> [options]
Vclust accepts a single FASTA file containing viral genomic sequences (example) or a directory of FASTA files (one genome per file) (example). The input file(s) can be gzipped.
The prefilter
command creates a pre-alignment filter that reduces the number of genome pairs to be aligned by filtering out dissimilar sequences before the alignment step. This process siginificantly lowers the computational demands by keeping only genome pairs with sufficient k-mer-based sequence similarity.
Two main options control the prefiltering process:
-
--min-kmers
: Minimum number of common k-mers required between two sequences (default: 20). -
--min-ident
: Minimum sequence identity, calculated over the shorter sequence [0-1] (default: 0.7).
Vclust computes these similarities using the entire set of k-mers with Kmer-db 2, overcoming the sampling limitations of sketching methods like FastANI and Mash. It also uses sparse distance matrices to efficiently process large datasets.
By default, the prefilter
command selects genome pairs with at least 20 common 25-mers and 70% identity over the shorter sequence:
# Create pre-alignment filter with 20 common 25-mers
# and 70% identity over the shorter sequence.
vclust prefilter -i genomes.fna -o fltr.txt
The sequence identity computed during prefiltering is higher than the ANI value from the alignment step. Thus, you can safely set the --min-ident
value close to the final ANI threshold without losing relevant genome pairs. For example, if you're targeting an ANI threshold of 95% or higher, set --min-ident
to around 0.95:
# Filter genome pairs with at least 20 common 25-mers and 95% identity
vclust prefilter -i genomes.fna -o fltr.txt --min-kmers 20 --min-ident 0.95
To optimize memory usage and reduce run time, large genomic datasets can be processed in smaller batches (--batch_size
) and a fraction of k-mers (--kmers_fraction
), as detailed in 5. Optimizing sensitivity and resource usage.
# Process genomes in batches of 2 million sequences each, and analyze
# 20% of k-mers in each genome sequence.
./vclust.py prefilter -i genomes.fna -o fltr.txt --batch-size 2000000 --kmers-fraction 0.2
The align
command performs pairwise sequence alignments of viral genomes and provides similarity measures like ANI and coverage (alignment fraction). It uses the LZ-ANI aligner, which, like BLAST-based methods, finds multiple local alignments (similar to HSPs in BLAST) between two genomic sequences to estimate ANI and other sequence similarity measures for that genome pair.
# Align genome pairs filtered by the prefiltering command.
vclust align -i genomes.fna -o ani.tsv --filter fltr.txt
The align
command allows for further filtering based on filter's minimum threshold.
# Align only the genome pairs with sequence identity greater than or equal to 80%.
vclust align -i genomes.fna -o ani.tsv --filter fltr.txt --filter-threshold 0.8
Without --filter
Vclust aligns all possible genome pairs.
vclust align -i genomes.fna -o ani.tsv
The align
command creates two TSV files: one with ANI values for genome pairs and another with genome identifiers sorted by decreasing sequence length. Both TSV files can be used as input for Vclust's clustering.
The following command will create two TSV files: ani.tsv and ani.ids.tsv:
vclust align -i genomes.fna -o ani.tsv --filter fltr.txt
Sample output:
qidx ridx query reference tani gani ani qcov rcov num_alns len_ratio
9 8 NC_010807.alt3 NC_010807.alt2 0.972839 0.960192 0.986657 0.973177 0.997608 60 0.9836
8 9 NC_010807.alt2 NC_010807.alt3 0.972839 0.985279 0.987642 0.997608 0.973177 67 0.9836
10 8 NC_010807.alt1 NC_010807.alt2 0.987250 0.987041 0.987117 0.999923 0.999901 34 0.9571
8 10 NC_010807.alt2 NC_010807.alt1 0.987250 0.987449 0.987547 0.999901 0.999923 36 0.9571
11 8 NC_010807.ref NC_010807.alt2 0.989807 0.989540 0.989617 0.999923 1.000000 14 0.9571
8 11 NC_010807.alt2 NC_010807.ref 0.989807 0.990063 0.990063 1.000000 0.999923 14 0.9571
10 9 NC_010807.alt1 NC_010807.alt3 0.979963 0.993250 0.994557 0.998686 0.972575 71 0.9730
9 10 NC_010807.alt3 NC_010807.alt1 0.979963 0.967035 0.994304 0.972575 0.998686 70 0.9730
11 9 NC_010807.ref NC_010807.alt3 0.983839 0.997166 0.997217 0.999948 0.974230 52 0.9730
9 11 NC_010807.alt3 NC_010807.ref 0.983839 0.970871 0.996552 0.974230 0.999948 52 0.9730
11 10 NC_010807.ref NC_010807.alt1 0.997462 0.997475 0.997475 1.000000 1.000000 23 1.0000
10 11 NC_010807.alt1 NC_010807.ref 0.997462 0.997449 0.997449 1.000000 1.000000 23 1.0000
To manage the output TSV file size, Vclust offers three formats: standard
, lite
, complete
.
vclust align -i genomes.fna -o ani.tsv --filter fltr.txt --outfmt lite
Column | Standard | Lite | Complete | Description |
---|---|---|---|---|
qidx | + | + | + | Index of query sequence |
ridx | + | + | + | Index of reference sequence |
query | + | - | + | Identifier (name) of query sequence |
reference | + | - | + | Identifier (name) of reference sequence |
tani | + | + | + | total ANI [0-1] |
gani | + | + | + | global ANI [0-1] |
ani | + | + | + | ANI [0-1] |
qcov | + | + | + | Query coverage (aligned fraction) [0-1] |
rcov | + | + | + | Reference coverage (aligned fraction) [0-1] |
num_alns | + | + | + | Number of local alignments |
len_ratio | + | + | + | Length ratio between shorter and longer sequence [0-1] |
qlen | - | - | + | Query sequence length |
rlen | - | - | + | Reference sequence length |
nt_match | - | - | + | Number of matching nucleotides across alignments |
nt_mismatch | - | - | + | Number of mismatching nucleotides across alignments |
The align
command enables filtering output by setting minimum thresholds for similarity measures. This ensures only genome pairs meeting the thresholds are reported, significantly reducing the output TSV file size.
# Output only genome pairs with ANI ≥ 0.95 and query coverage ≥ 0.85.
vclust align -i genomes.fna -o ani.tsv --filter fltr.txt --out-ani 0.95 --out-qcov 0.85
Vclust can output alignment details in a separate TSV file. This output format is similar to the BLASTn tabular output and includes information on each local alignment between two genomes, such as the coordinates in both the query and reference sequences, strand orientation, the number of matched and mismatched nucleotides, and the percentage of sequence identity.
vclust align -i genomes.fna -o ani.tsv --aln ani.aln.tsv
Sample output:
query reference pident alnlen qstart qend rstart rend nt_match nt_mismatch
NC_025457.alt2 NC_025457.ref 89.2893 999 22119 23117 14207 15163 892 107
NC_025457.alt2 NC_025457.ref 89.8305 826 3373 4198 2202 3020 742 84
NC_025457.alt2 NC_025457.ref 91.0804 796 41697 42492 27680 28475 725 71
NC_025457.alt2 NC_025457.ref 87.2483 745 38039 38783 24969 25688 650 95
NC_025457.alt2 NC_025457.ref 89.8860 702 7269 7970 5077 5778 631 71
NC_025457.alt2 NC_025457.ref 93.2081 692 62572 63263 41329 42020 645 47
NC_025457.alt2 NC_025457.ref 90.9565 575 31121 31695 20438 21003 523 52
NC_025457.alt2 NC_025457.ref 90.6195 565 11476 12040 7999 8563 512 53
NC_025457.alt2 NC_025457.ref 91.6211 549 10905 11453 7455 8003 503 46
NC_025457.alt2 NC_025457.ref 86.7041 534 29624 30157 19067 19586 463 71
NC_025457.alt2 NC_025457.ref 93.5673 513 10149 10661 6915 7427 480 33
NC_025457.alt2 NC_025457.ref 89.3701 508 34017 34524 22188 22695 454 54
NC_025457.alt2 NC_025457.ref 88.0240 501 18330 18830 11549 12049 441 60
Column | Description |
---|---|
query | Identifier (name) of query sequence |
reference | Identifier (name) of reference sequence |
pident | Percent identity of local alignment |
alnlen | Alignment length |
qstart | Start of alignment in query |
qend | End of alignment in query |
rstart | Start of alignment in reference |
rend | End of alignment in reference |
nt_match | Number of matched (identical) nucleotides |
nt_mismatch | Number of mismatching nucleotides |
The cluster
command takes as input two TSV files (outputs of vclust align
) and clusters viral genomes using a user-selected clustering algorithm and similarity measures. Internally, Vclust uses Clusty, enabling large-scale clustering of millions of genome sequences by leveraging sparse distance matrices.
Vclust supports six clustering algorithms (--algorithm
):
-
single
: Single-linkage clustering -
complete
: Complete-linkage clustering -
uclust
: UCLUST-like clustering -
cd-hit
: Greedy incremental clustering -
set-cover
: Set-Cover (greedy) -
leiden
: the Leiden algorithm
Vclust performs threshold-based clustering by assigning a genome sequence to a cluster if its similarity (e.g., ANI) to the cluster meets or exceeds a user-defined threshold. The specific similarity criterion depends on the clustering algorithm and may refer to the nearest genome, the furthest genome, or the centroid within the cluster.
Users can choose from three similarity measures for clustering using the --metric
option: tani
, gani
, or ani
. The threshold provided for the selected measure determines the minimum similarity required to connect two genomes. For asymmetric measures like ANI and gANI (where the similarity between genome A and B differs from that between B and A), Vclust uses the maximum value as the weight for clustering.
# Cluster genomes based on tANI and connect genomes if their ANI ≥ 95%.
vclust cluster -i ani.tsv -o clusters.tsv --ids ani.ids.tsv --algorithm single \
--metric ani --ani 0.95
Additionally, Vclust may use extra threshold values for various combinations of similarity measures (--ani
, --gani
, --tani
, --qcov
, --rcov
, --num_alns
, --len_ratio
). If a genome pair fails to meet the specified thresholds, it is excluded from clustering.
# Cluster genomes based on ANI similarity measure, but connect them only
# if ANI ≥ 95% and query coverage ≥ 85% and reference coverage ≥ 85%.
vclust cluster -i ani.tsv -o clusters.tsv --ids ani.ids.tsv --algorithm single \
--metric ani --ani 0.95 --qcov 0.85 --rcov 0.85
Vclust leverages the Leiden algorithm via the igraph implementation integrated into Clusty. As with other Vclust algorithms, you can select from three similarity measures: tANI, gANI, or ANI. These measures define the edge weights used by the Leiden algorithm.
# Cluster genomes based on ANI similarity measure, but connect them only
# if ANI ≥ 95% and query coverage ≥ 85%.
vclust cluster -i ani.tsv -o clusters.tsv --ids ani.ids.tsv --algorithm leiden \
--metric ani --ani 0.95 --qcov 0.85
If you wish to include coverage information in the edge weight, following the IMG/PR's approach (where edge weight = ANI x max(qcov, rcov)), use the gani
measure for clustering.
vclust cluster -i ani.tsv -o clusters.tsv --ids ani.ids.tsv --algorithm leiden \
--metric gani --gani 0.5 --ani 0.95 --qcov 0.85
The granularity of clustering can be adjusted with the --leiden-resolution
parameter [range: 0-1, default: 0.7
]. Higher resolutions produce more, smaller clusters, while lower resolutions result in fewer, larger clusters.
vclust cluster -i ani.tsv -o clusters.tsv --ids ani.ids.tsv --algorithm leiden \
--metric ani --ani 0.95 --qcov 0.85 --leiden-resolution 1
The clustering can be further refined using two additional parameters. The --leiden-beta
parameter (default: 0.01
) controls the randomness in the Leiden algorithm, allowing for fine-tuning of the clustering process. The --leiden-iterations
parameter (default: 2
) specifies the number of iterations for the Leiden algorithm, where increasing the number may lead to improved clustering quality.
vclust cluster -i ani.tsv -o clusters.tsv --ids ani.ids.tsv --algorithm leiden \
--metric ani --ani 0.95 --qcov 0.85 --leiden-resolution 0.8 --leiden-beta 0.02 \
--leiden-iterations 3
The cluster command generates a TSV file with genome identifiers followed by 0-based cluster identifiers. The file includes all input genomes, both clustered and singletons, listed in the same order as the IDs file (sorted by decreasing sequence length). The first genome in each cluster is the representative. For example, cluster 0
has NC_005091.alt2
as its representative and includes NC_005091.alt1
and NC_005091.ref
.
object cluster
NC_025457.alt2 3
NC_005091.alt2 0
NC_005091.alt1 0
NC_005091.ref 0
NC_002486.alt 1
NC_002486.ref 1
NC_025457.ref 4
NC_025457.alt1 5
NC_010807.alt2 2
NC_010807.alt3 2
NC_010807.alt1 2
NC_010807.ref 2
Alternatively, instead of numerical cluster identifiers, Vclust can output a representative genome for each cluster using the --out-repr
option. The representative genome within each cluster is determined as the one with the longest sequence.
object cluster
NC_025457.alt2 NC_025457.alt2
NC_005091.alt2 NC_005091.alt2
NC_005091.alt1 NC_005091.alt2
NC_005091.ref NC_005091.alt2
NC_002486.alt NC_002486.alt
NC_002486.ref NC_002486.alt
NC_025457.ref NC_025457.ref
NC_025457.alt1 NC_025457.alt1
NC_010807.alt2 NC_010807.alt2
NC_010807.alt3 NC_010807.alt2
NC_010807.alt1 NC_010807.alt2
NC_010807.ref NC_010807.alt2
The deduplicate
command removes duplicate (identical) sequences within a single FASTA file or across multiple FASTA files, creating a single non-redundant FASTA file. Identical sequences are identified using SHA-256 hashes, considering reverse complements. If two identical sequences are found in different orientations, they are treated as duplicates. The first sequence encountered (based on the input file order) is retained as the representative sequence.
# Remove duplicate sequences within and across refseq.fna and genbank.fna,
# prioritizing RefSeq sequences over GenBank if duplicates exist.
vclust deduplicate -i refseq.fna genbank.fna -o nr.fna
The command supports gzipped input and output files. Gzipped input files are automatically recognized, and the --gzip-output
option enables gzipped output:
vclust deduplicate -i refseq.fna.gz genbank.fna.gz --gzip-output -o nr.fna.gz
The deduplicate
command generates two output files:
- A FASTA file containing non-redundant sequences (e.g.,
nr.fna
). - A text file listing duplicate sequences (e.g.,
nr.fna.duplicates.txt
).
In the duplicate file, each line lists sequence identifiers of identical sequences. The first identifier represents the retained sequence, while other identifiers start with either + or -, indicating the orientation of duplicate sequences relative to the retained sequence:
NC_025457.1 +KJ473423.1
NC_010807.1 +EU547803.1
NC_005091.2 +AY357582.2
NC_002486.1 +AB044554.1
Here, NC_025457.1
is the retained sequence, while its duplicate, KJ473423.1
, is on the same strand and excluded from the FASTA output.
To prevent identifier collisions between input files, you can add file-based prefixes using the --add-prefixes
option:
vclust deduplicate -i refseq.fna genbank.fna other.fna -o nr.fna --add-prefixes
Example output:
other|Mushuvirus +other|Mushuvirus_copy
refseq|NC_025457.1 +genbank|KJ473423.1
refseq|NC_010807.1 +genbank|EU547803.1 +other|NC_010807.1_duplicate
refseq|NC_005091.2 -genbank|AY357582.2 +other|AY357582.2_duplicate
genbank|MN428048.1 -other|MN428048.1_revcomp
refseq|NC_002486.1 +genbank|AB044554.1
Alternatively, you can provide custom prefixes:
vclust deduplicate -i refseq.fna genbank.fna other.fna -o nr.fna --add-prefixes r_ g_ o_
Example output:
o_Mushuvirus +o_Mushuvirus_copy
r_NC_025457.1 +g_KJ473423.1
r_NC_010807.1 +g_EU547803.1 +o_NC_010807.1_duplicate
r_NC_005091.2 +g_AY357582.2 +o_AY357582.2_duplicate
g_MN428048.1 -o_MN428048.1_revcomp
r_NC_002486.1 +g_AB044554.1
- Features
- Installation
- Quick Start
- Usage
- Optimizing sensitivity and resource usage
-
Use cases
- Classify viruses into species and genera following ICTV standards
- Assign viral contigs into vOTUs following MIUViG standards
- Dereplicate viral contigs into representative genomes
- Calculate pairwise similarities between all-versus-all genomes
- Deduplicate (remove identical sequences) across multiple datasets
- Process large dataset of diverse virus genomes (IMG/VR)
- Process large dataset of highly redundant virus genomes
- Cluster plasmid genomes into pOTUs
- FAQ: Frequently Asked Questions