4 Usage

Vclust provides four commands: prefilter, align, cluster, and deduplicate. Calls to these commands follow the structure:

vclust command -i <input> -o <output> [options]

4.1. Input data

Vclust accepts a single FASTA file containing viral genomic sequences (example) or a directory of FASTA files (one genome per file) (example). The input file(s) can be gzipped.

4.2. Prefilter

The prefilter command creates a pre-alignment filter that reduces the number of genome pairs to be aligned by filtering out dissimilar sequences before the alignment step. This process siginificantly lowers the computational demands by keeping only genome pairs with sufficient k-mer-based sequence similarity.

Two main options control the prefiltering process:

--min-kmers: Minimum number of common k-mers required between two sequences (default: 20).
--min-ident: Minimum sequence identity, calculated over the shorter sequence [0-1] (default: 0.7).

Vclust computes these similarities using the entire set of k-mers with Kmer-db 2, overcoming the sampling limitations of sketching methods like FastANI and Mash. It also uses sparse distance matrices to efficiently process large datasets.

By default, the prefilter command selects genome pairs with at least 20 common 25-mers and 70% identity over the shorter sequence:

# Create pre-alignment filter with 20 common 25-mers
# and 70% identity over the shorter sequence.
vclust prefilter -i genomes.fna -o fltr.txt

The sequence identity computed during prefiltering is higher than the ANI value from the alignment step. Thus, you can safely set the --min-ident value close to the final ANI threshold without losing relevant genome pairs. For example, if you're targeting an ANI threshold of 95% or higher, set --min-ident to around 0.95:

# Filter genome pairs with at least 20 common 25-mers and 95% identity
vclust prefilter -i genomes.fna -o fltr.txt --min-kmers 20 --min-ident 0.95

To optimize memory usage and reduce run time, large genomic datasets can be processed in smaller batches (--batch_size) and a fraction of k-mers (--kmers_fraction), as detailed in 5. Optimizing sensitivity and resource usage.

# Process genomes in batches of 2 million sequences each, and analyze 
# 20% of k-mers in each genome sequence.
./vclust.py prefilter -i genomes.fna -o fltr.txt --batch-size 2000000 --kmers-fraction 0.2

4.3. Align

The align command performs pairwise sequence alignments of viral genomes and provides similarity measures like ANI and coverage (alignment fraction). It uses the LZ-ANI aligner, which, like BLAST-based methods, finds multiple local alignments (similar to HSPs in BLAST) between two genomic sequences to estimate ANI and other sequence similarity measures for that genome pair.

# Align genome pairs filtered by the prefiltering command.
vclust align -i genomes.fna -o ani.tsv --filter fltr.txt

The align command allows for further filtering based on filter's minimum threshold.

# Align only the genome pairs with sequence identity greater than or equal to 80%.
vclust align -i genomes.fna -o ani.tsv --filter fltr.txt --filter-threshold 0.8

Without --filter Vclust aligns all possible genome pairs.

vclust align -i genomes.fna -o ani.tsv

4.3.1. ANI output

The align command creates two TSV files: one with ANI values for genome pairs and another with genome identifiers sorted by decreasing sequence length. Both TSV files can be used as input for Vclust's clustering.

The following command will create two TSV files: ani.tsv and ani.ids.tsv:

vclust align -i genomes.fna -o ani.tsv --filter fltr.txt

Sample output:

qidx  ridx  query reference   tani  gani  ani   qcov  rcov  num_alns len_ratio
9  8  NC_010807.alt3 NC_010807.alt2 0.972839 0.960192 0.986657 0.973177 0.997608 60 0.9836
8  9  NC_010807.alt2 NC_010807.alt3 0.972839 0.985279 0.987642 0.997608 0.973177 67 0.9836
10 8  NC_010807.alt1 NC_010807.alt2 0.987250 0.987041 0.987117 0.999923 0.999901 34 0.9571
8  10 NC_010807.alt2 NC_010807.alt1 0.987250 0.987449 0.987547 0.999901 0.999923 36 0.9571
11 8  NC_010807.ref  NC_010807.alt2 0.989807 0.989540 0.989617 0.999923 1.000000 14 0.9571
8  11 NC_010807.alt2 NC_010807.ref  0.989807 0.990063 0.990063 1.000000 0.999923 14 0.9571
10 9  NC_010807.alt1 NC_010807.alt3 0.979963 0.993250 0.994557 0.998686 0.972575 71 0.9730
9  10 NC_010807.alt3 NC_010807.alt1 0.979963 0.967035 0.994304 0.972575 0.998686 70 0.9730
11 9  NC_010807.ref  NC_010807.alt3 0.983839 0.997166 0.997217 0.999948 0.974230 52 0.9730
9  11 NC_010807.alt3 NC_010807.ref  0.983839 0.970871 0.996552 0.974230 0.999948 52 0.9730
11 10 NC_010807.ref  NC_010807.alt1 0.997462 0.997475 0.997475 1.000000 1.000000 23 1.0000
10 11 NC_010807.alt1 NC_010807.ref  0.997462 0.997449 0.997449 1.000000 1.000000 23 1.0000

To manage the output TSV file size, Vclust offers three formats: standard, lite, complete.

vclust align -i genomes.fna -o ani.tsv --filter fltr.txt --outfmt lite

Column	Standard	Lite	Complete	Description
qidx	+	+	+	Index of query sequence
ridx	+	+	+	Index of reference sequence
query	+	-	+	Identifier (name) of query sequence
reference	+	-	+	Identifier (name) of reference sequence
tani	+	+	+	total ANI [0-1]
gani	+	+	+	global ANI [0-1]
ani	+	+	+	ANI [0-1]
qcov	+	+	+	Query coverage (aligned fraction) [0-1]
rcov	+	+	+	Reference coverage (aligned fraction) [0-1]
num_alns	+	+	+	Number of local alignments
len_ratio	+	+	+	Length ratio between shorter and longer sequence [0-1]
qlen	-	-	+	Query sequence length
rlen	-	-	+	Reference sequence length
nt_match	-	-	+	Number of matching nucleotides across alignments
nt_mismatch	-	-	+	Number of mismatching nucleotides across alignments

4.3.2. ANI output filtering

The align command enables filtering output by setting minimum thresholds for similarity measures. This ensures only genome pairs meeting the thresholds are reported, significantly reducing the output TSV file size.

# Output only genome pairs with ANI ≥ 0.95 and query coverage ≥ 0.85.
vclust align -i genomes.fna -o ani.tsv --filter fltr.txt --out-ani 0.95 --out-qcov 0.85

4.3.3. Alignment output

Vclust can output alignment details in a separate TSV file. This output format is similar to the BLASTn tabular output and includes information on each local alignment between two genomes, such as the coordinates in both the query and reference sequences, strand orientation, the number of matched and mismatched nucleotides, and the percentage of sequence identity.

vclust align -i genomes.fna -o ani.tsv --aln ani.aln.tsv

Sample output:

query reference   pident   alnlen   qstart   qend  rstart   rend  nt_match nt_mismatch
NC_025457.alt2 NC_025457.ref  89.2893  999   22119 23117 14207 15163 892   107
NC_025457.alt2 NC_025457.ref  89.8305  826   3373  4198  2202  3020  742   84
NC_025457.alt2 NC_025457.ref  91.0804  796   41697 42492 27680 28475 725   71
NC_025457.alt2 NC_025457.ref  87.2483  745   38039 38783 24969 25688 650   95
NC_025457.alt2 NC_025457.ref  89.8860  702   7269  7970  5077  5778  631   71
NC_025457.alt2 NC_025457.ref  93.2081  692   62572 63263 41329 42020 645   47
NC_025457.alt2 NC_025457.ref  90.9565  575   31121 31695 20438 21003 523   52
NC_025457.alt2 NC_025457.ref  90.6195  565   11476 12040 7999  8563  512   53
NC_025457.alt2 NC_025457.ref  91.6211  549   10905 11453 7455  8003  503   46
NC_025457.alt2 NC_025457.ref  86.7041  534   29624 30157 19067 19586 463   71
NC_025457.alt2 NC_025457.ref  93.5673  513   10149 10661 6915  7427  480   33
NC_025457.alt2 NC_025457.ref  89.3701  508   34017 34524 22188 22695 454   54
NC_025457.alt2 NC_025457.ref  88.0240  501   18330 18830 11549 12049 441   60

Column	Description
query	Identifier (name) of query sequence
reference	Identifier (name) of reference sequence
pident	Percent identity of local alignment
alnlen	Alignment length
qstart	Start of alignment in query
qend	End of alignment in query
rstart	Start of alignment in reference
rend	End of alignment in reference
nt_match	Number of matched (identical) nucleotides
nt_mismatch	Number of mismatching nucleotides

4.4. Cluster

The cluster command takes as input two TSV files (outputs of vclust align) and clusters viral genomes using a user-selected clustering algorithm and similarity measures. Internally, Vclust uses Clusty, enabling large-scale clustering of millions of genome sequences by leveraging sparse distance matrices.

Vclust supports six clustering algorithms (--algorithm):

single: Single-linkage clustering
complete: Complete-linkage clustering
uclust: UCLUST-like clustering
cd-hit: Greedy incremental clustering
set-cover: Set-Cover (greedy)
leiden: the Leiden algorithm

4.4.1. Clustering with similarity measures and thresholds

Vclust performs threshold-based clustering by assigning a genome sequence to a cluster if its similarity (e.g., ANI) to the cluster meets or exceeds a user-defined threshold. The specific similarity criterion depends on the clustering algorithm and may refer to the nearest genome, the furthest genome, or the centroid within the cluster.

Users can choose from three similarity measures for clustering using the --metric option: tani, gani, or ani. The threshold provided for the selected measure determines the minimum similarity required to connect two genomes. For asymmetric measures like ANI and gANI (where the similarity between genome A and B differs from that between B and A), Vclust uses the maximum value as the weight for clustering.

# Cluster genomes based on tANI and connect genomes if their ANI ≥ 95%.
vclust cluster -i ani.tsv -o clusters.tsv --ids ani.ids.tsv --algorithm single \
--metric ani --ani 0.95

Additionally, Vclust may use extra threshold values for various combinations of similarity measures (--ani, --gani, --tani, --qcov, --rcov, --num_alns, --len_ratio). If a genome pair fails to meet the specified thresholds, it is excluded from clustering.

# Cluster genomes based on ANI similarity measure, but connect them only
# if ANI ≥ 95% and query coverage ≥ 85% and reference coverage ≥ 85%.
vclust cluster -i ani.tsv -o clusters.tsv --ids ani.ids.tsv --algorithm single \
--metric ani --ani 0.95 --qcov 0.85 --rcov 0.85

4.4.2. Clustering with the Leiden algorithm

Vclust leverages the Leiden algorithm via the igraph implementation integrated into Clusty. As with other Vclust algorithms, you can select from three similarity measures: tANI, gANI, or ANI. These measures define the edge weights used by the Leiden algorithm.

# Cluster genomes based on ANI similarity measure, but connect them only
# if ANI ≥ 95% and query coverage ≥ 85%.
vclust cluster -i ani.tsv -o clusters.tsv --ids ani.ids.tsv --algorithm leiden \
--metric ani --ani 0.95 --qcov 0.85

If you wish to include coverage information in the edge weight, following the IMG/PR's approach (where edge weight = ANI x max(qcov, rcov)), use the gani measure for clustering.

vclust cluster -i ani.tsv -o clusters.tsv --ids ani.ids.tsv --algorithm leiden \
--metric gani --gani 0.5 --ani 0.95 --qcov 0.85

The granularity of clustering can be adjusted with the --leiden-resolution parameter [range: 0-1, default: 0.7]. Higher resolutions produce more, smaller clusters, while lower resolutions result in fewer, larger clusters.

vclust cluster -i ani.tsv -o clusters.tsv --ids ani.ids.tsv --algorithm leiden \
--metric ani --ani 0.95 --qcov 0.85 --leiden-resolution 1

The clustering can be further refined using two additional parameters. The --leiden-beta parameter (default: 0.01) controls the randomness in the Leiden algorithm, allowing for fine-tuning of the clustering process. The --leiden-iterations parameter (default: 2) specifies the number of iterations for the Leiden algorithm, where increasing the number may lead to improved clustering quality.

vclust cluster -i ani.tsv -o clusters.tsv --ids ani.ids.tsv --algorithm leiden \
--metric ani --ani 0.95 --qcov 0.85 --leiden-resolution 0.8 --leiden-beta 0.02 \
--leiden-iterations 3

4.4.3. Cluster output

The cluster command generates a TSV file with genome identifiers followed by 0-based cluster identifiers. The file includes all input genomes, both clustered and singletons, listed in the same order as the IDs file (sorted by decreasing sequence length). The first genome in each cluster is the representative. For example, cluster 0 has NC_005091.alt2 as its representative and includes NC_005091.alt1 and NC_005091.ref.

object	cluster
NC_025457.alt2	3
NC_005091.alt2	0
NC_005091.alt1	0
NC_005091.ref	0
NC_002486.alt	1
NC_002486.ref	1
NC_025457.ref	4
NC_025457.alt1	5
NC_010807.alt2	2
NC_010807.alt3	2
NC_010807.alt1	2
NC_010807.ref	2

Alternatively, instead of numerical cluster identifiers, Vclust can output a representative genome for each cluster using the --out-repr option. The representative genome within each cluster is determined as the one with the longest sequence.

object	cluster
NC_025457.alt2	NC_025457.alt2
NC_005091.alt2	NC_005091.alt2
NC_005091.alt1	NC_005091.alt2
NC_005091.ref	NC_005091.alt2
NC_002486.alt	NC_002486.alt
NC_002486.ref	NC_002486.alt
NC_025457.ref	NC_025457.ref
NC_025457.alt1	NC_025457.alt1
NC_010807.alt2	NC_010807.alt2
NC_010807.alt3	NC_010807.alt2
NC_010807.alt1	NC_010807.alt2
NC_010807.ref	NC_010807.alt2

4.5. Deduplicate

The deduplicate command removes duplicate (identical) sequences within a single FASTA file or across multiple FASTA files, creating a single non-redundant FASTA file. Identical sequences are identified using SHA-256 hashes, considering reverse complements. If two identical sequences are found in different orientations, they are treated as duplicates. The first sequence encountered (based on the input file order) is retained as the representative sequence.

# Remove duplicate sequences within and across refseq.fna and genbank.fna,
# prioritizing RefSeq sequences over GenBank if duplicates exist.
vclust deduplicate -i refseq.fna genbank.fna -o nr.fna

The command supports gzipped input and output files. Gzipped input files are automatically recognized, and the --gzip-output option enables gzipped output:

vclust deduplicate -i refseq.fna.gz genbank.fna.gz --gzip-output -o nr.fna.gz

4.5.1. Deduplicate output

The deduplicate command generates two output files:

A FASTA file containing non-redundant sequences (e.g., nr.fna).
A text file listing duplicate sequences (e.g., nr.fna.duplicates.txt).

In the duplicate file, each line lists sequence identifiers of identical sequences. The first identifier represents the retained sequence, while other identifiers start with either + or -, indicating the orientation of duplicate sequences relative to the retained sequence:

NC_025457.1 +KJ473423.1
NC_010807.1 +EU547803.1
NC_005091.2 +AY357582.2
NC_002486.1 +AB044554.1

Here, NC_025457.1 is the retained sequence, while its duplicate, KJ473423.1, is on the same strand and excluded from the FASTA output.

Adding prefixes to sequence identifiers

To prevent identifier collisions between input files, you can add file-based prefixes using the --add-prefixes option:

vclust deduplicate -i refseq.fna genbank.fna other.fna -o nr.fna --add-prefixes

Example output:

other|Mushuvirus +other|Mushuvirus_copy
refseq|NC_025457.1 +genbank|KJ473423.1
refseq|NC_010807.1 +genbank|EU547803.1 +other|NC_010807.1_duplicate
refseq|NC_005091.2 -genbank|AY357582.2 +other|AY357582.2_duplicate
genbank|MN428048.1 -other|MN428048.1_revcomp
refseq|NC_002486.1 +genbank|AB044554.1

Alternatively, you can provide custom prefixes:

vclust deduplicate -i refseq.fna genbank.fna other.fna -o nr.fna --add-prefixes r_ g_ o_