- added the
--small-genomes
preset. This is just an alias for-c 30 -m 200 --faster-small
. This makes skani much faster when comparing hundreds of thousands of small genomes.
- fixed a bug where
skani triangle --full-matrix
gave different results between STDOUT and-o
(thanks to Florian Plaza Onate) - added a
--diagonal
option (suggested by Antonio Camargo) to print diagonal entries for sparse and lower-triangular distance matrices - added a warning to use
--faster-small
when comparing too many contigs (e.g. viruses, plasmids).
More consistent support for small contigs and sequences.
- --faster-small option included in dist and triangle.
Genomes (and contigs with the --i, --ri, --qi options) with less than 20 marker k-mers are not screened according to the -s option. This makes skani for senative for small sequences, but can hamper performance on very large datasets with lots of small genomes/contigs.
This heuristic can now be disabled with the --faster-small
option.
- skani's version is now displayed properly
- Added some error messages for degenerate cases (and more testing)
- We found that the statically built binary can be a lot slower in certain cases. File i/o may be an issue for the binary version. A note is now added in the README.
- --learned-ani feature was buggy before and now removed.
- Major bug found: debiasing for ANI was turned off if there were > 5000 queries present in skani search and skani dist. This bug is fixed now.
- The rust API is changing in this version. Not published to Cargo yet (waiting on DDOtten/partitions#3 to be published to crates...)
Improved "N" character support:
- changed query-reference selection method slightly via a slight hack, using marker seeds to estimate reference length instead. This makes it so NNN characters are not counted.
- Now seeds with "N" characters present are no longer indexed.
- --robust now uses the learned ANI debiasing procedure by default.
- skani triangle had a bug where if more than 5000 queries were present and --sparse or -E was not specified, the intermediate batch of 5000 queries would be written in sparse mode.
- skani triangle -o was giving different upper triangle matrix instead of lower triangle (skani triangle > res gives lower triangle). Matrices are consistently lower triangle now.
- Changed to lto = true for release mode. I see anywhere from a 5-10% speedup for this.
- Changed some dependencies so no more dependencies on old crates that will deprecate.
- Fixed a bug where memory was blowing up in
dist
andtriangle
when the marker-index was activated. For big datasets, there could be > 100 GBs of wasted memory. - skani now outputs intermediate results after processing each batch of 5000 queries. This will mean that outputs may no longer be deterministically ordered if there are > 5000 genomes, but you can sort the output file to get deterministic outputs, i.e
skani triangle *.fa | sort -k 3 -n > sorted_skani_result.txt
will guarantee deterministic output order.
- Changed the marker index hash table population method. Used to overestimate memory usage slightly.
- New help message for marker parameters. Turns out that for small genomes, having more markers may make filtering significantly better.
- Added -i option to sketch so you can sketch individual records in multifastas -- does not work for search yet though, only for sketching.
Small fixes.
- Added
--medium
pre-set, which is just-c 70
. Seems to work okay for comparing fragmented genomes. - BREAKING: Changed
--marker-index
to--no-marker-index
as a more sane option. - Added
--distance
option toskani triangle
to output distance matrix (i.e. 100 - ANI) instead of similarity matrix. - Misc. help message fixes
Small fixes.
- Made aligned fraction in
triangle mode
a full matrix by default. This is not a symmetric matrix since AF is not symmetric. - Misc. help message fixes
We added new experiments on the revised version of our preprint (Extended Data Figs 11-14). We show skani has quite good AF correlation with MUMmer, and that it works decently on simple eukaryotic MAGs, especially with the --slow
option (see below).
- ANI debiasing added - skani now uses a debiasing step with a regression model trained on MAGs to give more accurate ANIs. Old version gave robust, but slightly overestimated ANIs, especially around 95-97% range. Debiasing is enabled by default, but can be turned off with
--no-learned-ani
. - More accurate aligned fraction - chaining algorithm changed to give a more accurate aligned fraction (AF) estimate. The previous version had more variance and underestimated AF for certain assemblies.
- Small contig/genome defaults made better - should be more sensitive so that they don't get filtered by default.
- Repetitive k-mer masking made better - smarter settings and should work better for eukaryotic genomes; shouldn't affect prokaryotic genomes much.
--fast
and--slow
mode added - alias for-c 200
and-c 30
respectively.- More non x86_64 builds should work - there was a bug before where skani would be dysfunctional on non x86_64 architectures. It seems to at least build on ARM64 architectures successfully now.