The input data format #22

bingyinglee · 2024-09-10T08:47:31Z

Hi, I tried to use bindash to process my fasta data. I first used the following command:
./bindash sketch mydata.fas --outfname=genomeA.sketch
The mydata.fas file size is about 50M, containing more than 20,000 nucleotide sequences. But the generated .sketch file is only 1kb. There must be something wrong, but I don't know where to modify it.
Are there any requirements for the input data format?

jianshu93 · 2024-09-10T10:27:32Z

Hi @bingyinglee,

The output file size is only related to the sketch size (--sketchsize64 M and --bbits N option) if your purpose is to compute genomic distance among your files. Sketches are just first N bits of M 64 bit integers so it is not that big. You can increase --sketchsize64 to 200 or even several thousand if you want accuracy at 99% or 99.99% ANI above (a widely used metric for genomic distance). This tool is only for genomic distance estimation, not for fastq/fasta file quality control or something.

Let me know if I am not clear.

Best,

Jianshu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The input data format #22

The input data format #22

bingyinglee commented Sep 10, 2024

jianshu93 commented Sep 10, 2024 •

edited

Loading

The input data format #22

The input data format #22

Comments

bingyinglee commented Sep 10, 2024

jianshu93 commented Sep 10, 2024 • edited Loading

jianshu93 commented Sep 10, 2024 •

edited

Loading