Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kind duplicacy and other tools benchmark report #635

Open
deajan opened this issue Sep 6, 2022 · 4 comments
Open

kind duplicacy and other tools benchmark report #635

deajan opened this issue Sep 6, 2022 · 4 comments

Comments

@deajan
Copy link

deajan commented Sep 6, 2022

Hello,

I'm currently doing benchmarks for deduplication backup tools, including duplicacy.
I decided to write a script that would:

  • Install the backup programs
  • Prepare the source server
  • Prepare local targets / remote targets
  • Run backup and restore benchmarks
  • Use public available data (linux kernel sources as git repo) and checkout various git tags to simulate user changes in the dataset

The idea of the script would be to have reproductible results, the only changing factor being the machine specs & network link between sources and targets.

So far, I've run two sets of benchmarks, each done locally and remotely.
You can find the results at https://github.com/deajan/backup-bench

I'd love you to review the recipe I used for duplicacy, and perhaps guide me on what parameters to use to get maximum performance.
Any remarks / ideas / PRs are welcome.

I've also made a comparaison table of some features of the backup solutions I'm benchmarking.
I still miss some informations for some of the backup programs.
Would you mind having a look at the comparaison table and fill the question marks related to the features of duplicacy ?
Also, if duplicacy has an interesting feature I didn't list, I'll be happy to extend the comparaison.

PS: I'm trying to be as unbiased as possible when it comes to those benchmarks, please forgive me if I didn't treat your program with the parameters it deserves.

Also, I've created the same issue in every git repo of the backup tools I'm testing, so every author / team / community member can judge / improve the instructions for better benchmarking.

@blackbit47
Copy link

Hi @deajan,
Awesome work!

I took a look at your script and I have a few suggestions:

1-For backup and restore commands, please use the -threads option with 8 threads for your setup. It will significantly increase speed.

Increase -threads from 8 until you saturate the network link or see a decrease in speed.

2-During init please play with chunk size:

-chunk-size, -c the average size of chunks (default is 4M)
-max-chunk-size, -max the maximum size of chunks (default is chunk-size*4)
-min-chunk-size, -min the minimum size of chunks (default is chunk-size/4)

With homogeneous data, you should see smaller backups and better deduplication. see Chunk size details

3-Some clarifications for your shopping list on Duplicacy:

1-.Redundant index copies : duplicacy doesn't use indexes.
2-Continue restore on bad blocks in repository: yes, and Erasure Coding
3-Data checksumming: yes
4-Backup mounting as filesystem: No (fuse implementation PR)
5-File includes / excludes bases on regexes: yes
6-Automatically excludes CACHEDIR.TAG(3) directories: No
7-Are metadatas encrypted too ?: yes
8-Can encrypted / compressed data be guessed (CRIME/BREACH style attacks)?: No
9-Can a compromised client delete backups?: No (with pub key and immutable target->requires target setup)
10-Can a compromised client restore encrypted data? No (with pub key)
11-Does the backup software support pre/post execution hooks?: yes, see Pre Command and Post Command Scripts
12-Does the backup software provide a crypto benchmark ?: there is a Benchmark command.

Important:

13- Duplicacy is serverless: Less cost, less maintenance, less attack surface.
14: Duplicacy works with a ton of storage backends: Infinitely scalable and more secure.
15-No indexes or databases.

Hope this helps a bit. Feel free to join the Forum.

Keep up the good work.

@deajan
Copy link
Author

deajan commented Oct 3, 2022

Thanks for your time. Table was updated with
I've seen very bad restore speeds using duplicacy. Anything in mind I could try ?

Also, would you have a link to something explaining why CRIME/BREACH style attacks are not feasable perhaps ?

@deajan
Copy link
Author

deajan commented Oct 3, 2022

Thinking of it, it seems that duplicacy has bigger repository sizes as it's contenders.
What's the default compression algorithm and level of duplicacy, and how does one change it ? All other programs use zstd:3.

@stevesbrain
Copy link

Thanks for your time. Table was updated with I've seen very bad restore speeds using duplicacy. Anything in mind I could try ?

Also, would you have a link to something explaining why CRIME/BREACH style attacks are not feasable perhaps ?

Could I suggest trying out something like Backblaze's B2 with Duplicacy? I just today experimented with restore times on SSH vs. B2 and B2 was 10x faster than SSH (and that was SSH to multiple remote hosts just to confirm)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants