Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.5.0-rc1 discussions #333

Open
zevv opened this issue Sep 5, 2024 · 19 comments
Open

1.5.0-rc1 discussions #333

zevv opened this issue Sep 5, 2024 · 19 comments
Assignees

Comments

@zevv
Copy link
Owner

zevv commented Sep 5, 2024

Hi @l8gravely,

lacking a mailing list or forum, i took the liberty of opening an issue for discussing the 1.5.0-rc1 release. I'll use this to jot down some notes in no particular order, feel free to ignore or answer as you please :)

New database format

The default db format moved from tokio to tkrzw. It's about high time we leave tokiocabinet behind as it's just not stable and very much unmaintained. Some comments tho:

  • After upgrading duc gets confused by the new db format and throws a message Error opening: /home/ico/.cache/duc/duc.db - Database corrupt and not usable. Maybe we could add a hint here that the database format might be mismatching the current version and that the used should clean it up and re-index
  • Apart from the performance, the small size of the database files was the main reason I went with tokyocabinet years ago, as it manages to create considerably smaller databases than the other engines. On my little test setup, the new duc ends up with a database nearly 9 times larger than before - 71M to 590M just to index my home dir. I have not tested on humongous file systems since I do not have any, so I have no clue what this looks like when scaling up.

topn command

No comments - very useful and a welcome addition!

histogram support

Still early work on the reporting side I see, but at least the info is already there in the db, nice.

@l8gravely
Copy link
Collaborator

I'll fix the DB checking issues, I thought I had already gotten that right, but obviously I missed something. Was the old DB on tokyocabinet format? I'll run some tests and see what I need to do.

@l8gravely l8gravely self-assigned this Sep 5, 2024
@bougui
Copy link

bougui commented Oct 9, 2024

Hello @zevv I will compile the new version and tested it on our setup where we index ~120 TB in the next week.

Questions:

  • Will the new version also include the user option to only get usage of a specific user ?

  • Also I was told by a colleague ( I did not test it yet ), if we run duc for a specific user lets say bob and from what he says if a subfolder is not owned by bob duc will not go down this directory where bob could have some file eve if he is not the owner of that specific subfolder.

TIA

@l8gravely
Copy link
Collaborator

l8gravely commented Oct 9, 2024 via email

@bougui
Copy link

bougui commented Oct 9, 2024

Hello John,

since BD size is increasing with the new version and since duc must be used on large filesystem with more than one user, I would definitly add an option to keep per user size if request as an argument ;-) And since we have lots of space to check, we should have the place to keep a larger DB.

@l8gravely
Copy link
Collaborator

l8gravely commented Oct 10, 2024 via email

@stuartthebruce
Copy link

Is there an option in version 1.5 to specify the different compressors supported by tkrzw?

And how about enhancing the output of --version to show what compression will be used by default, and --info to report what was used to generate the specified database file?

@stuartthebruce
Copy link

FYI, I was able to use 1.5.0-rc1 to index 1.2B files in a large backup zpool,

[root@origin-staging ~]# duc --version
duc version: 1.5.0-rc1
options: cairo x11 ui tkrzw

[root@origin-staging ~]# time duc index -vp /backup2 -d /dev/shm/duc.db                                                 
Writing to database "/dev/shm/duc.db"
Indexed 1205911647 files and 44090454 directories, (814.8TB apparent, 599.2TB actual) in 8 hours, 9 minutes, and 11.89 seconds.


real    489m11.917s
user    14m2.310s
sys     327m10.748s

It would be nice if there was inline compression for such large database files, e.g., post facto zstd is able to reduce database file 25GB to 16GB,

[root@origin-staging ~]# ls -lh /dev/shm/duc.db 
-rw-r--r-- 1 root root 25G Oct 19 21:19 /dev/shm/duc.db

[root@origin-staging ~]# zstd --verbose -T0 /dev/shm/duc.db 
*** zstd command line interface 64-bits v1.4.4, by Yann Collet ***
Note: 24 physical core(s) detected 
/dev/shm/duc.db      : 64.69%   (26053963960 => 16853415278 bytes, /dev/shm/duc.db.zst) 

[root@origin-staging ~]# ls -lh /dev/shm/duc.db.zst 
-rw-r--r-- 1 root root 16G Oct 19 21:19 /dev/shm/duc.db.zst

@l8gravely
Copy link
Collaborator

l8gravely commented Oct 21, 2024 via email

@l8gravely
Copy link
Collaborator

l8gravely commented Oct 21, 2024 via email

@l8gravely
Copy link
Collaborator

l8gravely commented Oct 21, 2024 via email

@stuartthebruce
Copy link

"stuartthebruce" == stuartthebruce @.***> writes: Is there an option in version 1.5 to specify the different compressors supported by tkrzw?
Currently this is not an option. Do you have a need?

Only to help test if a choice other than the current default helps with compressibility.

And how about enhancing the output of --version to show what compression will be used by default, and --info to report what was used to generate the specified database file?
That's a good point, I'll have to look into adding that. John

Thanks.

@stuartthebruce
Copy link

That is impressive space reduction. Or depressing depending on how you look at it. I'll see what I can come up with. I assume you're willing to run tests on proposed pateches?

Yes.

@stuartthebruce
Copy link

Do you happen to have the tkrzw utils installed?

I do now.

Can you run the following and send me the results? I'm trying to pick better tuning defaults if I can.

$ tkrzw_dbm_util inspect /dev/shm/duc.db

[root@origin-staging ~]# tkrzw_dbm_util inspect /dev/shm/duc.db
APPLICATION_ERROR: Unknown DBM implementation: db

With what I think are the right additional arguments?

[root@origin-staging ~]# tkrzw_dbm_util inspect --dbm hash /dev/shm/duc.db
Inspection:
  class=HashDBM
  healthy=true
  auto_restored=false
  path=/dev/shm/duc.db
  cyclic_magic=3
  pkg_major_version=1
  pkg_minor_version=0
  static_flags=49
  offset_width=5
  align_pow=3
  closure_flags=1
  num_buckets=1048583
  num_records=44090459
  eff_data_size=25474436170
  file_size=26053963960
  timestamp=1729397989.120004
  db_type=0
  max_file_size=8796093022208
  record_base=5246976
  update_mode=in-place
  record_crc_mode=none
  record_comp_mode=lz4
Actual File Size: 26053963960
Number of Records: 44090459
Healthy: true
Should be Rebuilt: true

and if you're feeling happy, please do:

$ time tkrzw_dbm_util rebuild /dev/shm/duc.db

[root@origin-staging ~]# time tkrzw_dbm_util rebuild --dbm hash /dev/shm/duc.db                                                            
Old Number of Records: 44090459
Old File Size: 26053963960
Old Effective Data Size: 25474436170
Old Number of Buckets: 1048583
Optimizing the database: ... ok (elapsed=183.065716)
New Number of Records: 44090459
New File Size: 26489626808
New Effective Data Size: 25474436170
New Number of Buckets: 88180927

real    3m3.069s
user    2m31.424s
sys     0m30.468s

$ tkrzw_dbm_util inspect /dev/shm/duc.db

[root@origin-staging ~]# tkrzw_dbm_util inspect --dbm hash /dev/shm/duc.db
Inspection:
  class=HashDBM
  healthy=true
  auto_restored=false
  path=/dev/shm/duc.db
  cyclic_magic=7
  pkg_major_version=1
  pkg_minor_version=0
  static_flags=49
  offset_width=5
  align_pow=3
  closure_flags=1
  num_buckets=88180927
  num_records=44090459
  eff_data_size=25474436170
  file_size=26489626808
  timestamp=1729531678.856718
  db_type=0
  max_file_size=8796093022208
  record_base=440909824
  update_mode=in-place
  record_crc_mode=none
  record_comp_mode=lz4
Actual File Size: 26489626808
Number of Records: 44090459
Healthy: true
Should be Rebuilt: false

@l8gravely
Copy link
Collaborator

l8gravely commented Oct 21, 2024 via email

@stuartthebruce
Copy link

Running with the above patch significantly reduces the db size for a large index, with an acceptable amount of increased CPU time; with the patch,

[root@origin-staging duc-1.5.0-rc1]# ./duc --version
duc version: 1.5.0-rc1
options: cairo x11 ui tkrzw (zstd)

[root@origin-staging duc-1.5.0-rc1]# time ./duc index -vp /backup2 -d /dev/shm/duc.db
Writing to database "/dev/shm/duc.db"
opening tkzrw DB with compression
Indexed 1211640321 files and 44450920 directories, (821.7TB apparent, 603.4TB actual) in 7 hours, 27 minutes, and 18.96 seconds.


real    447m18.983s
user    22m9.975s
sys     254m30.657s

[root@origin-staging duc-1.5.0-rc1]# ls -lh /dev/shm/duc.db
-rw-r--r-- 1 root root 17G Nov 24 03:34 /dev/shm/duc.db

and a subsequent manual compression run with zstd is able to find a bit more to compress,

[root@origin-staging duc-1.5.0-rc1]# time zstd --verbose /dev/shm/duc.db
*** zstd command line interface 64-bits v1.4.4, by Yann Collet ***
/dev/shm/duc.db      : 83.08%   (17810632376 => 14797378480 bytes, /dev/shm/duc.db.zst) 

real    0m59.624s
user    1m1.009s
sys     0m7.904s

[root@origin-staging duc-1.5.0-rc1]# ls -lh /dev/shm/duc.db.zst 
-rw-r--r-- 1 root root 14G Nov 24 03:34 /dev/shm/duc.db.zst

For comparison, here is a run with the RC1 version available on github,

[root@origin-staging ~]# ./duc-1.5.0-rc1-rpm --version
duc version: 1.5.0-rc1
options: cairo x11 ui tkrzw

[root@origin-staging ~]# time ./duc-1.5.0-rc1-rpm index -vp /backup2 -d /dev/shm/duc.rpm.db
Writing to database "/dev/shm/duc.rpm.db"
Indexed 1211640321 files and 44450920 directories, (821.7TB apparent, 603.4TB actual) in 7 hours, 22 minutes, and 2.30 seconds.            


real    442m2.320s
user    16m11.189s
sys     269m7.602s

[root@origin-staging ~]# ls -lh /dev/shm/duc.rpm.db
-rw-r--r-- 1 root root 25G Nov 24 03:34 /dev/shm/duc.rpm.db

which has the following addition manual zstd compressibility,

[root@origin-staging ~]# time zstd --verbose /dev/shm/duc.rpm.db 
*** zstd command line interface 64-bits v1.4.4, by Yann Collet ***
/dev/shm/duc.rpm.db  : 64.66%   (26172539240 => 16924320232 bytes, /dev/shm/duc.rpm.db.zst) 

real    2m50.898s
user    2m51.898s
sys     0m10.628s

[root@origin-staging ~]# ls -lh /dev/shm/duc.rpm.db.zst 
-rw-r--r-- 1 root root 16G Nov 24 03:34 /dev/shm/duc.rpm.db.zst

@l8gravely
Copy link
Collaborator

l8gravely commented Nov 27, 2024 via email

@stuartthebruce
Copy link

"stuartthebruce" == stuartthebruce @.***> writes: Running with the above patch significantly reduces the db size for a large index, with an acceptable amount of increased CPU time; with the patch,
Great! Glad to hear this!

So going from 17G to 14G is a nice savings, but honestly, you've got so much disk space and so much data I wonder if that extra step is worth it? grin

Agreed. The win here is going from 25GB (RC1) to 17GB (with your patch). The further reduction from 17GB to 14GB is just a measure from how much further tuning could possibly be done, but I am happy with 17GB.

Can you run the tkzrw tools again on the DB file (before you compressed it again) to report on the bucket sizes and such? It would be interesting to know if there's more tuning we can do it tkrzw to make things better: tkrwz_dbm_util --dbm hash /dev/shm/duc.db All I really do in the setup is tweak some bucket sizes, so maybe there's something else I can do to make it better.

[root@origin-staging ~]# tkrzw_dbm_util inspect --dbm hash /dev/shm/duc.db
Inspection:
  class=HashDBM
  healthy=true
  auto_restored=false
  path=/dev/shm/duc.db
  cyclic_magic=3
  pkg_major_version=1
  pkg_minor_version=0
  static_flags=33
  offset_width=5
  align_pow=3
  closure_flags=1
  num_buckets=1048583
  num_records=44450925
  eff_data_size=17225775288
  file_size=17810632376
  timestamp=1732448080.589690
  db_type=0
  max_file_size=8796093022208
  record_base=5246976
  update_mode=in-place
  record_crc_mode=none
  record_comp_mode=zstd
Actual File Size: 17810632376
Number of Records: 44450925
Healthy: true
Should be Rebuilt: true

@Guillaume-ATZ
Copy link

Hello all,

I was able to compile v1.5.0-rc1 and v1.5.0-rc2 on alma linux v 8.x, I will share the result for a small volume 511G ( I had not enough time tonight to test on our larger server).

./duc --version
duc version: 1.5.0-rc1
options: cairo x11 ui tkrzw

However, while testing I found out their is no option to exlcude specifc directories like .zfs where we have all our snapshots

I see this option

-e, --exclude=VAL exclude files matching VAL

But this is for file ? I did test it like this but the index went through the /import/common/.zfs snapshoots which is not what I want.

time ./duc index -vp /import/common/ -e .zfs/ -d /dev/shm/duc.common.db

Did I use the -e option correctly ? Or should we have a seperate option to ignore some directories ?

If I dont put the ending slash it seems to work as I need, maybe we should add this to the doc ?

time /import/poi/home/poigui04/code/duc/duc index -vp /import/common --exclude=.zfs -d /dev/shm/duc.common.db
Found big filesystem
Writing to database "/dev/shm/duc.common.db"
skipping /import/common/.zfs: Excluded by user
Indexed 18833092 files and 1666616 directories, (1.3TB apparent, 797.0GB actual) in 41 minutes, and 38.40 seconds.

real    41m38.819s
user    0m47.900s
sys     5m25.642s

I know zfs is doing some compressions but those number dont match what I see using df

[root@poisux103 ~]# df -h /import/common/
Filesystem                 Size  Used Avail Use% Mounted on
zfs:/storage/common        694G  511G  183G  74% /import/common

Here is the db details

 tkrzw_dbm_util inspect --dbm hash /dev/shm/duc.common.db
Inspection:
  class=HashDBM
  healthy=true
  auto_restored=false
  path=/dev/shm/duc.common.db
  cyclic_magic=3
  pkg_major_version=1
  pkg_minor_version=0
  static_flags=49
  offset_width=5
  align_pow=3
  closure_flags=1
  num_buckets=100000007
  num_records=1666621
  eff_data_size=384338666
  file_size=905454768
  timestamp=1733978796.671083
  db_type=0
  max_file_size=8796093022208
  record_base=500002816
  update_mode=in-place
  record_crc_mode=none
  record_comp_mode=lz4
Actual File Size: 905454768
Number of Records: 1666621
Healthy: true
Should be Rebuilt: false

My actual setup is a ESX VM alma inux 8.x mounting /import/common (compress=on) from our ZFS fileserver via NFS,

I will do more tests from our fileserver later on and report back.

@Guillaume-ATZ
Copy link

Ha I just figured this out ...

"I know zfs is doing some compressions but those number dont match what I see using df"

We have lots of hard links in this volume for sevral conda env.

So that could be explanation ?

And here is the outpout for our older duc 1.4.4

/import/poi/it/duc/current/bin/duc --version
duc version: 1.4.4
options: cairo x11 ui tokyocabinet
time /import/poi/it/duc/current/bin/duc index -vp /import/common --exclude=.zfs -d /dev/shm/duc.1.4.common.db
Writing to database "/dev/shm/duc.1.4.common.db"
skipping /import/common/.zfs: Excluded by user
Indexed 18833092 files and 1666616 directories, (1.3TB apparent, 797.0GB actual) in 34 minutes, and 54.83 seconds.


real    34m55.584s
user    1m21.891s
sys     5m24.614s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants