-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat use tdigest for metrics #546
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks like a positive step forward, but a few concerns:
- in order to merge it in, we'll need a patch against the latest development release (yes, this temporarily does not have gaggle support)
- are there any concerns with tdigest not having commits for a couple of years?
- we'll need to run a 2-3 day load test and confirm that this new data structure continues to scale well without slowing anything down
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to be able to merge this PR, it will need to be against the latest development version of the codebase. Please rebase. (I understand you desire the use of Gaggles, there is work underway to add them back in 0.18 (more likely to be called 1.0).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll rebase the work doesn't look to be too much work.
serde_cbor = "0.11" | ||
serde_json = "1.0" | ||
simplelog = "0.10" | ||
strum = "0.24" | ||
strum_macros = "0.24" | ||
tdigest = { version = "0.2.3", features = ["serde", "use_serde"] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this library hasn't been updated in 2 years, is there anything that has replaced it? Does it matter that it's not being updated?
https://github.com/MnO2/t-digest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, there is another implementation here https://docs.rs/pdatastructs/latest/pdatastructs/tdigest/struct.TDigest.html however that doesn't implement serde serialize/deserialize which we would need.
I am not concerned about the crate not getting updates as this is the kind of things that once implemented correctly doesn't need ongoing maintanence. However if you would prefer I can change out the implementation.
Or as we discussed I could make the implementation generic and then users can bring their own solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, with your effort toward making it Generic, I agree there's no huge risk using the library you have chosen.
src/manager.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is not in the current development branch ...
This sounds great, there is a parameter we can tune to adjust performance. When creating the tdigest you provide it with a max size, I use 100 a reasonable default, but we could potentially change it as needed. A lower value will mean less allocations an likely better performance as a result. |
Thanks for the quick feedback I have update the PR to be off main branch. Additionally I updated all the tests and introduced a This change might have introduced some new dead code. I am going to do another pass to see if that is so and delete it accordingly. |
By uncommenting line 236 of |
Thanks for the hints on the failing tests. I still haven't fixed them yet but I have learned a few things:
I see a few ways forward.
My vote is option 2 as it seems like the best long term solution since that crate is more active and more likely to accept the change. Why is the format so large? Under the hood a t-digest is a set of centroids, each centroid is two numeric values (i.e. the mean and the weight). With the max_size set to 100 the tdigest will store at most 100 centroids. However each centroid is serialized as The trade off with tdigest is that it is guaranteed to never use more than 100 centroids where as the previous implementation was technically unbounded, however tdigest is much more likely to use all 100 centroids. Meaning if we can get the logic working for 100 centroids we know it will always work. Additionally the value 100 was choosen somewhat arbitrarily, we could choose a smaller value if needed. Thanks for working through these details with me. Happy to go in whichever direction. |
The size of the tdigest serialization format shouldn't matter if we remove To remove from the tests, I was thinking something like this patch:
By moving That said, now there's an (apparently) unrelated problem in which users aren't starting quickly enough and so we instead fail with the following error:
At a quick glance I'm not sure why this would happen, as |
Thanks for the diff, I did something similar and got past that part of the test failures.
Yes, and I ran into the same issue with |
This change replaces this existing metrics collections with a tdigest implementation. This is beneficial for a few reasons:
This means we can get more accurate quantiles without having to use more memory that we currently do.
The previous logic would round values in order to save space, tdigest does something similar but does so in a way that minimizes the accumulated error introduced from the rounding. Instead of rounding values to predetermined values it rounds values to the closest existing point already in the distribution and only rounds when there is a need (i.e. its reached its space limit).
This PR is still WIP but wanted to start the conversation around the contribution before continuing:
TODO:
This PR addresses part of the issue #366
As an side we are using this branch successfully already getting good quantiles from our scenarios runs.