Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement slab level cache for remote frees #634

Closed
3 tasks
mjp41 opened this issue Sep 14, 2023 · 6 comments
Closed
3 tasks

Implement slab level cache for remote frees #634

mjp41 opened this issue Sep 14, 2023 · 6 comments

Comments

@mjp41
Copy link
Member

mjp41 commented Sep 14, 2023

@nwf-msr observed that we could improve the performance of remote deallocation if the producer does more work on building lists for each slab before returning to the original allocator. This could improve producer/consumer scenarios further.

Tasks

Preview Give feedback
@mjp41
Copy link
Member Author

mjp41 commented Sep 14, 2023

@Licenser @darach are you still using snmalloc. If so do you have any benchmarks that represent your workload? We have some ideas that would benefit your kind of workload for Tremor.

@darach
Copy link

darach commented Sep 14, 2023 via email

@mjp41
Copy link
Member Author

mjp41 commented Sep 14, 2023

Awesome thanks. Any benchmarks we can run would really help us in justifying the engineering work.

@darach
Copy link

darach commented Sep 14, 2023

Our CI benchmarks are snmalloc based https://www.tremor.rs/benchmarks/ - relatively boring UI there.
The benchmarks we ran a year or two ago for tremor ( and mimalloc before it, and jemalloc before that ) are here:
https://github.com/tremor-rs/tremor-runtime/blob/main/bench/README.md

They have changed enough though in themselves and we rewrote the runtime and connectors so the benchmarking
code works differently so YMMV

@nwf-msr nwf-msr mentioned this issue Sep 21, 2023
@nwf-msr
Copy link
Contributor

nwf-msr commented Sep 25, 2023

I've spent a while playing with various cache strategies (though nothing very sophisticated, just sort of seeing what the low-hanging fruit was like). For the two workloads I've tried, for three interesting choices of hashes, I currently have these message counts:

Workload No caching Perfect assembly 4-way direct hash
msgpass 7,205,380 552,781; 35 rings 1,076,843
xmalloc-test 2,388,551 317,057; 5 rings 317,065

The hash here is inspired by https://github.com/skeeto/hash-prospector but is sort of "a third" of that:

  hash = slab;
  hash *= 0x7feb352d;
  return (hash >> 16) & 3;

I'm sure it's possible to do better, but that seems to work alright, though it's not sensitive to the upper bits of the slab! Of note, however, changing the shift to hash >> 30 or performing a prospector-esque xor-shift prior to multiplication performs significantly worse on msgpass (and a little worse on xmalloc-test). The full prospector hash also does a little worse than the numbers above and is, obviously, a fair bit more expensive.

nwf-msr added a commit that referenced this issue Jun 13, 2024
* NFC: split freelist_queue from remoteallocator

This lets us use freelists as message queues in contexts other than
the remoteallocator.  No functional change indended.

* freelist_queue: add and use destroy_and_iterate

* freelist: make backptr obfuscation key "tweakable"

* freelist: tweakable keys in forward direction, too

* test/perf/msgpass: ubench a producer-consumer app

Approximate a message-passing application as a set of producers, a set of
consumers, and a set of proxies that do both.  We'll use this for some initial
insight for #634 but it seems worth
having in general.
nwf-msr added a commit that referenced this issue Sep 12, 2024
* msvc: set __cplusplus to the actual value in use

* ds_core/bits: add mask_bits; convert one_at_bit-s

* remotecache: enable reserve_space multiple objects

* nits

* Small changes to tracing

- Trace "Handling remote" once per batch, rather than per element

- Remote queue events also log the associated metaslab; we'll use this
  to assess the efficacy of #634

* freelist builder: allow forcibly tracking length

* Try forward declaring freelist::Builder to appease macos-14

* freelist: tweak intra-slab obfuscation keys by meta address

* NFC: freelist: allow `next` to be arbitrary value

* Switch to a central, tweaked key for all free lists

* allocconfig: introduce some properties of slabs

We'll use these to pack values in message queues.

- Maximum distance between two objects in a single slab
- Maximum number of objects in a slab

* NFC: Templatize LocalCache on Config

* NFC: split dealloc_local_object_slow

We'll use the _slower form when we're just stepping a slab through
multiple rounds of state transition (to come), which can't involve
the actual memory object in question.

* NFC: make freelist::Object::T-s by placement new

* NFC: CoreAlloc: split dealloc_local_object

The pattern of `if (!fast()) { slow() }` occurs in a few places, including in
contexts where we already know the entry and so don't need to look it up.
@mjp41
Copy link
Member Author

mjp41 commented Sep 25, 2024

This is now completed. #677 landed in main, and the paper appeared at ISMM'24.

@mjp41 mjp41 closed this as completed Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants