Reduce memory usage by using the body directly instead of copying in BasicPublishAsync #1445

GerardSmit · 2023-12-15T14:56:42Z

Proposed Changes

This PR adds the ability to use the body in BasicPublishAsync directly when sending it to RabbitMQ, instead of copying it to a temporary byte-array.

Technical information

Currently BasicPublishAsync rents an array from the ArrayPool, with the frame + body length:

rabbitmq-dotnet-client/projects/RabbitMQ.Client/client/impl/Frame.cs

Lines 173 to 178 in 3794b4c

    
           int size = Method.FrameSize + Header.FrameSize + 
        
                      method.GetRequiredBufferSize() + header.GetRequiredBufferSize() + 
        
                      BodySegment.FrameSize * GetBodyFrameCount(maxBodyPayloadBytes, remainingBodyBytes) + remainingBodyBytes; 
        
           // Will be returned by SocketFrameWriter.WriteLoop 
        
           byte[] array = ClientArrayPool.Rent(size);

Then the body is being copied to this array.

rabbitmq-dotnet-client/projects/RabbitMQ.Client/client/impl/Frame.cs

Lines 183 to 188 in 3794b4c

    
           while (remainingBodyBytes > 0) 
        
           { 
        
               int frameSize = remainingBodyBytes > maxBodyPayloadBytes ? maxBodyPayloadBytes : remainingBodyBytes; 
        
               offset += BodySegment.WriteTo(array.AsSpan(offset), channelNumber, bodySpan.Slice(bodySpan.Length - remainingBodyBytes, frameSize)); 
        
               remainingBodyBytes -= frameSize; 
        
           }

In our application, we're sometimes forwarding large bodies to RabbitMQ. The problem is that the ArrayPool<byte>.Shared is using a separate buckets in each thread, causing a lot of new byte arrays to be generated.

In addition, the BasicPublishAsync writes the rented array in a channel:

rabbitmq-dotnet-client/projects/RabbitMQ.Client/client/impl/SocketFrameHandler.cs

Lines 308 to 309 in 3794b4c

    
           await _channelWriter.WriteAsync(frames) 
        
               .ConfigureAwait(false);

Which is later written to the pipe:

rabbitmq-dotnet-client/projects/RabbitMQ.Client/client/impl/SocketFrameHandler.cs

Lines 333 to 342 in 3794b4c

    
           while (await _channelReader.WaitToReadAsync().ConfigureAwait(false)) 
        
           { 
        
               while (_channelReader.TryRead(out RentedMemory frames)) 
        
               { 
        
                   try 
        
                   { 
        
                       await _pipeWriter.WriteAsync(frames.Memory) 
        
                           .ConfigureAwait(false); 
        
                       RabbitMqClientEventSource.Log.CommandSent(frames.Size); 
        
                   }

If RabbitMQ is has a slow response rate or we're trying to send a lot of data at once, the channel can fill up quickly. Causing it to hold 128 rented arrays.

New API

ConnectionFactory

In the factory there is a new property called CopyBodyToMemoryThreshold:

ConnectionFactory factory = new ConnectionFactory
{
    CopyBodyToMemoryThreshold = 4096
}

When this value is set, every data that is larger than the provided value (in this instance, 4096) will be used directly instead of being copied to a new array.
If the data is smaller than the provided value, the body is still copied to a new array.

When we use the data directly, we cannot send the data in the background like we're currently doing. We have to wait until the bytes are written to the pipe: otherwise, the application can modify the buffer. So, whenever we're using the buffer directly, the BasicPublishAsync wait until the bytes are written to the pipe.

This solution is a middle ground: smaller bodies are sent in the background but allocating, larger bodies are sent directly but are blocking.

The default value of CopyBodyToMemoryThreshold is int.MaxValue, so there is no breaking change.

IChannel

BasicPublishAsync has a new parameter: bool? copyBody = null:

ValueTask BasicPublishAsync<TProperties>(string exchange, string routingKey, in TProperties basicProperties, ReadOnlyMemory<byte> body = default, bool mandatory = false, bool? copyBody = null)
    where TProperties : IReadOnlyBasicProperties, IAmqpHeader;

When the value is:

true: the body is copied to a new array and sent in the background.
false: the body is written directly but blocking.
null: the length is being compared to the config CopyBodyToMemoryThreshold in the connection

There are two new overloads that allows ReadOnlySequence<byte> to be used as body:

ValueTask BasicPublishAsync<TProperties>(string exchange, string routingKey, in TProperties basicProperties, ReadOnlySequence<byte> body = default, bool mandatory = false, bool? copyBody = null)
    where TProperties : IReadOnlyBasicProperties, IAmqpHeader;

This way you can use System.IO.Pipelines without allocating to a temporary ReadOnlyMemory.

Types of Changes

What types of changes does your code introduce to this project?

Bug fix (non-breaking change which fixes issue #NNNN)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause an observable behavior change in existing systems)
Documentation improvements (corrections, new content, etc)
Cosmetic change (whitespace, formatting, etc)

Checklist

I have read the CONTRIBUTING.md document
I have signed the CA (see https://cla.pivotal.io/sign/rabbitmq)
All tests pass locally with my changes
Some tests timed out.
I have added tests that prove my fix is effective or that my feature works
I have added necessary documentation (if appropriate)
I'll wait until this API is approved/changes have been made.
~~Any dependent changes have been merged and published in related repositories~~

Further Comments

Maybe the naming of the properties and/or parameters are confusing and have to be renamed.
I tried my best to make unit tests to check if "CopyBodyToMemoryThreshold" works as indented but doing so I had to make some stuff internal instead of private. I also had to track the number of allocating bytes in the connection. Is this OK?

lukebakken · 2023-12-15T16:03:39Z

Hello, thanks for this contribution. Could you please provide a test application that clearly demonstrates the issue that this PR solves?

michaelklishin · 2023-12-15T16:07:58Z

@GerardSmit do you have any benchmarking data that would help the maintainers (and users alike) reason about the scale of the improvements?

paulomorgado · 2023-12-15T16:08:01Z

Would it be feasible to never copy and being the user of the library the one that copies if needed?

lukebakken · 2023-12-15T16:09:16Z

Would it be feasible to never copy and being the user of the library the one that copies if needed?

Probably! In my head I have a task to check out how other .NET client libraries work in this regard. I will add a task for the 7.0 release to do that.

paulomorgado · 2023-12-15T16:16:13Z

Another nice to have would be support Stream and ReadOnlySequence<byte>.

lukebakken · 2023-12-15T16:24:27Z

@paulomorgado please feel free to add comments to #1446 so they aren't lost!

GerardSmit · 2023-12-15T16:31:19Z

Hello, thanks for this contribution. Could you please provide a test application that clearly demonstrates the issue that this PR solves?

do you have any benchmarking data that would help the maintainers (and users alike) reason about the scale of the improvements?

I'll see if I can create test application / benchmark tomorrow.

Would it be feasible to never copy and being the user of the library the one that copies if needed?

This would mean the background channel (in the code below) should always be blocking (so the buffer doesn't get modified by the application). In this case a Channel<T> would be unnecessary. Maybe it's better to use an async lock if we go this way.

rabbitmq-dotnet-client/projects/RabbitMQ.Client/client/impl/SocketFrameHandler.cs

Lines 299 to 311 in 3794b4c

    
           public async ValueTask WriteAsync(RentedMemory frames) 
        
           { 
        
               if (_closed) 
        
               { 
        
                   frames.Dispose(); 
        
                   await Task.Yield(); 
        
               } 
        
               else 
        
               { 
        
                   await _channelWriter.WriteAsync(frames) 
        
                       .ConfigureAwait(false); 
        
               } 
        
           }

Another nice to have would be support Stream and ReadOnlySequence.

As stated in the PR, ReadOnlySequence<byte> support has been added.
Stream would be a nice addition 😄

I'll wait with any changes until the research in #1446 is done.

lukebakken · 2023-12-15T16:44:09Z

The more I think about this PR, optimizations like this feel like version 8 sorts of changes, not version 7.

Thoughts?

paulomorgado · 2023-12-15T18:58:37Z

The more I think about this PR, optimizations like this feel like version 8 sorts of changes, not version 7.

Thoughts?

Probably. Unless the changes can be reduced by forcing the user to copy the message if it needs to be isolated.

GerardSmit · 2023-12-15T21:46:02Z

@lukebakken @michaelklishin the test application can be found here: https://github.com/GerardSmit/RabbitMQ.MemoryTest

Results

Official client (copying)

$ dotnet run -c Release -- --mb=1 --iterations=100 --tasks=16
Official version
Body size       : 1 MB
Iterations      : 100
Tasks           : 16
Non-copying     : False
Startup memory  : 20 MB

---  Start  ---
Memory usage: 22 MB
Memory usage: 1765 MB
Memory usage: 2167 MB
Memory usage: 2632 MB

--- Results ---
Avg time        : 3737 ms
Min time        : 3679 ms
Max time        : 3759 ms
Memory          : 2728 MB
Queue length    : 1600 / 1600
Valid messages  : 100 / 100 (first 100 of 1600)

Fork (copying)

$ dotnet run -c Release -p:Fork=true -- --mb=1 --iterations=100 --tasks=16
Forked version
Body size       : 1 MB
Iterations      : 100
Tasks           : 16
Non-copying     : False
Startup memory  : 20 MB

---  Start  ---
Memory usage: 22 MB
Memory usage: 1934 MB
Memory usage: 2398 MB
Memory usage: 2819 MB

--- Results ---
Avg time        : 3447 ms
Min time        : 3384 ms
Max time        : 3489 ms
Memory          : 2854 MB
Queue length    : 1600 / 1600
Valid messages  : 100 / 100 (first 100 of 1600)

Fork (non-copying)

$ dotnet run -c Release -p:Fork=true -- --mb=1 --iterations=100 --tasks=16 --nc
Forked version
Body size       : 1 MB
Iterations      : 100
Tasks           : 16
Non-copying     : True
Startup memory  : 20 MB

---  Start  ---
Memory usage: 22 MB
Memory usage: 61 MB
Memory usage: 61 MB
Memory usage: 61 MB

--- Results ---
Avg time        : 3252 ms
Min time        : 3211 ms
Max time        : 3277 ms
Memory          : 61 MB
Queue length    : 1600 / 1600
Valid messages  : 100 / 100 (first 100 of 1600)

GerardSmit · 2023-12-15T21:59:04Z

The more I think about this PR, optimizations like this feel like version 8 sorts of changes, not version 7.

Understandable. I'll keep the fork up-to-date with 7.0 in the meantime and publish it to NuGet; since without it we're getting a lot of OutOfMemory-exceptions in our application.

We're running the application in a restricted container in Kubernetes, at first we thought it was cgroups not being supported in .NET. It does since .NET 9.0 (dotnet/runtime#93611) so we tried a daily build, without success. In the end we were seeing a lot of memory allocation in the ArrayPool coming from RabbitMQ, hence this PR.

This test was disabled before I've implemented TrackRentedBytesAsync correctly. This should pass now.

This complicates the tests. After returning the rented object back to the pool, the data is resetted.

PauloHMattos · 2024-01-30T19:52:38Z

IMO the library should not make any copy and simply use the data that it received in BasicPublishAsync.
Once the data is passed to the library, the user should await the completion of BasicPublishAsync and only then the data is safe to be modified/reused.

My view is that BasicPublishAsync would only complete once the data is published (or written to the Pipe) and is no longer being utilized by the client. If a user from the library does not wish to wait for the data to be published, he should not await the call to BasicPublishAsync, and any polling or reuse of this data should be make safe by the user and not by the library copying data and doing unnecessary work.

That way each application can make use of this client in a way that better suits itself.
In my application is very common to publish large payloads (10 to 50 MB), and the copying as described in this PR is a major problem to our memory usage.
For us it would be beneficial for these changes to be incorporated in v7, but if not I guess we can use the GerardSmit fork for a while.

paulomorgado · 2024-01-30T20:12:36Z

IMO the library should not make any copy and simply use the data that it received in BasicPublishAsync. Once the data is passed to the library, the user should await the completion of BasicPublishAsync and only then the data is safe to be modified/reused.

My view is that BasicPublishAsync would only complete once the data is published (or written to the Pipe) and is no longer being utilized by the client. If a user from the library does not wish to wait for the data to be published, he should not await the call to BasicPublishAsync, and any polling or reuse of this data should be make safe by the user and not by the library copying data and doing unnecessary work.

That way each application can make use of this client in a way that better suits itself. In my application is very common to publish large payloads (10 to 50 MB), and the copying as described in this PR is a major problem to our memory usage. For us it would be beneficial for these changes to be incorporated in v7, but if not I guess we can use the GerardSmit fork for a while.

I agree with most of this, except that it's never a good idea to not await in any way async calls.

lukebakken · 2024-06-03T17:04:39Z

In the interest of shipping v7 of this library this month, I have moved this PR to version 8.

Thanks!

SpiritDogstar · 2024-06-21T16:19:38Z

I'm hitting this problem, with memory at high freq grows to gigs and gigs... and slowly, if pausing the stream, begins to drop. What I don't quite understand is, the data I'm sending gets to the backend quickly, so is it a cache not being cleaned out on my C# local publisher? Why is it being held still?

lukebakken · 2024-06-21T16:22:55Z

@SpiritDogstar please start a discussion here - https://github.com/rabbitmq/rabbitmq-dotnet-client/discussions

If you can share code, that would be ideal.

Also important - are you publishing / consuming large messages? What is their average size?

danielmarbach · 2024-09-13T11:54:17Z

Have you considered setting https://learn.microsoft.com/en-us/dotnet/core/runtime-config/garbage-collector#high-memory-percent? This would clear the TLS store of the pool too. That supported since net 6 dotnet/runtime#56316

Another alternative would be to expose setting the pool implementation instead of directly using arraypool shared

Most libraries that apply pooling of calls against the underlying protocol apply some form of copying the input. Changing the API surface of basic publish and all the underlying mechanics that come with it for scenarios that can be solved with for example slightly assigning more runtime memory on the pod that is closer to the usage scenario or providing default more opinionated pool implementations seems quite a bit of burden in terms of complexity to carry forward over the lifetime of the project

I'm not trying to shoot this down per se. I want to give a different perspective that hasn't really been discussed yet here

danielmarbach · 2024-10-30T19:39:50Z

To the point I wrote above there is another approach that might significantly reduce complexity. We could have an overload the allows passing IMemoryOwner<byte>

https://learn.microsoft.com/en-us/dotnet/api/system.buffers.imemoryowner-1?view=net-8.0

Rule #7: If you have an IMemoryOwner reference, you must at some point dispose of it or transfer its ownership (but not both).

Rule #8: If you have an IMemoryOwner parameter in your API surface, you are accepting ownership of that instance.

https://learn.microsoft.com/en-us/dotnet/standard/memory-and-spans/memory-t-usage-guidelines

Then the memory owner could be passed to the channel worker and eventually when the channel sent it, it will release the owner.

Then the calling code can use their own pooling, the internals can bypass copying but leave the rest as is.

Also related dotnet/aspnetcore#38153

lukebakken self-assigned this Dec 15, 2023

lukebakken added this to the 7.0.0 milestone Dec 15, 2023

lukebakken mentioned this pull request Dec 15, 2023

Investigate how other .NET client libraries deal with data from applications #1446

Open

lukebakken self-requested a review December 15, 2023 16:29

GerardSmit and others added 11 commits December 22, 2023 06:02

Reduce memory usage by using the body directly in the pipe

892088d

Merge sequential memories

5e68abb

Add configuration to set the copy threshold

ae7ca17

Added tests for non-copying body

8a59fc6

Fix incorrect test

f59728c

Fix formatting

1f0a4be

Re-enable test

51d6c66

This test was disabled before I've implemented TrackRentedBytesAsync correctly. This should pass now.

Remove ObjectPool

75b8424

This complicates the tests. After returning the rented object back to the pool, the data is resetted.

Improved disposing and fixed tests

70e9376

Fix blocking warning

1f0b20e

Update PublicAPI.Unshipped

7df74eb

lukebakken force-pushed the prototype/reduce-memory-usage branch from dfbd822 to 7df74eb Compare December 22, 2023 14:04

This was referenced Dec 28, 2023

Reduce allocations by using pooled memory and recycling memory streams #694

Closed

Heap size grows when publishing a very large batch of messages #1106

Closed

lukebakken force-pushed the main branch from 722aa36 to bf86447 Compare April 29, 2024 13:46

lukebakken modified the milestones: 7.0.0, 8.0.0 Jun 3, 2024

lukebakken added the next-gen-todo If a rewrite happens, address this issue. label Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce memory usage by using the body directly instead of copying in BasicPublishAsync #1445

Reduce memory usage by using the body directly instead of copying in BasicPublishAsync #1445

GerardSmit commented Dec 15, 2023 •

edited

Loading

lukebakken commented Dec 15, 2023

michaelklishin commented Dec 15, 2023

paulomorgado commented Dec 15, 2023

lukebakken commented Dec 15, 2023

paulomorgado commented Dec 15, 2023

lukebakken commented Dec 15, 2023

GerardSmit commented Dec 15, 2023 •

edited

Loading

lukebakken commented Dec 15, 2023

paulomorgado commented Dec 15, 2023

GerardSmit commented Dec 15, 2023 •

edited

Loading

Official client (copying)

Fork (copying)

Fork (non-copying)

GerardSmit commented Dec 15, 2023

PauloHMattos commented Jan 30, 2024

paulomorgado commented Jan 30, 2024

lukebakken commented Jun 3, 2024

SpiritDogstar commented Jun 21, 2024

lukebakken commented Jun 21, 2024 •

edited

Loading

danielmarbach commented Sep 13, 2024 •

edited

Loading

danielmarbach commented Oct 30, 2024 •

edited

Loading

	int size = Method.FrameSize + Header.FrameSize +
	method.GetRequiredBufferSize() + header.GetRequiredBufferSize() +
	BodySegment.FrameSize * GetBodyFrameCount(maxBodyPayloadBytes, remainingBodyBytes) + remainingBodyBytes;

	// Will be returned by SocketFrameWriter.WriteLoop
	byte[] array = ClientArrayPool.Rent(size);

	while (remainingBodyBytes > 0)
	{
	int frameSize = remainingBodyBytes > maxBodyPayloadBytes ? maxBodyPayloadBytes : remainingBodyBytes;
	offset += BodySegment.WriteTo(array.AsSpan(offset), channelNumber, bodySpan.Slice(bodySpan.Length - remainingBodyBytes, frameSize));
	remainingBodyBytes -= frameSize;
	}

	await _channelWriter.WriteAsync(frames)
	.ConfigureAwait(false);

	while (await _channelReader.WaitToReadAsync().ConfigureAwait(false))
	{
	while (_channelReader.TryRead(out RentedMemory frames))
	{
	try
	{
	await _pipeWriter.WriteAsync(frames.Memory)
	.ConfigureAwait(false);
	RabbitMqClientEventSource.Log.CommandSent(frames.Size);
	}

Reduce memory usage by using the body directly instead of copying in BasicPublishAsync #1445

Are you sure you want to change the base?

Reduce memory usage by using the body directly instead of copying in BasicPublishAsync #1445

Conversation

GerardSmit commented Dec 15, 2023 • edited Loading

Proposed Changes

Technical information

New API

ConnectionFactory

IChannel

Types of Changes

Checklist

Further Comments

lukebakken commented Dec 15, 2023

michaelklishin commented Dec 15, 2023

paulomorgado commented Dec 15, 2023

lukebakken commented Dec 15, 2023

paulomorgado commented Dec 15, 2023

lukebakken commented Dec 15, 2023

GerardSmit commented Dec 15, 2023 • edited Loading

lukebakken commented Dec 15, 2023

paulomorgado commented Dec 15, 2023

GerardSmit commented Dec 15, 2023 • edited Loading

Official client (copying)

Fork (copying)

Fork (non-copying)

GerardSmit commented Dec 15, 2023

PauloHMattos commented Jan 30, 2024

paulomorgado commented Jan 30, 2024

lukebakken commented Jun 3, 2024

SpiritDogstar commented Jun 21, 2024

lukebakken commented Jun 21, 2024 • edited Loading

danielmarbach commented Sep 13, 2024 • edited Loading

danielmarbach commented Oct 30, 2024 • edited Loading

GerardSmit commented Dec 15, 2023 •

edited

Loading

GerardSmit commented Dec 15, 2023 •

edited

Loading

GerardSmit commented Dec 15, 2023 •

edited

Loading

lukebakken commented Jun 21, 2024 •

edited

Loading

danielmarbach commented Sep 13, 2024 •

edited

Loading

danielmarbach commented Oct 30, 2024 •

edited

Loading