overhead and workload invocation sequences diverge #2305

AndyAyersMS · 2023-05-13T16:31:15Z

If these don't match then benchmark times will be either under or over estimated.

This is the "good" case where the jit isn't messing up the workload invoke codegen (which just makes the problem worse).

        public delegate System.Single OverheadDelegate();

        private void OverheadActionUnroll(System.Int64 invokeCount)
        {
            
            for (System.Int64 i = 0; i < invokeCount; i++)
            {
                consumer.Consume(overheadDelegate());


        public delegate  System.Numerics.Vector3 WorkloadDelegate();

        private void WorkloadActionUnroll(System.Int64 invokeCount)
        {
            
            for (System.Int64  i = 0; i < invokeCount; i++)
            {
                consumer.Consume(workloadDelegate().X);

;; overhead

       mov      rbp, gword ptr [rsi+38H]
       mov      rax, gword ptr [rsi+28H]
       mov      rcx, gword ptr [rax+08H]
       call     [rax+18H]BenchmarkDotNet.Autogenerated.Runnable_0+OverheadDelegate:Invoke():float:this
       vmovss   dword ptr [rbp+48H], xmm0

;; workload

       mov      rbp, gword ptr [rsi+38H]
       mov      rax, gword ptr [rsi+30H]
       mov      rcx, gword ptr [rax+08H]
       lea      rdx, [rsp+110H]
       call     [rax+18H]BenchmarkDotNet.Autogenerated.Runnable_0+WorkloadDelegate:Invoke():System.Numerics.Vector3:this
       vmovss   xmm0, dword ptr [rsp+110H]
       vmovss   dword ptr [rbp+48H], xmm0

May be restricted to certain return types like Vector; I haven't seen this elsewhere.

AndreyAkinshin · 2023-05-14T13:59:25Z

@AndyAyersMS Thanks for the bug report! I can confirm that it is a severe issue that affects all the benchmarks that return structs (except IntPtr, UIntPtr). The minimal repro:

public struct MyStruct
{
    public int Value;
}

public class Benchmarks
{
    [Benchmark]
    public MyStruct Foo() => new MyStruct();
}

However, it's not clear to me how to properly fix this problem. Below are some of my thoughts on this.

Firstly, we have two primary ways to consume the return value of the workload method:

consumer.Consume(workloadDelegate().Value); // Workload:1
consumer.Consume(workloadDelegate()); // Workload:2

The first one (Workload:1) is the current default. We don't use the second one (Workload:2) right now because it has object wrapping as a side-effect (Consume can natively consume only primitive types, IntPtr, UIntPtr, object).

Secondly, we have several ways to define the overhead delegate:

// Overhead:1
private System.Int32 __Overhead()
{
    return default(System.Int32);
}

// Overhead:2
private MyStruct __Overhead()
{
    return new MyStruct();
}

// Overhead:3
private MyStruct value;

private System.Int32 __Overhead()
{
    return value;
}

Here are some comments about all of these options:

Overhead:1 It is the current default: it always prefers a type that can be natively consumed by Consumer. Unfortunately, it doesn't match either Workload:1 (which has an additional .Value accessor) or Workload:2.
Overhead:2 It matches Workload:2, but it has a call of the MyStruct constructor as a side effect. If Workload doesn't create a new struct (e.g., it reads a value from a field), such an Overhead implementation is incorrect.
Overhead:3 It matches Workload:2, but it has a field reading as a side effect. If Workload doesn't read a field (e.g., it creates a new MyStruct instance), such an Overhead implementation is incorrect.

Since we don't know in advance how the given benchmark obtains the struct value, it's quite hard to provide a proper Overhead implementation that matches the Workload implementation. Therefore, BenchmarkDotNet uses Overhead:1 to reduce the risk of mismatched implementation. Unfortunately, it leads to other issues that are presented in dotnet/runtime#86033

At the moment, I have only one idea of how to provide a proper baseline for benchmarks that return structs:

We introduce our own single-field struct in the BenchmarkDotNet template:

public struct OverheadStruct
{
    public int Value;
}

The overhead method returns a new instance of this struct:

private OverheadStruct __Overhead()
{
    return new OverheadStruct();
}

Since OverheadStruct contains a single field, it should provide a nice baseline for struct creation (at least for structs that contain at least one field).

For both overheadDelegate and overheadDelegate we pass a field value to a consumer (whenever field consuming is applicable):

        public delegate OverheadStruct OverheadDelegate();

        private void OverheadActionUnroll(System.Int64 invokeCount)
        {

            for (System.Int64 i = 0; i < invokeCount; i++)
            {
                consumer.Consume(overheadDelegate().Value);


        public delegate System.Numerics.Vector3 WorkloadDelegate();

        private void WorkloadActionUnroll(System.Int64 invokeCount)
        {
            for (System.Int64  i = 0; i < invokeCount; i++)
            {
                consumer.Consume(overheadDelegate().X);

I don't feel like it is a perfect solution, but it should (seemingly) provide a more accurate baseline for benchmarks that return structs.

@AndyAyersMS @adamsitnik What do you think?

timcassell · 2023-05-14T14:16:33Z

@AndreyAkinshin What about adding a Consumer<T> type to consume the struct as a whole? (Obviously would not work with ref struct, but neither does the current consumer.)

timcassell · 2023-05-14T14:21:32Z

it has object wrapping as a side-effect

Also, if you use Consume<T>(in T value), it does not box the value (nullables fixed in #2191).

AndreyAkinshin · 2023-05-14T14:33:20Z

@timcassell With such an approach, Consume will spend some time copying the fields of the given value to the holder struct instance. It would be another unpleasant side effect that has to be reproduced in Overhead.

timcassell · 2023-05-14T14:37:45Z

@timcassell With such an approach, Consume will spend some time copying the fields of the given value to the holder struct instance. It would be another unpleasant side effect that has to be reproduced in Overhead.

I would assume that any consume code would be replicated in overhead regardless, no?

Anyway, I was thinking of simplifying it to this.

public unsafe void Consume<T>(in T value)
    => ptrHolder = (IntPtr) Unsafe.AsPointer(ref Unsafe.AsRef(in value));

Then it won't matter how large the struct is, it's only grabbing its reference. What do you think? (current implementation passes it to DeadCodeEliminationHelper.KeepAliveWithoutBoxingReadonly)

timcassell · 2023-05-14T14:54:19Z

And I think the overhead method can be changed to this:

private MyStruct __Overhead()
{
    Unsafe.SkipInit(out MyStruct value);
    return value;
}

That matches the workload type, and avoids the cost of field reading and constructor call and zero-initializing.

AndreyAkinshin · 2023-05-14T15:04:11Z

Then it won't matter how large the struct is, it's only grabbing its reference.

Sounds good.

That matches the workload type, and avoids the cost of field reading and constructor call and zero-initializing.

It's a great idea, I like it!

Do you want to send a PR?

timcassell · 2023-05-14T15:06:33Z

Do you want to send a PR?

Sure thing.

MichalPetryka · 2023-05-14T15:08:00Z

Why is a consumer even needed here? Isn't it enough to just do a NoInline method call?

AndyAyersMS · 2023-05-14T15:09:57Z

Another thought is to not optimize these methods, that way the return value can be unconsumed but presumably would always still be produced. But it would make overhead higher, which is probably less desirable.

timcassell · 2023-05-14T15:16:55Z

Why is a consumer even needed here? Isn't it enough to just do a NoInline method call?

I thought the same thing. There was a short discussion about that in #2173.

AndyAyersMS mentioned this issue May 13, 2023

Regressions in System.Numerics.Tests.Perf_Vector3 dotnet/runtime#86033

Closed

timcassell mentioned this issue May 17, 2023

Overhead match workload #2309

Closed

timcassell linked a pull request Jun 21, 2023 that will close this issue

Fair Return Types #2336

Open

timcassell mentioned this issue Aug 4, 2023

[Perf] Linux/arm64: 268 Regressions on 7/29/2023 7:04:01 PM dotnet/runtime#89940

Closed

timcassell self-assigned this Aug 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

overhead and workload invocation sequences diverge #2305

overhead and workload invocation sequences diverge #2305

AndyAyersMS commented May 13, 2023

AndreyAkinshin commented May 14, 2023

timcassell commented May 14, 2023 •

edited

Loading

timcassell commented May 14, 2023

AndreyAkinshin commented May 14, 2023

timcassell commented May 14, 2023 •

edited

Loading

timcassell commented May 14, 2023

AndreyAkinshin commented May 14, 2023

timcassell commented May 14, 2023

MichalPetryka commented May 14, 2023

AndyAyersMS commented May 14, 2023

timcassell commented May 14, 2023

overhead and workload invocation sequences diverge #2305

overhead and workload invocation sequences diverge #2305

Comments

AndyAyersMS commented May 13, 2023

AndreyAkinshin commented May 14, 2023

timcassell commented May 14, 2023 • edited Loading

timcassell commented May 14, 2023

AndreyAkinshin commented May 14, 2023

timcassell commented May 14, 2023 • edited Loading

timcassell commented May 14, 2023

AndreyAkinshin commented May 14, 2023

timcassell commented May 14, 2023

MichalPetryka commented May 14, 2023

AndyAyersMS commented May 14, 2023

timcassell commented May 14, 2023

timcassell commented May 14, 2023 •

edited

Loading

timcassell commented May 14, 2023 •

edited

Loading