Remove inlined function PC resolution for stack traces, remove DWARF caching #32166

grantseltzer · 2024-12-13T18:13:17Z

What does this PR do?

Removes the map and its usage that contains the inlined function PCs, which will no longer be in stack traces. This resolves a huge amount of memory usage.
Removes caching of DWARF information for each inspected binary. This resolves a large memory leak.
Filters return parameters from instrumentation for now.

Motivation

We noticed a large amount of memory being used by Go DI within the system probe and determined the program counter map was a primary culprit.

We also were caching the DWARF information for each binary. That data was only being used when a new probe was being created which is infrequent enough where it's not worth storing (and never clearing).

Describe how you validated your changes

Working on producing local profiling comparison.

Here is an example of events of the sample not_inlined() function:

After:

{
  "service": "go-di-sample-service",
  "message": "github.com/DataDog/datadog-agent/pkg/dynamicinstrumentation/testutil/sample.not_inlined {\"entry\":{}}",
  "ddsource": "dd_debugger",
  "ddtags": "",
  "logger": {
    "name": "",
    "method": ""
  },
  "debugger": {
    "snapshot": {
      "id": "8aad8538-b8ee-11ef-b1a5-001c42659120",
      "timestamp": 1734052014013,
      "language": "go",
      "probe": {
        "id": "e504163d-f367-4522-8905-fe8bc34eb975",
        "location": {
          "method": "github.com/DataDog/datadog-agent/pkg/dynamicinstrumentation/testutil/sample.not_inlined"
        }
      },
      "captures": {
        "entry": {}
      },
      "stack": [
        {
          "fileName": "/home/vagrant/datadog-agent/pkg/dynamicinstrumentation/testutil/sample/stacktraces.go",
          "function": "github.com/DataDog/datadog-agent/pkg/dynamicinstrumentation/testutil/sample.not_inlined",
          "lineNumber": 51
        },
        {
          "fileName": "/home/vagrant/datadog-agent/pkg/dynamicinstrumentation/testutil/sample/stacktraces.go",
          "function": "github.com/DataDog/datadog-agent/pkg/dynamicinstrumentation/testutil/sample.call_inlined_func_chain",
          "lineNumber": 30
        },
        {
          "fileName": "/home/vagrant/datadog-agent/pkg/dynamicinstrumentation/testutil/sample/stacktraces.go",
          "function": "github.com/DataDog/datadog-agent/pkg/dynamicinstrumentation/testutil/sample.ExecuteStackAndInlining",
          "lineNumber": 56
        },
        {
          "fileName": "/home/vagrant/datadog-agent/pkg/dynamicinstrumentation/testutil/sample/sample_service/sample_service.go",
          "function": "main.main",
          "lineNumber": 19
        }
      ]
    }
  },
  "duration": 0
}

Before:

{
  "service": "go-di-sample-service",
  "message": "github.com/DataDog/datadog-agent/pkg/dynamicinstrumentation/testutil/sample.not_inlined {\"entry\":{}}",
  "ddsource": "dd_debugger",
  "ddtags": "",
  "logger": {
    "name": "",
    "method": ""
  },
  "debugger": {
    "snapshot": {
      "id": "c517407a-b8ee-11ef-a5da-001c42659120",
      "timestamp": 1734052112014,
      "language": "go",
      "probe": {
        "id": "e504163d-f367-4522-8905-fe8bc34eb975",
        "location": {
          "method": "github.com/DataDog/datadog-agent/pkg/dynamicinstrumentation/testutil/sample.not_inlined"
        }
      },
      "captures": {
        "entry": {}
      },
      "stack": [
        {
          "fileName": "/home/vagrant/datadog-agent/pkg/dynamicinstrumentation/testutil/sample/stacktraces.go",
          "function": "github.com/DataDog/datadog-agent/pkg/dynamicinstrumentation/testutil/sample.not_inlined",
          "lineNumber": 51
        },
        {
          "fileName": "/home/vagrant/datadog-agent/pkg/dynamicinstrumentation/testutil/sample/stacktraces.go",
          "function": "github.com/DataDog/datadog-agent/pkg/dynamicinstrumentation/testutil/sample.inline_me_3 [inlined in github.com/DataDog/datadog-agent/pkg/dynamicinstrumentation/testutil/sample.call_inlined_func_chain]",
          "lineNumber": 41
        },
        {
          "fileName": "/home/vagrant/datadog-agent/pkg/dynamicinstrumentation/testutil/sample/stacktraces.go",
          "function": "github.com/DataDog/datadog-agent/pkg/dynamicinstrumentation/testutil/sample.inline_me_2 [inlined in github.com/DataDog/datadog-agent/pkg/dynamicinstrumentation/testutil/sample.call_inlined_func_chain]",
          "lineNumber": 36
        },
        {
          "fileName": "/home/vagrant/datadog-agent/pkg/dynamicinstrumentation/testutil/sample/stacktraces.go",
          "function": "github.com/DataDog/datadog-agent/pkg/dynamicinstrumentation/testutil/sample.inline_me_1 [inlined in github.com/DataDog/datadog-agent/pkg/dynamicinstrumentation/testutil/sample.call_inlined_func_chain]",
          "lineNumber": 31
        },
        {
          "fileName": "/home/vagrant/datadog-agent/pkg/dynamicinstrumentation/testutil/sample/stacktraces.go",
          "function": "github.com/DataDog/datadog-agent/pkg/dynamicinstrumentation/testutil/sample.call_inlined_func_chain",
          "lineNumber": 30
        },
        {
          "fileName": "/home/vagrant/datadog-agent/pkg/dynamicinstrumentation/testutil/sample/stacktraces.go",
          "function": "github.com/DataDog/datadog-agent/pkg/dynamicinstrumentation/testutil/sample.ExecuteStackAndInlining",
          "lineNumber": 56
        },
        {
          "fileName": "/home/vagrant/datadog-agent/pkg/dynamicinstrumentation/testutil/sample/sample_service/sample_service.go",
          "function": "main.main",
          "lineNumber": 19
        }
      ]
    }
  },
  "duration": 0
}

agent-platform-auto-pr · 2024-12-13T18:45:11Z

Package size comparison

Comparison with ancestor f483cc4c0dffec6f7bcd4824dc8df65c15957e27

Diff per package

package	diff	status	size	ancestor	threshold
datadog-agent-amd64-deb	0.00MB	✅	1265.94MB	1265.94MB	140.00MB
datadog-iot-agent-amd64-deb	0.00MB	✅	113.28MB	113.28MB	10.00MB
datadog-dogstatsd-amd64-deb	0.00MB	✅	78.52MB	78.52MB	10.00MB
datadog-heroku-agent-amd64-deb	0.00MB	✅	502.50MB	502.50MB	70.00MB
datadog-agent-x86_64-rpm	0.00MB	⚠️	1275.18MB	1275.18MB	140.00MB
datadog-agent-x86_64-suse	0.00MB	⚠️	1275.18MB	1275.18MB	140.00MB
datadog-iot-agent-x86_64-rpm	0.00MB	⚠️	113.35MB	113.35MB	10.00MB
datadog-iot-agent-x86_64-suse	0.00MB	⚠️	113.35MB	113.35MB	10.00MB
datadog-dogstatsd-x86_64-rpm	-0.00MB	✅	78.59MB	78.59MB	10.00MB
datadog-dogstatsd-x86_64-suse	-0.00MB	✅	78.59MB	78.59MB	10.00MB
datadog-agent-arm64-deb	-0.00MB	✅	1001.02MB	1001.02MB	140.00MB
datadog-iot-agent-arm64-deb	-0.00MB	✅	108.76MB	108.76MB	10.00MB
datadog-dogstatsd-arm64-deb	0.00MB	✅	55.74MB	55.74MB	10.00MB
datadog-agent-aarch64-rpm	-0.00MB	✅	1010.23MB	1010.24MB	140.00MB
datadog-iot-agent-aarch64-rpm	-0.00MB	✅	108.83MB	108.83MB	10.00MB

Decision

⚠️ Warning

agent-platform-auto-pr · 2024-12-13T18:45:41Z

Test changes on VM

Use this command from test-infra-definitions to manually test this PR changes on a VM:

inv aws.create-vm --pipeline-id=51775098 --os-family=ubuntu

Note: This applies to commit 580cbf7

cit-pr-commenter · 2024-12-13T19:14:22Z

Regression Detector

Regression Detector Results

Metrics dashboard
Target profiles
Run ID: e8e9ef55-71b9-4141-88b3-031e78379439

Baseline: f93202d
Comparison: 580cbf7
Diff

Optimization Goals: ✅ No significant changes detected

Fine details of change detection per experiment

perf	experiment	goal	Δ mean %	Δ mean % CI	trials	links
➖	tcp_syslog_to_blackhole	ingress throughput	+0.26	[+0.20, +0.32]	1	Logs
➖	quality_gate_idle	memory utilization	+0.25	[+0.22, +0.29]	1	Logs bounds checks dashboard
➖	file_tree	memory utilization	+0.18	[+0.04, +0.31]	1	Logs
➖	file_to_blackhole_0ms_latency_http1	egress throughput	+0.05	[-0.81, +0.91]	1	Logs
➖	file_to_blackhole_500ms_latency	egress throughput	+0.03	[-0.75, +0.81]	1	Logs
➖	file_to_blackhole_0ms_latency	egress throughput	+0.01	[-0.88, +0.91]	1	Logs
➖	file_to_blackhole_1000ms_latency_linear_load	egress throughput	+0.01	[-0.46, +0.48]	1	Logs
➖	file_to_blackhole_0ms_latency_http2	egress throughput	-0.00	[-0.83, +0.83]	1	Logs
➖	tcp_dd_logs_filter_exclude	ingress throughput	-0.00	[-0.01, +0.01]	1	Logs
➖	uds_dogstatsd_to_api	ingress throughput	-0.01	[-0.13, +0.11]	1	Logs
➖	file_to_blackhole_100ms_latency	egress throughput	-0.04	[-0.69, +0.61]	1	Logs
➖	file_to_blackhole_300ms_latency	egress throughput	-0.05	[-0.68, +0.59]	1	Logs
➖	file_to_blackhole_1000ms_latency	egress throughput	-0.05	[-0.84, +0.73]	1	Logs
➖	quality_gate_idle_all_features	memory utilization	-0.46	[-0.54, -0.38]	1	Logs bounds checks dashboard
➖	quality_gate_logs	% cpu utilization	-0.82	[-4.02, +2.38]	1	Logs
➖	uds_dogstatsd_to_api_cpu	% cpu utilization	-1.01	[-1.69, -0.33]	1	Logs

Bounds Checks: ❌ Failed

perf	experiment	bounds_check_name	replicates_passed	links
❌	file_to_blackhole_0ms_latency_http2	lost_bytes	7/10
✅	file_to_blackhole_0ms_latency	lost_bytes	10/10
✅	file_to_blackhole_0ms_latency	memory_usage	10/10
✅	file_to_blackhole_0ms_latency_http1	lost_bytes	10/10
✅	file_to_blackhole_0ms_latency_http1	memory_usage	10/10
✅	file_to_blackhole_0ms_latency_http2	memory_usage	10/10
✅	file_to_blackhole_1000ms_latency	memory_usage	10/10
✅	file_to_blackhole_1000ms_latency_linear_load	memory_usage	10/10
✅	file_to_blackhole_100ms_latency	lost_bytes	10/10
✅	file_to_blackhole_100ms_latency	memory_usage	10/10
✅	file_to_blackhole_300ms_latency	lost_bytes	10/10
✅	file_to_blackhole_300ms_latency	memory_usage	10/10
✅	file_to_blackhole_500ms_latency	lost_bytes	10/10
✅	file_to_blackhole_500ms_latency	memory_usage	10/10
✅	quality_gate_idle	memory_usage	10/10	bounds checks dashboard
✅	quality_gate_idle_all_features	memory_usage	10/10	bounds checks dashboard
✅	quality_gate_logs	lost_bytes	10/10
✅	quality_gate_logs	memory_usage	10/10

Explanation

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

Performance changes are noted in the perf column of each table:

✅ = significantly better comparison variant performance
❌ = significantly worse comparison variant performance
➖ = no significant change in performance

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
Its configuration does not mark it "erratic".

CI Pass/Fail Decision

✅ Passed. All Quality Gates passed.

quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.

pkg/config/setup/system_probe.go

pkg/dynamicinstrumentation/diconfig/dwarf.go

pkg/dynamicinstrumentation/testutil/sample/stacktraces.go

pkg/dynamicinstrumentation/ditypes/config.go

pkg/dynamicinstrumentation/di.go

agent-platform-auto-pr · 2024-12-24T16:57:20Z

[Fast Unit Tests Report]

On pipeline 51775098 (CI Visibility). The following jobs did not run any unit tests:

Jobs:

tests_windows-x64

If you modified Go files and expected unit tests to run in these jobs, please double check the job logs. If you think tests should have been executed reach out to #agent-devx-help

grantseltzer · 2024-12-24T17:04:04Z

I've removed inline function resolution altogether.

…at correspond with inlined subroutines Signed-off-by: grantseltzer <[email protected]>

Signed-off-by: grantseltzer <[email protected]>

agent-platform-auto-pr · 2024-12-24T17:31:49Z

Uncompressed package size comparison

Comparison with ancestor f93202d5b1c2065d3b9ab0b50ba833c30467466c

Diff per package

package	diff	status	size	ancestor	threshold
datadog-dogstatsd-amd64-deb	0.00MB	✅	78.57MB	78.57MB	10.00MB
datadog-dogstatsd-x86_64-rpm	0.00MB	✅	78.65MB	78.65MB	10.00MB
datadog-dogstatsd-x86_64-suse	0.00MB	✅	78.65MB	78.65MB	10.00MB
datadog-dogstatsd-arm64-deb	0.00MB	✅	55.77MB	55.77MB	10.00MB
datadog-heroku-agent-amd64-deb	0.00MB	✅	505.17MB	505.17MB	70.00MB
datadog-iot-agent-amd64-deb	0.00MB	✅	113.34MB	113.34MB	10.00MB
datadog-iot-agent-x86_64-rpm	0.00MB	✅	113.41MB	113.41MB	10.00MB
datadog-iot-agent-x86_64-suse	0.00MB	✅	113.41MB	113.41MB	10.00MB
datadog-iot-agent-arm64-deb	0.00MB	✅	108.81MB	108.81MB	10.00MB
datadog-iot-agent-aarch64-rpm	0.00MB	✅	108.88MB	108.88MB	10.00MB
datadog-agent-amd64-deb	-0.00MB	✅	1190.77MB	1190.77MB	140.00MB
datadog-agent-x86_64-rpm	-0.00MB	✅	1200.06MB	1200.06MB	140.00MB
datadog-agent-x86_64-suse	-0.00MB	✅	1200.06MB	1200.06MB	140.00MB
datadog-agent-arm64-deb	-0.01MB	✅	935.06MB	935.06MB	140.00MB
datadog-agent-aarch64-rpm	-0.01MB	✅	944.33MB	944.33MB	140.00MB

Decision

✅ Passed

cimi

👍 Thanks!

grantseltzer · 2024-12-24T20:32:03Z

/merge

dd-devflow · 2024-12-24T20:32:11Z

Devflow running: `/merge`

View all feedbacks in Devflow UI.

2024-12-24 20:32:11 UTC ℹ️ MergeQueue: pull request added to the queue

The median merge time in main is 34m.

2024-12-24 21:27:16 UTC ℹ️ MergeQueue: This merge request was merged

grantseltzer added component/system-probe team/dynamic-instrumentation Dynamic Instrumentation qa/done QA done before merge and regressions are covered by tests labels Dec 13, 2024

grantseltzer requested review from a team as code owners December 13, 2024 18:13

github-actions bot added medium review PR review might take time and removed component/system-probe labels Dec 13, 2024

grantseltzer changed the title ~~Add config for inlined functions in program counter resolution for Go DI~~ Add config for inlined functions in program counter resolution for Go DI, remove DWARF caching Dec 13, 2024

brycekahle reviewed Dec 13, 2024

View reviewed changes

pkg/config/setup/system_probe.go Outdated Show resolved Hide resolved

brycekahle reviewed Dec 13, 2024

View reviewed changes

pkg/dynamicinstrumentation/diconfig/dwarf.go Show resolved Hide resolved

brycekahle reviewed Dec 13, 2024

View reviewed changes

pkg/dynamicinstrumentation/testutil/sample/stacktraces.go Outdated Show resolved Hide resolved

cimi reviewed Dec 20, 2024

View reviewed changes

pkg/dynamicinstrumentation/ditypes/config.go Outdated Show resolved Hide resolved

pkg/dynamicinstrumentation/di.go Outdated Show resolved Hide resolved

github-actions bot added long review PR is complex, plan time to review it and removed medium review PR review might take time labels Dec 21, 2024

grantseltzer changed the title ~~Add config for inlined functions in program counter resolution for Go DI, remove DWARF caching~~ Remove inlined function PC resolution for stack traces, remove DWARF caching Dec 24, 2024

grantseltzer added the changelog/no-changelog label Dec 24, 2024

grantseltzer added 7 commits December 24, 2024 11:04

Add config variable for whether or not to collect program counters th…

8b1223a

…at correspond with inlined subroutines Signed-off-by: grantseltzer <[email protected]>

Filter out return values for now

06cf643

Signed-off-by: grantseltzer <[email protected]>

Linting fix

d3b76b7

Signed-off-by: grantseltzer <[email protected]>

Removing caching of DWARF info completely

824436a

Signed-off-by: grantseltzer <[email protected]>

Add back return to sample function

4604f33

Signed-off-by: grantseltzer <[email protected]>

Close elf file

d558ac7

Signed-off-by: grantseltzer <[email protected]>

Completely remove inlined function program counter resolution

580cbf7

Signed-off-by: grantseltzer <[email protected]>

grantseltzer force-pushed the grantseltzer/DEBUG-3207-add-config-for-inlined-functions-in-program-counter-resolution branch from 5c42da2 to 580cbf7 Compare December 24, 2024 17:04

cimi approved these changes Dec 24, 2024

View reviewed changes

grantseltzer requested review from brycekahle and removed request for brycekahle December 24, 2024 20:31

dd-mergequeue bot merged commit 272716f into main Dec 24, 2024
298 checks passed

dd-mergequeue bot deleted the grantseltzer/DEBUG-3207-add-config-for-inlined-functions-in-program-counter-resolution branch December 24, 2024 21:27

github-actions bot added this to the 7.62.0 milestone Dec 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove inlined function PC resolution for stack traces, remove DWARF caching #32166

Remove inlined function PC resolution for stack traces, remove DWARF caching #32166

grantseltzer commented Dec 13, 2024 •

edited

Loading

agent-platform-auto-pr bot commented Dec 13, 2024 •

edited

Loading

agent-platform-auto-pr bot commented Dec 13, 2024 •

edited

Loading

cit-pr-commenter bot commented Dec 13, 2024 •

edited

Loading

Fine details of change detection per experiment

Explanation

agent-platform-auto-pr bot commented Dec 24, 2024 •

edited

Loading

grantseltzer commented Dec 24, 2024

agent-platform-auto-pr bot commented Dec 24, 2024

cimi left a comment

grantseltzer commented Dec 24, 2024

dd-devflow bot commented Dec 24, 2024 •

edited

Loading

Remove inlined function PC resolution for stack traces, remove DWARF caching #32166

Remove inlined function PC resolution for stack traces, remove DWARF caching #32166

Conversation

grantseltzer commented Dec 13, 2024 • edited Loading

What does this PR do?

Motivation

Describe how you validated your changes

After:

Before:

agent-platform-auto-pr bot commented Dec 13, 2024 • edited Loading

Package size comparison

Decision

agent-platform-auto-pr bot commented Dec 13, 2024 • edited Loading

Test changes on VM

cit-pr-commenter bot commented Dec 13, 2024 • edited Loading

Regression Detector

Regression Detector Results

Optimization Goals: ✅ No significant changes detected

Fine details of change detection per experiment

Bounds Checks: ❌ Failed

Explanation

CI Pass/Fail Decision

agent-platform-auto-pr bot commented Dec 24, 2024 • edited Loading

grantseltzer commented Dec 24, 2024

agent-platform-auto-pr bot commented Dec 24, 2024

Uncompressed package size comparison

Decision

cimi left a comment

Choose a reason for hiding this comment

grantseltzer commented Dec 24, 2024

dd-devflow bot commented Dec 24, 2024 • edited Loading

Devflow running: /merge

grantseltzer commented Dec 13, 2024 •

edited

Loading

agent-platform-auto-pr bot commented Dec 13, 2024 •

edited

Loading

agent-platform-auto-pr bot commented Dec 13, 2024 •

edited

Loading

cit-pr-commenter bot commented Dec 13, 2024 •

edited

Loading

agent-platform-auto-pr bot commented Dec 24, 2024 •

edited

Loading

dd-devflow bot commented Dec 24, 2024 •

edited

Loading

Devflow running: `/merge`