Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Nginx access log was manipulated incorrectly by multi-line concatenation behavior #32111

Closed
toan-hf opened this issue Dec 12, 2024 · 10 comments

Comments

@toan-hf
Copy link

toan-hf commented Dec 12, 2024

Agent Environment

agent version 7.59.1
helm version 3.81.0

Describe what happened:

Since we upgraded the Datadog-Agent from 7.52.0 to the latest one 7.59.1, the Nginx access log was manipulated incorrectly by multi-line concatenation behavior

Image

Image

Using agent status, we can quickly identify the log combined is actively working.

Image

I think this issue comes from the recent refactoring that was introduced here

Describe what you expected:

The logline should not be combined like the one above, each line should be separated

Steps to reproduce the issue:

Additional environment details (Operating System, Cloud provider, etc):

AWS EKS 1.29.0

@toan-hf toan-hf changed the title [BUG] [BUG] Nginx access log was manipulated incorrectly by multi-line concatenation behavior Dec 12, 2024
@toan-hf
Copy link
Author

toan-hf commented Dec 13, 2024

@gh123man as the recent implemented by your PR, I would love to hear your thought here :)

@gh123man
Copy link
Member

gh123man commented Dec 13, 2024

Hi @toan-hf Thanks for the ping!
Can you share any details around your logs configuration?

  • Are you using logs_config.auto_multi_line_detection?
  • Do you have a custom regex pattern defined for this log source?

From looking at the screenshot - I can take a guess:

It looks like you may be hitting a case of "mixed format logs" ex: your log file contains both JSON and non JSON logs.
If you are using logs_config.auto_multi_line_detection, this can happen when the agent starts up and detects the wrong pattern, and it can appear to happen somewhat randomly since it is influenced by the application startup behavior. In this case, it looks like it detected the non-JSON log with a timestamp as the "start pattern", and the rest of the JSON logs get combined together incorrectly.

The code you linked to is actually designed to fix this problem among others. (however it is disabled at the moment while we test it).

Let me know if my assessment is accurate. If you can confirm the agent configuration settings you are using, I can make some suggestions on how to resolve it.

@toan-hf
Copy link
Author

toan-hf commented Dec 14, 2024

@gh123man Thanks for your swift response, I can confirm quickly

  1. Yes we are using logs_config.auto_multi_line_detection across all our environments, this config has been tuned by the environment variable named
    Image

  2. We don't have any regex pattern defined for the nginx log today.

However, I am fully aware that Nginx logs are allowed to be customized for the access_log itself, but for the error_log it is not the case

This can be an example snippet for the access log

 log_format upstreaminfo '{{ if $cfg.useProxyProtocol }}$proxy_protocol_addr{{ else }}$remote_addr{{ end }} - '
        '[$proxy_add_x_forwarded_for] - $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent" '
        '$request_length $request_time [$proxy_upstream_name] $upstream_addr $upstream_response_length $upstream_response_time $upstream_status';

So because of this, I think the error log is under syslog format under JSON (non-JSON) as you said,

  1. We rolled back to version 7.52.0, and it seems pretty OK that is why I feel a bit ambiguous about your recent change, I thought it was affected by this, and sorry if my assumption is incorrect

  2. Speaking about the Nginx log itself, and how it was collected. We often emit the "source": "nginx-ingress-controller" into all the logs of Nginx workload.

Image

In the Datadog WebUI, the default pipeline (standard one) leans to the source we emitted above.

Image

I am looking forward to hearing your suggestion here. Thank you very much

@gh123man
Copy link
Member

Thanks for confirming the configuration!

  1. We rolled back to version 7.52.0, and it seems pretty OK that is why I feel a bit ambiguous about your recent change, I thought it was affected by this, and sorry if my assumption is incorrect

No worries! If you do not expect multiline logs from the nginx source, I would suggest disabling auto multiline detection just on this source. You can do this by setting auto_multi_line_detection: false on the logs integration config. (docs). This will prevent the incorrect aggregation from happening in the future, and you should be able to upgrade to the newer agent version.

  1. Speaking about the Nginx log itself, and how it was collected. We often emit the "source": "nginx-ingress-controller" into all the logs of Nginx workload.
    In the Datadog WebUI, the default pipeline (standard one) leans to the source we emitted above.

This sounds reasonable to me. Let me know if my above suggestion works.
Happy to help!

@toan-hf
Copy link
Author

toan-hf commented Dec 16, 2024

@gh123man thanks for your hint.

The key thing that I am trying to understand a bit further here is what exactly the difference between that auto_multi_line_detection when it was 7.52.0 and 7.59.1.

As far as I can understand, the auto_multi_line_detection on version 7.52.0 was enabled, we set this globally on the Datadog Agent configuration side. In short, auto_multi_line_detection has been applied widely for Nginx Ingress as well as other log sources.

logs_config:
  auto_multi_line_detection: true

Now when it comes to the new version 7.59.1 if we want to disable that specifically for Ingress Nginx only ,it is fine but still a bit ambiguous about how this feature actually works

Appreciate for your insight.

@gh123man
Copy link
Member

The key thing that I am trying to understand a bit further here is what exactly the difference between that auto_multi_line_detection when it was 7.52.0 and 7.59.1.

There should be no difference. The changes for auto_multi_line_detection V2 are independent and don't change the existing behavior. I think the problem is - auto_multi_line_detection (v1) is nondeterministic in how it detects a pattern. When the log source restarts or first appears, depending on the distribution of log formats, the agent could choose a different pattern and as a result, change the aggregation behavior.

NOTE that this particular behavior is fixed in V2 (which we are working on rolling out). Our goal is to remove this ambiguity around how the feature works.

@toan-hf
Copy link
Author

toan-hf commented Dec 19, 2024

Thank you for your explanation @gh123man, we have had a chance to revise the code, and what we figured out is quite similar to what you have explained,
We followed your PR here and could leverage the Datadog-agent telemetry to identify the log combination feature while the incident occurred

Image

As that change introduces a few new patterns that it could match the nginx log format (including JSON access log and ERROR in Syslog format), hence the auto_multi_line_detection was trigged to concatenate many Nginx log lines.

My last question here would be awesome if you can clarify is: whether we have any option to disable the log_combination feature individually without disabling the auto_multi_line_detection feature entirely?
if not then when will this feature (auto_multiline_detection v2)is officially rolled out?

Thank you very much for your guidance.

@gh123man
Copy link
Member

My last question here would be awesome if you can clarify is: whether we have any option to disable the log_combination feature individually without disabling the auto_multi_line_detection feature entirely?

I hope I am understanding correctly: Yes.
If you have (in datadog.yaml)

logs_config:
  auto_multi_line_detection: true

you may disable it per source with:

logs:
  - type: file
    path: /my/file.log
    service: testApp
    source: java
    auto_multi_line_detection: false

(or if in K8s the same way via the pod annotation).
Here are the docs that cover this

This will ensure all sources will still use auto_multi_line_detection unless it's explicitly disable via the logs integration config.

if not then when will this feature (auto_multiline_detection v2)is officially rolled out?

I don't have a firm date, but the feature is complete, and we are just waiting to enable it in the config as we do some extra testing on our end. I'd expect it to be fully released in a few agent versions from now (sometime Early 2025).

Just a heads up, Ill be away until the new year starting tomorrow so my responses may be delayed.
I'll hand this thread over to one of my teammates if you have any other questions in the meantime, or always feel free to escalate through Datadog support.

@toan-hf
Copy link
Author

toan-hf commented Dec 19, 2024

Hmm i understand we can disable multiline detection per service/source, but what i am asking is I want to disable the feature log_combination individually as it was apart of multiline_detection feature

I guess we don't have that for now

Thank you, we can close this case

@gh123man
Copy link
Member

Ah - yes sorry I misunderstood.
Today detection and aggregation (combination) are one and the same.
However - thanks for the feedback and idea. In V2, these concepts are more decoupled so conceptually we could perform detection without actually combining the logs.

I'll evaluate this and see if it fits into our design as a future feature 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants