Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use async client for delete blob or path in S3 Blob Container #16788

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

ashking94
Copy link
Member

@ashking94 ashking94 commented Dec 5, 2024

Description

This PR addresses the port exhaustion issue (issue #16883) causing indexing failures and partial snapshots in OpenSearch clusters with high indexing loads. The problem manifests as periodic spikes in 5xx HTTP status codes during indexing operations and "Cannot assign requested address" exceptions in logs, particularly during stale segment deletion.

While an async client already exists, this PR extends its use to cover all S3 blob delete operations. This change aims to significantly reduce port exhaustion by minimizing the creation of new sockets for every delete request under high load.

Key changes:

  1. S3BlobContainer.java:

    • Refactored delete operations to exclusively use the async client
    • Removed synchronous delete methods, replacing them with async versions
    • Updated error handling and logging for async operations
      • Metric publisher hook for List was missing at one place which has been handled now.
  2. S3AsyncService.java:

    • Create retry policy within SocketAccess.doPrivileged to fix access issues. This also makes it in sync with sync client.
    • Refactored code to remove redundant code
  3. S3RepositoryPlugin.java:

  • Closing the event the loop group during close of the S3RepositoryPlugin else there are threads leaked due to their daemon nature.
  1. BlobStoreRepository.java:

    • Removed SNAPSHOT_ASYNC_DELETION_ENABLE_SETTING as async deletion is now the default
    • Updated deleteContainer and deleteFromContainer methods to use async operations exclusively
  2. Updated test classes to reflect the changes:

    • S3BlobStoreRepositoryTests.java
    • S3RepositoryThirdPartyTests.java
    • S3BlobStoreContainerTests.java
    • S3RepositoryPluginTests.java
  3. Removed references to the now obsolete async deletion setting in ClusterSettings.java

These changes should significantly improve the handling of delete operations in high-load scenarios, preventing port exhaustion and related issues by leveraging the existing async client more extensively.

Related Issues

Resolves #16883 (Port Exhaustion Causing Indexing Failures and Partial Snapshots)

Check List

  • Functionality includes testing.
  • [ ] API changes companion pull request created, if applicable.
  • [ ] Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Contributor

github-actions bot commented Dec 5, 2024

❌ Gradle check result for 384b63a: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 1c58299: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 49d893f: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 81e356d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for de40809: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for d9b306e: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for d9b306e: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for d9b306e: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 1db7150: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 1db7150: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

✅ Gradle check result for 1db7150: SUCCESS

Copy link

codecov bot commented Dec 19, 2024

Codecov Report

Attention: Patch coverage is 70.83333% with 7 lines in your changes missing coverage. Please review.

Project coverage is 72.18%. Comparing base (b5f651f) to head (1db7150).

Files with missing lines Patch % Lines
...rg/opensearch/repositories/s3/S3BlobContainer.java 45.45% 6 Missing ⚠️
...org/opensearch/repositories/s3/S3AsyncService.java 87.50% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #16788      +/-   ##
============================================
- Coverage     72.21%   72.18%   -0.03%     
+ Complexity    65335    65273      -62     
============================================
  Files          5318     5318              
  Lines        304081   303991      -90     
  Branches      43995    43982      -13     
============================================
- Hits         219578   219425     -153     
- Misses        66541    66576      +35     
- Partials      17962    17990      +28     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@github-actions github-actions bot added bug Something isn't working Storage:Snapshots labels Dec 19, 2024
@ashking94 ashking94 marked this pull request as ready for review December 19, 2024 09:35
@ashking94
Copy link
Member Author

Codecov Report

Attention: Patch coverage is 70.83333% with 7 lines in your changes missing coverage. Please review.

Project coverage is 72.18%. Comparing base (b5f651f) to head (1db7150).

Files with missing lines Patch % Lines
...rg/opensearch/repositories/s3/S3BlobContainer.java 45.45% 6 Missing ⚠️
...org/opensearch/repositories/s3/S3AsyncService.java 87.50% 0 Missing and 1 partial ⚠️
Additional details and impacted files
☔ View full report in Codecov by Sentry. 📢 Have feedback on the report? Share it here.

Trying to increase the coverage to unit tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

[BUG] Port Exhaustion Causing Indexing Failures and Partial Snapshots
1 participant