Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix multipart uploads with checksums on object locked buckets #6794

Merged
merged 4 commits into from
Dec 9, 2024

Conversation

avantgardnerio
Copy link
Contributor

Which issue does this PR close?

Closes #6793.

Rationale for this change

Described in issue

What changes are included in this PR?

A failing test

Are there any user-facing changes?

Multipart uploads fail

@github-actions github-actions bot added the object-store Object Store Interface label Nov 25, 2024
@avantgardnerio avantgardnerio changed the title Failing test for multipart uploads with signatures Failing test for multipart uploads with digests Nov 25, 2024
@avantgardnerio avantgardnerio changed the title Failing test for multipart uploads with digests Failing test for multipart uploads with checksums Nov 25, 2024
@thinkharderdev
Copy link
Contributor

thinkharderdev commented Nov 25, 2024

So there are a couple of issues with how multi-part upload is implemented:

  1. The initial PUT request to CreateMultipartUpload must specify the checksum algo or else calling PutPart with the checksum will fail (eg set x-amz-checksum-algorithm: SHA256 on the request)
  2. When calling CompleteMultipartUpload we need to send all the part checksums in the request (see `https://docs.aws.amazon.com/AmazonS3/latest/API/API_CompleteMultipartUpload.html)

So this means that we need to keep the checksums in the PartId for when we complete the upload

@tustvold
Copy link
Contributor

I'm not sure about this, an AWS specific detail now leaks into the public trait for all stores... It also is not immediately obvious when the checksums are needed or not...

I think this needs a bit more thought, I'll try to find some time in the next few weeks

@tustvold tustvold added the api-change Changes to the arrow API label Nov 27, 2024
@avantgardnerio
Copy link
Contributor Author

this, an AWS specific detail now leaks into the public trait for all stores... It also is not immediately obvious when the checksums are needed or not...

FWIW: I verified in the new tests we are now writing the checksums at the correct times for when the digest functionality is enabled.

Now I am just trying to get the original tests working for when it isn't.

@thinkharderdev
Copy link
Contributor

It also is not immediately obvious when the checksums are needed or not...

The AWS docs are not very helpful here but we pieced together that to write to an S3 bucket using multi-part with object locking enabled you need to:

  1. Specify the checksum algo in the initial API call to CreateMultiPartUpload
  2. Send the checksum for each part while calling PutPart
  3. When calling CompleteMultipartUpload you need to send the checksum again for each part that was uploaded

The current implementation does 2 but not 1 or 3.

@avantgardnerio
Copy link
Contributor Author

The current implementation does 2 but not 1 or 3.

I'd like to re-iterate that this PR is functionally complete. Although I am unsure about the docs, I am sure it works.

Now I'm just trying to fix the existing s3_test.

@thinkharderdev
Copy link
Contributor

The current implementation does 2 but not 1 or 3.

I'd like to re-iterate that this PR is functionally complete. Although I am unsure about the docs, I am sure it works.

Now I'm just trying to fix the existing s3_test.

Sorry, by current implementation I meant what is on master not what is in this PR (which does it correctly) 😄

Was just trying to clarify for @tustvold based on his earlier comment

@tustvold
Copy link
Contributor

Was just trying to clarify for @tustvold based on his earlier comment

Right, my comment pertained to the changes to the public API, we now have a somewhat peculiar quirk of AWS leaking out into the public API in a manner that also makes it very unclear when or if I need to specify the checksum or not. I need to find some time to sit down and think about how we could support this, ideally without making a breaking API change.

@tustvold
Copy link
Contributor

tustvold commented Nov 30, 2024

So I think there is an less intrusive way to support this.

UploadPart is not an issue as the per-part meta is never exposed.

The issue solely concerns MultipartStore, in particular how to carry the checksum through https://docs.rs/object_store/latest/object_store/multipart/struct.PartId.html.

Now we could add a checksum field to this, but this would be a breaking change. However, the content id is intentionally opaque. We could therefore just use the XML encoded https://docs.aws.amazon.com/AmazonS3/latest/API/API_CompletedPart.html instead of an ETag when necessary.

There would need to still be logic to handle an ETag based PartId, as these may have been stored in a database, but detecting this should be trivial e.g. by attempting to decode the XML.

This would avoid needing to make any breaking changes, whilst also being future proof for any of the other checksum types

@avantgardnerio avantgardnerio force-pushed the bg_multipart_sig branch 3 times, most recently from fc51b10 to 6757413 Compare December 3, 2024 18:43
@avantgardnerio
Copy link
Contributor Author

We could therefore just use the XML encoded

@tustvold done

@avantgardnerio avantgardnerio marked this pull request as ready for review December 3, 2024 19:59
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is important we don't add implicit checksum enablement, not only because this could break workloads (e.g. those that aren't actually S3), but also because S3 recently added a load of new checksum algorithms

@avantgardnerio avantgardnerio changed the title Failing test for multipart uploads with checksums Fix multipart uploads with checksums on object locked buckets Dec 4, 2024
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor nits

object_store/src/client/s3.rs Outdated Show resolved Hide resolved
object_store/src/aws/client.rs Outdated Show resolved Hide resolved
@tustvold tustvold removed the api-change Changes to the arrow API label Dec 8, 2024
@avantgardnerio avantgardnerio force-pushed the bg_multipart_sig branch 2 times, most recently from d655774 to 0605e23 Compare December 8, 2024 12:56
@avantgardnerio
Copy link
Contributor Author

@tustvold some unrelated tests started failing. Do you know what's up?

https://github.com/apache/arrow-rs/actions/runs/12221975779/job/34091689971?pr=6794

@avantgardnerio avantgardnerio merged commit fcf4aa4 into apache:main Dec 9, 2024
66 checks passed
@avantgardnerio avantgardnerio deleted the bg_multipart_sig branch December 9, 2024 18:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
object-store Object Store Interface
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multi-part s3 uploads fail when using checksum
4 participants