Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Google Storage Requester pays feature to Golang SDK. #33236

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

LeoCBS
Copy link

@LeoCBS LeoCBS commented Nov 27, 2024

Setting UserProject on Google Storage Bucket operations to enable requester pays feature.

Requester pays project ID will come from environment variable named BILLING_PROJECT_ID

More information about Google storage requester pays feature here https://cloud.google.com/storage/docs/requester-pays

fixes #30747

Setting UserProject on Google Storage Bucket operations to enable requester pays
feature.

Requester pays project ID will come from environment variable named
`BILLING_PROJECT_ID`

More information about Google storage requester pays feature here https://cloud.google.com/storage/docs/requester-pays
Copy link
Contributor

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @lostluck for label go.
R: @chamikaramj for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

Copy link
Contributor

github-actions bot commented Dec 5, 2024

Reminder, please take a look at this pr: @lostluck @chamikaramj

Copy link
Contributor

github-actions bot commented Dec 9, 2024

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @jrmccluskey for label go.
R: @ahmedabu98 for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

Copy link

codecov bot commented Dec 11, 2024

Codecov Report

Attention: Patch coverage is 93.75000% with 1 line in your changes missing coverage. Please review.

Project coverage is 57.41%. Comparing base (ad8545c) to head (bfc9077).
Report is 152 commits behind head on master.

Files with missing lines Patch % Lines
sdks/go/pkg/beam/io/filesystem/gcs/gcs.go 93.75% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #33236      +/-   ##
============================================
- Coverage     58.92%   57.41%   -1.51%     
+ Complexity     3112     1475    -1637     
============================================
  Files          1133      970     -163     
  Lines        174889   154431   -20458     
  Branches       3343     1076    -2267     
============================================
- Hits         103045    88661   -14384     
+ Misses        68495    63569    -4926     
+ Partials       3349     2201    -1148     
Flag Coverage Δ
go 34.65% <93.75%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@lostluck lostluck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This largely looks good, and it does seem that the storage package code handles an empty string there properly, despite not being in the documentation.

https://pkg.go.dev/cloud.google.com/go/storage#BucketHandle.UserProject

The main concern I have is whether the environment variable is automatically set in GCE VMs or not.

Have you tried this on Google Cloud Dataflow? (it's OK if you haven't, or your usage of Beam doesn't require it, but it's something I'll need to verify to ensure the feature works well...).

Once you get back to me on that, I'm happy to merge this in.

@lostluck
Copy link
Contributor

I think the only thing I don't really love is using an environment variables for this, as it's not necessarily going to be set or not, and there doesn't seem to be any standardization around it within Beam or GCP (outside of a few scattered unrelated examples).

It would probably be better to do something as a configurable hook at Pipeline submission time, that is triggered when a worker is initialized.

@LeoCBS
Copy link
Author

LeoCBS commented Dec 12, 2024

Hi @lostluck, thank you for your feedback. I will change the implementation to get the billing project from the pipeline variable instead of using the environment variable.

@lostluck
Copy link
Contributor

If it were me, I'd add a function to the GCS package that sets the value when the fs is constructed. eg. SetRequesterBillingProject(project string)

Then do something like the harness opts do for the side input cache:

Defines and installs the hook URN.

hooks.RegisterHook("beam:go:hook:sideinputcache:capacity", hf)

Has a method to be called at before pipeline submission to enable the hook, and call SetRequesterBillingProject with the value stored in the hook. Note, it just needs to be an arbitrary string (which could then be any serialized data that you like). This gets stored as a pipeline option, so it can be retrieved worker side.

https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/util/harnessopts/cache.go

This would affect the whole pipeline (just as this change does), but if a pipeline author wants to do trickier bits where the billing requester is changing, then we can figure out other options.
Ultimately, the goal is to enable most uses, but not all uses.

Copy link
Contributor

Reminder, please take a look at this pr: @jrmccluskey @ahmedabu98

@LeoCBS
Copy link
Author

LeoCBS commented Dec 22, 2024

@lostluck check if i understand your suggestion:

in file sdks/go/pkg/beam/io/filesystem/gcs/gcs.go i will register a hook to get billing project information:

var (
    billingProject string = ""
)

func init() {
	hf := func(opts []string) hooks.Hook {
		return hooks.Hook{
			Init: func(ctx context.Context) (context.Context, error) {
				if len(opts) == 0 {
					return ctx, nil
				}
				if len(opts) > 1 {
					return ctx, fmt.Errorf("expected 1 option, got %v: %v", len(opts), opts)
				}

				billingProject = opts[0]
				return ctx, nil
			},
		}
	}
	hooks.RegisterHook("beam:go:hook:gcstorage:requesterbillingproject", hf)
}

and on beam/sdks/go/pkg/beam/util/harnessopts/gcs.go i will create a function like this?

const (
	billingProjectHook = "beam:go:hook:requesterbillingproject:capacity"
)

func SetRequesterBillingProject(billingProject string) error {
	return hooks.EnableHook(billingProjectHook, billingProject)
}

I have some questions:

* who will call function SetRequesterBillingProject?
* How will i test if hook was called correcty?
* Will i passing billing project by CLI args? like the runner value, `--runner dataflow`?

Copy link
Contributor

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @lostluck for label go.
R: @damondouglas for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

@lostluck
Copy link
Contributor

@LeoCBS Thanks for your patience over this holiday season.

  1. That looks about right!
  2. There's no need to put the enabling function in the harnessopts package. The harness options were split up for package privacy reasons. Due to how tightly bound this feature is to GCS, please keep enabling function in the GCS filesystem package.
  3. Pipelines that would need to have that set will need to call it in the before pipeline submission in the binary's 'main'. It must be before pipeline submission, but doesn't have to be before beam.Init. Note that whatever is doing the call for this is allowed to use whatever environment variable you like. You can automate it that way if that's how you prefer. (eg. A final user specific package to call it at package init time, that looks for your environment variable, and then calls the enabler method.) (or as you suggest, a flag, that gets set by the binary, and then is used when calling the enabling function).
  4. You can add a test in the GCS filesystem package, that calls hooks.RunInitHooks -> https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/core/util/hooks/hooks.go#L115 That will execute the enabled hooks. You should be able to verify that the billingProject package variable was set as expected there.

Copy link
Contributor

@lostluck lostluck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just swapping this to requestChanges from approved since the change is being made anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request]: Access GCS bucket with requester pays
2 participants