Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of google datalfow using cloud build #1

Open
Shriyut opened this issue Jul 1, 2021 · 31 comments
Open

Implementation of google datalfow using cloud build #1

Shriyut opened this issue Jul 1, 2021 · 31 comments

Comments

@Shriyut
Copy link

Shriyut commented Jul 1, 2021

Hi Marco,
Just opening this issue so we can communicate further

@mmistroni
Copy link
Owner

SURE. you can ping me at [email protected]

@mmistroni mmistroni reopened this Jul 1, 2021
@Shriyut
Copy link
Author

Shriyut commented Jul 2, 2021

Hey Marco,
I'm going through your repository.
The metadata json file for parameters in spec directory, don't we need to copy it to the docker image as well?

@mmistroni
Copy link
Owner

mmistroni commented Jul 2, 2021 via email

@Shriyut
Copy link
Author

Shriyut commented Jul 2, 2021

Okay, I get it now.
This brings me to my next question, the dataflow commands in run_flow_templates.txt
Do we need to execute them beforehand? Can't they be automated as well, as part of cloudbuild.yaml file run.yaml in your repo

@mmistroni
Copy link
Owner

mmistroni commented Jul 2, 2021 via email

@Shriyut
Copy link
Author

Shriyut commented Jul 4, 2021

Hi Marco,
I think I have some more insight now for my issue after going through your repo and experimenting a bit on my own.

In my cloudbuild.yaml file, I have included steps for building the image and flex-template as well, before running the job.

I'm guessing you're doing it beforehand and not automating it via cloud build.

Is my approach to provision them wrong? Do we need to build the image and flex template manually beforehand ?

@mmistroni
Copy link
Owner

mmistroni commented Jul 4, 2021 via email

@Shriyut
Copy link
Author

Shriyut commented Jul 4, 2021

I'm trying it with dataflow runner each time, here's the repo where I'm trying to repeat the quickstart example using cloud build https://github.com/Shriyut/stream-dataflow

@mmistroni
Copy link
Owner

mmistroni commented Jul 4, 2021 via email

@mmistroni
Copy link
Owner

mmistroni commented Jul 5, 2021 via email

@Shriyut
Copy link
Author

Shriyut commented Jul 17, 2021

Hey Marco,
I was a bit preoccupied with other stuff for the past few days.
Will try it out this week.
Thanks a lot for your help

Regards,
Shriyut

@Shriyut
Copy link
Author

Shriyut commented Jul 31, 2021

Hi Marco,
Hope you're doing well.
I finally started working on this after sometime but I'm getting error in dataflow job
textPayload: "Failed to read the job file : gs://dataflow-staging-us-central1-86485829124/staging/template_launches/2021-07-31_03_48_03-244546569923956193/job_object with error message: (f6a41fa524f65419): Unable to open template file: gs://dataflow-staging-us-central1-86485829124/staging/template_launches/2021-07-31_03_48_03-244546569923956193/job_object.."

Before this error log i get an info log which says: py options not set in envsetup file not set in envextra package not set in env

Have you seen any of these errors before?

@mmistroni
Copy link
Owner

mmistroni commented Jul 31, 2021 via email

@Shriyut
Copy link
Author

Shriyut commented Jul 31, 2021

Hi Marco,
This isnt from your project, i tried implementing flex template for my use case, just to have a working demo ready i didnt keep any input parameters. Here's the github link: https://github.com/Shriyut/checkdataflow.

I have few print statements in my code and i can see them in dataflow logs so i think the image builds successfully and the code executes as well but in the dataflow console the job fails and i dont get any dependency graph. I'm really not sure how to proceed on this.

@Shriyut
Copy link
Author

Shriyut commented Jul 31, 2021

Also, I tried to run the python code using another image with all the dependency installed and using the python command to run the script but the program runs and does the processing in cloud build machine instead of dataflow, is --runner=DataflowRunner flag not enough to trigger a dataflow job via python comamnd?

@mmistroni
Copy link
Owner

mmistroni commented Jul 31, 2021 via email

@Shriyut
Copy link
Author

Shriyut commented Jul 31, 2021

Yeah I hope so, locally it runs fine on my machine, it doesnt work on dataflow in gcp, thats the tricky part, tried adding runner=DataflowRunner along with project id and temp location but still it only runs locally

@mmistroni
Copy link
Owner

mmistroni commented Jul 31, 2021 via email

@mmistroni
Copy link
Owner

mmistroni commented Aug 1, 2021 via email

@Shriyut
Copy link
Author

Shriyut commented Aug 1, 2021

I'm testing it in a test environment where I have project owner permission so I don't think it could be because of that.
Flex template is still giving me the same issue, meanwhile using python command via cloud build I can run the job with dataflow runner ( the issue was missing pipeline options configuration in code) but not sure why DoFn class throws another error because of the http.client library that I'm using.
I receive a name error for HTTP

Error message from worker: Traceback (most recent call last): File "apache_beam/runners/common.py", line 1233, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 582, in apache_beam.runners.common.SimpleInvoker.invoke_process File "main.py", line 112, in process NameError: name 'http' is not defined During handling of the above exception, another exception occurred: Traceback (most recent call last)

Further down the error log it mentions:

line 582, in apache_beam.runners.common.SimpleInvoker.invoke_process File "main.py", line 112, in process NameError: name 'http' is not defined [while running 'Make API call & Perform CDC'] Note: imports, functions and other variables defined in the global context of your main file of your Dataflow pipeline are, by default, not available in the worker execution environment, and such references will cause a NameError, unless the --save_main_session pipeline option is set to True. Please see https://cloud.google.com/dataflow/faq#how-do-i-handle-nameerrors for additional documentation on configuring your worker execution environment.

Have you ever encountered anything like this?
I'll try to use --save_main_session flag to see if that resolves the error or not next

Again, thanks a lot for your help, I don't know where I would be without your guidance. You've guided me more than my superiors

@mmistroni
Copy link
Owner

mmistroni commented Aug 1, 2021 via email

@mmistroni
Copy link
Owner

mmistroni commented Aug 6, 2021 via email

@mmistroni
Copy link
Owner

mmistroni commented Aug 7, 2021 via email

@Shriyut
Copy link
Author

Shriyut commented Aug 8, 2021

Hey Marco,
Thanks for sharing this link, I'm going through it will try to implement it tomorrow, client's being a real pain in the ass, need to write half of the code in java now

@mmistroni
Copy link
Owner

mmistroni commented Aug 8, 2021 via email

@Shriyut
Copy link
Author

Shriyut commented Aug 9, 2021

Hi Marco,
Where have you shared the zip file

@mmistroni
Copy link
Owner

mmistroni commented Aug 9, 2021 via email

@mmistroni
Copy link
Owner

mmistroni commented Aug 9, 2021 via email

@Shriyut
Copy link
Author

Shriyut commented Aug 9, 2021

Yeah its, [email protected]

@mmistroni
Copy link
Owner

mmistroni commented Aug 9, 2021 via email

@Shriyut
Copy link
Author

Shriyut commented Aug 13, 2021

Hey Marco,
Thanks a lot for guiding me, I was able to execute my use case based on the template that you shared with me.
You can delete the zip file from your repo.

Thanks,
Shriyut

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants