-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation of google datalfow using cloud build #1
Comments
SURE. you can ping me at [email protected] |
Hey Marco, |
Hi
uhm.. have chekced all my Dockerfiles and i never copied it for some
reason..
I think you dont need to ... because of the command you need to launch as
part of the 'flex setup'
gcloud dataflow flex-template build $TEMPLATE_PATH --image
"$TEMPLATE_IMAGE" --sdk-language "PYTHON" --metadata-file
spec/template_metadata
Pls check out all my run_flow_templates.txt files....
hth good luck
…On Fri, Jul 2, 2021, 7:18 PM Shriyut Jha ***@***.***> wrote:
Hey Marco,
I'm going through your repository.
The metadata json file for parameters in spec directory, don't we need to
copy it to the docker image as well?
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACPNCDWUJYYMDHLYHXSXNXTTVX7IFANCNFSM47UIH2GA>
.
|
Okay, I get it now. |
That's what I posted to David..as I m not sure. Coz image needs to be
built.im guessing if you run with direct runner you won't need it...but how
bout if you use data flow runner ..workers would need to know the image..
Will check David samples and try.
Right now I have another issue...I cannot open external url when using flex
/ docker....got to find out why before I spam the list
Gd luck
…On Fri, Jul 2, 2021, 8:03 PM Shriyut Jha ***@***.***> wrote:
Okay, I get it now.
This brings me to my next question, the dataflow commands in
run_flow_templates.txt
Do we need to execute them beforehand? Can't they be automated as well, as
part of cloudbuild.yaml file run.yaml in your repo
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACPNCDQ3XDX667RQR646I5DTVYEP5ANCNFSM47UIH2GA>
.
|
Hi Marco, In my cloudbuild.yaml file, I have included steps for building the image and flex-template as well, before running the job. I'm guessing you're doing it beforehand and not automating it via cloud build. Is my approach to provision them wrong? Do we need to build the image and flex template manually beforehand ? |
Hi I would speculate that if you run using a direct runner you will be fine.
Not sure though if you use a data flow runner as everything will be
distributed.pls give it a try and let me know
Rdgs
…On Sun, Jul 4, 2021, 12:32 PM Shriyut Jha ***@***.***> wrote:
Hi Marco,
I think I have some more insight now for my issue after going through your
repo and experimenting a bit on my own.
In my cloudbuild.yaml file, I have included steps for building the image
and flex-template as well, before running the job.
I'm guessing you're doing it beforehand and not automating it via cloud
build.
Is my approach to provision them wrong? Do we need to build the image and
flex template manually beforehand ?
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACPNCDWPYHWCRPXPWRKSCZ3TWBBDFANCNFSM47UIH2GA>
.
|
I'm trying it with dataflow runner each time, here's the repo where I'm trying to repeat the quickstart example using cloud build https://github.com/Shriyut/stream-dataflow |
Uhm if it works then great...I started following google samples where these
steps were prerequisites..but I didn't see such steps in David samples
…On Sun, Jul 4, 2021, 1:15 PM Shriyut Jha ***@***.***> wrote:
I'm trying it with dataflow runner each time, here's the repo where I'm
trying to repeat the quickstart example using cloud build
https://github.com/Shriyut/stream-dataflow
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACPNCDVZYUBUYFKGVFH44WLTWBGHBANCNFSM47UIH2GA>
.
|
Hi
ok. have done some experiments again on my pipeline.. I have moved to a
way where i am building my image using a run.yaml file - similar to David'
example =
this is how i launch it
gcloud beta builds submit --config run.yaml --substitutions
_REGION=$REGION --no-source
and here' s run.yaml - havent committed yet my repo , i need to clear up
some hardcoded stuff -
substitutions:
_IMAGE: my_logic:latest2
_JOB_NAME: 'pipelinerunner'
_TEMP_LOCATION: ''
_REGION: us-central1
steps:
- name: gcr.io/$PROJECT_ID/$_IMAGE
entrypoint: python
args:
- /dataflow/template/main.py
- --runner=DirectRunner
- --project=$PROJECT_ID
- --region=$_REGION
- --job_name=$_JOB_NAME
- --temp_location=$_TEMP_LOCATION
- --sdk_container_image=gcr.io/$PROJECT_ID/$_IMAGE
- --disk_size_gb=50
- --year=2018
- --quarter=QTR1
- --setup_file=/dataflow/template/setup.py
options:
logging: CLOUD_LOGGING_ONLY
# Use the Compute Engine default service account to launch the job.
serviceAccount: projects/$PROJECT_ID/serviceAccounts/$
***@***.***
Now, it appears that the image i am declaring here
substitutions:
_IMAGE: my_logic:latest2
must exist already... as as soon as you kick off the gcloud build , gcloud
will try to fetch the image
But if i dont build the image before hand, my command will fail
Unless you have found a different way to get it working.... afaik , the
Dockerfile will be called after you launch
gcloud beta builds submit --config run.yaml --substitutions
_REGION=$REGION --no-source
so i am ending up in a catch22 situation.. where the dockerfile is supposed
to build the image, but gcloud beta build expects the image to be there....
let me know what you find
regarfds
…On Sun, Jul 4, 2021 at 2:19 PM Sofia’s World ***@***.***> wrote:
Uhm if it works then great...I started following google samples where
these steps were prerequisites..but I didn't see such steps in David samples
On Sun, Jul 4, 2021, 1:15 PM Shriyut Jha ***@***.***> wrote:
> I'm trying it with dataflow runner each time, here's the repo where I'm
> trying to repeat the quickstart example using cloud build
> https://github.com/Shriyut/stream-dataflow
>
> —
> You are receiving this because you modified the open/close state.
> Reply to this email directly, view it on GitHub
> <#1 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ACPNCDVZYUBUYFKGVFH44WLTWBGHBANCNFSM47UIH2GA>
> .
>
|
Hey Marco, Regards, |
Hi Marco, Before this error log i get an info log which says: py options not set in envsetup file not set in envextra package not set in env Have you seen any of these errors before? |
Uhm is that from my GitHub project? Last time I tried it worked but have
improved since by moving all variables to cloud builds and kicking the job
via a cloud builds trigger.
At which step you have the issue? You might want to pay attention to logs
while building the image as it might give you hints on what might be wrong
From logs it looks like some of your input params are confusing cloud
builds....
Will try again my pipeline tomorrow and
let you know
Hth
…On Sat, Jul 31, 2021, 11:57 AM Shriyut Jha ***@***.***> wrote:
Hi Marco,
Hope you're doing well.
I finally started working on this after sometime but I'm getting error in
dataflow job
textPayload: "Failed to read the job file :
gs://dataflow-staging-us-central1-86485829124/staging/template_launches/2021-07-31_03_48_03-244546569923956193/job_object
with error message: (f6a41fa524f65419): Unable to open template file:
gs://dataflow-staging-us-central1-86485829124/staging/template_launches/2021-07-31_03_48_03-244546569923956193/job_object.."
Before this error log i get an info log which says: py options not set in
envsetup file not set in envextra package not set in env
Have you seen any of these errors before?
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACPNCDVY2JX74OKEOJCN4T3T2PJI7ANCNFSM47UIH2GA>
.
|
Hi Marco, I have few print statements in my code and i can see them in dataflow logs so i think the image builds successfully and the code executes as well but in the dataflow console the job fails and i dont get any dependency graph. I'm really not sure how to proceed on this. |
Also, I tried to run the python code using another image with all the dependency installed and using the python command to run the script but the program runs and does the processing in cloud build machine instead of dataflow, is --runner=DataflowRunner flag not enough to trigger a dataflow job via python comamnd? |
You need to use DirectRunner to run locally....will download your project
and have a go tomorrow morning..don't worry it'll be something v small to
fix
Rgds
…On Sat, Jul 31, 2021, 1:29 PM Shriyut Jha ***@***.***> wrote:
Also, I tried to run the python code using another image with all the
dependency installed and using the python command to run the script but the
program runs and does the processing in cloud build machine instead of
dataflow, is --runner=DataflowRunner flag not enough to trigger a dataflow
job via python comamnd?
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACPNCDQGRXRRR36R6RTRIW3T2PUCJANCNFSM47UIH2GA>
.
|
Yeah I hope so, locally it runs fine on my machine, it doesnt work on dataflow in gcp, thats the tricky part, tried adding runner=DataflowRunner along with project id and temp location but still it only runs locally |
Thanks, I'll check it out.
…On Sat, Jul 31, 2021, 3:13 PM Shriyut Jha ***@***.***> wrote:
Yeah I hope so, locally it runs fine on my machine, it doesnt work on
dataflow in gcp, thats the tricky part, tried adding runner=DataflowRunner
along with project id and temp location but still it only runs locally
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACPNCDWOETEUXZGCT7PKC5TT2QAHRANCNFSM47UIH2GA>
.
|
ok got your project;... still debugging
Have tried this
1- run the template from Dataflow console (you can create a new job from
there). It fails with same error log you mention
2 - run template from console, using the list of commands listed below...
(i had to do it coz i dont have access toyour project so i had to use some
environment variables), still the created job fails with the same error
I m going to inspect the Dockerfile and see what's going on...
export PROJECT_ID="$(gcloud config get-value project)"
export TEMPLATE_BUCKET=mm_dataflow_bucket/templates
export TEMPLATE_PATH=gs://$TEMPLATE_BUCKET/newbeam.json
export TEMPLATE_IMAGE=gcr.io/$PROJECT_ID/new-beam:latest
gcloud builds submit --project=$PROJECT_ID --tag $TEMPLATE_IMAGE .
gcloud dataflow flex-template build $TEMPLATE_PATH --image
"$TEMPLATE_IMAGE" --sdk-language "PYTHON" --metadata-file metadata.json
gcloud dataflow flex-template run "testeragain" \
--template-file-gcs-location "$TEMPLATE_PATH" \
--region "$REGION"
i have been stuck with similar problems when i started with flex
templates..... we'll get to a solution dont worry
BTW, if you say it's working locally but not on dataflow runner, could you
check you have the following permissions in your IAM?
***@***.***
hth
marco
…On Sat, Jul 31, 2021 at 3:13 PM Shriyut Jha ***@***.***> wrote:
Yeah I hope so, locally it runs fine on my machine, it doesnt work on
dataflow in gcp, thats the tricky part, tried adding runner=DataflowRunner
along with project id and temp location but still it only runs locally
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACPNCDWOETEUXZGCT7PKC5TT2QAHRANCNFSM47UIH2GA>
.
|
I'm testing it in a test environment where I have project owner permission so I don't think it could be because of that. Error message from worker: Traceback (most recent call last): File "apache_beam/runners/common.py", line 1233, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 582, in apache_beam.runners.common.SimpleInvoker.invoke_process File "main.py", line 112, in process NameError: name 'http' is not defined During handling of the above exception, another exception occurred: Traceback (most recent call last) Further down the error log it mentions: line 582, in apache_beam.runners.common.SimpleInvoker.invoke_process File "main.py", line 112, in process NameError: name 'http' is not defined [while running 'Make API call & Perform CDC'] Note: imports, functions and other variables defined in the global context of your main file of your Dataflow pipeline are, by default, not available in the worker execution environment, and such references will cause a NameError, unless the --save_main_session pipeline option is set to True. Please see https://cloud.google.com/dataflow/faq#how-do-i-handle-nameerrors for additional documentation on configuring your worker execution environment. Have you ever encountered anything like this? Again, thanks a lot for your help, I don't know where I would be without your guidance. You've guided me more than my superiors |
Thanks..at the moment I can't get code to run via cloud build.i will try to
simplify code and perhaps write a test to see if I can spot anything...,
Will keep you posted
…On Sun, Aug 1, 2021, 3:02 PM Shriyut Jha ***@***.***> wrote:
I'm testing it in a test environment where I have project owner permission
so I don't think it could be because of that.
Flex template is still giving me the same issue, meanwhile using python
command via cloud build I can run the job with dataflow runner ( the issue
was missing pipeline options configuration in code) but not sure why DoFn
class throws another error because of the http.client library that I'm
using.
I receive a name error for HTTP
Error message from worker: Traceback (most recent call last): File
"apache_beam/runners/common.py", line 1233, in
apache_beam.runners.common.DoFnRunner.process File
"apache_beam/runners/common.py", line 582, in
apache_beam.runners.common.SimpleInvoker.invoke_process File "main.py",
line 112, in process NameError: name 'http' is not defined During handling
of the above exception, another exception occurred: Traceback (most recent
call last)
Further down the error log it mentions:
line 582, in apache_beam.runners.common.SimpleInvoker.invoke_process File
"main.py", line 112, in process NameError: name 'http' is not defined
[while running 'Make API call & Perform CDC'] Note: imports, functions and
other variables defined in the global context of your *main* file of your
Dataflow pipeline are, by default, not available in the worker execution
environment, and such references will cause a NameError, unless the
--save_main_session pipeline option is set to True. Please see
https://cloud.google.com/dataflow/faq#how-do-i-handle-nameerrors for
additional documentation on configuring your worker execution environment.
Have you ever encountered anything like this?
I'll try to use --save_main_session flag to see if that resolves the error
or not next
Again, thanks a lot for your help, I don't know where I would be without
your guidance. You've guided me more than my superiors
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACPNCDTIZALESPNZ7DJS4F3T2VHV5ANCNFSM47UIH2GA>
.
|
Hey Shriyut
am debugging your repo again
there was a flex template presentation recently at beam summit... pls have
a look at this REPO and see if it can help you out
In meantime i will keep you posted on my progress
https://github.com/apichick/beam-summit-2021-flex-template
regards
…On Sun, Aug 1, 2021 at 3:02 PM Shriyut Jha ***@***.***> wrote:
I'm testing it in a test environment where I have project owner permission
so I don't think it could be because of that.
Flex template is still giving me the same issue, meanwhile using python
command via cloud build I can run the job with dataflow runner ( the issue
was missing pipeline options configuration in code) but not sure why DoFn
class throws another error because of the http.client library that I'm
using.
I receive a name error for HTTP
Error message from worker: Traceback (most recent call last): File
"apache_beam/runners/common.py", line 1233, in
apache_beam.runners.common.DoFnRunner.process File
"apache_beam/runners/common.py", line 582, in
apache_beam.runners.common.SimpleInvoker.invoke_process File "main.py",
line 112, in process NameError: name 'http' is not defined During handling
of the above exception, another exception occurred: Traceback (most recent
call last)
Further down the error log it mentions:
line 582, in apache_beam.runners.common.SimpleInvoker.invoke_process File
"main.py", line 112, in process NameError: name 'http' is not defined
[while running 'Make API call & Perform CDC'] Note: imports, functions and
other variables defined in the global context of your *main* file of your
Dataflow pipeline are, by default, not available in the worker execution
environment, and such references will cause a NameError, unless the
--save_main_session pipeline option is set to True. Please see
https://cloud.google.com/dataflow/faq#how-do-i-handle-nameerrors for
additional documentation on configuring your worker execution environment.
Have you ever encountered anything like this?
I'll try to use --save_main_session flag to see if that resolves the error
or not next
Again, thanks a lot for your help, I don't know where I would be without
your guidance. You've guided me more than my superiors
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACPNCDTIZALESPNZ7DJS4F3T2VHV5ANCNFSM47UIH2GA>
.
|
Hey
good news I managed to get your code to run on dataflow runner
Had to do few modifications to the code
1 - simplifying the main. you dont need complex code to test your
flex template. once you got it running you can then enhance it
2 - apparently you are not passing any PipelineOptions, so dataflow
assumes you run by default with DirectRunner
3 - seems your syntax to running the template via cloud sdk is not quite
right
I attach the zip file which i use
PLese note that i have split the build and the run in two separate yaml
files..
Let me know if this helps at all
…On Fri, Aug 6, 2021 at 6:59 AM Sofia’s World ***@***.***> wrote:
Hey Shriyut
am debugging your repo again
there was a flex template presentation recently at beam summit... pls have
a look at this REPO and see if it can help you out
In meantime i will keep you posted on my progress
https://github.com/apichick/beam-summit-2021-flex-template
regards
On Sun, Aug 1, 2021 at 3:02 PM Shriyut Jha ***@***.***>
wrote:
> I'm testing it in a test environment where I have project owner
> permission so I don't think it could be because of that.
> Flex template is still giving me the same issue, meanwhile using python
> command via cloud build I can run the job with dataflow runner ( the issue
> was missing pipeline options configuration in code) but not sure why DoFn
> class throws another error because of the http.client library that I'm
> using.
> I receive a name error for HTTP
>
> Error message from worker: Traceback (most recent call last): File
> "apache_beam/runners/common.py", line 1233, in
> apache_beam.runners.common.DoFnRunner.process File
> "apache_beam/runners/common.py", line 582, in
> apache_beam.runners.common.SimpleInvoker.invoke_process File "main.py",
> line 112, in process NameError: name 'http' is not defined During handling
> of the above exception, another exception occurred: Traceback (most recent
> call last)
>
> Further down the error log it mentions:
>
> line 582, in apache_beam.runners.common.SimpleInvoker.invoke_process File
> "main.py", line 112, in process NameError: name 'http' is not defined
> [while running 'Make API call & Perform CDC'] Note: imports, functions and
> other variables defined in the global context of your *main* file of
> your Dataflow pipeline are, by default, not available in the worker
> execution environment, and such references will cause a NameError, unless
> the --save_main_session pipeline option is set to True. Please see
> https://cloud.google.com/dataflow/faq#how-do-i-handle-nameerrors for
> additional documentation on configuring your worker execution environment.
>
> Have you ever encountered anything like this?
> I'll try to use --save_main_session flag to see if that resolves the
> error or not next
>
> Again, thanks a lot for your help, I don't know where I would be without
> your guidance. You've guided me more than my superiors
>
> —
> You are receiving this because you modified the open/close state.
> Reply to this email directly, view it on GitHub
> <#1 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ACPNCDTIZALESPNZ7DJS4F3T2VHV5ANCNFSM47UIH2GA>
> .
>
|
Hey Marco, |
K gd luck I sent u a zip where there is a working example.i simplified the
code ..it should give u a start
Gd luck
…On Sun, Aug 8, 2021, 5:02 PM Shriyut Jha ***@***.***> wrote:
Hey Marco,
Thanks for sharing this link, I'm going through it will try to implement
it tomorrow, client's being a real pain in the ass, need to write half of
the code in java now
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACPNCDWFM6MKFY3ZAEXCAJDT32S75ANCNFSM47UIH2GA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>
.
|
Hi Marco, |
attached to the zip. Might have been rejected from your mails i guess..?
…On Mon, Aug 9, 2021 at 1:02 PM Shriyut Jha ***@***.***> wrote:
Hi Marco,
Where have you shared the zip file
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACPNCDR3RBK4Q32SJK363F3T367UNANCNFSM47UIH2GA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>
.
|
do you have a private email account? so i can send the zip> i am guessing
replying to this thread will go to github....so no attachments allowed,.,,.
…On Mon, Aug 9, 2021 at 1:02 PM Shriyut Jha ***@***.***> wrote:
Hi Marco,
Where have you shared the zip file
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACPNCDR3RBK4Q32SJK363F3T367UNANCNFSM47UIH2GA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>
.
|
Yeah its, [email protected] |
Hey.,
sorry am working all on cloud nowadays..problematic to download and send
emails with attachments . so have added zip file here
https://github.com/mmistroni/GCP_Experiments/tree/master/dataflow
please note that your main.py has been greatly simplified. Also, for future
references, ih ave added substitutions variable to the cloudbuild and other
yaml scripts... so next time we share code i just update them with my own
path.
I'll keep zip around until u comfortable...then i will delete it from my
repo
regards
…On Mon, Aug 9, 2021 at 2:07 PM Shriyut Jha ***@***.***> wrote:
Yeah its, ***@***.***
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#1 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACPNCDQUI3LABD743L7SXX3T37HJRANCNFSM47UIH2GA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>
.
|
Hey Marco, Thanks, |
Hi Marco,
Just opening this issue so we can communicate further
The text was updated successfully, but these errors were encountered: