Ch10 - can't run model.py - error with hypertune #166

jgammerman · 2023-02-14T17:17:52Z

Hello,

So on p.338 of the book it says:

But when I run this I get the following error:

Traceback (most recent call last):
  File "/home/jgammerman/data-science-on-gcp/10_mlops/model.py", line 331, in <module>
    train_and_evaluate(TRAIN_DATA_PATTERN, EVAL_DATA_PATTERN, TEST_DATA_PATTERN, OUTPUT_MODEL_DIR, OUTPUT_DIR)
  File "/home/jgammerman/data-science-on-gcp/10_mlops/model.py", line 180, in train_and_evaluate
    hpt = hypertune.HyperTune()
AttributeError: module 'hypertune' has no attribute 'HyperTune'

I have pip installed hypertune on my VM so I know it's there:

jgammerman@cloudshell:~/data-science-on-gcp/.......$ pip show hypertune
Name: hypertune
Version: 1.0.3
Summary: A library for performing hyperparameter optimization with Polyaxon.
Home-page: https://github.com/polyaxon/hypertune
Author: Polyaxon, Inc.
Author-email: [email protected]
License: Apache 2.0
Location: /home/jgammerman/.local/lib/python3.9/site-packages

The text was updated successfully, but these errors were encountered:

lakshmanok · 2023-02-14T17:32:09Z

Make sure you are pip installing to the right Python installation: python3 -m pip install hypertune and python3 -m pip show hypertune thanks Lak

…

On Tue, Feb 14, 2023 at 9:18 AM James ***@***.***> wrote: Hello, So on p.338 of the book it says: [image: image] <https://user-images.githubusercontent.com/8484188/218809208-d8b90fb4-36b0-41eb-bf9b-03187670438a.png> But when I run this I get the following error: Traceback (most recent call last): File "/home/jgammerman/data-science-on-gcp/10_mlops/model.py", line 331, in <module> train_and_evaluate(TRAIN_DATA_PATTERN, EVAL_DATA_PATTERN, TEST_DATA_PATTERN, OUTPUT_MODEL_DIR, OUTPUT_DIR) File "/home/jgammerman/data-science-on-gcp/10_mlops/model.py", line 180, in train_and_evaluate hpt = hypertune.HyperTune() AttributeError: module 'hypertune' has no attribute 'HyperTune' I have pip installed hypertune on my VM so I know it's there: ***@***.***:~/data-science-on-gcp/.......$ pip show hypertune Name: hypertune Version: 1.0.3 Summary: A library for performing hyperparameter optimization with Polyaxon. Home-page: https://github.com/polyaxon/hypertune Author: Polyaxon, Inc. Author-email: ***@***.*** License: Apache 2.0 Location: /home/jgammerman/.local/lib/python3.9/site-packages — Reply to this email directly, view it on GitHub <#166>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AANJPZ6XEIUFDVLI6PVT3W3WXO44XANCNFSM6AAAAAAU32THIU> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

jgammerman · 2023-02-14T17:36:16Z

Tried that - same error as before

lakshmanok · 2023-02-14T17:40:10Z

ah. wrong hypertune! Please uninstall that one; this is the one you need to pip install: https://pypi.org/project/cloudml-hypertune/ thanks Lak

…

On Tue, Feb 14, 2023 at 9:36 AM James ***@***.***> wrote: Tried that - same error as before — Reply to this email directly, view it on GitHub <#166 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AANJPZ7FTNM2PIJPMOVBLRLWXO7BVANCNFSM6AAAAAAU32THIU> . You are receiving this because you commented.Message ID: ***@***.***>

jgammerman · 2023-02-15T16:53:41Z

That made it work, thanks Lak.

(By the way for anyone else at this stage, you might need to run export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python to get it to work.)

I then tried running the pipeline train_on_vertexai.py and it spent about 10 mins training before failing due to a memory error:

RuntimeError: Training failed with: code: 8 message: "The following quota metrics exceed quota limits: aiplatform.googleapis.com/custom_model_training_nvidia_t4_gpus"

I'm currently running the same pipeline using AutoML and it's been training for 2 hours so far - should it take that long?

lakshmanok · 2023-02-15T17:38:09Z

(1) Could you add the "export" instruction to the README instructions on GitHub? https://github.com/GoogleCloudPlatform/data-science-on-gcp/tree/main/10_mlops/README.md (2) I believe that's a quota error you are getting. You don't have the quota for one Nvidia T4 GPU. You may need to request that quota. If you have a different GPU available, change this line appropriately: accelerator_type=aip.AcceleratorType.NVIDIA_TESLA_T4.name, accelerator_count=1, (3) It should finish in a little over 2 hours: budget_milli_node_hours=(300 if develop_mode else 2000),

…

On Wed, Feb 15, 2023 at 8:53 AM James ***@***.***> wrote: That made it work, thanks Lak. (By the way for anyone else at this stage, you might need to run export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python to get it to work.) I then tried running the pipeline *train_on_vertexai.py* and it spent about 10 mins training before failing due to a memory error: RuntimeError: Training failed with: code: 8 message: "The following quota metrics exceed quota limits: aiplatform.googleapis.com/custom_model_training_nvidia_t4_gpus" I'm currently running the same pipeline using AutoML and it's been training for 2 hours so far - should it take that long? — Reply to this email directly, view it on GitHub <#166 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AANJPZ45NV7WIPKMV5NO2YTWXUC2BANCNFSM6AAAAAAU32THIU> . You are receiving this because you commented.Message ID: ***@***.***>

jgammerman · 2023-02-16T14:01:54Z

Done, submitted a pull request
I do have an NVIDIA T4 GPU attached but it's still failing with the same error:

AutoML pipeline ended up completing after 3 hours 40 mins.

jgammerman · 2023-02-16T15:22:05Z

I'm getting the same error when trying to run hyperparameter tuning:

google.api_core.exceptions.ResourceExhausted: 429 The following quota metrics exceed quota limits: aiplatform.googleapis.com/custom_model_training_nvidia_t4_gpus

lakshmanok · 2023-02-16T15:30:22Z

Note the name of the quota that you need increased: aiplatform.googleapis.com/custom_model_training_nvidia_t4_gpus Please look in console.cloud.google.com/quotas for a quota with this name thanks Lak

…

On Thu, Feb 16, 2023 at 7:22 AM James ***@***.***> wrote: I'm getting the same error when trying to run hyperparameter tuning: google.api_core.exceptions.ResourceExhausted: 429 The following quota metrics exceed quota limits: aiplatform.googleapis.com/custom_model_training_nvidia_t4_gpus — Reply to this email directly, view it on GitHub <#166 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AANJPZ7X3FF25KG2JRKIXODWXZA2TANCNFSM6AAAAAAU32THIU> . You are receiving this because you commented.Message ID: ***@***.***>

lakshmanok · 2023-02-16T15:31:42Z

Also, remember that if your quota is 1 GPU, and you have VM already created with this GPU, you have exhausted the quota. The Vertex AI managed service won't have a GPU available. Similarly, if you are doing hyperparm training with 4 parallel workers, you need 4 GPUs in your quota On Thu, Feb 16, 2023 at 7:30 AM Lakshmanan Valliappa ***@***.***> wrote:

…

Note the name of the quota that you need increased: aiplatform.googleapis.com/custom_model_training_nvidia_t4_gpus Please look in console.cloud.google.com/quotas for a quota with this name thanks Lak On Thu, Feb 16, 2023 at 7:22 AM James ***@***.***> wrote: > I'm getting the same error when trying to run hyperparameter tuning: > > google.api_core.exceptions.ResourceExhausted: 429 The following quota > metrics exceed quota limits: > aiplatform.googleapis.com/custom_model_training_nvidia_t4_gpus > > — > Reply to this email directly, view it on GitHub > <#166 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AANJPZ7X3FF25KG2JRKIXODWXZA2TANCNFSM6AAAAAAU32THIU> > . > You are receiving this because you commented.Message ID: > ***@***.*** > com> >

jgammerman · 2023-02-16T15:52:03Z

When navigating to console.cloud.google.com/quotas it says that I need to upgrade to a paid account:

I'm still using a managed service (I tried creating a user-managed one a few days ago but it didn't work, something about not enough GPUs currently being available....I just put it down to bad timing and decided to try again later). I guess that's the root of the problem. Will try again.

jgammerman · 2023-02-16T16:36:35Z

So I still can't create a user-managed notebook with a GPU:

Have tried US-west, -east and -central. Sometimes I also get this error:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ch10 - can't run model.py - error with hypertune #166

Ch10 - can't run model.py - error with hypertune #166

jgammerman commented Feb 14, 2023

lakshmanok commented Feb 14, 2023 via email

jgammerman commented Feb 14, 2023

lakshmanok commented Feb 14, 2023 via email

jgammerman commented Feb 15, 2023

lakshmanok commented Feb 15, 2023 via email

jgammerman commented Feb 16, 2023

jgammerman commented Feb 16, 2023

lakshmanok commented Feb 16, 2023 via email

lakshmanok commented Feb 16, 2023 via email

jgammerman commented Feb 16, 2023

jgammerman commented Feb 16, 2023

Ch10 - can't run model.py - error with hypertune #166

Ch10 - can't run model.py - error with hypertune #166

Comments

jgammerman commented Feb 14, 2023

lakshmanok commented Feb 14, 2023 via email

jgammerman commented Feb 14, 2023

lakshmanok commented Feb 14, 2023 via email

jgammerman commented Feb 15, 2023

lakshmanok commented Feb 15, 2023 via email

jgammerman commented Feb 16, 2023

jgammerman commented Feb 16, 2023

lakshmanok commented Feb 16, 2023 via email

lakshmanok commented Feb 16, 2023 via email

jgammerman commented Feb 16, 2023

jgammerman commented Feb 16, 2023