Problem resuming training in Google Colab (Continued) #674

sebasmej · 2024-05-02T07:43:38Z

Search before asking

I have searched the HUB issues and found no similar bug report.

HUB Component

Training

Bug

I am training a model using google colab and when I try to resume executing the commands:

%pip install ultralytics  # install
from ultralytics import YOLO, checks, hub
checks()  # checks

hub.login('my_API_KEY')
model = YOLO('my_MODEL_ID')
results = model.train()

the following error message appears:

Ultralytics HUB: New authentication successful ✅
Ultralytics HUB: View model at https://hub.ultralytics.com/models/6SUZnsAo0z0y6gld7lpp 🚀
Downloading https://storage.googleapis.com/ultralytics-hub.appspot.com/users/gR39oPibZKaU7n6mUI0WE1H1CQH2/models/6SUZnsAo0z0y6gld7lpp/epoch-32.pt to 'epoch-32.pt'...
⚠️ Download failure, retrying 1/3 https://storage.googleapis.com/ultralytics-hub.appspot.com/users/gR39oPibZKaU7n6mUI0WE1H1CQH2/models/6SUZnsAo0z0y6gld7lpp/epoch-32.pt?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=firebase-adminsdk-jsjt9%40ultralytics-hub.iam.gserviceaccount.com%2F20240502%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20240502T073319Z&X-Goog-Expires=900&X-Goog-SignedHeaders=host&X-Goog-Signature=1ec7aecb22b2a6b13261b8b0fc2ac3e0a11e1077edf9483536cedc8921e7a0ee992473b79b8e59a9ae1d8cdc46d6d6c4c78b0aeab0f7b5707f7cf3d797c5cab7f43e91666670df484c8a871b893088dd59ed8b5769a46b9b32670425c473a352033766e62ce291a547b5d597eab4b9bbfdcafd298eb1ed8d54f7c09a8b4e0e6979fe039d702a9606bb1c27bebfc9e41f67c4bb34de8efd35cfabc08630a390b73caa5b43c1fb663446fafc0971f404ab0f001a0c4f3b7ab128f44413d977e428423e438d7612a2bd310d1640cb87acd8f63aa327f781500564f56970abd173091acdd698191cbfc38f1980d36f2e32ac7da17d2fe3862447d3685070082273ea...
---------------------------------------------------------------------------
UnpicklingError                           Traceback (most recent call last)
<ipython-input-8-1b077e47cb44> in <cell line: 5>()
      3 
      4 # Load your model from HUB (replace 'YOUR_MODEL_ID' with your model ID)
----> 5 model = YOLO('https://hub.ultralytics.com/models/6SUZnsAo0z0y6gld7lpp')
      6 
      7 # Train the model

6 frames
/usr/local/lib/python3.10/dist-packages/torch/serialization.py in _legacy_load(f, map_location, pickle_module, **pickle_load_args)
   1256             "functionality.")
   1257 
-> 1258     magic_number = pickle_module.load(f, **pickle_load_args)
   1259     if magic_number != MAGIC_NUMBER:
   1260         raise RuntimeError("Invalid magic number; corrupt file?")

UnpicklingError: invalid load key, '<'.

I follow your recomendations to solve this issue, I Rerun the Training Cell, Check Internet Connection and Clear Colab Environment. but the issue persits. For further investigation i append details of the error after rerun:

Ultralytics HUB: New authentication successful ✅
Ultralytics HUB: View model at https://hub.ultralytics.com/models/6SUZnsAo0z0y6gld7lpp 🚀
Found https://storage.googleapis.com/ultralytics-hub.appspot.com/users/gR39oPibZKaU7n6mUI0WE1H1CQH2/models/6SUZnsAo0z0y6gld7lpp/epoch-32.pt locally at epoch-32.pt
---------------------------------------------------------------------------
UnpicklingError                           Traceback (most recent call last)
<ipython-input-9-1b077e47cb44> in <cell line: 5>()
      3 
      4 # Load your model from HUB (replace 'YOUR_MODEL_ID' with your model ID)
----> 5 model = YOLO('https://hub.ultralytics.com/models/6SUZnsAo0z0y6gld7lpp')
      6 
      7 # Train the model

6 frames
/usr/local/lib/python3.10/dist-packages/ultralytics/models/yolo/model.py in __init__(self, model, task, verbose)
     21         else:
     22             # Continue with default YOLO initialization
---> 23             super().__init__(model=model, task=task, verbose=verbose)
     24 
     25     @property

/usr/local/lib/python3.10/dist-packages/ultralytics/engine/model.py in __init__(self, model, task, verbose)
    149             self._new(model, task=task, verbose=verbose)
    150         else:
--> 151             self._load(model, task=task)
    152 
    153     def __call__(

/usr/local/lib/python3.10/dist-packages/ultralytics/engine/model.py in _load(self, weights, task)
    238 
    239         if Path(weights).suffix == ".pt":
--> 240             self.model, self.ckpt = attempt_load_one_weight(weights)
    241             self.task = self.model.args["task"]
    242             self.overrides = self.model.args = self._reset_ckpt_args(self.model.args)

/usr/local/lib/python3.10/dist-packages/ultralytics/nn/tasks.py in attempt_load_one_weight(weight, device, inplace, fuse)
    804 def attempt_load_one_weight(weight, device=None, inplace=True, fuse=False):
    805     """Loads a single model weights."""
--> 806     ckpt, weight = torch_safe_load(weight)  # load ckpt
    807     args = {**DEFAULT_CFG_DICT, **(ckpt.get("train_args", {}))}  # combine model and default args, preferring model args
    808     model = (ckpt.get("ema") or ckpt["model"]).to(device).float()  # FP32 model

/usr/local/lib/python3.10/dist-packages/ultralytics/nn/tasks.py in torch_safe_load(weight)
    730             }
    731         ):  # for legacy 8.0 Classify and Pose models
--> 732             ckpt = torch.load(file, map_location="cpu")
    733 
    734     except ModuleNotFoundError as e:  # e.name is missing module name

/usr/local/lib/python3.10/dist-packages/torch/serialization.py in load(f, map_location, pickle_module, weights_only, mmap, **pickle_load_args)
   1038             except RuntimeError as e:
   1039                 raise pickle.UnpicklingError(UNSAFE_MESSAGE + str(e)) from None
-> 1040         return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
   1041 
   1042 

/usr/local/lib/python3.10/dist-packages/torch/serialization.py in _legacy_load(f, map_location, pickle_module, **pickle_load_args)
   1256             "functionality.")
   1257 
-> 1258     magic_number = pickle_module.load(f, **pickle_load_args)
   1259     if magic_number != MAGIC_NUMBER:
   1260         raise RuntimeError("Invalid magic number; corrupt file?")

UnpicklingError: invalid load key, '<'.

Environment

Google Colab

Minimal Reproducible Example

Login to hub
Search the model to train
Click to copy the Colab code
Run First Google Colab cell
Run Second Google Colab cell
Error appears
Rerun Second Google Colab cell
Second Error appears

Additional

No response

The text was updated successfully, but these errors were encountered:

pderrenger · 2024-05-02T10:26:58Z

Hello! 👋 It seems the issue you're encountering is related to the download or loading of the model's checkpoint file. The error message you're seeing (UnpicklingError: invalid load key, '<'.) suggests that the downloaded file might be corrupted or not a valid .pt file. This can sometimes occur due to incomplete downloads or network issues.

As you've already tried the recommended steps (re-running the cell, checking internet connection, and clearing the Colab environment), you could try the following additional step to ensure the .pt file is fully and correctly downloaded:

Manually download the checkpoint file: Use the link provided in the error message or locate the direct download link for the .pt file from the Ultralytics HUB website. You can do this in a browser or through a programmatic method in Colab. Once downloaded, make sure the file size looks correct (not significantly smaller than expected).
Upload the .pt file to your Colab environment: You can use the Colab file upload feature to upload the .pt file directly into the Colab file system.
Directly load the uploaded .pt file in your script: Instead of using the model ID or download URL, point to the locally uploaded .pt file when loading the model.

If the problem persists even after these steps, it's possible there may be an issue with the .pt file itself. For further assistance, providing detailed information about the file size and exact steps you've taken could help in diagnosing the issue.

Remember to check the Ultralytics HUB Docs at https://docs.ultralytics.com/hub for more detailed instructions and troubleshooting tips. Your feedback is valuable, and the Ultralytics team appreciates your community involvement. Let's work together to solve this issue! 🚀

sebasmej · 2024-05-03T08:55:25Z

thanks for the answer I tried to download the file manually as you recommended. But when I try to download it with [the link that is in the error message] (https://storage.googleapis.com/ultralytics-hub.appspot.com/users/gR39oPibZKaU7n6mUI0WE1H1CQH2/models/6SUZnsAo0z0y6gld7lpp/epoch-32.pt) the following message appears:

<Error>
<Code>AccessDenied</Code>
<Message>Access denied.</Message>
<Details>Anonymous caller does not have storage.objects.get access to the Google Cloud Storage object. Permission 'storage.objects.get' denied on resource (or it may not exist).</Details>
</Error>

I am encountering problems with permissions, I am not sure how to authenticate, could you please help me to properly access the resources to manually download the .pt file.

This also happens if i try downloading it with the second link provided on the message error (https://storage.googleapis.com/ultralytics-hub.appspot.com/users/gR39oPibZKaU7n6mUI0WE1H1CQH2/models/6SUZnsAo0z0y6gld7lpp/epoch-32.pt?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=firebase-adminsdk-jsjt9%40ultralytics-hub.iam.gserviceaccount.com%2F20240503%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20240503T082307Z&X-Goog-Expires=900&X-Goog-SignedHeaders=host&X-Goog-Signature=61d4b243926debc55bdba0bfc6e849b6eea5e290fdb2cf3b0a8965f92c6fe9e8b76932ddff03ac95e779de05746b764a33274653d40fa967677e4d64cc167751d360b739df1c4ca19286956364c4b104f898ce2c36d79cbe699a9899f18b2ee968381e2cb85c1a48acd3d8f998fc60cfebd5ade7e210f4d053335b43e39fdfa5e77e70087160a066f9323a61aa962c90bf419a854a8bb7deccfc9ab5cdc3a78c9f2d5518edd24a4bab5864d0d64e26049820359fe910aae8aa1e8f5c6c2c5172c7067057e0d40375a64c1efa84d733f7c582e30996fe154c6c65e08cbd3eed28ebe8fcd43c77de1471cefcab353b7a2b729047789716b0abe5bca6e6107effea...
) the following message appears:

<Error>
<Code>ExpiredToken</Code>
<Message>Invalid argument.</Message>
<Details>The provided token has expired. Request signature expired at: 2024-05-03T08:38:07+00:00</Details>
</Error>

pderrenger · 2024-05-03T10:46:57Z

Hey there! 👋 It looks like you're encountering access issues due to permission settings or an expired token for the .pt file. For security reasons, direct access to the download URLs typically requires authentication that matches the credentials permitted in our system.

For downloading model checkpoints from the Ultralytics HUB, I recommend ensuring that you're logged into the hub using the hub.login('your_API_KEY') method within your script. This should prevent the 'Access Denied' error by authenticating your access.

Regarding the expired token in the second URL, this usually occurs because URLs with embedded credentials have a short validity period for security reasons. To resolve this, it's best to generate a fresh download URL by re-initiating your session or request immediately before you plan to download the file.

If these steps don't resolve the issue, I'd suggest reaching out through our support channels with specifics (while avoiding sharing sensitive information like API keys publicly), so we can ensure proper access on your account.

Let's get this sorted! 🚀

sergiuwaxmann · 2024-05-08T10:05:52Z

@sebasmej We just checked and everything is working fine when starting/resuming training in Google Colab. Do you still have the issues above?

sebasmej · 2024-05-13T07:21:45Z

Yes, the problem persists. I have not been able to resume training, I am encountering the same errors I mentioned before.

sergiuwaxmann · 2024-05-13T08:07:50Z

@sebasmej

I’ve reviewed your model, and it appears there was indeed a hiccup with uploading the checkpoint for epoch 32. As a temporary measure, I’ve reverted the checkpoint to epoch 31 (previous successful checkpoint upload), which should allow you to resume training immediately. Could you please confirm if everything is back on track on your end?

Additionally, I’ve documented this incident with our development team to investigate further and ensure a permanent fix is implemented. This will help prevent such issues from recurring in the future.

PS If the error still occurs, maybe consider starting the training again (new model).

sebasmej · 2024-05-13T10:06:56Z

Thank you for your prompt reply. Yes everything is working fine now. I was able to continue the model training from epoch 31 without any problem.

sergiuwaxmann · 2024-05-13T10:14:59Z

@sebasmej I am glad your issue was solved. Thank you for you patience!
Hopefully, our team can implement a permanent fix soon as well.

vwyLss · 2024-05-31T10:57:42Z

@sergiuwaxmann I am having the same issue. I tried to manually download the checkpoint and its size is just 1kb. Could you revert my checkpoint to a previous successful checkpoint? Thanks in advance

My model ID: https://hub.ultralytics.com/models/ung87rRVHYHU5Wrhmq8p?tab=train

sergiuwaxmann · 2024-05-31T11:19:51Z

@vwyLss Can you check now? Last checkpoint should be epoch 125.

vwyLss · 2024-05-31T12:03:43Z

@sergiuwaxmann It is working now, thanks!

sergiuwaxmann · 2024-05-31T12:37:05Z

@vwyLss You're welcome! 🚀

marshaniswah · 2024-11-04T00:07:32Z

@sergiuwaxmann Hello! I'm still encountering the same issue. Could you please revert my model to a previous successful checkpoint? Thank you very much!

My model: https://hub.ultralytics.com/models/LMEhtucmCZk4XUTeUjWD

sergiuwaxmann · 2024-11-04T10:21:28Z

@marshaniswah I’ve reverted the checkpoint to epoch 77 (previous successful checkpoint upload), which should allow you to resume training immediately. Could you please confirm if training is working again?

marshaniswah · 2024-11-04T11:01:04Z

@sergiuwaxmann Yeah, its working now. I'm training my model right now. Thanks !

joseabraham · 2024-11-29T21:37:05Z

@sergiuwaxmann Hello, I'm encountering the same issue. Could you please revert my model to a previous successful checkpoint? Thanks in advance:

Mode: https://hub.ultralytics.com/models/9nVppEnRgfYxE9aGROjl

sergiuwaxmann · 2024-12-03T09:36:20Z

@joseabraham Hello!
Sure, I replied to your issue: #940.

bcastagna1 · 2024-12-09T05:58:34Z

Hello @sergiuwaxmann. I'm running into the same issue as the above. In the Hub it's showing 100% with last checkpoint saved for epoch 299 (of 300). Would you be able to perform the same fix for me?

I dug into the locally saved weights/epoch-299.pt and I'm seeing the file showing the "NoSuchKey" error. I'm running everything on a custom agent.

Model URL: https://hub.ultralytics.com/models/ZgZqmopBn22vCOEzBrUS

Thank you!!

sergiuwaxmann · 2024-12-09T08:49:29Z

@bcastagna1 I’ve reverted the checkpoint to epoch 137 (previous successful checkpoint upload), which should allow you to resume training immediately.

bcastagna1 · 2024-12-09T17:53:17Z

Thank you @sergiuwaxmann !

KonDan2310 · 2024-12-15T13:53:57Z

@sergiuwaxmann Hi, I also have the same problem. Could you please revert my model to the previous checkpoint? Thanks in advance for your help:
My model: https://hub.ultralytics.com/models/g1kGF9foy7xBjFHMU0gz

sergiuwaxmann · 2024-12-16T10:45:20Z

@KonDan2310 I’ve reverted the checkpoint to epoch 75 (previous successful checkpoint upload), which should allow you to resume training immediately.

KonDan2310 · 2024-12-17T11:04:22Z

Thanks for your help

…

Message ID: ***@***.***>

pderrenger · 2024-12-17T16:40:10Z

You're welcome! 😊 If you encounter any further issues or have additional questions, feel free to ask. Happy training and best of luck with your project! 🚀

sebasmej added the bug Something isn't working label May 2, 2024

sergiuwaxmann mentioned this issue May 6, 2024

Problem resuming training in Google Colab #671

Closed

1 task

sergiuwaxmann assigned yogendrasinghx May 8, 2024

sergiuwaxmann added the fixed Bug has been resolved label May 13, 2024

sergiuwaxmann closed this as completed May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem resuming training in Google Colab (Continued) #674

Problem resuming training in Google Colab (Continued) #674

sebasmej commented May 2, 2024

pderrenger commented May 2, 2024

sebasmej commented May 3, 2024

pderrenger commented May 3, 2024

sergiuwaxmann commented May 8, 2024

sebasmej commented May 13, 2024

sergiuwaxmann commented May 13, 2024

sebasmej commented May 13, 2024

sergiuwaxmann commented May 13, 2024

vwyLss commented May 31, 2024

sergiuwaxmann commented May 31, 2024

vwyLss commented May 31, 2024

sergiuwaxmann commented May 31, 2024

marshaniswah commented Nov 4, 2024

sergiuwaxmann commented Nov 4, 2024

marshaniswah commented Nov 4, 2024 •

edited

Loading

joseabraham commented Nov 29, 2024

sergiuwaxmann commented Dec 3, 2024

bcastagna1 commented Dec 9, 2024

sergiuwaxmann commented Dec 9, 2024

bcastagna1 commented Dec 9, 2024

KonDan2310 commented Dec 15, 2024

sergiuwaxmann commented Dec 16, 2024

KonDan2310 commented Dec 17, 2024 via email

pderrenger commented Dec 17, 2024

Problem resuming training in Google Colab (Continued) #674

Problem resuming training in Google Colab (Continued) #674

Comments

sebasmej commented May 2, 2024

Search before asking

HUB Component

Bug

Environment

Minimal Reproducible Example

Additional

pderrenger commented May 2, 2024

sebasmej commented May 3, 2024

pderrenger commented May 3, 2024

sergiuwaxmann commented May 8, 2024

sebasmej commented May 13, 2024

sergiuwaxmann commented May 13, 2024

sebasmej commented May 13, 2024

sergiuwaxmann commented May 13, 2024

vwyLss commented May 31, 2024

sergiuwaxmann commented May 31, 2024

vwyLss commented May 31, 2024

sergiuwaxmann commented May 31, 2024

marshaniswah commented Nov 4, 2024

sergiuwaxmann commented Nov 4, 2024

marshaniswah commented Nov 4, 2024 • edited Loading

joseabraham commented Nov 29, 2024

sergiuwaxmann commented Dec 3, 2024

bcastagna1 commented Dec 9, 2024

sergiuwaxmann commented Dec 9, 2024

bcastagna1 commented Dec 9, 2024

KonDan2310 commented Dec 15, 2024

sergiuwaxmann commented Dec 16, 2024

KonDan2310 commented Dec 17, 2024 via email

pderrenger commented Dec 17, 2024

marshaniswah commented Nov 4, 2024 •

edited

Loading