Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem resuming training in Google Colab (Continued) #674

Closed
1 task done
sebasmej opened this issue May 2, 2024 · 24 comments
Closed
1 task done

Problem resuming training in Google Colab (Continued) #674

sebasmej opened this issue May 2, 2024 · 24 comments
Assignees
Labels
bug Something isn't working fixed Bug has been resolved

Comments

@sebasmej
Copy link

sebasmej commented May 2, 2024

Search before asking

  • I have searched the HUB issues and found no similar bug report.

HUB Component

Training

Bug

I am training a model using google colab and when I try to resume executing the commands:

%pip install ultralytics  # install
from ultralytics import YOLO, checks, hub
checks()  # checks

hub.login('my_API_KEY')
model = YOLO('my_MODEL_ID')
results = model.train()

the following error message appears:

Ultralytics HUB: New authentication successful ✅
Ultralytics HUB: View model at https://hub.ultralytics.com/models/6SUZnsAo0z0y6gld7lpp 🚀
Downloading https://storage.googleapis.com/ultralytics-hub.appspot.com/users/gR39oPibZKaU7n6mUI0WE1H1CQH2/models/6SUZnsAo0z0y6gld7lpp/epoch-32.pt to 'epoch-32.pt'...
⚠️ Download failure, retrying 1/3 https://storage.googleapis.com/ultralytics-hub.appspot.com/users/gR39oPibZKaU7n6mUI0WE1H1CQH2/models/6SUZnsAo0z0y6gld7lpp/epoch-32.pt?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=firebase-adminsdk-jsjt9%40ultralytics-hub.iam.gserviceaccount.com%2F20240502%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20240502T073319Z&X-Goog-Expires=900&X-Goog-SignedHeaders=host&X-Goog-Signature=1ec7aecb22b2a6b13261b8b0fc2ac3e0a11e1077edf9483536cedc8921e7a0ee992473b79b8e59a9ae1d8cdc46d6d6c4c78b0aeab0f7b5707f7cf3d797c5cab7f43e91666670df484c8a871b893088dd59ed8b5769a46b9b32670425c473a352033766e62ce291a547b5d597eab4b9bbfdcafd298eb1ed8d54f7c09a8b4e0e6979fe039d702a9606bb1c27bebfc9e41f67c4bb34de8efd35cfabc08630a390b73caa5b43c1fb663446fafc0971f404ab0f001a0c4f3b7ab128f44413d977e428423e438d7612a2bd310d1640cb87acd8f63aa327f781500564f56970abd173091acdd698191cbfc38f1980d36f2e32ac7da17d2fe3862447d3685070082273ea...
---------------------------------------------------------------------------
UnpicklingError                           Traceback (most recent call last)
<ipython-input-8-1b077e47cb44> in <cell line: 5>()
      3 
      4 # Load your model from HUB (replace 'YOUR_MODEL_ID' with your model ID)
----> 5 model = YOLO('https://hub.ultralytics.com/models/6SUZnsAo0z0y6gld7lpp')
      6 
      7 # Train the model

6 frames
/usr/local/lib/python3.10/dist-packages/torch/serialization.py in _legacy_load(f, map_location, pickle_module, **pickle_load_args)
   1256             "functionality.")
   1257 
-> 1258     magic_number = pickle_module.load(f, **pickle_load_args)
   1259     if magic_number != MAGIC_NUMBER:
   1260         raise RuntimeError("Invalid magic number; corrupt file?")

UnpicklingError: invalid load key, '<'.

I follow your recomendations to solve this issue, I Rerun the Training Cell, Check Internet Connection and Clear Colab Environment. but the issue persits. For further investigation i append details of the error after rerun:

Ultralytics HUB: New authentication successful ✅
Ultralytics HUB: View model at https://hub.ultralytics.com/models/6SUZnsAo0z0y6gld7lpp 🚀
Found https://storage.googleapis.com/ultralytics-hub.appspot.com/users/gR39oPibZKaU7n6mUI0WE1H1CQH2/models/6SUZnsAo0z0y6gld7lpp/epoch-32.pt locally at epoch-32.pt
---------------------------------------------------------------------------
UnpicklingError                           Traceback (most recent call last)
<ipython-input-9-1b077e47cb44> in <cell line: 5>()
      3 
      4 # Load your model from HUB (replace 'YOUR_MODEL_ID' with your model ID)
----> 5 model = YOLO('https://hub.ultralytics.com/models/6SUZnsAo0z0y6gld7lpp')
      6 
      7 # Train the model

6 frames
/usr/local/lib/python3.10/dist-packages/ultralytics/models/yolo/model.py in __init__(self, model, task, verbose)
     21         else:
     22             # Continue with default YOLO initialization
---> 23             super().__init__(model=model, task=task, verbose=verbose)
     24 
     25     @property

/usr/local/lib/python3.10/dist-packages/ultralytics/engine/model.py in __init__(self, model, task, verbose)
    149             self._new(model, task=task, verbose=verbose)
    150         else:
--> 151             self._load(model, task=task)
    152 
    153     def __call__(

/usr/local/lib/python3.10/dist-packages/ultralytics/engine/model.py in _load(self, weights, task)
    238 
    239         if Path(weights).suffix == ".pt":
--> 240             self.model, self.ckpt = attempt_load_one_weight(weights)
    241             self.task = self.model.args["task"]
    242             self.overrides = self.model.args = self._reset_ckpt_args(self.model.args)

/usr/local/lib/python3.10/dist-packages/ultralytics/nn/tasks.py in attempt_load_one_weight(weight, device, inplace, fuse)
    804 def attempt_load_one_weight(weight, device=None, inplace=True, fuse=False):
    805     """Loads a single model weights."""
--> 806     ckpt, weight = torch_safe_load(weight)  # load ckpt
    807     args = {**DEFAULT_CFG_DICT, **(ckpt.get("train_args", {}))}  # combine model and default args, preferring model args
    808     model = (ckpt.get("ema") or ckpt["model"]).to(device).float()  # FP32 model

/usr/local/lib/python3.10/dist-packages/ultralytics/nn/tasks.py in torch_safe_load(weight)
    730             }
    731         ):  # for legacy 8.0 Classify and Pose models
--> 732             ckpt = torch.load(file, map_location="cpu")
    733 
    734     except ModuleNotFoundError as e:  # e.name is missing module name

/usr/local/lib/python3.10/dist-packages/torch/serialization.py in load(f, map_location, pickle_module, weights_only, mmap, **pickle_load_args)
   1038             except RuntimeError as e:
   1039                 raise pickle.UnpicklingError(UNSAFE_MESSAGE + str(e)) from None
-> 1040         return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
   1041 
   1042 

/usr/local/lib/python3.10/dist-packages/torch/serialization.py in _legacy_load(f, map_location, pickle_module, **pickle_load_args)
   1256             "functionality.")
   1257 
-> 1258     magic_number = pickle_module.load(f, **pickle_load_args)
   1259     if magic_number != MAGIC_NUMBER:
   1260         raise RuntimeError("Invalid magic number; corrupt file?")

UnpicklingError: invalid load key, '<'.

Environment

Google Colab

Minimal Reproducible Example

  1. Login to hub
  2. Search the model to train
  3. Click to copy the Colab code
  4. Run First Google Colab cell
  5. Run Second Google Colab cell
  6. Error appears
  7. Rerun Second Google Colab cell
  8. Second Error appears

Additional

No response

@sebasmej sebasmej added the bug Something isn't working label May 2, 2024
@pderrenger
Copy link
Member

Hello! 👋 It seems the issue you're encountering is related to the download or loading of the model's checkpoint file. The error message you're seeing (UnpicklingError: invalid load key, '<'.) suggests that the downloaded file might be corrupted or not a valid .pt file. This can sometimes occur due to incomplete downloads or network issues.

As you've already tried the recommended steps (re-running the cell, checking internet connection, and clearing the Colab environment), you could try the following additional step to ensure the .pt file is fully and correctly downloaded:

  1. Manually download the checkpoint file: Use the link provided in the error message or locate the direct download link for the .pt file from the Ultralytics HUB website. You can do this in a browser or through a programmatic method in Colab. Once downloaded, make sure the file size looks correct (not significantly smaller than expected).

  2. Upload the .pt file to your Colab environment: You can use the Colab file upload feature to upload the .pt file directly into the Colab file system.

  3. Directly load the uploaded .pt file in your script: Instead of using the model ID or download URL, point to the locally uploaded .pt file when loading the model.

If the problem persists even after these steps, it's possible there may be an issue with the .pt file itself. For further assistance, providing detailed information about the file size and exact steps you've taken could help in diagnosing the issue.

Remember to check the Ultralytics HUB Docs at https://docs.ultralytics.com/hub for more detailed instructions and troubleshooting tips. Your feedback is valuable, and the Ultralytics team appreciates your community involvement. Let's work together to solve this issue! 🚀

@sebasmej
Copy link
Author

sebasmej commented May 3, 2024

thanks for the answer I tried to download the file manually as you recommended. But when I try to download it with [the link that is in the error message] (https://storage.googleapis.com/ultralytics-hub.appspot.com/users/gR39oPibZKaU7n6mUI0WE1H1CQH2/models/6SUZnsAo0z0y6gld7lpp/epoch-32.pt) the following message appears:

<Error>
<Code>AccessDenied</Code>
<Message>Access denied.</Message>
<Details>Anonymous caller does not have storage.objects.get access to the Google Cloud Storage object. Permission 'storage.objects.get' denied on resource (or it may not exist).</Details>
</Error>

I am encountering problems with permissions, I am not sure how to authenticate, could you please help me to properly access the resources to manually download the .pt file.

This also happens if i try downloading it with the second link provided on the message error (https://storage.googleapis.com/ultralytics-hub.appspot.com/users/gR39oPibZKaU7n6mUI0WE1H1CQH2/models/6SUZnsAo0z0y6gld7lpp/epoch-32.pt?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=firebase-adminsdk-jsjt9%40ultralytics-hub.iam.gserviceaccount.com%2F20240503%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20240503T082307Z&X-Goog-Expires=900&X-Goog-SignedHeaders=host&X-Goog-Signature=61d4b243926debc55bdba0bfc6e849b6eea5e290fdb2cf3b0a8965f92c6fe9e8b76932ddff03ac95e779de05746b764a33274653d40fa967677e4d64cc167751d360b739df1c4ca19286956364c4b104f898ce2c36d79cbe699a9899f18b2ee968381e2cb85c1a48acd3d8f998fc60cfebd5ade7e210f4d053335b43e39fdfa5e77e70087160a066f9323a61aa962c90bf419a854a8bb7deccfc9ab5cdc3a78c9f2d5518edd24a4bab5864d0d64e26049820359fe910aae8aa1e8f5c6c2c5172c7067057e0d40375a64c1efa84d733f7c582e30996fe154c6c65e08cbd3eed28ebe8fcd43c77de1471cefcab353b7a2b729047789716b0abe5bca6e6107effea...
) the following message appears:

<Error>
<Code>ExpiredToken</Code>
<Message>Invalid argument.</Message>
<Details>The provided token has expired. Request signature expired at: 2024-05-03T08:38:07+00:00</Details>
</Error>

@pderrenger
Copy link
Member

Hey there! 👋 It looks like you're encountering access issues due to permission settings or an expired token for the .pt file. For security reasons, direct access to the download URLs typically requires authentication that matches the credentials permitted in our system.

For downloading model checkpoints from the Ultralytics HUB, I recommend ensuring that you're logged into the hub using the hub.login('your_API_KEY') method within your script. This should prevent the 'Access Denied' error by authenticating your access.

Regarding the expired token in the second URL, this usually occurs because URLs with embedded credentials have a short validity period for security reasons. To resolve this, it's best to generate a fresh download URL by re-initiating your session or request immediately before you plan to download the file.

If these steps don't resolve the issue, I'd suggest reaching out through our support channels with specifics (while avoiding sharing sensitive information like API keys publicly), so we can ensure proper access on your account.

Let's get this sorted! 🚀

@sergiuwaxmann
Copy link
Member

@sebasmej We just checked and everything is working fine when starting/resuming training in Google Colab. Do you still have the issues above?

@sebasmej
Copy link
Author

Yes, the problem persists. I have not been able to resume training, I am encountering the same errors I mentioned before.

@sergiuwaxmann
Copy link
Member

@sebasmej

I’ve reviewed your model, and it appears there was indeed a hiccup with uploading the checkpoint for epoch 32. As a temporary measure, I’ve reverted the checkpoint to epoch 31 (previous successful checkpoint upload), which should allow you to resume training immediately. Could you please confirm if everything is back on track on your end?

Additionally, I’ve documented this incident with our development team to investigate further and ensure a permanent fix is implemented. This will help prevent such issues from recurring in the future.

PS If the error still occurs, maybe consider starting the training again (new model).

@sebasmej
Copy link
Author

Thank you for your prompt reply. Yes everything is working fine now. I was able to continue the model training from epoch 31 without any problem.

@sergiuwaxmann
Copy link
Member

@sebasmej I am glad your issue was solved. Thank you for you patience!
Hopefully, our team can implement a permanent fix soon as well.

@sergiuwaxmann sergiuwaxmann added the fixed Bug has been resolved label May 13, 2024
@vwyLss
Copy link

vwyLss commented May 31, 2024

@sergiuwaxmann I am having the same issue. I tried to manually download the checkpoint and its size is just 1kb. Could you revert my checkpoint to a previous successful checkpoint? Thanks in advance

My model ID: https://hub.ultralytics.com/models/ung87rRVHYHU5Wrhmq8p?tab=train

@sergiuwaxmann
Copy link
Member

@vwyLss Can you check now? Last checkpoint should be epoch 125.

@vwyLss
Copy link

vwyLss commented May 31, 2024

@sergiuwaxmann It is working now, thanks!

@sergiuwaxmann
Copy link
Member

@vwyLss You're welcome! 🚀

@marshaniswah
Copy link

@sergiuwaxmann Hello! I'm still encountering the same issue. Could you please revert my model to a previous successful checkpoint? Thank you very much!

My model: https://hub.ultralytics.com/models/LMEhtucmCZk4XUTeUjWD

@sergiuwaxmann
Copy link
Member

@marshaniswah I’ve reverted the checkpoint to epoch 77 (previous successful checkpoint upload), which should allow you to resume training immediately. Could you please confirm if training is working again?

@marshaniswah
Copy link

marshaniswah commented Nov 4, 2024

@sergiuwaxmann Yeah, its working now. I'm training my model right now. Thanks !

@joseabraham
Copy link

@sergiuwaxmann Hello, I'm encountering the same issue. Could you please revert my model to a previous successful checkpoint? Thanks in advance:

Mode: https://hub.ultralytics.com/models/9nVppEnRgfYxE9aGROjl

@sergiuwaxmann
Copy link
Member

@joseabraham Hello!
Sure, I replied to your issue: #940.

@bcastagna1
Copy link

Hello @sergiuwaxmann. I'm running into the same issue as the above. In the Hub it's showing 100% with last checkpoint saved for epoch 299 (of 300). Would you be able to perform the same fix for me?

I dug into the locally saved weights/epoch-299.pt and I'm seeing the file showing the "NoSuchKey" error. I'm running everything on a custom agent.

Model URL: https://hub.ultralytics.com/models/ZgZqmopBn22vCOEzBrUS

Thank you!!

@sergiuwaxmann
Copy link
Member

@bcastagna1 I’ve reverted the checkpoint to epoch 137 (previous successful checkpoint upload), which should allow you to resume training immediately.

@bcastagna1
Copy link

Thank you @sergiuwaxmann !

@KonDan2310
Copy link

@sergiuwaxmann Hi, I also have the same problem. Could you please revert my model to the previous checkpoint? Thanks in advance for your help:
My model: https://hub.ultralytics.com/models/g1kGF9foy7xBjFHMU0gz

@sergiuwaxmann
Copy link
Member

@KonDan2310 I’ve reverted the checkpoint to epoch 75 (previous successful checkpoint upload), which should allow you to resume training immediately.

@KonDan2310
Copy link

KonDan2310 commented Dec 17, 2024 via email

@pderrenger
Copy link
Member

You're welcome! 😊 If you encounter any further issues or have additional questions, feel free to ask. Happy training and best of luck with your project! 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working fixed Bug has been resolved
Projects
None yet
Development

No branches or pull requests

9 participants