-
-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem resuming training in Google Colab (Continued) #674
Comments
Hello! 👋 It seems the issue you're encountering is related to the download or loading of the model's checkpoint file. The error message you're seeing ( As you've already tried the recommended steps (re-running the cell, checking internet connection, and clearing the Colab environment), you could try the following additional step to ensure the
If the problem persists even after these steps, it's possible there may be an issue with the Remember to check the Ultralytics HUB Docs at https://docs.ultralytics.com/hub for more detailed instructions and troubleshooting tips. Your feedback is valuable, and the Ultralytics team appreciates your community involvement. Let's work together to solve this issue! 🚀 |
Hey there! 👋 It looks like you're encountering access issues due to permission settings or an expired token for the For downloading model checkpoints from the Ultralytics HUB, I recommend ensuring that you're logged into the hub using the Regarding the expired token in the second URL, this usually occurs because URLs with embedded credentials have a short validity period for security reasons. To resolve this, it's best to generate a fresh download URL by re-initiating your session or request immediately before you plan to download the file. If these steps don't resolve the issue, I'd suggest reaching out through our support channels with specifics (while avoiding sharing sensitive information like API keys publicly), so we can ensure proper access on your account. Let's get this sorted! 🚀 |
@sebasmej We just checked and everything is working fine when starting/resuming training in Google Colab. Do you still have the issues above? |
Yes, the problem persists. I have not been able to resume training, I am encountering the same errors I mentioned before. |
I’ve reviewed your model, and it appears there was indeed a hiccup with uploading the checkpoint for epoch 32. As a temporary measure, I’ve reverted the checkpoint to epoch 31 (previous successful checkpoint upload), which should allow you to resume training immediately. Could you please confirm if everything is back on track on your end? Additionally, I’ve documented this incident with our development team to investigate further and ensure a permanent fix is implemented. This will help prevent such issues from recurring in the future. PS If the error still occurs, maybe consider starting the training again (new model). |
Thank you for your prompt reply. Yes everything is working fine now. I was able to continue the model training from epoch 31 without any problem. |
@sebasmej I am glad your issue was solved. Thank you for you patience! |
@sergiuwaxmann I am having the same issue. I tried to manually download the checkpoint and its size is just 1kb. Could you revert my checkpoint to a previous successful checkpoint? Thanks in advance My model ID: https://hub.ultralytics.com/models/ung87rRVHYHU5Wrhmq8p?tab=train |
@vwyLss Can you check now? Last checkpoint should be epoch 125. |
@sergiuwaxmann It is working now, thanks! |
@vwyLss You're welcome! 🚀 |
@sergiuwaxmann Hello! I'm still encountering the same issue. Could you please revert my model to a previous successful checkpoint? Thank you very much! My model: https://hub.ultralytics.com/models/LMEhtucmCZk4XUTeUjWD |
@marshaniswah I’ve reverted the checkpoint to epoch 77 (previous successful checkpoint upload), which should allow you to resume training immediately. Could you please confirm if training is working again? |
@sergiuwaxmann Yeah, its working now. I'm training my model right now. Thanks ! |
@sergiuwaxmann Hello, I'm encountering the same issue. Could you please revert my model to a previous successful checkpoint? Thanks in advance: Mode: https://hub.ultralytics.com/models/9nVppEnRgfYxE9aGROjl |
@joseabraham Hello! |
Hello @sergiuwaxmann. I'm running into the same issue as the above. In the Hub it's showing 100% with last checkpoint saved for epoch 299 (of 300). Would you be able to perform the same fix for me? I dug into the locally saved weights/epoch-299.pt and I'm seeing the file showing the "NoSuchKey" error. I'm running everything on a custom agent. Model URL: https://hub.ultralytics.com/models/ZgZqmopBn22vCOEzBrUS Thank you!! |
@bcastagna1 I’ve reverted the checkpoint to epoch 137 (previous successful checkpoint upload), which should allow you to resume training immediately. |
Thank you @sergiuwaxmann ! |
@sergiuwaxmann Hi, I also have the same problem. Could you please revert my model to the previous checkpoint? Thanks in advance for your help: |
@KonDan2310 I’ve reverted the checkpoint to epoch 75 (previous successful checkpoint upload), which should allow you to resume training immediately. |
Thanks for your help
… Message ID: ***@***.***>
|
You're welcome! 😊 If you encounter any further issues or have additional questions, feel free to ask. Happy training and best of luck with your project! 🚀 |
Search before asking
HUB Component
Training
Bug
I am training a model using google colab and when I try to resume executing the commands:
the following error message appears:
I follow your recomendations to solve this issue, I Rerun the Training Cell, Check Internet Connection and Clear Colab Environment. but the issue persits. For further investigation i append details of the error after rerun:
Environment
Google Colab
Minimal Reproducible Example
Additional
No response
The text was updated successfully, but these errors were encountered: