Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model training not getting completed/ Disconnected. Stuck at 100% #952

Open
1 task done
Sudhir1609 opened this issue Dec 13, 2024 · 16 comments
Open
1 task done

Model training not getting completed/ Disconnected. Stuck at 100% #952

Sudhir1609 opened this issue Dec 13, 2024 · 16 comments
Assignees
Labels
bug Something isn't working HUB Ultralytics HUB issues web Related to web interface or web functionality

Comments

@Sudhir1609
Copy link

Search before asking

  • I have searched the HUB issues and found no similar bug report.

HUB Component

Models

Bug

Its constantly getting stuck at 100% and not getting completed.

Model12

Environment

Ultralytics HUB Version
v0.1.79
Client User Agent
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36
Operating System
Linux x86_64
Browser Window Size
1848 x 932
Server Timestamp
1734061093

Minimal Reproducible Example

No response

Additional

No response

@Sudhir1609 Sudhir1609 added the bug Something isn't working label Dec 13, 2024
@UltralyticsAssistant UltralyticsAssistant added the HUB Ultralytics HUB issues label Dec 13, 2024
@UltralyticsAssistant
Copy link
Member

👋 Hello @Sudhir1609, thank you for reporting an issue about Ultralytics HUB 🚀! Please check out our HUB Docs for more information:

  • Quickstart. Start training and deploying YOLO models with HUB in seconds.
  • Datasets: Preparing and Uploading. Learn how to prepare and upload your datasets to HUB in YOLO format.
  • Projects: Creating and Managing. Group your models into projects for improved organization.
  • Models: Training and Exporting. Train YOLOv5 and YOLOv8 models on your custom datasets and export them to various formats for deployment.
  • Integrations. Explore different integration options for your trained models, such as TensorFlow, ONNX, OpenVINO, CoreML, and PaddlePaddle.
  • Ultralytics HUB App. Learn about the Ultralytics App for iOS and Android, which allows you to run models directly on your mobile device.
    • iOS. Learn about YOLO CoreML models accelerated on Apple's Neural Engine on iPhones and iPads.
    • Android. Explore TFLite acceleration on mobile devices.
  • Inference API. Understand how to use the Inference API for running your trained models in the cloud to generate predictions.

It looks like you've reported a 🐛 bug where the model gets stuck at 100% completion and doesn't finalize. To help us investigate and resolve this, could you please provide a minimum reproducible example (MRE)? This includes:

  1. Detailed steps to reproduce the issue you're encountering.
  2. Screenshots or relevant logs that might give us more context.
  3. Information about any specific datasets, tasks, or customized configurations involved.

For guidance on creating an MRE, visit our Minimum Reproducible Example guide. 🛠️

An Ultralytics engineer will also review your issue and assist you shortly. Thank you for bringing this to our attention and for your patience! 😊

@Sudhir1609 Sudhir1609 changed the title Model not getting completed/ Disconnected Model training not getting completed/ Disconnected. Stuck at 100% Dec 13, 2024
@sergiuwaxmann
Copy link
Member

@Sudhir1609 Hello!
Can you share your model ID? You can find it on the URL of your model's page.

@Sudhir1609
Copy link
Author

@sergiuwaxmann sergiuwaxmann self-assigned this Dec 13, 2024
@sergiuwaxmann
Copy link
Member

@Sudhir1609 Yes, this URL points to your model. I can see your model is disconnected and the last epoch is 95.
You can try resuming the training while we investigate this issue further.

@yogendrasinghx yogendrasinghx added the web Related to web interface or web functionality label Dec 13, 2024
@Sudhir1609
Copy link
Author

I've tried 'Resume Training' like 5-6 times now and everytime it gets disconnected around the same epoch, I'm worried losing my funds too.

@pderrenger
Copy link
Member

Thank you for sharing the update, @Sudhir1609! I understand how frustrating this must be, especially with the concern about funds.

To address this, please try the following steps:

  1. Check Your Internet Stability: Cloud training sessions can sometimes disconnect if there are interruptions in your network stability, so ensure you're on a reliable connection.

  2. Inspect the Logs: From the model page, review the training logs to see if there's any specific error or indication of what's causing the disconnection.

  3. Resume Training: Since the issue persists around the same epoch, try reducing your batch size or tweaking your dataset settings to see if that resolves any potential resource constraints. You can adjust these settings when resuming training.

  4. Funds and Billing: Rest assured, the HUB deducts funds only for completed epochs. If the session disconnects before completing an epoch, the balance for that epoch should not be affected. You can verify this via the Billing tab in the HUB.

If the issue persists and you've already tried the above steps, please let us know. You can also share with us any specific error messages or logs that appear before the disconnection. We'll investigate further to ensure this gets resolved for you.

Thank you for your patience! 😊

@Sudhir1609
Copy link
Author

Sudhir1609 commented Dec 13, 2024

Im not able to change any configuration, Only the Resume training option is enabled and to change the Instance. How can i reduce the epoch size or tweak my dataset settings ? @pderrenger

@sergiuwaxmann
Copy link
Member

@Sudhir1609 Unfortunately, the number of epochs can't be changed after the model started training.
Apologies for the inconvenience, we will refund the account balance you used so far for this training as we can see you tried resuming several times. Once we do this (you should see the account balance back in your account in about 30 minutes), maybe you can try creating a new model again and start a fresh training?

@Sudhir1609
Copy link
Author

@sergiuwaxmann Thanks, I was facing the same problem and tried the same steps for this model too.
https://hub.ultralytics.com/models/MD72j92nP9uX9fwShDIS

Thanks for you help !

@sergiuwaxmann
Copy link
Member

@Sudhir1609 You should have your account balance back.
Maybe you can try choosing a different GPU? Which GPU did you use for the trainings that failed?

@sergiuwaxmann
Copy link
Member

I believe the size of your dataset causes OOM issue but we are still investigating this.

@Sudhir1609
Copy link
Author

@sergiuwaxmann I tried changing the instance between
NVIDIA GeForce RTX 4090 and NVIDIA L40.

Thanks for the update. I'll try to change my dataset and try again

@Sudhir1609
Copy link
Author

Sudhir1609 commented Dec 16, 2024

@sergiuwaxmann I changed the dataset size and tried training the model and faced with the same problem
https://hub.ultralytics.com/models/zEDjZlwIbNiMnrD1qVtT

Can you please let me know about this

@yogendrasinghx
Copy link
Member

@Sudhir1609 Thank you for your patience as we continue to investigate this issue. We're currently working to identify the root cause, but reproducing the problem has been challenging due to the large size of the dataset involved.

Please rest assured that we're actively working on this and will keep you updated as soon as we have more information. Apologies for the inconvenience, and thank you for your understanding! 🙏

@yogendrasinghx
Copy link
Member

@Sudhir1609

Thank you for your patience and understanding as we looked into this issue. We have successfully reproduced the issue on our end and identified the root cause. The development team has been informed and is actively working on a fix.

We appreciate your cooperation and will update you as soon as the fix is deployed.

Thank you! 😊

@Sudhir1609
Copy link
Author

@yogendrasinghx
Sure thanks for the update, Please let me know once the problem is fixed.
Hope I can train the model seamlessly soon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working HUB Ultralytics HUB issues web Related to web interface or web functionality
Projects
None yet
Development

No branches or pull requests

5 participants