Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generating multiple cloud generations at the same time, even if 1 is on local runtime, makes the rest not work on Ultralytics HUB. #950

Open
1 task done
AntDX316 opened this issue Dec 11, 2024 · 10 comments
Labels
bug Something isn't working HUB Ultralytics HUB issues

Comments

@AntDX316
Copy link

AntDX316 commented Dec 11, 2024

Search before asking

  • I have searched the HUB issues and found no similar bug report.

HUB Component

Training

Bug

Generating multiple cloud Ultralytics HUB training generations at the same time, even if 1 is on local runtime started first, makes the rest not work on Ultralytics HUB.

Then you cannot delete the projects that bust.

Environment

It's an Ultralytics Cloud HUB issue as dual running Local Google Colab Runtimes with Jupyter Lab does work.

Minimal Reproducible Example

Generating multiple cloud generations at the same time, even if 1 is on local runtime, makes the rest not work on Ultralytics HUB.

Then you cannot delete the projects that bust.

Additional

Please fix.

@AntDX316 AntDX316 added the bug Something isn't working label Dec 11, 2024
@UltralyticsAssistant UltralyticsAssistant added the HUB Ultralytics HUB issues label Dec 11, 2024
@UltralyticsAssistant
Copy link
Member

👋 Hello @AntDX316, thank you for reporting this issue with Ultralytics HUB 🚀! Your feedback is invaluable in helping us improve. Please check out our HUB Docs for additional information and guidance:


For your issue:

If this is a 🐛 Bug Report, could you please provide the following to help narrow down the cause and work towards a fix?

  1. Screenshots, if applicable, to highlight the behavior you're seeing.
  2. A more detailed minimum reproducible example (MRE) that specifically outlines steps starting from project creation to the generation issue. You can refer to our guide for creating an MRE.

An Ultralytics engineer will review your issue and assist shortly. Thank you for bringing this to our attention, and we truly appreciate your patience! 🙏✨

@AntDX316
Copy link
Author

AntDX316 commented Dec 11, 2024

The cloud stuff doesn't seems to still not work. Local runtime works great on the RTX 4090.

@AntDX316
Copy link
Author

I'm not sure, trying to make the cloud stuff work is unreliable, even if everything else was completed. H100 HBM3 wasn't working earlier then worked. A lot of the other stuff works, perhaps everything that isn't local runtime.

@AntDX316
Copy link
Author

It's super bust, not even the 3080 cloud version that worked before works anymore.

@pderrenger
Copy link
Member

Thank you for raising this issue. It sounds like you're encountering consistent problems with Ultralytics HUB's cloud training functionality across different instances. Let's address this step by step:

  1. Verify Cloud Training Status: Sometimes, cloud training issues can occur due to high demand or temporary server-side limitations. Please ensure you're using the latest version of Ultralytics HUB and try again after a short period.

  2. Single Cloud Training Limitation: As noted in the Cloud Training documentation, only one cloud training session can run at a time per user. Attempting to initialize multiple cloud sessions might cause conflicts. Can you confirm if any other training sessions were active when this issue occurred?

  3. Instance Availability: For cloud GPUs like the NVIDIA T4 or others, availability depends on demand. If the instance initialization fails, it could mean resources are temporarily unavailable. Please try selecting a different training duration or instance type if available.

  4. Account Balance and Billing Check: Ensure your account balance is sufficient if you're using "Epochs" training. For "Timed" training, confirm your payment method is set up correctly.

  5. Local Runtime as a Backup Option: Since you mentioned the local runtime works flawlessly on your RTX 4090, you can continue training locally as a temporary measure while we troubleshoot this issue.

Next Steps

Could you provide additional details, such as:

  • The exact error messages (if any) you are observing.
  • The steps leading up to the failure (e.g., instance selection, dataset upload, etc.).
  • Confirmation that you're using the latest version of HUB.

If nothing resolves the issue, I recommend submitting a support ticket via the HUB interface or directly sharing the logs through the GitHub Issues Tab for a deeper investigation. The Ultralytics team will look into this promptly.

Your feedback is invaluable, and we appreciate your patience while we work to improve the cloud training experience. Let us know if there's anything else we can assist you with! 😊

@yogendrasinghx
Copy link
Member

Hi @AntDX316,

We sincerely apologize for the inconvenience you're experiencing and truly appreciate you bringing this to our attention. To better understand and investigate the issue, could you please provide additional details and steps to reproduce the problem? Specifically, it would be helpful if you could share:

  1. Dataset Information: Which dataset are you using for the training?
  2. GPU Selection: Which GPU are you selecting for Ultralytics Cloud training (e.g., NVIDIA GeForce RTX 4090, NVIDIA H100 PCIe)?
  3. Training Configuration:
    • Number of epochs
    • Image size
    • Any other relevant configuration options
  4. Screenshots: If possible, please share screenshots highlighting the issue, particularly when the problem occurs.
  5. Model ID: You can find it in the URL when accessing the model in the HUB Web for example: https://hub.ultralytics.com/models/7wzkDSKNMcwkPTs8ZVJC.
    Model ID: 7wzkDSKNMcwkPTs8ZVJC

Providing the Model ID will allow us to locate your account and gain a better understanding of the issue. We sincerely apologize for any inconvenience this has caused and appreciate your patience as we work on a resolution.

Looking forward to your response.

@AntDX316
Copy link
Author

AntDX316 commented Dec 12, 2024

Hi @AntDX316,

We sincerely apologize for the inconvenience you're experiencing and truly appreciate you bringing this to our attention. To better understand and investigate the issue, could you please provide additional details and steps to reproduce the problem? Specifically, it would be helpful if you could share:

  1. Dataset Information: Which dataset are you using for the training?

  2. GPU Selection: Which GPU are you selecting for Ultralytics Cloud training (e.g., NVIDIA GeForce RTX 4090, NVIDIA H100 PCIe)?

  3. Training Configuration:

    • Number of epochs
    • Image size
    • Any other relevant configuration options
  4. Screenshots: If possible, please share screenshots highlighting the issue, particularly when the problem occurs.

  5. Model ID: You can find it in the URL when accessing the model in the HUB Web for example: https://hub.ultralytics.com/models/7wzkDSKNMcwkPTs8ZVJC.
    Model ID: 7wzkDSKNMcwkPTs8ZVJC

Providing the Model ID will allow us to locate your account and gain a better understanding of the issue. We sincerely apologize for any inconvenience this has caused and appreciate your patience as we work on a resolution.

Looking forward to your response.

All of them. It's all default. I was using the carparts one with 3833 photos as an example.

@pderrenger
Copy link
Member

Hi @AntDX316,

Thank you for the update and for providing the dataset details. Based on your description, it appears you're using the default settings and the "carparts" dataset with 3,833 images. We'll need to investigate further to understand why the cloud training is failing across different GPU instances.

In the meantime, could you please help us with the following additional information to narrow down the issue?

  1. Error Messages: Are there any specific error messages displayed during the cloud training initialization or execution? If so, copying those here would be incredibly helpful.
  2. Training Instance: Which specific cloud GPU instance were you using for this training (e.g., NVIDIA T4, RTX 3080, RTX 4090, or H100)? If you tried multiple instances, let us know which ones were unsuccessful.
  3. Project ID/Model ID: If you can locate the Model ID or Project ID (available in the URL of your model or project page on the HUB), please share it here. This will allow us to directly access and debug the associated logs.

Temporary Workarounds:

  • Local Training: Since local runtime works well on your RTX 4090, you may want to proceed with that setup temporarily while we investigate the cloud issue.
  • Cloud Resource Availability: Cloud GPU resources may sometimes be in high demand, causing delays or initialization errors. If the issue persists, you can try initiating the training at a different time or opting for another GPU type (if available).

Your collaboration is invaluable in resolving this issue. Please provide the requested details, and we’ll work diligently to identify and address the root cause. Thank you for your patience and understanding! 😊

@AntDX316
Copy link
Author

I assume it's fixed now?

I was just using your default dataset to test.

@pderrenger
Copy link
Member

Hi @AntDX316,

Thanks for following up! If you're using the default dataset and still encountering issues, I recommend verifying the following:

  1. Ensure Latest Version: Make sure you're using the latest version of Ultralytics HUB. Updates often include critical fixes and improvements that could resolve your issue.

  2. Cloud Resource Availability: As a reminder, cloud GPU availability can sometimes fluctuate based on demand. If resource contention was causing issues earlier, it's possible that availability has improved now.

  3. Account Balance: If you're using the "Epochs" training option, confirm that your account balance is sufficient (minimum $5 required). Insufficient balance can cause training sessions to halt.

  4. Confirm with a Clean Test: Start a new training session using the default dataset and default configuration to confirm if the issue persists. If it works fine now, the problem may have been temporary.

If the problem reoccurs or you notice any unexpected behavior, please share:

  • Any error messages or logs you see.
  • The Model ID or Project ID (found in the URL) of the affected project for further investigation.

Your feedback is crucial to improving the platform, so please don’t hesitate to reach out with updates. Thanks for your patience and for testing out Ultralytics HUB! 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working HUB Ultralytics HUB issues
Projects
None yet
Development

No branches or pull requests

4 participants