Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No such file or directory: 'runs/detect/train/weights/best.pt' #485

Closed
1 task done
Fistcar opened this issue Nov 29, 2023 · 10 comments
Closed
1 task done

No such file or directory: 'runs/detect/train/weights/best.pt' #485

Fistcar opened this issue Nov 29, 2023 · 10 comments
Assignees
Labels
bug Something isn't working

Comments

@Fistcar
Copy link

Fistcar commented Nov 29, 2023

Search before asking

  • I have searched the HUB issues and found no similar bug report.

HUB Component

Training

Bug

I've been trying to train a model in the hub to run on my phone. The training got to epoch 97 out of 100 and simply errors out. I've tried using Firefox and Edge, but the same errors occur. The error was 'No such file or directory: 'runs/detect/train/weights/best.pt'' and I could find no way to move that file from the hub to google colab. I created an empty best.pt file and that just seemed to cause more errors as now I see 'raise RuntimeError("Invalid magic number; corrupt file?") EOFError: Ran out of input'. How can I fix this? I've wasted over 5 computer hours on google colab simply trying to finish the training of this network. I do not want to start training all over again. The only reason I am using the hub is to test the network out on my phone.

Environment

Ultralytics HUB Version
v0.1.31
Client User Agent
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36 Edg/119.0.0.0
Operating System
Win32
Browser Window Size
2000 x 1038
Server Timestamp
1701285411

Minimal Reproducible Example

image

Additional

No response

@Fistcar Fistcar added the bug Something isn't working label Nov 29, 2023
Copy link

👋 Hello @Fistcar, thank you for raising an issue about Ultralytics HUB 🚀! Please visit our HUB Docs to learn more:

  • Quickstart. Start training and deploying YOLO models with HUB in seconds.
  • Datasets: Preparing and Uploading. Learn how to prepare and upload your datasets to HUB in YOLO format.
  • Projects: Creating and Managing. Group your models into projects for improved organization.
  • Models: Training and Exporting. Train YOLOv5 and YOLOv8 models on your custom datasets and export them to various formats for deployment.
  • Integrations. Explore different integration options for your trained models, such as TensorFlow, ONNX, OpenVINO, CoreML, and PaddlePaddle.
  • Ultralytics HUB App. Learn about the Ultralytics App for iOS and Android, which allows you to run models directly on your mobile device.
    • iOS. Learn about YOLO CoreML models accelerated on Apple's Neural Engine on iPhones and iPads.
    • Android. Explore TFLite acceleration on mobile devices.
  • Inference API. Understand how to use the Inference API for running your trained models in the cloud to generate predictions.

If this is a 🐛 Bug Report, please provide screenshots and steps to reproduce your problem to help us get started working on a fix.

If this is a ❓ Question, please provide as much information as possible, including dataset, model, environment details etc. so that we might provide the most helpful response.

We try to respond to all issues as promptly as possible. Thank you for your patience!

@UltralyticsAssistant
Copy link
Member

@Fistcar hello! I'm sorry to hear you're encountering issues with your model training on the HUB.

The error you're experiencing indicates the training script is unable to locate the 'best.pt' file which should contain the weights of your best-performing model. This file is typically saved automatically during the training process when a new best metric is achieved.

If you cannot find the 'best.pt' file in the specified directory, it is possible that either the file was not created due to an interruption in the training process or it may have been inadvertently moved or deleted.

Creating an empty 'best.pt' file is not a valid solution since the file needs to contain specific data serialized in a format that the PyTorch framework can understand. An empty file or a file with invalid contents will cause the 'Invalid magic number; corrupt file?' error you're seeing.

Here's what you can do:

  • First, please check your training output directory again to ensure the 'best.pt' file wasn't created in a different folder by accident.
  • Then, investigate if the training process was interrupted for some reason before it could save the 'best.pt' file.
  • If you are certain that the file has been lost, you may need to access the last saved checkpoint, if one exists, to avoid starting the training from scratch. Check for any checkpoints such as 'last.pt' that may have been saved during training.

If you continue to have difficulties, you may refer to the HUB documentation to ensure your training setup and process are configured correctly. Training checkpoints and the way to continue training from them should also be covered in the documentation, which may help prevent the need to start over.

I understand how frustrating it can be to encounter such issues, especially after a significant amount of training time. I hope this guidance is helpful, and we're here for any further assistance you may need. Good luck! 🤞

@Fistcar
Copy link
Author

Fistcar commented Nov 30, 2023

I let an epoch run and then made a copy of last.pt and named it best.pt. The training finished but then colab locked up somehow and got stuck here:
image

@kalenmike
Copy link
Contributor

@Fistcar If you have been manually creating best.pt there is probably not much we can do to save your model. You can try and use this notebook and see if you have any luck:

https://colab.research.google.com/drive/1vW8xNoNi89Y4yWratNVUpqPp3d-9bY45#scrollTo=-xtsX6NxdxHz

@Fistcar
Copy link
Author

Fistcar commented Nov 30, 2023

I've not been manually creating it. I tried that this morning because I wanted the hub to work for me. I'll try that notebook. I don't know what happened to best.pt for it to be missing. The only other oddity I've noticed is that in the hub it says the network is on epoch 97, but whenever I resume training (see previous images) the notebook starts training at epoch 99... I do not know what happened to epoch 98.
image

@Fistcar
Copy link
Author

Fistcar commented Nov 30, 2023

@kalenmike Your notebook allowed the model to be used in the hub and on my phone. Thank you.

@kalenmike
Copy link
Contributor

@Fistcar This is a limitation we are trying to fix at the moment. The problem arises when training resumes on a fresh instance without ever achieving the best mAP again. As training completes without ever outperforming an epoch that completed before resuming and the environment no longer has the previous best epoch saved the final upload fails.

This can be avoided by ensuring that your environment does not get reset during resume.

@kalenmike
Copy link
Contributor

@Fistcar Glad to hear. We are working on a way to avoid these issues but for the moment we can only fix them if they occur.

@kalenmike kalenmike self-assigned this Nov 30, 2023
@Ray150789
Copy link

my training completed but i cant find the best,pt file it says results saved to runs/detect/train my folders include runs/detect/predict which has output video files which I wish to retrain as the output has low quality output.

@pderrenger
Copy link
Member

Hello @Ray150789!

It appears that your training process completed but the best.pt file is missing, which can sometimes happen if the training environment was interrupted or a checkpoint wasn't properly saved. Typically, the best.pt file (representing the weights with the best performance during training) and last.pt (weights from the final epoch) are stored in the runs/detect/train/weights/ directory. If this folder is missing or doesn't contain the expected files, here are some steps you can follow:

1. Check Saved Directories

Verify the exact location of your training output. By default, results are saved to runs/detect/train/, and the weights should be in runs/detect/train/weights/. If the folder is missing or empty:

  • Ensure the training process wasn't interrupted before saving the weights.
  • Check if the training log mentions any errors or interruptions.

2. Use last.pt for Retraining

If you cannot find best.pt, you can use last.pt (if available) for further training or evaluation. You can find it in the same weights/ directory. It may not have the absolute best metrics but will still allow you to continue from where the training left off or fine-tune the model.

Here’s how you can resume training using last.pt:

yolo detect train data=your_data.yaml model=runs/detect/train/weights/last.pt epochs=50

This will resume training from the last saved state.

3. Retraining with Output Videos

If you're looking to retrain the model due to low-quality predictions, you can enhance your dataset by incorporating new training examples. For example:

  • Extract frames from your output videos showing poor predictions.
  • Annotate these frames using tools like Roboflow or LabelImg.
  • Add these annotated images to your original dataset and retrain the model using the combined dataset.

4. Debugging Missing best.pt

If this issue persists, it could be related to a bug in the training process or an environment-specific issue. Ensure you're using the latest version of Ultralytics software. You can upgrade to the latest version by running:

pip install ultralytics --upgrade

If you're still unable to locate the weights or run into issues, consider sharing your training logs or specific error messages so we can assist further. Additionally, you can refer to the HUB Quickstart guide to streamline your workflow.

I hope this helps, and feel free to share any updates or additional questions! 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants