You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,thanks for sharing your great work and it really helps me a lot.
I am trying to replicate the results of DSMIL and make some modifications based on it. However, I encountered some problems when training on the Camelyon16 dataset. I would greatly appreciate any guidance and advice, as my experience with training deep learning models and analyzing pathological images is limited.
Firstly, I randomly divided the official training set into 5 folds with even distribution of labels, and then conducted 5-fold cross-validation. In this process, the model was trained on 4 folds, with one fold serving as the validation set. The best model was then used to predict results on the official test set. The final results of the 5-fold cross-validation were averaged. Is this approach correct?
Secondly, when training my own designed MIL model, I sometimes experience unstable training(such as some fold), or even an ongoing increase in val_loss, as shown in the figure below. But sometimes
However, I select the model with the highest AUC on the validation set, and the final results on the test set are not too bad. I wonder if these training issues arise from a flaw in my model design or improper training parameter settings. I have adopted the training parameter settings from the DSMIL code. criterion = nn.BCEWithLogitsLoss() optimizer = torch.optim.Adam(milnet.parameters(), lr=0.0001, betas=(0.5, 0.9), weight_decay=5e-3) scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, 200, 0.000005)
Third, when training with Camelyon16 features downloaded from script, I achieve an exceptionally high AUC of 0.97, as mentioned in Requests for multiscale features #49. Then, I downloaded the 'c16-multiscale-features' provided by the authors and used the first 512 dimensions of the 1024-dimensional features as single-scale features for training my model, but encountered several issues. (1) DSMIL achieved AUCs of 0.77 and 0.81 on multi-scale and single-scale, respectively, far below the results reported in the paper. (2) Regardless of the model, the AUC on single-scale is always higher than on multi-scale. (3) My MIL model's AUC ranges from 0.84 to 0.87 on single-scale and from 0.81 to 0.84 on multi-scale. I wonder if anyone else has encountered similar issues, or is it a problem on my end?
The text was updated successfully, but these errors were encountered:
I incorporated the training/testing into the same pipeline in the latest commit. You can set --eval_scheme=5-fold-cv-standalone-test which will perform a train/valid/test like this:
A standalone test set consisting of 20% samples is reserved, remaining 80% of samples are used to construct a 5-fold cross-validation.
For each fold, the best model and corresponding threshold are saved.
After the 5-fold cross-validation, 5 best models along with the corresponding optimal thresholds are obtained which are used to perform inference on the reserved test set. A final prediction for a test sample is the majority vote of the 5 models.
For a binary classification, accuracy and balanced accuracy scores are computed. For a multi-label classification, hamming loss (smaller the better) and subset accuracy are computed.
You can also simply run a 5-fold cv --eval_scheme=5-fold-cv
There were some issues with the testing script when loading pretrained weights (i.e., sometimes the weights are not fully loaded or there are missing weights, set strict=False can reveal the problems.) I will fix this in a couple of days.
Hi,thanks for sharing your great work and it really helps me a lot.
I am trying to replicate the results of DSMIL and make some modifications based on it. However, I encountered some problems when training on the Camelyon16 dataset. I would greatly appreciate any guidance and advice, as my experience with training deep learning models and analyzing pathological images is limited.
Firstly, I randomly divided the official training set into 5 folds with even distribution of labels, and then conducted 5-fold cross-validation. In this process, the model was trained on 4 folds, with one fold serving as the validation set. The best model was then used to predict results on the official test set. The final results of the 5-fold cross-validation were averaged. Is this approach correct?
Secondly, when training my own designed MIL model, I sometimes experience unstable training(such as some fold), or even an ongoing increase in val_loss, as shown in the figure below. But sometimes
However, I select the model with the highest AUC on the validation set, and the final results on the test set are not too bad. I wonder if these training issues arise from a flaw in my model design or improper training parameter settings. I have adopted the training parameter settings from the DSMIL code.
criterion = nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(milnet.parameters(), lr=0.0001, betas=(0.5, 0.9), weight_decay=5e-3)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, 200, 0.000005)
Third, when training with Camelyon16 features downloaded from script, I achieve an exceptionally high AUC of 0.97, as mentioned in Requests for multiscale features #49. Then, I downloaded the 'c16-multiscale-features' provided by the authors and used the first 512 dimensions of the 1024-dimensional features as single-scale features for training my model, but encountered several issues. (1) DSMIL achieved AUCs of 0.77 and 0.81 on multi-scale and single-scale, respectively, far below the results reported in the paper. (2) Regardless of the model, the AUC on single-scale is always higher than on multi-scale. (3) My MIL model's AUC ranges from 0.84 to 0.87 on single-scale and from 0.81 to 0.84 on multi-scale. I wonder if anyone else has encountered similar issues, or is it a problem on my end?
The text was updated successfully, but these errors were encountered: