This repository provides the official implementation of the following paper:
Learning to Discover and Detect Objects
Vladimir Fomenko,
Ismail Elezi,
Deva Ramanan,
Laura Leal-Taixé,
Aljoša Ošep
In Advances in Neural Information Processing Systems 36 (NeurIPS 2022).
Project page | Paper | Source code | Poster | Video
Abstract: We tackle the problem of novel class discovery, detection, and localization (NCDL). In this setting, we assume a source dataset with labels for objects of commonly observed classes. Instances of other classes need to be discovered, classified, and localized automatically based on visual similarity, without human supervision. To this end, we propose a two-stage object detection network RNCDL, that uses a region proposal network to localize potential objects and classify them. We train our network to classify each proposal, either as one of the known classes, seen in the source dataset, or one of the extended set of novel classes with a constraint that the distribution of class assignments should follow natural long-tail distributions common in the real open-world. By training our detection network with this objective in an end-to-end manner, it learns to classify all region proposals for a large variety of classes, including those that are not part of the labeled object class vocabulary.
Our implementation is based on the Detectron2 v0.6 framework. For our experiments we followed Detectron2's setup guide and picked the CUDA=11.3 and torch=1.10 versions (python -m pip install detectron2 -f
may work). Other Detectron2 versions should work as well, but we did not test them. All the additional dependencies we put to requirements.txt
. We used Python 3.8 for all experiments.
After installing the detectron2
package, we require several small modifications to the library in order to speed-up evaluation procedures. To identify detectron2
's installation folder, you can use pip show detectron2
(example output is: anaconda/envs/d2/lib/python3.8/site-packages/detectron2
Speeding up the evaluation by increasing batch size. Open the data/
file (e.g $CONDA_ENV_PATH$/site-packages/detectron2/data/
). Inside, modify the def build_detection_test_loader
function as follows:
- add
argument to the function definition (with a default value ofbatch_size=1
for backward compatibility) - in the final
statement, replacebatch_size=1
This way we add support of the variable batch size for the evaluation.
To run VisualGenome experiments, we require further modifications to the Detectron2 source code. In LVIS + VisualGenome setup we support supervised training phase with mask annotations from LVIS (to train class-agnostic mask head). However, VisualGenome annotations (used for evaluation) don't contain masks. This causes issues during data loading, as some images contain masks, and some don't. To support such case we require modification of datasets/
(e.g. $CONDA_ENV_PATH$/site-packages/detectron2/datasets/
), line 150, by adding an if
statement to check if segmentation mask is present in a given annotation. Specifically, please insert the following if
statement to line 150 and place the segmentations-related logic under the new if
as follows:
if "segmentation" in anno:
segm = anno["segmentation"] # list[list[float]]
obj["segmentation"] = segm
for extra_ann_key in extra_annotation_keys: # <--- leave this line and the rest lines below untouched
For COCO + LVIS experiments, first download COCO images + annotations and LVIS annotations as instructed in the Detectron2 data documentation. For COCO, please download COCO 2017 data with standard train/val annotations.
After downloading and extracting the standard datasets, please download the additional annotations for the COCOhalf dataset introduced in our paper: coco_half_train.json and coco_half_val.json. Place them in the coco/annotations/
folder. If you want to check how this dataset is registered by Detectron2, please see configs/data/
script for details. This script is automatically called during training.
As LVIS and VisualGenome datasets largely overlap (50K images) and VisualGenome does not provide a default train-val split, for our work we devised a specific split. We selected our split so that the validation images for LVIS + VisualGenome setup are a subset of LVIS validation images. Specifically, we use LVIS validation images that can also be found in VisualGenome. For more details, please see our paper (supplementary).
First, download COCO and LVIS datasets as instructed above. Then, download the VisualGenome images v1.2 from the official website and put all images in a visualgenome/
subfolder of a folder where COCO images are located. By default, that would correspond to $DETECTRON2_DATASETS$/coco/visualgenome/
, so an example image path would be $DETECTRON2_DATASETS$/coco/visualgenome/2386299.jpg
For annotations, we had to pre-process official VisualGenome annotations to match the Detectron2 format. Please, download our custom annotation files visualgenome_train.json and visualgenome_val.json and put them to the $DETECTRON2_DATASETS$/visualgenome/
folder (if you want to use a different folder, please modify configs/data/
file accordingly).
The final structure of your DETECTRON2_DATASETS
folder should be the following:
..original coco annotations (optional)..
During the first, supervised phase, we initialize R-CNN from weights pretrained in a self-supervised manner, specifically, MoCo v2. In the table in the next sub-section, we provide weights that we downloaded in early 2022 as well. However, if you'd like to obtain the weights from the official MoCo v2 repository, please follow the next paragraph.
To obtain the original MoCo v2 weights, please download 800-epoch weights from MoCo official repository from this table. Then, clone the official moco repository, and run a script as described here to convert the weights to the Detectron2 format.
Below we provide weights for some of the trained models. For more details on the models' hyperparameters, please see our manuscript. Unfortunately, we cannot provide the weights for the original network described in the manuscript, but to reproduce our results, we have trained another network for 7500 iterations, and provide its weights. It can surpass the scores mentioned in the paper and achieves 7.35 mAP on all classes.
Description | Link | Scores |
Pretrained MoCo v2 ResNet official moco repository used for backbone initialization | moco_v2_800ep_pretrain.pkl |
N/A |
Fully-supervised Mask-RCNN with FPN and Res50 backbone, trained on COCOhalf dataset, used as the initialization for the discovery phase | supervised_cocohalf_maskrcnn.pth |
mAPCOCOhalf: 35.69 |
Fully-supervised Mask-RCNN with FPN and Res50 backbone, trained on LVIS dataset, used as the initialization for the discovery phase | supervised_lvis_maskrcnn.pth |
mAPLVIS: 18.47 |
RNCDL network trained for discovery mode on COCOhalf and the rest of unlabeled images with the number of unlabeled classes set to 3000 | discovery_cocohalf_lvis_maskrcnn_lr3e-2_iter7500.pth |
mAPCOCOhalf: 24.46, mAPLVIS: 5.94, mAPall: 7.35 |
We use Detectron2's lazy configuration to define our framework. The configurations are located in configs/
folder and the root config files can be found in config/train/
For convenience, we provide slurm example job scripts that are based on the scripts that we used to execute the jobs. They can be found in slurm_scripts/
All our experiments were tested on 4 NVIDIA A40 GPUs with 48G memory.
To train our fully-supervised baselines, please use scripts in the slurm_scripts/fully_supervised/
folder. E.g. to train a fully-supervised R-CNN on COCOhalf use:
DETECTRON2_DATASETS=/path/to/datasets \
python tools/ \
--config-file ./configs/train/fully_supervised/ \
--num-gpus 4 \
--dist-url 'tcp://localhost:10042' \
train.init_checkpoint=./checkpoints/moco_v2_800ep_pretrain.pkl \
train.output_dir=./output \
train.exp_name="cocohalf-supervised" \
To train our discovery networks, please use scripts in the slurm_scripts/discovery/
folder. E.g. to run discovery training for COCOhalf + LVIS setup use:
DETECTRON2_DATASETS=/path/to/datasets \
python tools/ \
--config-file ./configs/train/discovery/ \
--num-gpus 4 \
--dist-url 'tcp://localhost:10042' \
train.exp_name="coco50pct_lvis-discovery" \
train.output_dir=./output \
train.eval_period=999999 \
discovery_evaluator.evaluator.output_dir=./output \
train.init_checkpoint=./checkpoints/supervised_cocohalf_maskrcnn.pth \
train.max_iter=15000 \ \
train.seed=42 \
model_proposals_extraction_param.test_nms_thresh=1.01 \
model_proposals_extraction_param.test_topk_per_image=50 \
train.supervised_loss_lambda=0.5 \
model_supervised.roi_heads.box_predictor.discovery_model.num_unlabeled=3000 \
model_supervised.roi_heads.box_predictor.discovery_model.sk_mode="lognormal" \
model_supervised.roi_heads.box_predictor.discovery_model.memory_batches=100 \
To run discovery training for LVIS + VisualGenome setup, modify the command above as follows:
python tools/ \
--config-file ./configs/train/discovery/ \
train.init_checkpoint=./checkpoints/supervised_lvis_maskrcnn.pth \
model_supervised.roi_heads.box_predictor.discovery_model.num_unlabeled=5000 \
Sometimes, after successfully finishing the discovery training phase and saving the weights, the script hangs or crashes during evaluation.
First, note that discovery evaluation may take up to 2-3 hours, so please make sure you give your script enough time to complete.
If you cannot obtain the results after several hours, you may cancel your job, and re-use the training script above with the dumped weights. For that, re-run the script above with the model weights set to your newly generated checkpoint (called model_final.pth
by default in Detectron2),) train.max_iter
set to 1, learning rate and weight decay set to 0, and train.weights_mode="from_discovery"
. With such configuration, the script will load the saved weights and proceed to the evaluation. E.g.:
python tools/ \
train.init_checkpoint="discovery_checkpoint.pth" \
train.max_iter=1 \ \
optimizer.weight_decay=0.0 \
Even after this modification, evaluation for the discovery part may take several hours.
If you find RNCDL useful in your research or reference it in your work, please star our repository and use the folowing:
author = {Vladimir Fomenko and Ismail Elezi and Deva Ramanan and Laura Leal-Taix{'e} and Aljo\v{s}a O\v{s}ep},
title = {Learning to Discover and Detect Objects},
booktitle={Advances in Neural Information Processing Systems},
year = {2022}