The simplest way is to convert your dataset to existing dataset formats (ImageNet).
For training, it differentiates classes by folders. The directory of training data is as follows:
imagenet
├── ...
├── train
│ ├── n01440764
│ │ ├── n01440764_10026.JPEG
│ │ ├── n01440764_10027.JPEG
│ │ ├── ...
│ ├── ...
│ ├── n15075141
│ │ ├── n15075141_999.JPEG
│ │ ├── n15075141_9993.JPEG
│ │ ├── ...
For validation, we provide a annotation list. Each line of the list contrains a filename and its corresponding ground-truth labels. The format is as follows:
ILSVRC2012_val_00000001.JPEG 65
ILSVRC2012_val_00000002.JPEG 970
ILSVRC2012_val_00000003.JPEG 230
ILSVRC2012_val_00000004.JPEG 809
ILSVRC2012_val_00000005.JPEG 516
Note: The value of ground-truth labels should fall in range [0, num_classes - 1]
.
You can write a new Dataset class inherited from BaseDataset
, and overwrite load_annotations(self)
,
like CIFAR10 and ImageNet.
Typically, this function returns a list, where each sample is a dict, containing necessary data informations, e.g., img
and gt_label
.
Assume we are going to implement a Filelist
dataset, which takes filelists for both training and testing. The format of annotation list is as follows:
000001.jpg 0
000002.jpg 1
We can create a new dataset in mmcls/datasets/filelist.py
to load the data.
import mmcv
import numpy as np
from .builder import DATASETS
from .base_dataset import BaseDataset
@DATASETS.register_module()
class Filelist(BaseDataset):
def load_annotations(self):
assert isinstance(self.ann_file, str)
data_infos = []
with open(self.ann_file) as f:
samples = [x.strip().split(' ') for x in f.readlines()]
for filename, gt_label in samples:
info = {'img_prefix': self.data_prefix}
info['img_info'] = {'filename': filename}
info['gt_label'] = np.array(gt_label, dtype=np.int64)
data_infos.append(info)
return data_infos
And add this dataset class in mmcls/datasets/__init__.py
from .base_dataset import BaseDataset
...
from .filelist import Filelist
__all__ = [
'BaseDataset', ... ,'Filelist'
]
Then in the config, to use Filelist
you can modify the config as the following
train = dict(
type='Filelist',
ann_file = 'image_list.txt',
pipeline=train_pipeline
)
MMClassification also supports to mix dataset for training. Currently it supports to concat and repeat datasets.
We use RepeatDataset
as wrapper to repeat the dataset. For example, suppose the original dataset is Dataset_A
, to repeat it, the config looks like the following
dataset_A_train = dict(
type='RepeatDataset',
times=N,
dataset=dict( # This is the original config of Dataset_A
type='Dataset_A',
...
pipeline=train_pipeline
)
)
We use ClassBalancedDataset
as wrapper to repeat the dataset based on category
frequency. The dataset to repeat needs to instantiate function self.get_cat_ids(idx)
to support ClassBalancedDataset
.
For example, to repeat Dataset_A
with oversample_thr=1e-3
, the config looks like the following
dataset_A_train = dict(
type='ClassBalancedDataset',
oversample_thr=1e-3,
dataset=dict( # This is the original config of Dataset_A
type='Dataset_A',
...
pipeline=train_pipeline
)
)
You may refer to source code for details.