diff --git a/docs/gitbook/how-tos/avalanchedataset/README.md b/docs/gitbook/how-tos/avalanchedataset/README.md index 71f4dd456..9239ceec1 100644 --- a/docs/gitbook/how-tos/avalanchedataset/README.md +++ b/docs/gitbook/how-tos/avalanchedataset/README.md @@ -4,12 +4,14 @@ description: Dealing with AvalancheDatasets # AvalancheDataset -The `AvalancheDataset` is an implementation of the PyTorch `Dataset` class that comes with many useful out-of-the-box functionalities. For most users, the _AvalancheDataset_ can be used as a plain PyTorch Dataset. For classification problems, `AvalancheDataset` return `x, y, t` elements (input, target, task label). However, the `AvalancheDataset` can be easily extended for any custom needs. +The `AvalancheDataset` is an implementation of the PyTorch `Dataset` class that comes with many useful out-of-the-box functionalities. For most users, the *AvalancheDataset* can be used as a plain PyTorch Dataset. For classification problems, `AvalancheDataset` return `x, y, t` elements (input, target, task label). However, the `AvalancheDataset` can be easily extended for any custom needs. -**A serie of **_**Mini How-Tos**_ will guide you through the functionalities of the _AvalancheDataset_ and its subclasses: +**A serie of _Mini How-Tos_** will guide you through the functionalities of the *AvalancheDataset* and its subclasses: + +- [AvalancheDatasets basics](https://avalanche.continualai.org/how-tos/avalanchedataset/avalanche-datasets) +- [Advanced Transformations](https://avalanche.continualai.org/how-tos/avalanchedataset/advanced-transformations) -* [AvalancheDatasets basics](https://avalanche.continualai.org/how-tos/avalanchedataset/avalanche-datasets) -* [Advanced Transformations](https://avalanche.continualai.org/how-tos/avalanchedataset/advanced-transformations) ```python + ``` diff --git a/docs/gitbook/how-tos/avalanchedataset/avalanche-datasets.md b/docs/gitbook/how-tos/avalanchedataset/avalanche-datasets.md index 700e523e4..5a355db3b 100644 --- a/docs/gitbook/how-tos/avalanchedataset/avalanche-datasets.md +++ b/docs/gitbook/how-tos/avalanchedataset/avalanche-datasets.md @@ -2,8 +2,7 @@ description: Converting PyTorch Datasets to Avalanche Dataset --- -# avalanche-datasets - +# Avalanche Datasets Datasets are a fundamental data structure for continual learning. Unlike offline training, in continual learning we often need to manipulate datasets to create streams, benchmarks, or to manage replay buffers. High-level utilities and predefined benchmarks already take care of the details for you, but you can easily manipulate the data yourself if you need to. These how-to will explain: 1. PyTorch datasets and data loading @@ -11,38 +10,37 @@ Datasets are a fundamental data structure for continual learning. Unlike offline 3. AvalancheDataset features In Avalanche, the `AvalancheDataset` is everywhere: +- The dataset carried by the `experience.dataset` field is always an *AvalancheDataset*. +- Many benchmark creation functions accept *AvalancheDataset*s to create benchmarks. +- Avalanche benchmarks are created by manipulating *AvalancheDataset*s. +- Replay buffers also use `AvalancheDataset` to easily concanate data and handle transformations. -* The dataset carried by the `experience.dataset` field is always an _AvalancheDataset_. -* Many benchmark creation functions accept _AvalancheDataset_s to create benchmarks. -* Avalanche benchmarks are created by manipulating _AvalancheDataset_s. -* Replay buffers also use `AvalancheDataset` to easily concanate data and handle transformations. ## 📚 PyTorch Dataset: general definition In PyTorch, **a `Dataset` is a class** exposing two methods: - -* `__len__()`, which returns the amount of instances in the dataset (as an `int`). -* `__getitem__(idx)`, which returns the data point at index `idx`. +- `__len__()`, which returns the amount of instances in the dataset (as an `int`). +- `__getitem__(idx)`, which returns the data point at index `idx`. In other words, a Dataset instance is just an object for which, similarly to a list, one can simply: - -* Obtain its length using the Python `len(dataset)` function. -* Obtain a single data point using the `x, y = dataset[idx]` syntax. +- Obtain its length using the Python `len(dataset)` function. +- Obtain a single data point using the `x, y = dataset[idx]` syntax. The content of the dataset can be either loaded in memory when the dataset is instantiated (like the torchvision MNIST dataset does) or, for big datasets like ImageNet, the content is kept on disk, with the dataset keeping the list of files in an internal field. In this case, data is loaded from the storage on-the-fly when `__getitem__(idx)` is called. The way those things are managed is specific to each dataset implementation. ### Quick note on the IterableDataset class - A variation of the standard `Dataset` exist in PyTorch: the [IterableDataset](https://pytorch.org/docs/stable/data.html#iterable-style-datasets). When using an `IterableDataset`, one can load the data points in a sequential way only (by using a tape-alike approach). The `dataset[idx]` syntax and `len(dataset)` function are not allowed. **Avalanche does NOT support `IterableDataset`s.** You shouldn't worry about this because, realistically, you will never encounter such datasets (at least in torchvision). If you need `IterableDataset` let us know and we will consider adding support for them. -## How to Create an AvalancheDataset +## How to Create an AvalancheDataset To create an `AvalancheDataset` from a PyTorch you only need to pass the original data to the constructor as follows + ```python !pip install avalanche-lib ``` + ```python import torch from torch.utils.data.dataset import TensorDataset @@ -60,6 +58,7 @@ avl_data = AvalancheDataset(torch_data) The dataset is equivalent to the original one: + ```python print(torch_data[0]) print(avl_data[0]) @@ -70,13 +69,13 @@ print(avl_data[0]) most of the time, you can also use one of the utility function in [benchmark utils](https://avalanche-api.continualai.org/en/latest/benchmarks.html#utils-data-loading-and-avalanchedataset) that also add attributes such as class and task labels to the dataset. For example, you can create a classification dataset using `make_classification_dataset`. Classification dataset - -* returns triplets of the form \, where t is the task label (which defaults to 0). -* The wrapped dataset must contain a valid **targets** field. +- returns triplets of the form , where t is the task label (which defaults to 0). +- The wrapped dataset must contain a valid **targets** field. Avalanche provides some utility functions to create supervised classification datasets such as: +- `make_tensor_classification_dataset` for tensor datasets +all of these will automatically create the `targets` and `targets_task_labels` attributes. -* `make_tensor_classification_dataset` for tensor datasets all of these will automatically create the `targets` and `targets_task_labels` attributes. ```python from avalanche.benchmarks.utils import make_classification_dataset @@ -90,9 +89,9 @@ sup_data = make_classification_dataset(torch_data, task_labels=tls) ``` ## DataLoader - Avalanche provides some [custom dataloaders](https://avalanche-api.continualai.org/en/latest/benchmarks.html#utils-data-loading-and-avalanchedataset) to sample in a task-balanced way or to balance the replay buffer and current data, but you can also use the standard pytorch `DataLoader`. + ```python from torch.utils.data.dataloader import DataLoader @@ -105,9 +104,9 @@ for x_minibatch, y_minibatch in my_dataloader: ``` ## Dataset Operations: Concatenation and SubSampling - While PyTorch provides two different classes for concatenation and subsampling (`ConcatDataset` and `Subset`), Avalanche implements them as dataset methods. These operations return a new dataset, leaving the original one unchanged. + ```python cat_data = avl_data.concat(avl_data) print(len(cat_data)) # 100 + 100 = 200 @@ -119,8 +118,9 @@ print(len(avl_data)) # 100, original data stays the same ``` ## Dataset Attributes +AvalancheDataset allows to add attributes to datasets. Attributes are named arrays that carry some information that is propagated by concatenation and subsampling operations. +For example, classification datasets use this functionality to manage class and task labels. -AvalancheDataset allows to add attributes to datasets. Attributes are named arrays that carry some information that is propagated by concatenation and subsampling operations. For example, classification datasets use this functionality to manage class and task labels. ```python tls = [0 for _ in range(100)] # one task label for each sample @@ -142,16 +142,14 @@ print(cat_data.targets_task_labels.name, len(cat_data.targets_task_labels._data) Thanks to `DataAttribute`s, you can freely operate on your data (e.g. to manage a replay buffer) without losing class or task labels. This makes it easy to manage multi-task datasets or to balance datasets by class. ## Transformations - -Most datasets from the _torchvision_ libraries (as well as datasets found "in the wild") allow for a `transformation` function to be passed to the dataset constructor. The support for transformations is not mandatory for a dataset, but it is quite common to support them. The transformation is used to process the X value of a data point before returning it. This is used to normalize values, apply augmentations, etcetera. +Most datasets from the *torchvision* libraries (as well as datasets found "in the wild") allow for a `transformation` function to be passed to the dataset constructor. The support for transformations is not mandatory for a dataset, but it is quite common to support them. The transformation is used to process the X value of a data point before returning it. This is used to normalize values, apply augmentations, etcetera. `AvalancheDataset` implements a very rich and powerful set of functionalities for managing transformation. You can learn more about it in the [Advanced Transformations How-To](https://avalanche.continualai.org/how-tos/avalanchedataset/advanced-transformations). ## Next steps +With these notions in mind, you can start start your journey on understanding the functionalities offered by the AvalancheDatasets by going through the *Mini How-To*s. -With these notions in mind, you can start start your journey on understanding the functionalities offered by the AvalancheDatasets by going through the _Mini How-To_s. - -Please refer to the [list of the _Mini How-To_s regarding AvalancheDatasets](https://avalanche.continualai.org/how-tos/avalanchedataset) for a complete list. It is recommended to start with the **"Creating AvalancheDatasets"** _Mini How-To_. +Please refer to the [list of the *Mini How-To*s regarding AvalancheDatasets](https://avalanche.continualai.org/how-tos/avalanchedataset) for a complete list. It is recommended to start with the **"Creating AvalancheDatasets"** *Mini How-To*. ## 🤝 Run it on Google Colab diff --git a/docs/gitbook/how-tos/avalanchedataset/avalanche-transformations.md b/docs/gitbook/how-tos/avalanchedataset/avalanche-transformations.md index 6b79165c8..dce14acbe 100644 --- a/docs/gitbook/how-tos/avalanchedataset/avalanche-transformations.md +++ b/docs/gitbook/how-tos/avalanchedataset/avalanche-transformations.md @@ -2,28 +2,28 @@ description: Dealing with transformations (groups, appending, replacing, freezing). --- -# avalanche-transformations - +# Advanced Transformations While torchvision (and other) datasets typically have a fixed set of transformations, AvalancheDataset also provides some additional functionalities. `AvalancheDataset`s can: - 1. Have multiple **transformation "groups"** in the same dataset (like separate train and eval transformations). 2. Manipulate transformation by **freezing, replacing and removing** them. -The following sub-sections show examples on how to use these features. It is warmly recommended to **run this page as a notebook** using Colab (info at the bottom of this page). +The following sub-sections show examples on how to use these features. +It is warmly recommended to **run this page as a notebook** using Colab (info at the bottom of this page). Let's start by installing Avalanche: + ```python !pip install avalanche-lib ``` ## Transformation groups - AvalancheDatasets can contain multiple **transformation groups**. This can be useful to keep train and test transformations in the same dataset and to have different sets of transformations. For instance, you can easily add ad-hoc transformations to using for replay data. For classification dataset, we follow torchvision conventions. Therefore, `make_classification_dataset` supports `transform`, which is applied to input (X) values, and `target_transform`, which is applied to class labels (Y). The latter is rarely used. This means that **a transformation group is a pair of transformations to be applied to the X and Y values** of each instance returned by the dataset. In both torchvision and Avalanche implementations, **a transformation must be a function (or other callable object)** that accepts one input (the X or Y value) and outputs its transformed version. A comprehensive guide on transformations can be found in the [torchvision documentation](https://pytorch.org/vision/stable/transforms.html). -In the following example, a MNIST dataset is created and then wrapped in an AvalancheDataset. When creating the AvalancheDataset, we can set _train_ and _eval_ transformations by passing a _transform\_groups_ parameter. Train transformations usually include some form of random augmentation, while eval transformations usually include a sequence of deterministic transformations only. Here we define the sequence of train transformations as a random rotation followed by the ToTensor operation. The eval transformations only include the ToTensor operation. +In the following example, a MNIST dataset is created and then wrapped in an AvalancheDataset. When creating the AvalancheDataset, we can set *train* and *eval* transformations by passing a *transform\_groups* parameter. Train transformations usually include some form of random augmentation, while eval transformations usually include a sequence of deterministic transformations only. Here we define the sequence of train transformations as a random rotation followed by the ToTensor operation. The eval transformations only include the ToTensor operation. + ```python from torchvision import transforms @@ -53,7 +53,8 @@ transform_groups = { avl_mnist_transform = make_classification_dataset(mnist_dataset, transform_groups=transform_groups) ``` -Of course, one can also just use the `transform` and `target_transform` constructor parameters to set the transformations for both the _train_ and the _eval_ groups. However, it is recommended to use the approach based on _transform\_groups_ (shown in the code above) as it is much more flexible. +Of course, one can also just use the `transform` and `target_transform` constructor parameters to set the transformations for both the *train* and the *eval* groups. However, it is recommended to use the approach based on *transform\_groups* (shown in the code above) as it is much more flexible. + ```python # Not recommended: use transform_groups instead @@ -62,13 +63,14 @@ avl_mnist_same_transforms = make_classification_dataset(mnist_dataset, transfor ### Using `.train()` and `.eval()` -**The default behaviour of the AvalancheDataset is to use transformations from the **_**train**_** group.** However, one can easily obtain a version of the dataset where the _eval_ group is used. Note: when obtaining the dataset of experiences from the test stream, those datasets will already be using the _eval_ group of transformations so you don't need to switch to the eval group ;). +**The default behaviour of the AvalancheDataset is to use transformations from the _train_ group.** However, one can easily obtain a version of the dataset where the *eval* group is used. Note: when obtaining the dataset of experiences from the test stream, those datasets will already be using the *eval* group of transformations so you don't need to switch to the eval group ;). -You can switch between the _train_ and _eval_ groups using the `.train()` and `.eval()` methods to obtain a copy (view) of the dataset with the proper transformations enabled. As a general rule, **methods that manipulate the AvalancheDataset fields (and transformations) always create a view of the dataset. The original dataset is never changed.** +You can switch between the *train* and *eval* groups using the `.train()` and `.eval()` methods to obtain a copy (view) of the dataset with the proper transformations enabled. As a general rule, **methods that manipulate the AvalancheDataset fields (and transformations) always create a view of the dataset. The original dataset is never changed.** -In the following cell we use the _avl\_mnist\_transform_ dataset created in the cells above. We first obtain a view of it in which _eval_ transformations are enabled. Then, starting from this view, we obtain a version of it in which _train_ transformations are enabled. We want to double-stress that `.train()` and `.eval()` never change the group of the dataset on which they are called: they always create a view. +In the following cell we use the *avl\_mnist\_transform* dataset created in the cells above. We first obtain a view of it in which *eval* transformations are enabled. Then, starting from this view, we obtain a version of it in which *train* transformations are enabled. We want to double-stress that `.train()` and `.eval()` never change the group of the dataset on which they are called: they always create a view. + +One can check that the correct transformation group is in use by looking at the content of the *transform/target_transform* fields. -One can check that the correct transformation group is in use by looking at the content of the _transform/target\_transform_ fields. ```python # Obtain a view of the dataset in which eval transformations are enabled @@ -95,10 +97,10 @@ print(avl_mnist_train._transform_groups.transform_groups[cgroup]) ``` ### Custom transformation groups +In *AvalancheDataset*s the **_train_ and _eval_ transformation groups are always available**. However, *AvalancheDataset* also supports **custom transformation groups**. -In _AvalancheDataset_s the _**train**_** and **_**eval**_** transformation groups are always available**. However, _AvalancheDataset_ also supports **custom transformation groups**. +The following example shows how to create an AvalancheDataset with an additional group named *replay*. We define the replay transformation as a random crop followed by the ToTensor operation. -The following example shows how to create an AvalancheDataset with an additional group named _replay_. We define the replay transformation as a random crop followed by the ToTensor operation. ```python from avalanche.benchmarks.utils import AvalancheDataset @@ -119,7 +121,8 @@ transform_groups_with_replay = { AvalancheDataset(mnist_dataset, transform_groups=transform_groups_with_replay) ``` -However, once created the dataset will use the _train_ group. You can switch to the group using the `.with_transforms(group_name)` method. The `.with_transforms(group_name)` method behaves in the same way `.train()` and `.eval()` do by creating a view of the original dataset. +However, once created the dataset will use the *train* group. You can switch to the group using the `.with_transforms(group_name)` method. The `.with_transforms(group_name)` method behaves in the same way `.train()` and `.eval()` do by creating a view of the original dataset. + ```python avl_mnist_custom_transform_not_enabled = AvalancheDataset( @@ -139,12 +142,13 @@ print(avl_mnist_custom_transform_2._transform_groups.transform_groups[cgroup]) ## Replacing transformations -The replacement operation follows the same idea (and benefits) of the append one. By using `.replace_current_transform_group(transform, target_transform)` one can obtain a view of the original dataset in which the **transformaations for the current group** are replaced with the given ones. One may also change tranformations for other groups by passing the name of the group as the optional parameter `group`. As with any transform-related operation, the original dataset is not affected. +The replacement operation follows the same idea (and benefits) of the append one. By using `.replace_current_transform_group(transform, target_transform)` one can obtain a view of the original dataset in which the **transformaations for the current group** are replaced with the given ones. One may also change tranformations for other groups by passing the name of the group as the optional parameter `group`. As with any transform-related operation, the original dataset is not affected. Note: one can use `.replace_transforms(...)` to remove previous transformations (by passing `None` as the new transform). The following cell shows how to use `.replace_transforms(...)` to replace the transformations of the current group: + ```python avl_mnist = make_classification_dataset(mnist_dataset, transform_groups=transform_groups) new_transform = transforms.RandomCrop(size=(28, 28), padding=4) @@ -171,7 +175,8 @@ One may wonder when this may come in handy... in fact, you will probably rarely Transformations for all transform groups can be frozen at once by using `.freeze_transforms()`. As always, those methods return a view of the original dataset. -The cell below shows a simplified excerpt from the [PermutedMNIST benchmark implementation](../../../../avalanche/benchmarks/classic/cmnist.py). First, a _PixelsPermutation_ instance is created. That instance is a transformation that will permute the pixels of the input image. We then create the train end test sets. Once created, transformations for those datasets are frozen using `.freeze_transforms()`. +The cell below shows a simplified excerpt from the [PermutedMNIST benchmark implementation](https://github.com/ContinualAI/avalanche/blob/master/avalanche/benchmarks/classic/cmnist.py). First, a *PixelsPermutation* instance is created. That instance is a transformation that will permute the pixels of the input image. We then create the train end test sets. Once created, transformations for those datasets are frozen using `.freeze_transforms()`. + ```python from avalanche.benchmarks.classic.cmnist import PixelsPermutation @@ -213,6 +218,7 @@ In this way, that transform can't be removed. However, remember that one can alw The cell below shows that `replace_transforms` can't remove frozen transformations: + ```python # First, show that the image pixels are permuted print('Before replace_transforms:') @@ -227,12 +233,11 @@ display(with_removed_transforms[0][0].resize((192, 192), 0)) ``` ## Transformations wrap-up - -This completes the _Mini How-To_ for the functionalities of the _AvalancheDataset_ related to **transformations**. +This completes the *Mini How-To* for the functionalities of the *AvalancheDataset* related to **transformations**. Here you learned how to use **transformation groups** and how to **append/replace/freeze transformations** in a simple way. -Other _Mini How-To_s will guide you through the other functionalities offered by the _AvalancheDataset_ class. The list of _Mini How-To_s can be found [here](https://avalanche.continualai.org/how-tos/avalanchedataset). +Other *Mini How-To*s will guide you through the other functionalities offered by the *AvalancheDataset* class. The list of *Mini How-To*s can be found [here](https://avalanche.continualai.org/how-tos/avalanchedataset). ## 🤝 Run it on Google Colab diff --git a/docs/gitbook/how-tos/checkpoints.md b/docs/gitbook/how-tos/checkpoints.md index 08237beae..d9b9231e1 100644 --- a/docs/gitbook/how-tos/checkpoints.md +++ b/docs/gitbook/how-tos/checkpoints.md @@ -1,2 +1,241 @@ -# checkpoints +--- +description: Save and load checkpoints +--- +# Save and load checkpoints +The ability to **save and resume experiments** may be very useful when running long experiments. Avalanche offers a checkpointing functionality that can be used to save and restore your strategy including plugins, metrics, and loggers. + +This guide will show how to plug the checkpointing functionality into the usual Avalanche main script. This only requires minor changes in the main: no changes on the strategy/plugins/... code is required! Also, make sure to check the [checkpointing.py](https://github.com/ContinualAI/avalanche/blob/master/examples/checkpointing.py) example in the repository for a ready-to-go template. + +## Continual learning vs classic deep learning +**Resuming a continual learning experiment is not the same as resuming a classic deep learning training session.** In classic training setups, the elements needed to resume an experiment are limited to i) the model weights, ii) the optimizer state, and iii) additional info such as the number of epochs/iterations so far. On the contrary, continual learning experiments need far more info to be correctly resumed: + +- The state of **plugins**, such as: + - the examples saved in the replay buffer + - the importance of model weights (EwC, Synaptic Intelligence) + - a copy of the model (LwF) + - ... any many others, which are *specific to each technique*! +- The state of **metrics**, as some are computed on the performance measured on previous experiences: + - AMCA (Average Mean Class Accuracy) metric + - Forgetting metric + + +## Resuming experiments in Avalanche +To handle all these elements, we opted to provide an easy-to-use plugin: the *CheckpointPlugin*. It will take care of loading: + +- Strategy, including the model +- Plugins +- Metrics +- Loggers: this includes re-opening the logs for TensoBoard, Weights & Biases, ... +- State of all random number generators + - In continual learning experiments, this affects the choice of replay examples and other critical elements. This is usually not needed in classic deep learning, but here may be useful! + + +Here, in a couple of cells, we'll show you how to use it. Remember that you can follow this guide by running it as a notebook (see below for a direct link to load it on Colab). + +Let's install Avalanche: + + +```python +!pip install avalanche-lib +``` + +And let us import the needed elements: + + +```python +import sys +sys.path.append('/home/lorenzo/Desktop/ProjectsVCS/avalanche/') + +import os +from typing import Sequence + +import torch +from torch.nn import CrossEntropyLoss +from torch.optim import SGD + +from avalanche.benchmarks import CLExperience, SplitMNIST +from avalanche.evaluation.metrics import accuracy_metrics, loss_metrics, \ + class_accuracy_metrics +from avalanche.logging import InteractiveLogger, TensorboardLogger, \ + WandBLogger, TextLogger +from avalanche.models import SimpleMLP, as_multitask +from avalanche.training.determinism.rng_manager import RNGManager +from avalanche.training.plugins import EvaluationPlugin, ReplayPlugin +from avalanche.training.plugins.checkpoint import CheckpointPlugin, \ + FileSystemCheckpointStorage +from avalanche.training.supervised import Naive +``` + +Let's proceed by defining a very vanilla Avalanche main script. Simply put, this usually comes down to defining: + +0. Load any configuration, set seeds, etcetera +1. The benchmark +2. The model, optimizer, and loss function +3. Evaluation components + - The list of metrics to track + - The loggers + - The evaluation plugin (that glues the metrics and loggers together) +4. The training plugins +5. The strategy +6. The train-eval loop + +They do not have to be in this particular order, but this is the order followed in this guide. + +To enable checkpointing, the following changes are needed: +1. In the very first part of the code, fix the seeds for reproducibility + - The **RNGManager** class is used, which may be useful even in experiments in which checkpointing is not needed ;) +2. Instantiate the checkpointing plugin +3. Check if a checkpoint exists and load it + - Only if not resuming from a checkpoint: create the Evaluation components, the plugins, and the strategy +5. Change the train/eval loop to start from the experience + +Note that those changes are all properly annotated in the [checkpointing.py](https://github.com/ContinualAI/avalanche/blob/master/examples/checkpointing.py) example, which is the recommended template to follow when enabling checkpoint in a training script. + +### Step by step +Let's start with the first change: defining a fixed seed. This is needed to correctly re-create the benchmark object and should be the same seed used to create the checkpoint. + +The **RNGManager** takes care of setting the seed for the following generators: Python *random*, NumPy, and PyTorch (both cpu and device-specific generators). In this way, you can be sure that any randomness-dependent elements in the benchmark creation procedure are identical across save/resume operations. + + +```python +# Set a fixed seed: must be kept the same across save/resume operations +RNGManager.set_random_seeds(1234) +``` + +Let's then proceed with the usual Avalanche code. Note: nothing to change here to enable checkpointing. Here we create a SplitMNIST benchmark and instantiate a multi-task MLP model. Notice that checkpointing works fine with multi-task models wrapped using `as_multitask`. + + +```python + +# Nothing new here... +device = torch.device( + f"cuda:0" + if torch.cuda.is_available() + else "cpu" +) +print('Using device', device) + +# CL Benchmark Creation +n_experiences = 5 +benchmark = SplitMNIST(n_experiences=n_experiences, + return_task_id=True) +input_size = 28*28*1 + +# Define the model (and load initial weights if necessary) +# Again, not checkpoint-related +model = SimpleMLP(input_size=input_size, + num_classes=benchmark.n_classes // n_experiences) +model = as_multitask(model, 'classifier') + +# Prepare for training & testing: not checkpoint-related +optimizer = SGD(model.parameters(), lr=0.01, momentum=0.9) +criterion = CrossEntropyLoss() +``` + +It's now time to instantiate the checkpointing plugin and load the checkpoint. + + +```python +checkpoint_plugin = CheckpointPlugin( + FileSystemCheckpointStorage( + directory='./checkpoints/task_incremental', + ), + map_location=device +) + +# Load checkpoint (if exists in the given storage) +# If it does not exist, strategy will be None and initial_exp will be 0 +strategy, initial_exp = checkpoint_plugin.load_checkpoint_if_exists() +``` + +Please notice the arguments passed to the *CheckpointPlugin* constructor: + +1. The first parameter is a **storage object**. We decided to allow the checkpointing plugin to load checkpoints from arbitrary storages. The simpler storage, `FileSystemCheckpointStorage`, will use a given directory to store the file for the current experiment (**do not point multiple experiments/runs to the same directory!**). However, we plan to expand this in the future to support network/cloud storages. Contributions on this are welcome :-)! Remember that the `CheckpointStorage` interface is quite simple to implement in a way that best fits your needs. +2. The device used for training. This functionality may be particularly useful in some cases: the plugin will take care of *loading the checkpoint on the correct device, even if the checkpoint was created on a cuda device with a different id*. This means that it can also be used to resume a CUDA checkpoint on CPU. The only caveat is that it cannot be used to load a CPU checkpoint to CUDA. In brief: CUDA -> CPU (OK), CUDA:0 -> CUDA:1 (OK), CPU -> CUDA (NO!). This will also take care of updating the *device* field of the strategy (and plugins) to point to the current device. + +The next change is in fact quite minimal. It only requires wrapping the creation of plugins, metrics, and loggers in an "if" that checks if a checkpoint was actually loaded. If a checkpoint is loaded, the resumed strategy already contains the properly restored plugins, metrics, and loggers: *it would be an error to create them*. + + +```python + +# Create the CL strategy (if not already loaded from checkpoint) +if strategy is None: + plugins = [ + checkpoint_plugin, # Add the checkpoint plugin to the list! + ReplayPlugin(mem_size=500), # Other plugins you want to use + # ... + ] + + # Create loggers (as usual) + # Note that the checkpoint plugin will automatically correctly + # resume loggers! + os.makedirs(f'./logs/checkpointing_example', + exist_ok=True) + loggers = [ + TextLogger( + open(f'./logs/checkpointing_example/log.txt', 'w')), + InteractiveLogger(), + TensorboardLogger(f'./logs/checkpointing_example') + ] + + # The W&B logger is correctly resumed without resulting in + # duplicated runs! + use_wandb = False + if use_wandb: + loggers.append(WandBLogger( + project_name='AvalancheCheckpointing', + run_name=f'checkpointing_example' + )) + + # Create the evaluation plugin (as usual) + evaluation_plugin = EvaluationPlugin( + accuracy_metrics(minibatch=False, epoch=True, + experience=True, stream=True), + loss_metrics(minibatch=False, epoch=True, + experience=True, stream=True), + class_accuracy_metrics( + stream=True + ), + loggers=loggers + ) + + # Create the strategy (as usual) + strategy = Naive( + model=model, + optimizer=optimizer, + criterion=criterion, + train_mb_size=128, + train_epochs=2, + eval_mb_size=128, + device=device, + plugins=plugins, + evaluator=evaluation_plugin + ) +``` + +Final change: adapt the for loop so that the training stream is iterated starting from `initial_exp`. This variable was created when loading the checkpoint and it tells the next training experience to run. If no checkpoint was found, then its value will be 0. + + +```python +exit_early = False + +for train_task in benchmark.train_stream[initial_exp:]: + strategy.train(train_task) + strategy.eval(benchmark.test_stream) + + if exit_early: + exit(0) +``` + +A new checkpoint is stored *at the end of each eval phase*! If the program is interrupted before, all progress from the previous eval phase is lost. + +Here `exit_early` is a simple placeholder that you can use to experiment a bit. You may obtain a similar effect by stopping this notebook manually, restarting the kernel, and re-running all cells. You will notice that the last checkpoint will be loaded and training will resume as expected. + +Usually, `exit_early` should be implemented as a mechanism able to gracefully stop the process. When using SLURM or other schedulers (or even when terminating processes using Ctrl-C), you can catch termination signals and manage them properly so that the process exits after the next eval phase. However, don't worry if the process is killed abruptly: the last checkpoint will be loaded correctly once the experiment is restarted by the scheduler. + +That's it for the checkpointing functionality! This is relatively new mechanism and feedbacks on this are warmly welcomed in our [discussions section](https://github.com/ContinualAI/avalanche/discussions) in the repository! + +## 🤝 Run it on Google Colab + +You can run _this guide_ and play with it on Google Colaboratory by clicking here: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ContinualAI/avalanche/blob/master/notebooks/how-tos/checkpoints.ipynb)