Benchmark for sequence classification #874

davians12 · 2022-01-08T13:19:29Z

davians12
Jan 8, 2022

I am trying to classify a sequence of tokens. The setup can be thought as NER task in NLP where for each token multiple possible entity classes. The setup in question is similar to the discussion here, i.e., the point 1 requirement by @AndreaCossu is fulfilled, but defers in point 2. Instead of whole sequence classfication, classification is required for each token(each timestep), but number of timestep(max length of sentence in terms of tokens) is known and fixed.

I tried to create AvalancheDataset from the custom pytorch dataset, but got the error - ValueError: Unsupported dataset: must have a valid targets field or has to be a Tensor Dataset with at least 2 Tensors. I understand that the pytorch dataset should have a targets field, but I can't directly set the token labels.

So is it possible to setup benchmark in such scenarios in Avalanche?

PS: If I understand correctly, this discussion points thats its not possible, or need extra care while using metrics. Please correct me if wrong.

Answered by AndreaCossu

Jan 9, 2022

Hi @davians12 ! Thanks for reaching out.
One hack that should work exploits the fact that Avalanche dataset (as pytorch dataset) is able to return a variable number of elements.
So, when looping over the dataloader you can have a variable number of tensors: for x, y, a, b, ..., t in dataloader.

Consider that in avalanche the BaseStrategy defines the input mb_x as the first element returned by the dataloader, the target mb_y as the second and the (optional) task label mb_task_id as the last one (see here the properties I am mentioning).

So, you can:

create your benchmark by putting in the dataset (tensor dataset or other kinds) the input tensors in the first position, fake targets (an int…

View full answer

AndreaCossu · 2022-01-09T15:57:50Z

AndreaCossu
Jan 9, 2022
Maintainer

Hi @davians12 ! Thanks for reaching out.
One hack that should work exploits the fact that Avalanche dataset (as pytorch dataset) is able to return a variable number of elements.
So, when looping over the dataloader you can have a variable number of tensors: for x, y, a, b, ..., t in dataloader.

Consider that in avalanche the BaseStrategy defines the input mb_x as the first element returned by the dataloader, the target mb_y as the second and the (optional) task label mb_task_id as the last one (see here the properties I am mentioning).

So, you can:

create your benchmark by putting in the dataset (tensor dataset or other kinds) the input tensors in the first position, fake targets (an integer for each pattern) in second position and your real targets (the vector of prediction for each element) in the third position.
Then, redefine the BaseStrategy and link the mb_y to the third element returned by the dataloader, which will contain your real targets. Everything else can remain the same. The metrics will use your targets instead of the fake ones but you have to pay attention because not all metrics may support a vector of targets (e.g. accuracy does not currently). You can define your own metrics if you want.

Let me know if this is clear and works (I haven't tried it yet) or if you need further help in coding this up. Of course, if you find a better solution feel free to share!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark for sequence classification #874

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Benchmark for sequence classification #874

davians12 Jan 8, 2022

Replies: 1 comment

AndreaCossu Jan 9, 2022 Maintainer

davians12
Jan 8, 2022

AndreaCossu
Jan 9, 2022
Maintainer