This is a repository for End-to-end Dialogue Transformer project for Statistical Dialogue Systems course.
- Improve sequicity comments
-
Use PyTorch'snn.transformer
to implement Sequicity style dialogue system- Try to run Sequicity as is - this should be quite easy.
-
Rewrite classesSimpleDynamicEncoder
,BSpanDecoder
, andResponseDecoder
fromtsd_net.py
to use transformer instead of RNNs. This will probably involve also adjustingTSD
class.
- Compare it with existing dialogue systems (probably Sequicity, mainly)
-
Improve performance by utilizing pre-trained LM. - Implement it in tensorflow
We evaluated out system on the CamRest676 dataset.
System | F1 success | BLEU |
---|---|---|
Transformer | 0.770 | 0.327 |
Transformer \ copynet | 0.710 | 0.315 |
Sequicity | 0.854 | 0.253 |
We have shown that transformer with copy mechanism comparable performance with Sequicity. We believe the system could be improved by utilizing a pre-trained language model (BERT, GPT-{2|3}, MASS, XLNet, ...)
Although the the success F1 score did not superseded our baseline, our model has BLEU score of responses 7.4% larger than Sequicity. We think that the worse performance of Transformer, compared to recurrent neural networks may be caused by the small amount of data we have, relatively low batch size and generally lower stability of training (Training Tips for the Transfomer Model).
Papers related to this work
- Sequicity
- Incorporating Copying Mechanism in Sequence-to-Sequence Learning - the copy mechanism referenced from Sequicity, quite an interesting paper
- Attention Is All You Need - the transformer architecture
- Hello, It's GPT-2
- ALBERT: A Lite BERT - IHMO (ondrej) the methods described in this paper might be easier to use with limited computational resources compared to other pretrained transformers (BERT, GPT-2, XLNet, Transformer-XL, ...)
- Training Tips for the Transfomer Model - A nice paper form UFAL about practical tips for training transformer, might be useful
- On Layer Normalization in the Transformer Architecture - They stabilize the training by placing layer normalization inside the residual block and before the multi-head attention (Pre-LN). Therefore they can remove warm-up and use a larger learning rate.
- The transformer is the official Tensorflow implementation.
- Sequicity implementation from the authors' repository