This repo is to demo the concept of lossless compression with Transformers-based language model as encoder and decoder.
Contributors: Shangmin Guo (@Shawn-Guo-CN), Ze Peng (@Raphaelhpze)
The modules are:
compress.py
: The script to compress a text file by the arithmetic encoding algorithm with a Transformer modeldata_loader.py
: The DataLoader class for loading the text filedecompress.py
: The script to decompress a binary file by the arithmetic decoding algorithm with a Transformer model identical to the compression modelmodel.py
: The Transformer model classtrainer.py
: The Trainer class for updating the model parameters and predicting next-token with a Transformer modeltokenizer.py
: The Tokenizer classutils.py
: Utility functions
Many features in the current version are for demonstration purposes only. The following are part of the future work:
-
Implement th I/O streams for large files, the current version reads the whole file into memory
-
Update the compressing/decompressing and the training of LLM to a batch-wise manner, the current version assumes batch size = 1
-
Support tracking the progress of the compression/decompression and the corresponding negative log-likelihood of the data (which represents the compression ratio)
python compress.py --input_file <input_file> --output_file <output_file> --config_file <config_file>
input_file
: The path to the input text file, e.g.data/demo.txt
output_file
: The path to the output compressed file, e.g.data/demo_encode_out.txt
config_file
: The path to the configuration file in the YAML format, e.g.config/global/demo.yaml
- Tokenize the input text
- Calculate the probability of a token given the previous tokens by the forward pass of Transformer model
- Encode the token with the probability and the arithmetic coding algorithm
- Output the arithmetic code to a text file for readability
python decompress.py --input_file <input_file> --output_file <output_file> --config_file <config_file>
input_file
: The path to the input text file, e.g.data/demo_encode_out.txt
output_file
: The path to the output compressed file, e.g.data/demo_decode_out.txt
config_file
: The path to the configuration file in the YAML format, e.g.config/global/demo.yaml
- Read the arithmetic code from the input file
- Decode the arithmetic code to the token while getting probability from Transformer model and updating the parameters of the model
- Detokenise the tokens
- Output the detokenised text to the output file