Lossless Text Compression with Transformer-based Language Model

This repo is to demo the concept of lossless compression with Transformers-based language model as encoder and decoder.

Contributors: Shangmin Guo (@Shawn-Guo-CN), Ze Peng (@Raphaelhpze)

The modules are:

compress.py: The script to compress a text file by the arithmetic encoding algorithm with a Transformer model
data_loader.py: The DataLoader class for loading the text file
decompress.py: The script to decompress a binary file by the arithmetic decoding algorithm with a Transformer model identical to the compression model
model.py: The Transformer model class
trainer.py: The Trainer class for updating the model parameters and predicting next-token with a Transformer model
tokenizer.py: The Tokenizer class
utils.py: Utility functions

TODOs

Many features in the current version are for demonstration purposes only. The following are part of the future work:

Implement th I/O streams for large files, the current version reads the whole file into memory
Update the compressing/decompressing and the training of LLM to a batch-wise manner, the current version assumes batch size = 1
Support tracking the progress of the compression/decompression and the corresponding negative log-likelihood of the data (which represents the compression ratio)

Usage

Compress

python compress.py --input_file <input_file> --output_file <output_file> --config_file <config_file>

Arguments

input_file: The path to the input text file, e.g. data/demo.txt
output_file: The path to the output compressed file, e.g. data/demo_encode_out.txt
config_file: The path to the configuration file in the YAML format, e.g. config/global/demo.yaml

Pipeline of the compression

Tokenize the input text
Calculate the probability of a token given the previous tokens by the forward pass of Transformer model
Encode the token with the probability and the arithmetic coding algorithm
Output the arithmetic code to a text file for readability

Decompress

python decompress.py --input_file <input_file> --output_file <output_file> --config_file <config_file>

Arguments

input_file: The path to the input text file, e.g. data/demo_encode_out.txt
output_file: The path to the output compressed file, e.g. data/demo_decode_out.txt
config_file: The path to the configuration file in the YAML format, e.g. config/global/demo.yaml

Pipeline of the decompression

Read the arithmetic code from the input file
Decode the arithmetic code to the token while getting probability from Transformer model and updating the parameters of the model
Detokenise the tokens
Output the detokenised text to the output file

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
config		config
data		data
notebooks		notebooks
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
compress.py		compress.py
data_loader.py		data_loader.py
decompress.py		decompress.py
file_io.py		file_io.py
model.py		model.py
modules.py		modules.py
requirements.txt		requirements.txt
tokenizer.py		tokenizer.py
trainer.py		trainer.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lossless Text Compression with Transformer-based Language Model

TODOs

Usage

Compress

Arguments

Pipeline of the compression

Decompress

Arguments

Pipeline of the decompression

About

Releases

Packages

Languages

License

Shawn-Guo-CN/Lossless_Text_Compression_with_Transformer

Folders and files

Latest commit

History

Repository files navigation

Lossless Text Compression with Transformer-based Language Model

TODOs

Usage

Compress

Arguments

Pipeline of the compression

Decompress

Arguments

Pipeline of the decompression

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages