This is a web application that demonstrates Byte Pair Encoding (BPE) tokenization, a powerful technique used in natural language processing and machine learning, particularly in modern language models like GPT.
- Tokenization: Convert text into tokens using Byte Pair Encoding
- Decoding: Reconstruct text from tokens
- Flexible Input: Support for different input formats
- Modern, Minimalist UI
- Dark Mode Design
- Python 3.8+
- Flask
- NumPy
- Clone the repository:
git clone https://github.com/yourusername/bpe-tokenizer.git
cd bpe-tokenizer
- Create a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows, use `venv\Scripts\activate`
- Install dependencies:
pip install flask numpy
- Run the application:
python app.py
- Open your browser and navigate to
http://localhost:5000
app.py
: Flask backend applicationtokernizer.py
: Custom tokenization implementationtemplates/index.html
: Web application frontendread.txt
: Sample training text for tokenizer (optional)
- Input text is analyzed
- Tokens are generated using Byte Pair Encoding
- Tokens can be decoded back to original text
- Input tokens
- Reconstruct original text
- Modify
vocab_size
inapp.py
to change token vocabulary - Update
read.txt
with your training corpus
- Python
- Flask
- JavaScript
- Axios
- HTML5
- CSS3
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature
) - Commit your changes (
git commit -m 'Add some AmazingFeature'
) - Push to the branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
Distributed under the MIT License. See LICENSE
for more information.
- Your Name
- Email: [email protected]
- Project Link: https://github.com/yourusername/bpe-tokenizer