Spaces:
Sleeping
Sleeping
title: Bpe Tokenizer | |
emoji: 🔥 | |
colorFrom: blue | |
colorTo: yellow | |
sdk: gradio | |
sdk_version: 5.12.0 | |
app_file: app.py | |
pinned: false | |
license: apache-2.0 | |
short_description: Telugu BPE tokenizer with vocabulary of 4800 words. | |
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference | |
# Telugu Text Tokenizer | |
A Gradio web interface for encoding and decoding Telugu text using a trained BPE tokenizer. | |
## Features | |
- Encode Telugu text to token IDs | |
- View compression statistics and token visualization | |
- Decode token IDs back to Telugu text | |
- Interactive and user-friendly interface | |
## Usage | |
1. **Encoding Text** | |
- Enter Telugu text in the encoder tab | |
- Click "Encode" to get token IDs and statistics | |
- View token segmentation with color visualization | |
2. **Decoding Text** | |
- Paste encoded token IDs in the decoder tab | |
- Click "Decode" to get back the original text | |
## Technical Details | |
- Uses Byte Pair Encoding (BPE) algorithm | |
- Vocabulary size: 4800 tokens | |
- Supports efficient compression of Telugu text | |
- Maintains perfect reconstruction | |
## Model Information | |
The tokenizer is trained on a diverse corpus of Telugu text with: | |
- Maximum vocabulary size: 5000 tokens | |
- Target compression ratio: ≥ 3.2x | |
- Perfect reconstruction guarantee | |