Spaces:
Sleeping
Sleeping
File size: 1,306 Bytes
708e762 ef36b5d 45095f6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 |
---
title: Bpe Tokenizer
emoji: 🔥
colorFrom: blue
colorTo: yellow
sdk: gradio
sdk_version: 5.12.0
app_file: app.py
pinned: false
license: apache-2.0
short_description: Telugu BPE tokenizer with vocabulary of 4800 words.
---
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
# Telugu Text Tokenizer
A Gradio web interface for encoding and decoding Telugu text using a trained BPE tokenizer.
## Features
- Encode Telugu text to token IDs
- View compression statistics and token visualization
- Decode token IDs back to Telugu text
- Interactive and user-friendly interface
## Usage
1. **Encoding Text**
- Enter Telugu text in the encoder tab
- Click "Encode" to get token IDs and statistics
- View token segmentation with color visualization
2. **Decoding Text**
- Paste encoded token IDs in the decoder tab
- Click "Decode" to get back the original text
## Technical Details
- Uses Byte Pair Encoding (BPE) algorithm
- Vocabulary size: 4800 tokens
- Supports efficient compression of Telugu text
- Maintains perfect reconstruction
## Model Information
The tokenizer is trained on a diverse corpus of Telugu text with:
- Maximum vocabulary size: 5000 tokens
- Target compression ratio: ≥ 3.2x
- Perfect reconstruction guarantee
|