Spaces:

kishkath
/

bpe-tokenizer

Running

kishkath commited on Jan 15

Commit

ef36b5d

verified ·

1 Parent(s): a746578

Update README.md

Files changed (1) hide show

README.md CHANGED Viewed

@@ -1,14 +1,35 @@
----
-title: Bpe Tokenizer
-emoji: 🔥
-colorFrom: blue
-colorTo: yellow
-sdk: gradio
-sdk_version: 5.12.0
-app_file: app.py
-pinned: false
-license: apache-2.0
-short_description: Telugu BPE tokenizer with vocabulary of 4800 words.
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# Telugu Text Tokenizer
+A Gradio web interface for encoding and decoding Telugu text using a trained BPE tokenizer.
+## Features
+- Encode Telugu text to token IDs
+- View compression statistics and token visualization
+- Decode token IDs back to Telugu text
+- Interactive and user-friendly interface
+## Usage
+1. **Encoding Text**
+   - Enter Telugu text in the encoder tab
+   - Click "Encode" to get token IDs and statistics
+   - View token segmentation with color visualization
+2. **Decoding Text**
+   - Paste encoded token IDs in the decoder tab
+   - Click "Decode" to get back the original text
+## Technical Details
+- Uses Byte Pair Encoding (BPE) algorithm
+- Vocabulary size: 4800 tokens
+- Supports efficient compression of Telugu text
+- Maintains perfect reconstruction
+## Model Information
+The tokenizer is trained on a diverse corpus of Telugu text with:
+- Maximum vocabulary size: 5000 tokens
+- Target compression ratio: ≥ 3.2x
+- Perfect reconstruction guarantee