kishkath commited on
Commit
ef36b5d
·
verified ·
1 Parent(s): a746578

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +35 -14
README.md CHANGED
@@ -1,14 +1,35 @@
1
- ---
2
- title: Bpe Tokenizer
3
- emoji: 🔥
4
- colorFrom: blue
5
- colorTo: yellow
6
- sdk: gradio
7
- sdk_version: 5.12.0
8
- app_file: app.py
9
- pinned: false
10
- license: apache-2.0
11
- short_description: Telugu BPE tokenizer with vocabulary of 4800 words.
12
- ---
13
-
14
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Telugu Text Tokenizer
2
+
3
+ A Gradio web interface for encoding and decoding Telugu text using a trained BPE tokenizer.
4
+
5
+ ## Features
6
+
7
+ - Encode Telugu text to token IDs
8
+ - View compression statistics and token visualization
9
+ - Decode token IDs back to Telugu text
10
+ - Interactive and user-friendly interface
11
+
12
+ ## Usage
13
+
14
+ 1. **Encoding Text**
15
+ - Enter Telugu text in the encoder tab
16
+ - Click "Encode" to get token IDs and statistics
17
+ - View token segmentation with color visualization
18
+
19
+ 2. **Decoding Text**
20
+ - Paste encoded token IDs in the decoder tab
21
+ - Click "Decode" to get back the original text
22
+
23
+ ## Technical Details
24
+
25
+ - Uses Byte Pair Encoding (BPE) algorithm
26
+ - Vocabulary size: 4800 tokens
27
+ - Supports efficient compression of Telugu text
28
+ - Maintains perfect reconstruction
29
+
30
+ ## Model Information
31
+
32
+ The tokenizer is trained on a diverse corpus of Telugu text with:
33
+ - Maximum vocabulary size: 5000 tokens
34
+ - Target compression ratio: ≥ 3.2x
35
+ - Perfect reconstruction guarantee