File size: 1,306 Bytes
708e762
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ef36b5d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45095f6
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
---
title: Bpe Tokenizer
emoji: 🔥
colorFrom: blue
colorTo: yellow
sdk: gradio
sdk_version: 5.12.0
app_file: app.py
pinned: false
license: apache-2.0
short_description: Telugu BPE tokenizer with vocabulary of 4800 words.
---

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

# Telugu Text Tokenizer

A Gradio web interface for encoding and decoding Telugu text using a trained BPE tokenizer.

## Features

- Encode Telugu text to token IDs
- View compression statistics and token visualization
- Decode token IDs back to Telugu text
- Interactive and user-friendly interface

## Usage

1. **Encoding Text**
   - Enter Telugu text in the encoder tab
   - Click "Encode" to get token IDs and statistics
   - View token segmentation with color visualization

2. **Decoding Text**
   - Paste encoded token IDs in the decoder tab
   - Click "Decode" to get back the original text

## Technical Details

- Uses Byte Pair Encoding (BPE) algorithm
- Vocabulary size: 4800 tokens
- Supports efficient compression of Telugu text
- Maintains perfect reconstruction

## Model Information

The tokenizer is trained on a diverse corpus of Telugu text with:
- Maximum vocabulary size: 5000 tokens
- Target compression ratio: ≥ 3.2x
- Perfect reconstruction guarantee