Spaces:

kishkath
/

bpe-tokenizer

Sleeping

bpe-tokenizer / README.md

Update README.md

708e762 verified 4 months ago

1.31 kB

	---
	title: Bpe Tokenizer
	emoji: 🔥
	colorFrom: blue
	colorTo: yellow
	sdk: gradio
	sdk_version: 5.12.0
	app_file: app.py
	pinned: false
	license: apache-2.0
	short_description: Telugu BPE tokenizer with vocabulary of 4800 words.
	---

	Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

	# Telugu Text Tokenizer

	A Gradio web interface for encoding and decoding Telugu text using a trained BPE tokenizer.

	## Features

	- Encode Telugu text to token IDs
	- View compression statistics and token visualization
	- Decode token IDs back to Telugu text
	- Interactive and user-friendly interface

	## Usage

	1. Encoding Text
	- Enter Telugu text in the encoder tab
	- Click "Encode" to get token IDs and statistics
	- View token segmentation with color visualization

	2. Decoding Text
	- Paste encoded token IDs in the decoder tab
	- Click "Decode" to get back the original text

	## Technical Details

	- Uses Byte Pair Encoding (BPE) algorithm
	- Vocabulary size: 4800 tokens
	- Supports efficient compression of Telugu text
	- Maintains perfect reconstruction

	## Model Information

	The tokenizer is trained on a diverse corpus of Telugu text with:
	- Maximum vocabulary size: 5000 tokens
	- Target compression ratio: ≥ 3.2x
	- Perfect reconstruction guarantee