bpe-tokenizer / README.md
kishkath's picture
Update README.md
708e762 verified

A newer version of the Gradio SDK is available: 5.29.0

Upgrade
metadata
title: Bpe Tokenizer
emoji: 🔥
colorFrom: blue
colorTo: yellow
sdk: gradio
sdk_version: 5.12.0
app_file: app.py
pinned: false
license: apache-2.0
short_description: Telugu BPE tokenizer with vocabulary of 4800 words.

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

Telugu Text Tokenizer

A Gradio web interface for encoding and decoding Telugu text using a trained BPE tokenizer.

Features

  • Encode Telugu text to token IDs
  • View compression statistics and token visualization
  • Decode token IDs back to Telugu text
  • Interactive and user-friendly interface

Usage

  1. Encoding Text

    • Enter Telugu text in the encoder tab
    • Click "Encode" to get token IDs and statistics
    • View token segmentation with color visualization
  2. Decoding Text

    • Paste encoded token IDs in the decoder tab
    • Click "Decode" to get back the original text

Technical Details

  • Uses Byte Pair Encoding (BPE) algorithm
  • Vocabulary size: 4800 tokens
  • Supports efficient compression of Telugu text
  • Maintains perfect reconstruction

Model Information

The tokenizer is trained on a diverse corpus of Telugu text with:

  • Maximum vocabulary size: 5000 tokens
  • Target compression ratio: ≥ 3.2x
  • Perfect reconstruction guarantee