README.md · miscovery/tokenizer at main

metadata

language:
  - ar
  - en
dataset:
  - fka/awesome-chatgpt-prompts
  - open-r1/codeforces
license: mit

Miscovery Tokenizer

A SentencePiece unigram tokenizer trained on a mix of Arabic and English text, with a vocabulary size of 70,000 tokens.

Training Data

This tokenizer was trained on:

Arabic Quran.
awesome-chatgpt-prompts
open-r1/codeforces

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("miscovery/arabic-english-tokenizer")

# Example usage
text = "بسم الله الرحمن الرحيم Hello World"
encoded = tokenizer(text)
print(encoded)

Features

Vocabulary size: 70,000
Model type: Unigram
Model Max Length: 512
Handles both Arabic and English text
Supports Arabic normalization