tokenizer / README.md
mahrnoud's picture
First Commit
9070cdf
metadata
language:
  - ar
  - en
dataset:
  - fka/awesome-chatgpt-prompts
  - open-r1/codeforces
license: mit

Miscovery Tokenizer

A SentencePiece unigram tokenizer trained on a mix of Arabic and English text, with a vocabulary size of 70,000 tokens.

Training Data

This tokenizer was trained on:

  • Arabic Quran.
  • awesome-chatgpt-prompts
  • open-r1/codeforces

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("miscovery/arabic-english-tokenizer")

# Example usage
text = "بسم الله الرحمن الرحيم Hello World"
encoded = tokenizer(text)
print(encoded)

Features

  • Vocabulary size: 70,000
  • Model type: Unigram
  • Model Max Length: 512
  • Handles both Arabic and English text
  • Supports Arabic normalization