metadata
language:
- ar
- en
dataset:
- fka/awesome-chatgpt-prompts
- open-r1/codeforces
license: mit
Miscovery Tokenizer
A SentencePiece unigram tokenizer trained on a mix of Arabic and English text, with a vocabulary size of 70,000 tokens.
Training Data
This tokenizer was trained on:
- Arabic Quran.
- awesome-chatgpt-prompts
- open-r1/codeforces
Usage
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("miscovery/arabic-english-tokenizer")
# Example usage
text = "بسم الله الرحمن الرحيم Hello World"
encoded = tokenizer(text)
print(encoded)
Features
- Vocabulary size: 70,000
- Model type: Unigram
- Model Max Length: 512
- Handles both Arabic and English text
- Supports Arabic normalization