datasets: | |
- manu/tok_corpus | |
language: | |
- fr | |
- en | |
BPE Tokenizer fitted on a custom corpus, with digit separation, byte fallback and other features from LlamaTokenizer. | |
Only fitted on 1,000,000 samples. |
datasets: | |
- manu/tok_corpus | |
language: | |
- fr | |
- en | |
BPE Tokenizer fitted on a custom corpus, with digit separation, byte fallback and other features from LlamaTokenizer. | |
Only fitted on 1,000,000 samples. |