Fill-Mask
Transformers
PyTorch
TensorBoard
Safetensors
French
modernbert
camembert
wissamantoun commited on
Commit
74b415e
·
verified ·
1 Parent(s): 31a9b4d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -3
README.md CHANGED
@@ -13,8 +13,8 @@ tags:
13
  ---
14
  # ModernCamemBERT
15
 
16
- [ModernCamemBERT](TODO) is a French language model pretrained on a large corpus of 1T tokens of High-Quality French text. It is the French version of the [ModernBERT](https://huggingface.co/answerdotai/ModernBERT-base) model. ModernCamemBERT was trained using the Masked Language Modeling (MLM) objective with 30% mask rate on 1T tokens on 48 H100 GPUs. The dataset used for training is a combination of French [RedPajama-V2](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2) filtered using heuristic and semantic filtering, French scientific documents from [HALvest](https://huggingface.co/datasets/almanach/HALvest), and the French Wikipedia. Semantic filtering was done by fine-tuning a BERT classifier trained on a document quality dataset automatically labeled by LLama-3 70B.
17
- We also re-use the old [CamemBERTav2](https://huggingface.co/almanach/camembertav2-base) tokenizer. The model was first trained with 1024 context length which was then increased to 8192 tokens later in the pretraining. More details about the training process can be found in the [ModernCamemBERT](TODO) paper.
18
 
19
  The goal of ModernCamemBERT was to run a controlled study by pretraining ModernBERT on the same dataset as CamemBERTaV2, a DeBERTaV3 French model, isolating the effect of model design. Our results show that the previous model generation remains superior in sample efficiency and overall benchmark performance, with ModernBERT’s primary advantage being faster training and inference speed. However, the new proposed model still provides meaningful architectural improvements compared to earlier models such as the BERT and RoBERTa CamemBERT/v2 model. Additionally, we observe that high-quality pre-training data accelerates convergence but does not significantly improve final performance, suggesting potential benchmark saturation.
20
 
@@ -55,6 +55,13 @@ We use the pretraining codebase from the [ModernBERT repository](https://github.
55
  ## Citation
56
 
57
  ```bibtex
58
- @misc{TODO
 
 
 
 
 
 
 
59
  }
60
  ```
 
13
  ---
14
  # ModernCamemBERT
15
 
16
+ [ModernCamemBERT](https://arxiv.org/abs/2504.08716) is a French language model pretrained on a large corpus of 1T tokens of High-Quality French text. It is the French version of the [ModernBERT](https://huggingface.co/answerdotai/ModernBERT-base) model. ModernCamemBERT was trained using the Masked Language Modeling (MLM) objective with 30% mask rate on 1T tokens on 48 H100 GPUs. The dataset used for training is a combination of French [RedPajama-V2](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2) filtered using heuristic and semantic filtering, French scientific documents from [HALvest](https://huggingface.co/datasets/almanach/HALvest), and the French Wikipedia. Semantic filtering was done by fine-tuning a BERT classifier trained on a document quality dataset automatically labeled by LLama-3 70B.
17
+ We also re-use the old [CamemBERTav2](https://huggingface.co/almanach/camembertav2-base) tokenizer. The model was first trained with 1024 context length which was then increased to 8192 tokens later in the pretraining. More details about the training process can be found in the [ModernCamemBERT](https://arxiv.org/abs/2504.08716) paper.
18
 
19
  The goal of ModernCamemBERT was to run a controlled study by pretraining ModernBERT on the same dataset as CamemBERTaV2, a DeBERTaV3 French model, isolating the effect of model design. Our results show that the previous model generation remains superior in sample efficiency and overall benchmark performance, with ModernBERT’s primary advantage being faster training and inference speed. However, the new proposed model still provides meaningful architectural improvements compared to earlier models such as the BERT and RoBERTa CamemBERT/v2 model. Additionally, we observe that high-quality pre-training data accelerates convergence but does not significantly improve final performance, suggesting potential benchmark saturation.
20
 
 
55
  ## Citation
56
 
57
  ```bibtex
58
+ @misc{antoun2025modernbertdebertav3examiningarchitecture,
59
+ title={ModernBERT or DeBERTaV3? Examining Architecture and Data Influence on Transformer Encoder Models Performance},
60
+ author={Wissam Antoun and Benoît Sagot and Djamé Seddah},
61
+ year={2025},
62
+ eprint={2504.08716},
63
+ archivePrefix={arXiv},
64
+ primaryClass={cs.CL},
65
+ url={https://arxiv.org/abs/2504.08716},
66
  }
67
  ```