pinzhenchen commited on
Commit
3c38b13
·
verified ·
1 Parent(s): f4840f4

upload Marian checkpoint and README

Browse files
Files changed (2) hide show
  1. README.md +54 -0
  2. model.npz.best-chrf.npz +3 -0
README.md ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+ language:
4
+ - en
5
+ - af
6
+ tags:
7
+ - translation
8
+ license: cc-by-4.0
9
+ inference: false
10
+ ---
11
+
12
+ <img src="https://hplt-project.org/_next/static/media/logo-hplt.d5e16ca5.svg" width=12.5%>
13
+
14
+ ### HPLT MT release v2.0
15
+
16
+ This repository contains the English-Afrikaans (en->af) encoder-decoder translation model trained on HPLT v2.0 and OPUS parallel data. The model is currently available in Marian format and we are working on converting it to the Hugging Face format.
17
+
18
+ ### Model Info
19
+
20
+ * Source language: English
21
+ * Target language: Afrikaans
22
+ * Data: HPLT v2.0 and OPUS parallel data
23
+ * Model architecture: Transformer-base
24
+ * Tokenizer: SentencePiece (Unigram)
25
+
26
+ You can check out our [paper](https://arxiv.org/abs/2503.10267), [GitHub repository](https://github.com/hplt-project/HPLT-MT-Models/tree/main/v2.0), or [website](https://hplt-project.org) for more details.
27
+
28
+ ### Usage
29
+ The model has been trained with [MarianNMT](https://github.com/marian-nmt/marian) and the weights are in the Marian format.
30
+
31
+ #### Using Marian
32
+ To run inference with MarianNMT, refer to the [Inference/Decoding/Translation](https://github.com/hplt-project/HPLT-MT-Models/tree/main/v1.0#inferencedecodingtranslation) section of our GitHub repository. You will need the model file `model.npz.best-chrf.npz` and the vocabulary file `model.en-af.spm` from this repository.
33
+
34
+ #### Using transformers
35
+ We are working on this.
36
+
37
+ ### Acknowledgements
38
+
39
+ This project has received funding from the European Union's Horizon Europe research and innovation programme under grant agreement No 101070350 and from UK Research and Innovation (UKRI) under the UK government's Horizon Europe funding guarantee [grant number 10052546]
40
+
41
+
42
+ ### Citation
43
+
44
+ If you find this model useful, please cite the following paper:
45
+ ```bibtex
46
+ @article{hpltv2,
47
+ title={An Expanded Massive Multilingual Dataset for High-Performance Language Technologies},
48
+ author={Laurie Burchell and Ona de Gibert and Nikolay Arefyev and Mikko Aulamo and Marta Bañón and Pinzhen Chen and Mariia Fedorova and Liane Guillou and Barry Haddow and Jan Hajič and Jindřich Helcl and Erik Henriksson and Mateusz Klimaszewski and Ville Komulainen and Andrey Kutuzov and Joona Kytöniemi and Veronika Laippala and Petter Mæhlum and Bhavitvya Malik and Farrokh Mehryary and Vladislav Mikhailov and Nikita Moghe and Amanda Myntti and Dayyán O'Brien and Stephan Oepen and Proyag Pal and Jousia Piha and Sampo Pyysalo and Gema Ramírez-Sánchez and David Samuel and Pavel Stepachev and Jörg Tiedemann and Dušan Variš and Tereza Vojtěchová and Jaume Zaragoza-Bernabeu},
49
+ journal={arXiv preprint arXiv:2503.10267},
50
+ year={2025},
51
+ url={https://arxiv.org/abs/2503.10267},
52
+ }
53
+ ```
54
+
model.npz.best-chrf.npz ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0d35f74ec3a6370bd6368a9e8054593b6bfb7a140f2c5453b189130f6a88fade
3
+ size 307935566