Spaces:

corasan
/

tokenisation

Running

Constantin Orasan commited on Nov 14, 2024

Commit

a514775

1 Parent(s): 538dc0d

Updated the app and the models

Files changed (3) hide show

app.py CHANGED Viewed

@@ -5,7 +5,11 @@ examples = [
     "Hello, world!",
     "European Central bank has announced cuts.",
     "This document is a summary of the European Public Assessment Report (EPAR).",
-    "En el presente documento se resume el Informe Público Europeo de Evaluación (EPAR)."]
 def greet(sentence):
@@ -25,9 +29,18 @@ def greet(sentence):
             "</div>")
 demo = gr.Interface(fn=greet, inputs="text", outputs="html",
-                    examples=examples, title="SentencePiece BPE",
-                    description="Demo for SentencePiece BPE.",
                     cache_examples="lazy",
                     concurrency_limit=30,
                     css=".output {font-size: 150%;}")

     "Hello, world!",
     "European Central bank has announced cuts.",
     "This document is a summary of the European Public Assessment Report (EPAR).",
+    "En el presente documento se resume el Informe Público Europeo de Evaluación (EPAR).",
+    "Solution for injection",
+    "How is Abilify used?",
+    "¿Para qué se utiliza Abilify?",
+    "Tratado de la Unión Europea y Tratado de Funcionamiento de la Unión Europea"]
 def greet(sentence):
             "</div>")
+description = """
+Demo for SentencePiece. The model is trained on ECB and EMEA datasets in order to see the differences in tokenization.
+The ECB dataset contains financial news articles, while the EMEA dataset contains medical articles.
+The texts included in the training are in English and Spanish, for this reason the tokenisation will work best for these languages.
+You can try some other languages and see how the tokenisation works. However, make sure you use only Latin characters.
+The model did not see any non-Latin characters during training, so the results for languages that do not use Latin characters will be unpredictable.
+Both variants are trained with 5000 vocab size.
+"""
 demo = gr.Interface(fn=greet, inputs="text", outputs="html",
+                    examples=examples, title="SentencePiece",
+                    description=description,
                     cache_examples="lazy",
                     concurrency_limit=30,
                     css=".output {font-size: 150%;}")

bpe-ECB.model CHANGED Viewed

Binary files a/bpe-ECB.model and b/bpe-ECB.model differ

bpe-EMEA.model CHANGED Viewed

Binary files a/bpe-EMEA.model and b/bpe-EMEA.model differ