Constantin Orasan commited on
Commit
a514775
1 Parent(s): 538dc0d

Updated the app and the models

Browse files
Files changed (3) hide show
  1. app.py +16 -3
  2. bpe-ECB.model +0 -0
  3. bpe-EMEA.model +0 -0
app.py CHANGED
@@ -5,7 +5,11 @@ examples = [
5
  "Hello, world!",
6
  "European Central bank has announced cuts.",
7
  "This document is a summary of the European Public Assessment Report (EPAR).",
8
- "En el presente documento se resume el Informe P煤blico Europeo de Evaluaci贸n (EPAR)."]
 
 
 
 
9
 
10
 
11
  def greet(sentence):
@@ -25,9 +29,18 @@ def greet(sentence):
25
  "</div>")
26
 
27
 
 
 
 
 
 
 
 
 
 
28
  demo = gr.Interface(fn=greet, inputs="text", outputs="html",
29
- examples=examples, title="SentencePiece BPE",
30
- description="Demo for SentencePiece BPE.",
31
  cache_examples="lazy",
32
  concurrency_limit=30,
33
  css=".output {font-size: 150%;}")
 
5
  "Hello, world!",
6
  "European Central bank has announced cuts.",
7
  "This document is a summary of the European Public Assessment Report (EPAR).",
8
+ "En el presente documento se resume el Informe P煤blico Europeo de Evaluaci贸n (EPAR).",
9
+ "Solution for injection",
10
+ "How is Abilify used?",
11
+ "驴Para qu茅 se utiliza Abilify?",
12
+ "Tratado de la Uni贸n Europea y Tratado de Funcionamiento de la Uni贸n Europea"]
13
 
14
 
15
  def greet(sentence):
 
29
  "</div>")
30
 
31
 
32
+ description = """
33
+ Demo for SentencePiece. The model is trained on ECB and EMEA datasets in order to see the differences in tokenization.
34
+ The ECB dataset contains financial news articles, while the EMEA dataset contains medical articles.
35
+ The texts included in the training are in English and Spanish, for this reason the tokenisation will work best for these languages.
36
+ You can try some other languages and see how the tokenisation works. However, make sure you use only Latin characters.
37
+ The model did not see any non-Latin characters during training, so the results for languages that do not use Latin characters will be unpredictable.
38
+ Both variants are trained with 5000 vocab size.
39
+ """
40
+
41
  demo = gr.Interface(fn=greet, inputs="text", outputs="html",
42
+ examples=examples, title="SentencePiece",
43
+ description=description,
44
  cache_examples="lazy",
45
  concurrency_limit=30,
46
  css=".output {font-size: 150%;}")
bpe-ECB.model CHANGED
Binary files a/bpe-ECB.model and b/bpe-ECB.model differ
 
bpe-EMEA.model CHANGED
Binary files a/bpe-EMEA.model and b/bpe-EMEA.model differ