Spaces:
Runtime error
Runtime error
Update README.md
Browse files
README.md
CHANGED
@@ -9,5 +9,199 @@ app_file: app.py
|
|
9 |
pinned: false
|
10 |
license: mit
|
11 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
12 |
|
13 |
-
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
|
|
9 |
pinned: false
|
10 |
license: mit
|
11 |
---
|
12 |
+
The translator app:
|
13 |
+
|
14 |
+

|
15 |
+
|
16 |
+
# Model Name
|
17 |
+
German to English Translator
|
18 |
+
|
19 |
+
# Model Description
|
20 |
+
This model translates german language to english language. It used Sequence to Sequence Transformer(Seq2SeqTransformer) for training.
|
21 |
+
|
22 |
+
|
23 |
+
- **Developed by:** Neelima Monjusha Preeti
|
24 |
+
- **Model type:** Seq2SeqTransformer
|
25 |
+
- **Language(s):** Python
|
26 |
+
- **License:** MIT
|
27 |
+
- **Contact:** [email protected]
|
28 |
+
|
29 |
+
# Task Description
|
30 |
+
This app translates German to English. First the language is tokenized, passed through encoder, decoder and trained with Seq2SeqTransformer.
|
31 |
+
Then as output the language is english.
|
32 |
+
|
33 |
+
# Data Processing
|
34 |
+
|
35 |
+
Defining source and target languages and then Tokenization. Tokenizers for German and English are initialized using spaCy (spacy library).
|
36 |
+
The get_tokenizer function from spaCy is used to obtain tokenizers for each language.
|
37 |
+
A function yield_tokens is defined to tokenize sentences from the data iterator for both source and target languages.
|
38 |
+
Special symbols and indices:
|
39 |
+
|
40 |
+
Special indices are defined for unknown words (UNK_IDX), padding (PAD_IDX), beginning of sequence (BOS_IDX), and end of sequence (EOS_IDX).
|
41 |
+
Special symbols are defined as ['<unk>', '<pad>', '<bos>', '<eos>'].
|
42 |
+
|
43 |
+
Then vocabulary is built.For each language (source and target), the code iterates over the training data and builds a vocabulary using the build_vocab_from_iterator function.
|
44 |
+
It uses the tokenization function defined earlier to tokenize the data.
|
45 |
+
The vocabulary is built with a minimum frequency of 1 (including all tokens) and special symbols are added first.
|
46 |
+
For each language's vocabulary, the default index for unknown tokens (UNK_IDX) is set.
|
47 |
+
|
48 |
+
```bash
|
49 |
+
token_transform[SRC_LANGUAGE] = get_tokenizer('spacy', language='de_core_news_sm')
|
50 |
+
token_transform[TGT_LANGUAGE] = get_tokenizer('spacy', language='en_core_web_sm')
|
51 |
+
|
52 |
+
|
53 |
+
def yield_tokens(data_iter: Iterable, language: str) -> List[str]:
|
54 |
+
language_index = {SRC_LANGUAGE: 0, TGT_LANGUAGE: 1}
|
55 |
+
|
56 |
+
for data_sample in data_iter:
|
57 |
+
yield token_transform[language](data_sample[language_index[language]])
|
58 |
+
|
59 |
+
# Define special symbols and indices
|
60 |
+
UNK_IDX, PAD_IDX, BOS_IDX, EOS_IDX = 0, 1, 2, 3
|
61 |
+
# Make sure the tokens are in order of their indices to properly insert them in vocab
|
62 |
+
special_symbols = ['<unk>', '<pad>', '<bos>', '<eos>']
|
63 |
+
|
64 |
+
for ln in [SRC_LANGUAGE, TGT_LANGUAGE]:
|
65 |
+
# Training data Iterator
|
66 |
+
train_iter = Multi30k(split='train', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))
|
67 |
+
|
68 |
+
vocab_transform[ln] = build_vocab_from_iterator(yield_tokens(train_iter, ln),
|
69 |
+
min_freq=1,
|
70 |
+
specials=special_symbols,
|
71 |
+
special_first=True)
|
72 |
+
|
73 |
+
for ln in [SRC_LANGUAGE, TGT_LANGUAGE]:
|
74 |
+
vocab_transform[ln].set_default_index(UNK_IDX)
|
75 |
+
```
|
76 |
+
# Model Architecture
|
77 |
+
|
78 |
+
For machine translation I used Seq2SeqTransformer.
|
79 |
+
class PositionalEncoding(nn.Module) adds positional encodings to token embeddings, while class TokenEmbedding(nn.Module) converts token indices into dense embeddings using an embedding layer.
|
80 |
+
The parameters defined and initialized for the model are:
|
81 |
+
|
82 |
+
### num_encoder_layers: Number of layers in the encoder stack -- 3.
|
83 |
+
### num_decoder_layers: Number of layers in the decoder stack-- 3.
|
84 |
+
### emb_size: The dimensionality of token embeddings-- 512.
|
85 |
+
### nhead: The number of attention heads in the multi-head attention mechanism-- 512.
|
86 |
+
### src_vocab_size: Vocabulary size of the source language.
|
87 |
+
### tgt_vocab_size: Vocabulary size of the target language.
|
88 |
+
### dim_feedforward: Dimensionality of the feedforward network (defaulted to 512).
|
89 |
+
### dropout: Dropout probability (defaulted to 0.1).
|
90 |
+
|
91 |
+
The loss function and optimizer are calculated with this:
|
92 |
+
|
93 |
+
```bash
|
94 |
+
loss_fn = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX)
|
95 |
+
optimizer = torch.optim.Adam(transformer.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)
|
96 |
+
```
|
97 |
+
Then the model is passed through encoder and decoder layers.
|
98 |
+
|
99 |
+
The helper functions and list are
|
100 |
+
|
101 |
+
```bash
|
102 |
+
sequential_transforms(*transforms)
|
103 |
+
tensor_transform(token_ids: List[int])
|
104 |
+
collate_fn(batch)
|
105 |
+
text_transform = {}
|
106 |
+
```
|
107 |
+
|
108 |
+
These utility functions and transformations handle the preprocessing of text data, including tokenization, numericalization, adding special tokens, and collating samples into batch tensors suitable for training a sequence-to-sequence transformer model.
|
109 |
+
|
110 |
+
Then the model is trained with Seq2SeqTransformer and evaluated with function evaluate(model).
|
111 |
+
|
112 |
+
# Result Analysis
|
113 |
+
greedy_decode() - this function takes
|
114 |
+
### model: The sequence-to-sequence transformer model.
|
115 |
+
### src: The source sequence tensor.
|
116 |
+
### src_mask: The mask for the source sequence.
|
117 |
+
### max_len: The maximum length of the output sequence.
|
118 |
+
### start_symbol: The index of the start symbol in the target vocabulary
|
119 |
+
|
120 |
+
as parameter and returns the generated target sequence tensor ys, which contains the complete translation.
|
121 |
+
|
122 |
+
## Test input:
|
123 |
+
|
124 |
+
The function for translating german to english is - translate().
|
125 |
+
```bash
|
126 |
+
def translate(src_sentence: str):
|
127 |
+
model = Seq2SeqTransformer(NUM_ENCODER_LAYERS, NUM_DECODER_LAYERS, EMB_SIZE,NHEAD, SRC_VOCAB_SIZE, TGT_VOCAB_SIZE, FFN_HID_DIM)
|
128 |
+
|
129 |
+
model.load_state_dict(torch.load('./transformer_model.pth'))
|
130 |
+
model.to(DEVICE)
|
131 |
+
model.eval()
|
132 |
+
src = text_transform[SRC_LANGUAGE](src_sentence).view(-1, 1)
|
133 |
+
num_tokens = src.shape[0]
|
134 |
+
src_mask = (torch.zeros(num_tokens, num_tokens)).type(torch.bool)
|
135 |
+
tgt_tokens = greedy_decode(
|
136 |
+
model, src, src_mask, max_len=num_tokens + 5, start_symbol=BOS_IDX).flatten()
|
137 |
+
return " ".join(vocab_transform[TGT_LANGUAGE].lookup_tokens(list(tgt_tokens.cpu().numpy()))).replace("<bos>", "").replace("<eos>", "")
|
138 |
+
```
|
139 |
+
This function first loads the saved model. Then it tokenizes and implements greedy_decode for getting the translated output. Then returns the
|
140 |
+
output.
|
141 |
+
|
142 |
+
# Hugging Face Interface:
|
143 |
+
|
144 |
+
For creating interface gradio and torch as well as Seq2SeqTransformer, translate and greedy_decode function from the germantoenglish.py file was loaded.
|
145 |
+
```bash
|
146 |
+
import gradio as gr
|
147 |
+
import torch
|
148 |
+
from germantoenglish import Seq2SeqTransformer, translate, greedy_decode
|
149 |
+
```
|
150 |
+
The the app takes input a german line and output shows the translated english text.
|
151 |
+
```bash
|
152 |
+
if __name__ == "__main__":
|
153 |
+
iface = gr.Interface(
|
154 |
+
fn=translate,
|
155 |
+
inputs=[
|
156 |
+
gr.components.Textbox(label="Text")
|
157 |
+
|
158 |
+
],
|
159 |
+
outputs=["text"],
|
160 |
+
cache_examples=False,
|
161 |
+
title="GermanToEnglish",
|
162 |
+
)
|
163 |
+
iface.launch(share=True)
|
164 |
+
|
165 |
+
```
|
166 |
+
The app interface looks like this:
|
167 |
+
|
168 |
+

|
169 |
+
|
170 |
+
# Project Structure
|
171 |
+
```bash
|
172 |
+
|---Readme.md
|
173 |
+
|
|
174 |
+
|---germantoenglish.py-The full code for processing, training, evaluating is here
|
175 |
+
|
|
176 |
+
|---app.py- This is for creating the app interface
|
177 |
+
|
|
178 |
+
|---Modeltensors- needed tensor file for loading app
|
179 |
+
|
|
180 |
+
|---requirements.txt- necessary packages and dataset which needs to be downloaded for the app to work.
|
181 |
+
|
|
182 |
+
|--translate_model.pth- the model file which is loaded for the app
|
183 |
+
|
184 |
+
```
|
185 |
+
|
186 |
+
# How to Run
|
187 |
+
|
188 |
+
```bash
|
189 |
+
|
190 |
+
git clone https://huggingface.co/spaces/neelimapreeti297/GermanToEnglish
|
191 |
+
|
192 |
+
cd GermanToEnglish
|
193 |
+
|
194 |
+
pip install -r requirements.txt
|
195 |
+
|
196 |
+
python app.py
|
197 |
+
```
|
198 |
+
|
199 |
+
|
200 |
+
# License
|
201 |
+
This project is licensed under the MIT License.
|
202 |
+
|
203 |
+
# Contributor
|
204 |
+
Neelima Monjusha Preeti - [email protected]
|
205 |
+
|
206 |
+
App link: https://huggingface.co/spaces/neelimapreeti297/GermanToEnglish
|
207 |
|
|