neelimapreeti297 commited on
Commit
9594397
·
verified ·
1 Parent(s): 94333b4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +195 -1
README.md CHANGED
@@ -9,5 +9,199 @@ app_file: app.py
9
  pinned: false
10
  license: mit
11
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
 
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
9
  pinned: false
10
  license: mit
11
  ---
12
+ The translator app:
13
+
14
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/65b2665fee3f66b2b0f7b765/nVheCVJjZiCK3cvof6x84.png)
15
+
16
+ # Model Name
17
+ German to English Translator
18
+
19
+ # Model Description
20
+ This model translates german language to english language. It used Sequence to Sequence Transformer(Seq2SeqTransformer) for training.
21
+
22
+
23
+ - **Developed by:** Neelima Monjusha Preeti
24
+ - **Model type:** Seq2SeqTransformer
25
+ - **Language(s):** Python
26
+ - **License:** MIT
27
+ - **Contact:** [email protected]
28
+
29
+ # Task Description
30
+ This app translates German to English. First the language is tokenized, passed through encoder, decoder and trained with Seq2SeqTransformer.
31
+ Then as output the language is english.
32
+
33
+ # Data Processing
34
+
35
+ Defining source and target languages and then Tokenization. Tokenizers for German and English are initialized using spaCy (spacy library).
36
+ The get_tokenizer function from spaCy is used to obtain tokenizers for each language.
37
+ A function yield_tokens is defined to tokenize sentences from the data iterator for both source and target languages.
38
+ Special symbols and indices:
39
+
40
+ Special indices are defined for unknown words (UNK_IDX), padding (PAD_IDX), beginning of sequence (BOS_IDX), and end of sequence (EOS_IDX).
41
+ Special symbols are defined as ['<unk>', '<pad>', '<bos>', '<eos>'].
42
+
43
+ Then vocabulary is built.For each language (source and target), the code iterates over the training data and builds a vocabulary using the build_vocab_from_iterator function.
44
+ It uses the tokenization function defined earlier to tokenize the data.
45
+ The vocabulary is built with a minimum frequency of 1 (including all tokens) and special symbols are added first.
46
+ For each language's vocabulary, the default index for unknown tokens (UNK_IDX) is set.
47
+
48
+ ```bash
49
+ token_transform[SRC_LANGUAGE] = get_tokenizer('spacy', language='de_core_news_sm')
50
+ token_transform[TGT_LANGUAGE] = get_tokenizer('spacy', language='en_core_web_sm')
51
+
52
+
53
+ def yield_tokens(data_iter: Iterable, language: str) -> List[str]:
54
+ language_index = {SRC_LANGUAGE: 0, TGT_LANGUAGE: 1}
55
+
56
+ for data_sample in data_iter:
57
+ yield token_transform[language](data_sample[language_index[language]])
58
+
59
+ # Define special symbols and indices
60
+ UNK_IDX, PAD_IDX, BOS_IDX, EOS_IDX = 0, 1, 2, 3
61
+ # Make sure the tokens are in order of their indices to properly insert them in vocab
62
+ special_symbols = ['<unk>', '<pad>', '<bos>', '<eos>']
63
+
64
+ for ln in [SRC_LANGUAGE, TGT_LANGUAGE]:
65
+ # Training data Iterator
66
+ train_iter = Multi30k(split='train', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))
67
+
68
+ vocab_transform[ln] = build_vocab_from_iterator(yield_tokens(train_iter, ln),
69
+ min_freq=1,
70
+ specials=special_symbols,
71
+ special_first=True)
72
+
73
+ for ln in [SRC_LANGUAGE, TGT_LANGUAGE]:
74
+ vocab_transform[ln].set_default_index(UNK_IDX)
75
+ ```
76
+ # Model Architecture
77
+
78
+ For machine translation I used Seq2SeqTransformer.
79
+ class PositionalEncoding(nn.Module) adds positional encodings to token embeddings, while class TokenEmbedding(nn.Module) converts token indices into dense embeddings using an embedding layer.
80
+ The parameters defined and initialized for the model are:
81
+
82
+ ### num_encoder_layers: Number of layers in the encoder stack -- 3.
83
+ ### num_decoder_layers: Number of layers in the decoder stack-- 3.
84
+ ### emb_size: The dimensionality of token embeddings-- 512.
85
+ ### nhead: The number of attention heads in the multi-head attention mechanism-- 512.
86
+ ### src_vocab_size: Vocabulary size of the source language.
87
+ ### tgt_vocab_size: Vocabulary size of the target language.
88
+ ### dim_feedforward: Dimensionality of the feedforward network (defaulted to 512).
89
+ ### dropout: Dropout probability (defaulted to 0.1).
90
+
91
+ The loss function and optimizer are calculated with this:
92
+
93
+ ```bash
94
+ loss_fn = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX)
95
+ optimizer = torch.optim.Adam(transformer.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)
96
+ ```
97
+ Then the model is passed through encoder and decoder layers.
98
+
99
+ The helper functions and list are
100
+
101
+ ```bash
102
+ sequential_transforms(*transforms)
103
+ tensor_transform(token_ids: List[int])
104
+ collate_fn(batch)
105
+ text_transform = {}
106
+ ```
107
+
108
+ These utility functions and transformations handle the preprocessing of text data, including tokenization, numericalization, adding special tokens, and collating samples into batch tensors suitable for training a sequence-to-sequence transformer model.
109
+
110
+ Then the model is trained with Seq2SeqTransformer and evaluated with function evaluate(model).
111
+
112
+ # Result Analysis
113
+ greedy_decode() - this function takes
114
+ ### model: The sequence-to-sequence transformer model.
115
+ ### src: The source sequence tensor.
116
+ ### src_mask: The mask for the source sequence.
117
+ ### max_len: The maximum length of the output sequence.
118
+ ### start_symbol: The index of the start symbol in the target vocabulary
119
+
120
+ as parameter and returns the generated target sequence tensor ys, which contains the complete translation.
121
+
122
+ ## Test input:
123
+
124
+ The function for translating german to english is - translate().
125
+ ```bash
126
+ def translate(src_sentence: str):
127
+ model = Seq2SeqTransformer(NUM_ENCODER_LAYERS, NUM_DECODER_LAYERS, EMB_SIZE,NHEAD, SRC_VOCAB_SIZE, TGT_VOCAB_SIZE, FFN_HID_DIM)
128
+
129
+ model.load_state_dict(torch.load('./transformer_model.pth'))
130
+ model.to(DEVICE)
131
+ model.eval()
132
+ src = text_transform[SRC_LANGUAGE](src_sentence).view(-1, 1)
133
+ num_tokens = src.shape[0]
134
+ src_mask = (torch.zeros(num_tokens, num_tokens)).type(torch.bool)
135
+ tgt_tokens = greedy_decode(
136
+ model, src, src_mask, max_len=num_tokens + 5, start_symbol=BOS_IDX).flatten()
137
+ return " ".join(vocab_transform[TGT_LANGUAGE].lookup_tokens(list(tgt_tokens.cpu().numpy()))).replace("<bos>", "").replace("<eos>", "")
138
+ ```
139
+ This function first loads the saved model. Then it tokenizes and implements greedy_decode for getting the translated output. Then returns the
140
+ output.
141
+
142
+ # Hugging Face Interface:
143
+
144
+ For creating interface gradio and torch as well as Seq2SeqTransformer, translate and greedy_decode function from the germantoenglish.py file was loaded.
145
+ ```bash
146
+ import gradio as gr
147
+ import torch
148
+ from germantoenglish import Seq2SeqTransformer, translate, greedy_decode
149
+ ```
150
+ The the app takes input a german line and output shows the translated english text.
151
+ ```bash
152
+ if __name__ == "__main__":
153
+ iface = gr.Interface(
154
+ fn=translate,
155
+ inputs=[
156
+ gr.components.Textbox(label="Text")
157
+
158
+ ],
159
+ outputs=["text"],
160
+ cache_examples=False,
161
+ title="GermanToEnglish",
162
+ )
163
+ iface.launch(share=True)
164
+
165
+ ```
166
+ The app interface looks like this:
167
+
168
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/65b2665fee3f66b2b0f7b765/J_Q4eqXiN7cNuhOM3NbjR.png)
169
+
170
+ # Project Structure
171
+ ```bash
172
+ |---Readme.md
173
+ |
174
+ |---germantoenglish.py-The full code for processing, training, evaluating is here
175
+ |
176
+ |---app.py- This is for creating the app interface
177
+ |
178
+ |---Modeltensors- needed tensor file for loading app
179
+ |
180
+ |---requirements.txt- necessary packages and dataset which needs to be downloaded for the app to work.
181
+ |
182
+ |--translate_model.pth- the model file which is loaded for the app
183
+
184
+ ```
185
+
186
+ # How to Run
187
+
188
+ ```bash
189
+
190
+ git clone https://huggingface.co/spaces/neelimapreeti297/GermanToEnglish
191
+
192
+ cd GermanToEnglish
193
+
194
+ pip install -r requirements.txt
195
+
196
+ python app.py
197
+ ```
198
+
199
+
200
+ # License
201
+ This project is licensed under the MIT License.
202
+
203
+ # Contributor
204
+ Neelima Monjusha Preeti - [email protected]
205
+
206
+ App link: https://huggingface.co/spaces/neelimapreeti297/GermanToEnglish
207