Spaces:

neelimapreeti297
/

GermanToEnglish

Runtime error

App Files Files Community

neelimapreeti297 commited on Apr 13, 2024

Commit

9594397

verified ·

1 Parent(s): 94333b4

Update README.md

Browse files

Files changed (1) hide show

README.md +195 -1

README.md CHANGED Viewed

@@ -9,5 +9,199 @@ app_file: app.py
 pinned: false
 license: mit
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 pinned: false
 license: mit
 ---
+The translator app:
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/65b2665fee3f66b2b0f7b765/nVheCVJjZiCK3cvof6x84.png)
+# Model Name
+German to English Translator
+# Model Description
+This model translates german language to english language. It used Sequence to Sequence Transformer(Seq2SeqTransformer) for training.
+- **Developed by:** Neelima Monjusha Preeti
+- **Model type:** Seq2SeqTransformer
+- **Language(s):** Python
+- **License:** MIT
+- **Contact:** [email protected]
+# Task Description
+This app translates German to English. First the language is tokenized, passed through encoder, decoder and trained with Seq2SeqTransformer.
+Then as output the language is english.
+# Data Processing
+Defining source and target languages and then Tokenization. Tokenizers for German and English are initialized using spaCy (spacy library).
+The get_tokenizer function from spaCy is used to obtain tokenizers for each language.
+A function yield_tokens is defined to tokenize sentences from the data iterator for both source and target languages.
+Special symbols and indices:
+Special indices are defined for unknown words (UNK_IDX), padding (PAD_IDX), beginning of sequence (BOS_IDX), and end of sequence (EOS_IDX).
+Special symbols are defined as ['<unk>', '<pad>', '<bos>', '<eos>'].
+Then vocabulary is built.For each language (source and target), the code iterates over the training data and builds a vocabulary using the build_vocab_from_iterator function.
+It uses the tokenization function defined earlier to tokenize the data.
+The vocabulary is built with a minimum frequency of 1 (including all tokens) and special symbols are added first.
+For each language's vocabulary, the default index for unknown tokens (UNK_IDX) is set.
+```bash
+token_transform[SRC_LANGUAGE] = get_tokenizer('spacy', language='de_core_news_sm')
+token_transform[TGT_LANGUAGE] = get_tokenizer('spacy', language='en_core_web_sm')
+def yield_tokens(data_iter: Iterable, language: str) -> List[str]:
+    language_index = {SRC_LANGUAGE: 0, TGT_LANGUAGE: 1}
+    for data_sample in data_iter:
+        yield token_transform[language](data_sample[language_index[language]])
+# Define special symbols and indices
+UNK_IDX, PAD_IDX, BOS_IDX, EOS_IDX = 0, 1, 2, 3
+# Make sure the tokens are in order of their indices to properly insert them in vocab
+special_symbols = ['<unk>', '<pad>', '<bos>', '<eos>']
+for ln in [SRC_LANGUAGE, TGT_LANGUAGE]:
+    # Training data Iterator
+    train_iter = Multi30k(split='train', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))
+    vocab_transform[ln] = build_vocab_from_iterator(yield_tokens(train_iter, ln),
+                                                    min_freq=1,
+                                                    specials=special_symbols,
+                                                    special_first=True)
+for ln in [SRC_LANGUAGE, TGT_LANGUAGE]:
+  vocab_transform[ln].set_default_index(UNK_IDX)
+```
+# Model Architecture
+For machine translation I used Seq2SeqTransformer.
+class PositionalEncoding(nn.Module) adds positional encodings to token embeddings, while class TokenEmbedding(nn.Module) converts token indices into dense embeddings using an embedding layer.
+The parameters defined and initialized for the model are:
+### num_encoder_layers: Number of layers in the encoder stack -- 3.
+### num_decoder_layers: Number of layers in the decoder stack-- 3.
+### emb_size: The dimensionality of token embeddings-- 512.
+### nhead: The number of attention heads in the multi-head attention mechanism-- 512.
+### src_vocab_size: Vocabulary size of the source language.
+### tgt_vocab_size: Vocabulary size of the target language.
+### dim_feedforward: Dimensionality of the feedforward network (defaulted to 512).
+### dropout: Dropout probability (defaulted to 0.1).
+The loss function and optimizer are calculated with this:
+```bash
+loss_fn = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX)
+optimizer = torch.optim.Adam(transformer.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)
+```
+Then the model is passed through encoder and decoder layers.
+The helper functions and list are
+```bash
+sequential_transforms(*transforms)
+tensor_transform(token_ids: List[int])
+collate_fn(batch)
+text_transform = {}
+```
+These utility functions and transformations handle the preprocessing of text data, including tokenization, numericalization, adding special tokens, and collating samples into batch tensors suitable for training a sequence-to-sequence transformer model.
+Then the model is trained with Seq2SeqTransformer and evaluated with function evaluate(model).
+# Result Analysis
+greedy_decode() - this function takes
+### model: The sequence-to-sequence transformer model.
+### src: The source sequence tensor.
+### src_mask: The mask for the source sequence.
+### max_len: The maximum length of the output sequence.
+### start_symbol: The index of the start symbol in the target vocabulary
+as parameter and returns the generated target sequence tensor ys, which contains the complete translation.
+## Test input:
+The function for translating german to english is - translate().
+```bash
+def translate(src_sentence: str):
+    model = Seq2SeqTransformer(NUM_ENCODER_LAYERS, NUM_DECODER_LAYERS, EMB_SIZE,NHEAD, SRC_VOCAB_SIZE, TGT_VOCAB_SIZE, FFN_HID_DIM)
+    model.load_state_dict(torch.load('./transformer_model.pth'))
+    model.to(DEVICE)
+    model.eval()
+    src = text_transform[SRC_LANGUAGE](src_sentence).view(-1, 1)
+    num_tokens = src.shape[0]
+    src_mask = (torch.zeros(num_tokens, num_tokens)).type(torch.bool)
+    tgt_tokens = greedy_decode(
+        model,  src, src_mask, max_len=num_tokens + 5, start_symbol=BOS_IDX).flatten()
+    return " ".join(vocab_transform[TGT_LANGUAGE].lookup_tokens(list(tgt_tokens.cpu().numpy()))).replace("<bos>", "").replace("<eos>", "")
+```
+This function first loads the saved model. Then it tokenizes and implements greedy_decode for getting the translated output. Then returns the
+output.
+# Hugging Face Interface:
+For creating interface gradio and torch as well as Seq2SeqTransformer, translate and greedy_decode function from the germantoenglish.py file was loaded.
+```bash
+import gradio as gr
+import torch
+from germantoenglish import Seq2SeqTransformer, translate, greedy_decode
+```
+The the app takes input a german line and output shows the translated english text.
+```bash
+if __name__ == "__main__":
+    iface = gr.Interface(
+        fn=translate,
+        inputs=[
+            gr.components.Textbox(label="Text")
+    ],
+    outputs=["text"],
+    cache_examples=False,
+    title="GermanToEnglish",
+    )
+iface.launch(share=True)
+```
+The app interface looks like this:
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/65b2665fee3f66b2b0f7b765/J_Q4eqXiN7cNuhOM3NbjR.png)
+# Project Structure
+```bash
+|---Readme.md
+|
+|---germantoenglish.py-The full code for processing, training, evaluating is here
+|
+|---app.py- This is for creating the app interface
+|
+|---Modeltensors- needed tensor file for loading app
+|
+|---requirements.txt- necessary packages and dataset which needs to be downloaded for the app to work.
+|
+|--translate_model.pth- the model file which is loaded for the app
+```
+# How to Run
+```bash
+git clone https://huggingface.co/spaces/neelimapreeti297/GermanToEnglish
+cd GermanToEnglish
+pip install -r requirements.txt
+python app.py
+```
+# License
+This project is licensed under the MIT License.
+# Contributor
+Neelima Monjusha Preeti - [email protected]
+App link: https://huggingface.co/spaces/neelimapreeti297/GermanToEnglish