# Overview

This tutorial will demonstrate how to train, evaluate, and test three types of models for Question-Answering -
1. BERT-like models for Extractive Question-Answering
2. Sequence-to-Sequence (S2S) models for Generative Question-Answering (ex. T5/BART-like)
3. GPT-like models for Generative Question-Answering

## Task Description

- Given a context and a natural language query, we want to generate an answer for the query
- Depending on how the answer is generated, the task can be broadly divided into two types:
 1. Extractive Question Answering
 2. Generative Question Answering


### Extractive Question-Answering with BERT-like models

Given a question and a context, both in natural language, predict the span within the context with a start and end position which indicates the answer to the question.
For every word in our training dataset we’re going to predict:
- likelihood this word is the start of the span 
- likelihood this word is the end of the span

We are using a BERT encoder with 2 span prediction heads for predicting start and end position of the answer. The span predictions are token classifiers consisting of a single linear layer.

### Generative Question-Answering with S2S and GPT-like models

Given a question and a context, both in natural language, generate an answer for the question. Unlike the BERT-like models, there is no constraint that the answer should be a span within the context.

# Installing NeMo

You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.

Instructions for setting up Colab are as follows:
1. Open a new Python 3 notebook.
2. Import this notebook from GitHub (File -> Upload Notebook -> "GITHUB" tab -> copy/paste GitHub URL)
3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select "GPU" for hardware accelerator)
4. Run the cell below to set up dependencies.

In [None]:
BRANCH = 'r1.17.0'

In [None]:
!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[nlp]

# Imports and constants

In [None]:
import os
import wget
import gc

import pytorch_lightning as pl
from omegaconf import OmegaConf

from nemo.collections.nlp.models.question_answering.qa_bert_model import BERTQAModel
from nemo.collections.nlp.models.question_answering.qa_gpt_model import GPTQAModel
from nemo.collections.nlp.models.question_answering.qa_s2s_model import S2SQAModel
from nemo.utils.exp_manager import exp_manager

pl.seed_everything(42)
gc.disable()

In [None]:
# set the following paths
DATA_DIR = "data_dir" # directory for storing datasets
WORK_DIR = "work_dir" # directory for storing trained models, logs, additionally downloaded scripts

os.makedirs(DATA_DIR, exist_ok=True)
os.makedirs(WORK_DIR, exist_ok=True)

# Configuration

The model is defined in a config file which declares multiple important sections:
- **model**: All arguments that will relate to the Model - language model, span prediction, optimizer and schedulers, datasets and any other related information
- **trainer**: Any argument to be passed to PyTorch Lightning
- **exp_manager**: All arguments used for setting up the experiment manager - target directory, name, logger information

We will download the default config file provided at `NeMo/examples/nlp/question_answering/conf/qa_conf.yaml` and edit necessary values for training different models

In [None]:
# download the model's default configuration file 
config_dir = WORK_DIR + '/conf/'
os.makedirs(config_dir, exist_ok=True)
if not os.path.exists(config_dir + "qa_conf.yaml"):
 print('Downloading config file...')
 wget.download(f'https://raw.githubusercontent.com/NVIDIA/NeMo/{BRANCH}/examples/nlp/question_answering/conf/qa_conf.yaml', config_dir)
else:
 print ('config file already exists')

In [None]:
# this will print the entire default config of the model
config_path = f'{WORK_DIR}/conf/qa_conf.yaml'
print(config_path)
config = OmegaConf.load(config_path)
print("Default Config - \n")
print(OmegaConf.to_yaml(config))

# Training and testing models on SQuAD v2.0

## Dataset

For this example, we are going to download the [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) dataset to showcase how to do training and inference. There are two datasets, SQuAD1.0 and SQuAD2.0. SQuAD 1.1, the previous version of the SQuAD dataset, contains 100,000+ question-answer pairs on 500+ articles. SQuAD2.0 dataset combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. 

To download both datasets, we use `NeMo/examples/nlp/question_answering/get_squad.py`

In [None]:
# download get_squad.py script to download and preprocess the SQuAD data
os.makedirs(WORK_DIR, exist_ok=True)
if not os.path.exists(WORK_DIR + '/get_squad.py'):
 print('Downloading get_squad.py...')
 wget.download(f'https://raw.githubusercontent.com/NVIDIA/NeMo/{BRANCH}/examples/nlp/question_answering/get_squad.py', WORK_DIR)
else:
 print ('get_squad.py already exists')

In [None]:
# download and preprocess the data
!python $WORK_DIR/get_squad.py --destDir $DATA_DIR

After execution of the above cell, your data folder will contain a subfolder "squad" the following four files for training and evaluation

```
squad 
│
└───v1.1
│ │ - train-v1.1.json
│ │ - dev-v1.1.json
│
└───v2.0
 │ - train-v2.0.json
 │ - dev-v2.0.json
```

In [None]:
!ls -LR {DATA_DIR}/squad

## Set dataset config values

In [None]:
# if True, model will load features from cache if file is present, or
# create features and dump to cache file if not already present
config.model.dataset.use_cache = False

# indicates whether the dataset has unanswerable questions
config.model.dataset.version_2_with_negative = True

# indicates whether the dataset is of extractive nature or not
# if True, context spans/chunks that do not contain answer are treated as unanswerable 
config.model.dataset.check_if_answer_in_context = True

# set file paths for train, validation, and test datasets
config.model.train_ds.file = f"{DATA_DIR}/squad/v2.0/train-v2.0.json"
config.model.validation_ds.file = f"{DATA_DIR}/squad/v2.0/dev-v2.0.json"
config.model.test_ds.file = f"{DATA_DIR}/squad/v2.0/dev-v2.0.json"

# set batch sizes for train, validation, and test datasets
config.model.train_ds.batch_size = 8
config.model.validation_ds.batch_size = 8
config.model.test_ds.batch_size = 8

# set number of samples to be used from dataset. setting to -1 uses entire dataset
config.model.train_ds.num_samples = 5000
config.model.validation_ds.num_samples = 1000
config.model.test_ds.num_samples = 100

## Set trainer config values

In [None]:
config.trainer.max_epochs = 1
config.trainer.max_steps = -1 # takes precedence over max_epochs
config.trainer.precision = 16
config.trainer.devices = [0] # 0 for CPU, or list of the GPUs to use [0] this tutorial does not support multiple GPUs. If needed please use NeMo/examples/nlp/question_answering/question_answering.py
config.trainer.accelerator = "gpu"
config.trainer.strategy="dp"

## Set experiment manager config values

In [None]:
config.exp_manager.exp_dir = WORK_DIR
config.exp_manager.name = "QA-SQuAD2"
config.exp_manager.create_wandb_logger=False

## BERT model for SQuAD v2.0

### Set model config values

In [None]:
# set language model and tokenizer to be used
# tokenizer is derived from model if a tokenizer name is not provided
config.model.language_model.pretrained_model_name = "bert-base-uncased"
config.model.tokenizer.tokenizer_name = "bert-base-uncased"

# path where model will be saved
config.model.nemo_path = f"{WORK_DIR}/checkpoints/bert_squad_v2_0.nemo"

config.exp_manager.create_checkpoint_callback = True

config.model.optim.lr = 3e-5

### Create trainer and initialize model

In [None]:
trainer = pl.Trainer(**config.trainer)
model = BERTQAModel(config.model, trainer=trainer)

### Train, test, and save the model

In [None]:
trainer.fit(model)
trainer.test(model)

model.save_to(config.model.nemo_path)

### Load the saved model and run inference

In [None]:
model = BERTQAModel.restore_from(config.model.nemo_path)

eval_device = [config.trainer.devices[0]] if isinstance(config.trainer.devices, list) else 1
model.trainer = pl.Trainer(
 devices=eval_device,
 accelerator=config.trainer.accelerator,
 precision=16,
 logger=False,
)

config.exp_manager.create_checkpoint_callback = False
exp_dir = exp_manager(model.trainer, config.exp_manager)
output_nbest_file = os.path.join(exp_dir, "output_nbest_file.json")
output_prediction_file = os.path.join(exp_dir, "output_prediction_file.json")

all_preds, all_nbest = model.inference(
 config.model.test_ds.file,
 output_prediction_file=output_prediction_file,
 output_nbest_file=output_nbest_file,
 num_samples=10, # setting to -1 will use all samples for inference
)

for question_id in all_preds:
 print(all_preds[question_id])

## S2S BART model for SQuAD v2.0

### Set model config values

In [None]:
# set language model and tokenizer to be used
# tokenizer is derived from model if a tokenizer name is not provided
config.model.language_model.pretrained_model_name = "facebook/bart-base"
config.model.tokenizer.tokenizer_name = "facebook/bart-base"

# path where model will be saved
config.model.nemo_path = f"{WORK_DIR}/checkpoints/bart_squad_v2_0.nemo"

config.exp_manager.create_checkpoint_callback = True

config.model.optim.lr = 5e-5

#remove vocab_file from gpt model
config.model.tokenizer.vocab_file = None

### Create trainer and initialize model

In [None]:
# uncomment below line and run if you get an error while initializing tokenizer on Colab (reference: https://github.com/huggingface/transformers/issues/8690)
# !rm -r /root/.cache/huggingface/

trainer = pl.Trainer(**config.trainer)
model = S2SQAModel(config.model, trainer=trainer)

### Train, test, and save the model

In [None]:
trainer.fit(model)
trainer.test(model)

model.save_to(config.model.nemo_path)

### Load the saved model and run inference

In [None]:
model = S2SQAModel.restore_from(config.model.nemo_path)

eval_device = [config.trainer.devices[0]] if isinstance(config.trainer.devices, list) else 1
model.trainer = pl.Trainer(
 devices=eval_device,
 accelerator=config.trainer.accelerator,
 precision=16,
 logger=False,
)

config.exp_manager.create_checkpoint_callback = False
exp_dir = exp_manager(model.trainer, config.exp_manager)
output_nbest_file = os.path.join(exp_dir, "output_nbest_file.json")
output_prediction_file = os.path.join(exp_dir, "output_prediction_file.json")

all_preds, all_nbest = model.inference(
 config.model.test_ds.file,
 output_prediction_file=output_prediction_file,
 output_nbest_file=output_nbest_file,
 num_samples=10, # setting to -1 will use all samples for inference
)

for question_id in all_preds:
 print(all_preds[question_id])

## GPT2 model for SQuAD v2.0

### Set model config values

In [None]:
# set language model and tokenizer to be used
# tokenizer is derived from model if a tokenizer name is not provided
config.model.language_model.pretrained_model_name = "gpt2"
config.model.tokenizer.tokenizer_name = "gpt2"

# path where model will be saved
config.model.nemo_path = f"{WORK_DIR}/checkpoints/gpt2_squad_v2_0.nemo"

config.exp_manager.create_checkpoint_callback = True

config.model.optim.lr = 1e-4

### Create trainer and initialize model

In [None]:
# uncomment below line and run if you get an error while initializing tokenizer on Colab (reference: https://github.com/huggingface/transformers/issues/8690)
# !rm -r /root/.cache/huggingface/

trainer = pl.Trainer(**config.trainer)
model = GPTQAModel(config.model, trainer=trainer)

### Train, test, and save the model

In [None]:
trainer.fit(model)
trainer.test(model)

model.save_to(config.model.nemo_path)

### Load the saved model and run inference

In [None]:
model = GPTQAModel.restore_from(config.model.nemo_path)

eval_device = [config.trainer.devices[0]] if isinstance(config.trainer.devices, list) else 1
model.trainer = pl.Trainer(
 devices=eval_device,
 accelerator=config.trainer.accelerator,
 precision=16,
 logger=False,
)

config.exp_manager.create_checkpoint_callback = False
exp_dir = exp_manager(model.trainer, config.exp_manager)
output_nbest_file = os.path.join(exp_dir, "output_nbest_file.json")
output_prediction_file = os.path.join(exp_dir, "output_prediction_file.json")

all_preds, all_nbest = model.inference(
 config.model.test_ds.file,
 output_prediction_file=output_prediction_file,
 output_nbest_file=output_nbest_file,
 num_samples=10, # setting to -1 will use all samples for inference
)

for question_id in all_preds:
 print(all_preds[question_id])

# Training and testing models on MS-MARCO

## Dataset

### Downloading the data

MS-MARCO(Microsoft Machine Reading Comprehension) is a large scale dataset focused on machine reading comprehension, question answering, and passage ranking. MS-MARCO consists of 1,010,916 queries generated from real, anonymized Bing user queries. The contexts are extracted from real web documents and the answers are generated by humans.

Please agree to the Terms of Use at https://microsoft.github.io/msmarco/ before downloading the data

The data can be downloaded at:
- https://msmarco.blob.core.windows.net/msmarco/train_v2.1.json.gz
- https://msmarco.blob.core.windows.net/msmarco/dev_v2.1.json.gz

In [None]:
os.makedirs(os.path.join(DATA_DIR, "msmarco"), exist_ok=True)

!wget https://msmarco.blob.core.windows.net/msmarco/train_v2.1.json.gz -P $DATA_DIR/msmarco
!gunzip $DATA_DIR/msmarco/train_v2.1.json.gz

!wget https://msmarco.blob.core.windows.net/msmarco/dev_v2.1.json.gz -P $DATA_DIR/msmarco
!gunzip $DATA_DIR/msmarco/dev_v2.1.json.gz

### Converting to SQuAD format

The script for converting MS-MARCO dataset to SQuAD can be found at `NeMo/examples/nlp/question_answering/convert_msmarco_to_squad_format.py`

In [None]:
# download convert_msmarco_to_squad_format.py script to format the MS-MARCO data
os.makedirs(WORK_DIR, exist_ok=True)
if not os.path.exists(WORK_DIR + '/convert_msmarco_to_squad_format.py'):
 print('Downloading convert_msmarco_to_squad_format.py...')
 wget.download(f'https://raw.githubusercontent.com/NVIDIA/NeMo/{BRANCH}/examples/nlp/question_answering/convert_msmarco_to_squad_format.py', WORK_DIR)
else:
 print ('convert_msmarco_to_squad_format.py already exists')

In [None]:
# we will exclude examples from MS-MARCO dataset that do not have a wellFormedAnswer using a utility script
# download remove_ms_marco_samples_without_wellFormedAnswers.py script to format the MS-MARCO data
os.makedirs(WORK_DIR, exist_ok=True)
if not os.path.exists(WORK_DIR + '/remove_ms_marco_samples_without_wellFormedAnswers.py'):
 print('Downloading remove_ms_marco_samples_without_wellFormedAnswers.py...')
 wget.download(f'https://raw.githubusercontent.com/NVIDIA/NeMo/{BRANCH}/examples/nlp/dialogue/remove_ms_marco_samples_without_wellFormedAnswers.py', WORK_DIR)
else:
 print ('remove_ms_marco_samples_without_wellFormedAnswers.py already exists')

In [None]:
!python $WORK_DIR/remove_ms_marco_samples_without_wellFormedAnswers.py --filename $DATA_DIR/msmarco/train_v2.1.json
!python $WORK_DIR/remove_ms_marco_samples_without_wellFormedAnswers.py --filename $DATA_DIR/msmarco/dev_v2.1.json

In [None]:
!(python $WORK_DIR/convert_msmarco_to_squad_format.py \
 --msmarco_train_input_filepath=$DATA_DIR/msmarco/train_v2.1.json \
 --msmarco_dev_input_filepath=$DATA_DIR/msmarco/dev_v2.1.json \
 --converted_train_save_path=$DATA_DIR/msmarco/msmarco-squad-format-train-v2.1.json \
 --converted_dev_save_path=$DATA_DIR/msmarco/msmarco-squad-format-dev-v2.1.json \
 --exclude_negative_samples=False \
 --keep_only_relevant_passages=False)

## Set dataset config values

In [None]:
# if True, model will load features from cache if file is present, or
# create features and dump to cache file if not already present
config.model.dataset.use_cache = False

# indicates whether the dataset has unanswerable questions
config.model.dataset.version_2_with_negative = True

# if True, context spans/chunks that do not contain answer are treated as unanswerable 
# should be False for MS-MARCO dataset, or other datasets of generative nature
config.model.dataset.check_if_answer_in_context = False

# set file paths for train, validation, and test datasets
config.model.train_ds.file = f"{DATA_DIR}/msmarco/msmarco-squad-format-train-v2.1.json"
config.model.validation_ds.file = f"{DATA_DIR}/msmarco/msmarco-squad-format-dev-v2.1.json"
config.model.test_ds.file = f"{DATA_DIR}/msmarco/msmarco-squad-format-dev-v2.1.json"

# set batch sizes for train, validation, and test datasets
config.model.train_ds.batch_size = 16
config.model.validation_ds.batch_size = 16
config.model.test_ds.batch_size = 16

# set number of samples to be used from dataset. setting to -1 uses entire dataset
config.model.train_ds.num_samples = 5000
config.model.validation_ds.num_samples = 1000
config.model.test_ds.num_samples = 100

## Set trainer config values

In [None]:
config.trainer.max_epochs = 1
config.trainer.max_steps = -1 # takes precedence over max_epochs
config.trainer.precision = 16
config.trainer.devices = [0] # 0 for CPU, or list of the GPUs to use e.g. [0, 1] or [0]
config.trainer.accelerator = "gpu"

## Set experiment manager config values

In [None]:
config.exp_manager.exp_dir = WORK_DIR
config.exp_manager.name = "QA-MSMARCO"
config.exp_manager.create_wandb_logger=False

## S2S BART model for MS-MARCO

### Set model config values

In [None]:
# set language model and tokenizer to be used
# tokenizer is derived from model if a tokenizer name is not provided
config.model.language_model.pretrained_model_name = "facebook/bart-base"
config.model.tokenizer.tokenizer_name = "facebook/bart-base"

# path where model will be saved
config.model.nemo_path = f"{WORK_DIR}/checkpoints/bart_msmarco_v2_0.nemo"

config.exp_manager.create_checkpoint_callback = True

config.model.optim.lr = 5e-5

### Create trainer and initialize model

In [None]:
trainer = pl.Trainer(**config.trainer)
model = S2SQAModel(config.model, trainer=trainer)

### Train, test, and save the model

In [None]:
trainer.fit(model)
trainer.test(model)

model.save_to(config.model.nemo_path)

### Load the saved model and run inference

In [None]:
model = S2SQAModel.restore_from(config.model.nemo_path)

eval_device = [config.trainer.devices[0]] if isinstance(config.trainer.devices, list) else 1
model.trainer = pl.Trainer(
 devices=eval_device,
 accelerator=config.trainer.accelerator,
 precision=16,
 logger=False,
)

config.exp_manager.create_checkpoint_callback = False
exp_dir = exp_manager(model.trainer, config.exp_manager)
output_nbest_file = os.path.join(exp_dir, "output_nbest_file.json")
output_prediction_file = os.path.join(exp_dir, "output_prediction_file.json")

all_preds, all_nbest = model.inference(
 config.model.test_ds.file,
 output_prediction_file=output_prediction_file,
 output_nbest_file=output_nbest_file,
 num_samples=10, # setting to -1 will use all samples for inference
)

for question_id in all_preds:
 print(all_preds[question_id])