{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "tiIOhb7iVC3J"
      },
      "source": [
        "# Overview"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "PucJwfbhVC3L"
      },
      "source": [
        "This tutorial will demonstrate how to train, evaluate, and test three types of models for Question-Answering -\n",
        "1. BERT-like models for Extractive Question-Answering\n",
        "2. Sequence-to-Sequence (S2S) models for Generative Question-Answering (ex. T5/BART-like)\n",
        "3. GPT-like models for Generative Question-Answering\n",
        "\n",
        "## Task Description\n",
        "\n",
        "- Given a context and a natural language query, we want to generate an answer for the query\n",
        "- Depending on how the answer is generated, the task can be broadly divided into two types:\n",
        "    1. Extractive Question Answering\n",
        "    2. Generative Question Answering\n",
        "\n",
        "\n",
        "### Extractive Question-Answering with BERT-like models\n",
        "\n",
        "Given a question and a context, both in natural language, predict the span within the context with a start and end position which indicates the answer to the question.\n",
        "For every word in our training dataset we’re going to predict:\n",
        "- likelihood this word is the start of the span \n",
        "- likelihood this word is the end of the span\n",
        "\n",
        "We are using a BERT encoder with 2 span prediction heads for predicting start and end position of the answer. The span predictions are token classifiers consisting of a single linear layer.\n",
        "\n",
        "### Generative Question-Answering with S2S and GPT-like models\n",
        "\n",
        "Given a question and a context, both in natural language, generate an answer for the question. Unlike the BERT-like models, there is no constraint that the answer should be a span within the context."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "IpX0w2PtVC3M"
      },
      "source": [
        "# Installing NeMo"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "72XWYFQYVC3M"
      },
      "source": [
        "You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.\n",
        "\n",
        "Instructions for setting up Colab are as follows:\n",
        "1. Open a new Python 3 notebook.\n",
        "2. Import this notebook from GitHub (File -> Upload Notebook -> \"GITHUB\" tab -> copy/paste GitHub URL)\n",
        "3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select \"GPU\" for hardware accelerator)\n",
        "4. Run the cell below to set up dependencies."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "_xQBtr0KVC3M"
      },
      "outputs": [],
      "source": [
        "BRANCH = 'r1.17.0'"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "9R1D6W58VC3N"
      },
      "outputs": [],
      "source": [
        "!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[nlp]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "fof5-57iVC3N"
      },
      "source": [
        "# Imports and constants"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "KqKD-wReVC3O"
      },
      "outputs": [],
      "source": [
        "import os\n",
        "import wget\n",
        "import gc\n",
        "\n",
        "import pytorch_lightning as pl\n",
        "from omegaconf import OmegaConf\n",
        "\n",
        "from nemo.collections.nlp.models.question_answering.qa_bert_model import BERTQAModel\n",
        "from nemo.collections.nlp.models.question_answering.qa_gpt_model import GPTQAModel\n",
        "from nemo.collections.nlp.models.question_answering.qa_s2s_model import S2SQAModel\n",
        "from nemo.utils.exp_manager import exp_manager\n",
        "\n",
        "pl.seed_everything(42)\n",
        "gc.disable()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "xhPr9Jf_VC3O"
      },
      "outputs": [],
      "source": [
        "# set the following paths\n",
        "DATA_DIR = \"data_dir\" # directory for storing datasets\n",
        "WORK_DIR = \"work_dir\" # directory for storing trained models, logs, additionally downloaded scripts\n",
        "\n",
        "os.makedirs(DATA_DIR, exist_ok=True)\n",
        "os.makedirs(WORK_DIR, exist_ok=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "dWymW8e0VC3O"
      },
      "source": [
        "# Configuration"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "0YhKTkuXVC3P"
      },
      "source": [
        "The model is defined in a config file which declares multiple important sections:\n",
        "- **model**: All arguments that will relate to the Model - language model, span prediction, optimizer and schedulers, datasets and any other related information\n",
        "- **trainer**: Any argument to be passed to PyTorch Lightning\n",
        "- **exp_manager**: All arguments used for setting up the experiment manager - target directory, name, logger information\n",
        "\n",
        "We will download the default config file provided at `NeMo/examples/nlp/question_answering/conf/qa_conf.yaml` and edit necessary values for training different models"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "WOIWJqQ0VC3P"
      },
      "outputs": [],
      "source": [
        "# download the model's default configuration file \n",
        "config_dir = WORK_DIR + '/conf/'\n",
        "os.makedirs(config_dir, exist_ok=True)\n",
        "if not os.path.exists(config_dir + \"qa_conf.yaml\"):\n",
        "    print('Downloading config file...')\n",
        "    wget.download(f'https://raw.githubusercontent.com/NVIDIA/NeMo/{BRANCH}/examples/nlp/question_answering/conf/qa_conf.yaml', config_dir)\n",
        "else:\n",
        "    print ('config file already exists')"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "cvD-gv-FVC3P"
      },
      "outputs": [],
      "source": [
        "# this will print the entire default config of the model\n",
        "config_path = f'{WORK_DIR}/conf/qa_conf.yaml'\n",
        "print(config_path)\n",
        "config = OmegaConf.load(config_path)\n",
        "print(\"Default Config - \\n\")\n",
        "print(OmegaConf.to_yaml(config))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "E08e-ItPVC3P"
      },
      "source": [
        "# Training and testing models on SQuAD v2.0"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "xn022MsKVC3Q"
      },
      "source": [
        "## Dataset"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "c356CGL1VC3Q"
      },
      "source": [
        "For this example, we are going to download the [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) dataset to showcase how to do training and inference. There are two datasets, SQuAD1.0 and SQuAD2.0. SQuAD 1.1, the previous version of the SQuAD dataset, contains 100,000+ question-answer pairs on 500+ articles. SQuAD2.0 dataset combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Gaju1h_bVC3Q"
      },
      "source": [
        "To download both datasets, we use `NeMo/examples/nlp/question_answering/get_squad.py`"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "nb840_bZVC3Q"
      },
      "outputs": [],
      "source": [
        "# download get_squad.py script to download and preprocess the SQuAD data\n",
        "os.makedirs(WORK_DIR, exist_ok=True)\n",
        "if not os.path.exists(WORK_DIR + '/get_squad.py'):\n",
        "    print('Downloading get_squad.py...')\n",
        "    wget.download(f'https://raw.githubusercontent.com/NVIDIA/NeMo/{BRANCH}/examples/nlp/question_answering/get_squad.py', WORK_DIR)\n",
        "else:\n",
        "    print ('get_squad.py already exists')"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "sOgY0tRzVC3Q"
      },
      "outputs": [],
      "source": [
        "# download and preprocess the data\n",
        "!python $WORK_DIR/get_squad.py --destDir $DATA_DIR"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "nprGkyvRVC3Q"
      },
      "source": [
        "After execution of the above cell, your data folder will contain a subfolder \"squad\" the following four files for training and evaluation\n",
        "\n",
        "```\n",
        "squad  \n",
        "│\n",
        "└───v1.1\n",
        "│   │ -  train-v1.1.json\n",
        "│   │ -  dev-v1.1.json\n",
        "│\n",
        "└───v2.0\n",
        "    │ -  train-v2.0.json\n",
        "    │ -  dev-v2.0.json\n",
        "```"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "GX0KWQXKVC3Q"
      },
      "outputs": [],
      "source": [
        "!ls -LR {DATA_DIR}/squad"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "RFVcvseOVC3R"
      },
      "source": [
        "## Set dataset config values"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Grb0EeRqVC3R"
      },
      "outputs": [],
      "source": [
        "# if True, model will load features from cache if file is present, or\n",
        "# create features and dump to cache file if not already present\n",
        "config.model.dataset.use_cache = False\n",
        "\n",
        "# indicates whether the dataset has unanswerable questions\n",
        "config.model.dataset.version_2_with_negative = True\n",
        "\n",
        "# indicates whether the dataset is of extractive nature or not\n",
        "# if True, context spans/chunks that do not contain answer are treated as unanswerable \n",
        "config.model.dataset.check_if_answer_in_context = True\n",
        "\n",
        "# set file paths for train, validation, and test datasets\n",
        "config.model.train_ds.file = f\"{DATA_DIR}/squad/v2.0/train-v2.0.json\"\n",
        "config.model.validation_ds.file = f\"{DATA_DIR}/squad/v2.0/dev-v2.0.json\"\n",
        "config.model.test_ds.file = f\"{DATA_DIR}/squad/v2.0/dev-v2.0.json\"\n",
        "\n",
        "# set batch sizes for train, validation, and test datasets\n",
        "config.model.train_ds.batch_size = 8\n",
        "config.model.validation_ds.batch_size = 8\n",
        "config.model.test_ds.batch_size = 8\n",
        "\n",
        "# set number of samples to be used from dataset. setting to -1 uses entire dataset\n",
        "config.model.train_ds.num_samples = 5000\n",
        "config.model.validation_ds.num_samples = 1000\n",
        "config.model.test_ds.num_samples = 100"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "rFWF41VwVC3R"
      },
      "source": [
        "## Set trainer config values"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "42yif-GIVC3R"
      },
      "outputs": [],
      "source": [
        "config.trainer.max_epochs = 1\n",
        "config.trainer.max_steps = -1 # takes precedence over max_epochs\n",
        "config.trainer.precision = 16\n",
        "config.trainer.devices = [0] # 0 for CPU, or list of the GPUs to use [0] this tutorial does not support multiple GPUs. If needed please use NeMo/examples/nlp/question_answering/question_answering.py\n",
        "config.trainer.accelerator = \"gpu\"\n",
        "config.trainer.strategy=\"dp\""
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "EDQzMBlbVC3R"
      },
      "source": [
        "## Set experiment manager config values"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "pxY4rnJBVC3R"
      },
      "outputs": [],
      "source": [
        "config.exp_manager.exp_dir = WORK_DIR\n",
        "config.exp_manager.name = \"QA-SQuAD2\"\n",
        "config.exp_manager.create_wandb_logger=False"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "N2_C8reNVC3R"
      },
      "source": [
        "## BERT model for SQuAD v2.0"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "4Mf-_rioVC3R"
      },
      "source": [
        "### Set model config values"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "gtlGHzVJVC3R"
      },
      "outputs": [],
      "source": [
        "# set language model and tokenizer to be used\n",
        "# tokenizer is derived from model if a tokenizer name is not provided\n",
        "config.model.language_model.pretrained_model_name = \"bert-base-uncased\"\n",
        "config.model.tokenizer.tokenizer_name = \"bert-base-uncased\"\n",
        "\n",
        "# path where model will be saved\n",
        "config.model.nemo_path = f\"{WORK_DIR}/checkpoints/bert_squad_v2_0.nemo\"\n",
        "\n",
        "config.exp_manager.create_checkpoint_callback = True\n",
        "\n",
        "config.model.optim.lr = 3e-5"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "RaM7fe8rVC3R"
      },
      "source": [
        "### Create trainer and initialize model"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ukLzGmy9VC3R"
      },
      "outputs": [],
      "source": [
        "trainer = pl.Trainer(**config.trainer)\n",
        "model = BERTQAModel(config.model, trainer=trainer)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "qZIA69rlVC3R"
      },
      "source": [
        "### Train, test, and save the model"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "asutB9ZzVC3R"
      },
      "outputs": [],
      "source": [
        "trainer.fit(model)\n",
        "trainer.test(model)\n",
        "\n",
        "model.save_to(config.model.nemo_path)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "n5AIv0SEVC3S"
      },
      "source": [
        "### Load the saved model and run inference"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "7k5kD6tvVC3S"
      },
      "outputs": [],
      "source": [
        "model = BERTQAModel.restore_from(config.model.nemo_path)\n",
        "\n",
        "eval_device = [config.trainer.devices[0]] if isinstance(config.trainer.devices, list) else 1\n",
        "model.trainer = pl.Trainer(\n",
        "    devices=eval_device,\n",
        "    accelerator=config.trainer.accelerator,\n",
        "    precision=16,\n",
        "    logger=False,\n",
        ")\n",
        "\n",
        "config.exp_manager.create_checkpoint_callback = False\n",
        "exp_dir = exp_manager(model.trainer, config.exp_manager)\n",
        "output_nbest_file = os.path.join(exp_dir, \"output_nbest_file.json\")\n",
        "output_prediction_file = os.path.join(exp_dir, \"output_prediction_file.json\")\n",
        "\n",
        "all_preds, all_nbest = model.inference(\n",
        "    config.model.test_ds.file,\n",
        "    output_prediction_file=output_prediction_file,\n",
        "    output_nbest_file=output_nbest_file,\n",
        "    num_samples=10, # setting to -1 will use all samples for inference\n",
        ")\n",
        "\n",
        "for question_id in all_preds:\n",
        "    print(all_preds[question_id])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "zyh0SNiyVC3S"
      },
      "source": [
        "## S2S BART model for SQuAD v2.0"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Sy9IYgVYVC3S"
      },
      "source": [
        "### Set model config values"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "PKNmHKV5VC3S"
      },
      "outputs": [],
      "source": [
        "# set language model and tokenizer to be used\n",
        "# tokenizer is derived from model if a tokenizer name is not provided\n",
        "config.model.language_model.pretrained_model_name = \"facebook/bart-base\"\n",
        "config.model.tokenizer.tokenizer_name = \"facebook/bart-base\"\n",
        "\n",
        "# path where model will be saved\n",
        "config.model.nemo_path = f\"{WORK_DIR}/checkpoints/bart_squad_v2_0.nemo\"\n",
        "\n",
        "config.exp_manager.create_checkpoint_callback = True\n",
        "\n",
        "config.model.optim.lr = 5e-5\n",
        "\n",
        "#remove vocab_file from gpt model\n",
        "config.model.tokenizer.vocab_file = None"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "S_0glS4yVC3S"
      },
      "source": [
        "### Create trainer and initialize model"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "8jWyHY1oVC3S"
      },
      "outputs": [],
      "source": [
        "# uncomment below line and run if you get an error while initializing tokenizer on Colab (reference: https://github.com/huggingface/transformers/issues/8690)\n",
        "# !rm -r /root/.cache/huggingface/\n",
        "\n",
        "trainer = pl.Trainer(**config.trainer)\n",
        "model = S2SQAModel(config.model, trainer=trainer)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "xg-j39b4VC3S"
      },
      "source": [
        "### Train, test, and save the model"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ocsf0EBDVC3S"
      },
      "outputs": [],
      "source": [
        "trainer.fit(model)\n",
        "trainer.test(model)\n",
        "\n",
        "model.save_to(config.model.nemo_path)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Vs3pl0VMVC3S"
      },
      "source": [
        "### Load the saved model and run inference"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "NoW6_GO_VC3S"
      },
      "outputs": [],
      "source": [
        "model = S2SQAModel.restore_from(config.model.nemo_path)\n",
        "\n",
        "eval_device = [config.trainer.devices[0]] if isinstance(config.trainer.devices, list) else 1\n",
        "model.trainer = pl.Trainer(\n",
        "    devices=eval_device,\n",
        "    accelerator=config.trainer.accelerator,\n",
        "    precision=16,\n",
        "    logger=False,\n",
        ")\n",
        "\n",
        "config.exp_manager.create_checkpoint_callback = False\n",
        "exp_dir = exp_manager(model.trainer, config.exp_manager)\n",
        "output_nbest_file = os.path.join(exp_dir, \"output_nbest_file.json\")\n",
        "output_prediction_file = os.path.join(exp_dir, \"output_prediction_file.json\")\n",
        "\n",
        "all_preds, all_nbest = model.inference(\n",
        "    config.model.test_ds.file,\n",
        "    output_prediction_file=output_prediction_file,\n",
        "    output_nbest_file=output_nbest_file,\n",
        "    num_samples=10, # setting to -1 will use all samples for inference\n",
        ")\n",
        "\n",
        "for question_id in all_preds:\n",
        "    print(all_preds[question_id])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "a7-iInbPVC3S"
      },
      "source": [
        "## GPT2 model for SQuAD v2.0"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "VaIC0l2aVC3S"
      },
      "source": [
        "### Set model config values"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "5j6SVk6fVC3S"
      },
      "outputs": [],
      "source": [
        "# set language model and tokenizer to be used\n",
        "# tokenizer is derived from model if a tokenizer name is not provided\n",
        "config.model.language_model.pretrained_model_name = \"gpt2\"\n",
        "config.model.tokenizer.tokenizer_name = \"gpt2\"\n",
        "\n",
        "# path where model will be saved\n",
        "config.model.nemo_path = f\"{WORK_DIR}/checkpoints/gpt2_squad_v2_0.nemo\"\n",
        "\n",
        "config.exp_manager.create_checkpoint_callback = True\n",
        "\n",
        "config.model.optim.lr = 1e-4"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "rWhhEuvzVC3S"
      },
      "source": [
        "### Create trainer and initialize model"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "vBtP3ukDVC3S"
      },
      "outputs": [],
      "source": [
        "# uncomment below line and run if you get an error while initializing tokenizer on Colab (reference: https://github.com/huggingface/transformers/issues/8690)\n",
        "# !rm -r /root/.cache/huggingface/\n",
        "\n",
        "trainer = pl.Trainer(**config.trainer)\n",
        "model = GPTQAModel(config.model, trainer=trainer)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "EApFrJh8VC3T"
      },
      "source": [
        "### Train, test, and save the model"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "zYo2JDdOVC3T"
      },
      "outputs": [],
      "source": [
        "trainer.fit(model)\n",
        "trainer.test(model)\n",
        "\n",
        "model.save_to(config.model.nemo_path)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "6aNEt06fVC3T"
      },
      "source": [
        "### Load the saved model and run inference"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ioLT4DVbVC3T"
      },
      "outputs": [],
      "source": [
        "model = GPTQAModel.restore_from(config.model.nemo_path)\n",
        "\n",
        "eval_device = [config.trainer.devices[0]] if isinstance(config.trainer.devices, list) else 1\n",
        "model.trainer = pl.Trainer(\n",
        "    devices=eval_device,\n",
        "    accelerator=config.trainer.accelerator,\n",
        "    precision=16,\n",
        "    logger=False,\n",
        ")\n",
        "\n",
        "config.exp_manager.create_checkpoint_callback = False\n",
        "exp_dir = exp_manager(model.trainer, config.exp_manager)\n",
        "output_nbest_file = os.path.join(exp_dir, \"output_nbest_file.json\")\n",
        "output_prediction_file = os.path.join(exp_dir, \"output_prediction_file.json\")\n",
        "\n",
        "all_preds, all_nbest = model.inference(\n",
        "    config.model.test_ds.file,\n",
        "    output_prediction_file=output_prediction_file,\n",
        "    output_nbest_file=output_nbest_file,\n",
        "    num_samples=10, # setting to -1 will use all samples for inference\n",
        ")\n",
        "\n",
        "for question_id in all_preds:\n",
        "    print(all_preds[question_id])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "hTWOlD9AVC3T"
      },
      "source": [
        "# Training and testing models on MS-MARCO"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "lZWsMwnGVC3T"
      },
      "source": [
        "## Dataset"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "pRUAwgAbVC3T"
      },
      "source": [
        "### Downloading the data"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "qz3DO9JGVC3T"
      },
      "source": [
        "MS-MARCO(Microsoft Machine Reading Comprehension) is a large scale dataset focused on machine reading comprehension, question answering, and passage ranking. MS-MARCO consists of 1,010,916 queries generated from real, anonymized Bing user queries. The contexts are extracted from real web documents and the answers are generated by humans.\n",
        "\n",
        "Please agree to the Terms of Use at https://microsoft.github.io/msmarco/ before downloading the data\n",
        "\n",
        "The data can be downloaded at:\n",
        "- https://msmarco.blob.core.windows.net/msmarco/train_v2.1.json.gz\n",
        "- https://msmarco.blob.core.windows.net/msmarco/dev_v2.1.json.gz"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Fm5MzZ91inP5"
      },
      "outputs": [],
      "source": [
        "os.makedirs(os.path.join(DATA_DIR, \"msmarco\"), exist_ok=True)\n",
        "\n",
        "!wget https://msmarco.blob.core.windows.net/msmarco/train_v2.1.json.gz -P $DATA_DIR/msmarco\n",
        "!gunzip $DATA_DIR/msmarco/train_v2.1.json.gz\n",
        "\n",
        "!wget https://msmarco.blob.core.windows.net/msmarco/dev_v2.1.json.gz -P $DATA_DIR/msmarco\n",
        "!gunzip $DATA_DIR/msmarco/dev_v2.1.json.gz"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "nDmFHzBtVC3T"
      },
      "source": [
        "### Converting to SQuAD format\n",
        "\n",
        "The script for converting MS-MARCO dataset to SQuAD can be found at `NeMo/examples/nlp/question_answering/convert_msmarco_to_squad_format.py`"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "tJtNIzZQVC3T"
      },
      "outputs": [],
      "source": [
        "# download convert_msmarco_to_squad_format.py script to format the MS-MARCO data\n",
        "os.makedirs(WORK_DIR, exist_ok=True)\n",
        "if not os.path.exists(WORK_DIR + '/convert_msmarco_to_squad_format.py'):\n",
        "    print('Downloading convert_msmarco_to_squad_format.py...')\n",
        "    wget.download(f'https://raw.githubusercontent.com/NVIDIA/NeMo/{BRANCH}/examples/nlp/question_answering/convert_msmarco_to_squad_format.py', WORK_DIR)\n",
        "else:\n",
        "    print ('convert_msmarco_to_squad_format.py already exists')"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Io_esJPSuBcW"
      },
      "outputs": [],
      "source": [
        "# we will exclude examples from MS-MARCO dataset that do not have a wellFormedAnswer using a utility script\n",
        "# download remove_ms_marco_samples_without_wellFormedAnswers.py script to format the MS-MARCO data\n",
        "os.makedirs(WORK_DIR, exist_ok=True)\n",
        "if not os.path.exists(WORK_DIR + '/remove_ms_marco_samples_without_wellFormedAnswers.py'):\n",
        "    print('Downloading remove_ms_marco_samples_without_wellFormedAnswers.py...')\n",
        "    wget.download(f'https://raw.githubusercontent.com/NVIDIA/NeMo/{BRANCH}/examples/nlp/dialogue/remove_ms_marco_samples_without_wellFormedAnswers.py', WORK_DIR)\n",
        "else:\n",
        "    print ('remove_ms_marco_samples_without_wellFormedAnswers.py already exists')"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "cs_CXkfXuYVQ"
      },
      "outputs": [],
      "source": [
        "!python $WORK_DIR/remove_ms_marco_samples_without_wellFormedAnswers.py --filename $DATA_DIR/msmarco/train_v2.1.json\n",
        "!python $WORK_DIR/remove_ms_marco_samples_without_wellFormedAnswers.py --filename $DATA_DIR/msmarco/dev_v2.1.json"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "AUAKI086VC3T"
      },
      "outputs": [],
      "source": [
        "!(python $WORK_DIR/convert_msmarco_to_squad_format.py \\\n",
        "    --msmarco_train_input_filepath=$DATA_DIR/msmarco/train_v2.1.json \\\n",
        "    --msmarco_dev_input_filepath=$DATA_DIR/msmarco/dev_v2.1.json \\\n",
        "    --converted_train_save_path=$DATA_DIR/msmarco/msmarco-squad-format-train-v2.1.json \\\n",
        "    --converted_dev_save_path=$DATA_DIR/msmarco/msmarco-squad-format-dev-v2.1.json \\\n",
        "    --exclude_negative_samples=False \\\n",
        "    --keep_only_relevant_passages=False)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "AeHesaFcVC3T"
      },
      "source": [
        "## Set dataset config values"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "rhx-_1X3VC3T"
      },
      "outputs": [],
      "source": [
        "# if True, model will load features from cache if file is present, or\n",
        "# create features and dump to cache file if not already present\n",
        "config.model.dataset.use_cache = False\n",
        "\n",
        "# indicates whether the dataset has unanswerable questions\n",
        "config.model.dataset.version_2_with_negative = True\n",
        "\n",
        "# if True, context spans/chunks that do not contain answer are treated as unanswerable \n",
        "# should be False for MS-MARCO dataset, or other datasets of generative nature\n",
        "config.model.dataset.check_if_answer_in_context = False\n",
        "\n",
        "# set file paths for train, validation, and test datasets\n",
        "config.model.train_ds.file = f\"{DATA_DIR}/msmarco/msmarco-squad-format-train-v2.1.json\"\n",
        "config.model.validation_ds.file = f\"{DATA_DIR}/msmarco/msmarco-squad-format-dev-v2.1.json\"\n",
        "config.model.test_ds.file = f\"{DATA_DIR}/msmarco/msmarco-squad-format-dev-v2.1.json\"\n",
        "\n",
        "# set batch sizes for train, validation, and test datasets\n",
        "config.model.train_ds.batch_size = 16\n",
        "config.model.validation_ds.batch_size = 16\n",
        "config.model.test_ds.batch_size = 16\n",
        "\n",
        "# set number of samples to be used from dataset. setting to -1 uses entire dataset\n",
        "config.model.train_ds.num_samples = 5000\n",
        "config.model.validation_ds.num_samples = 1000\n",
        "config.model.test_ds.num_samples = 100"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "X43k_EeqVC3T"
      },
      "source": [
        "## Set trainer config values"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "HavpkQLPVC3U"
      },
      "outputs": [],
      "source": [
        "config.trainer.max_epochs = 1\n",
        "config.trainer.max_steps = -1 # takes precedence over max_epochs\n",
        "config.trainer.precision = 16\n",
        "config.trainer.devices = [0] # 0 for CPU, or list of the GPUs to use e.g. [0, 1] or [0]\n",
        "config.trainer.accelerator = \"gpu\""
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "R-_FIZE2VC3U"
      },
      "source": [
        "## Set experiment manager config values"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "10TT3okiVC3U"
      },
      "outputs": [],
      "source": [
        "config.exp_manager.exp_dir = WORK_DIR\n",
        "config.exp_manager.name = \"QA-MSMARCO\"\n",
        "config.exp_manager.create_wandb_logger=False"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "MKIq6YT-VC3U"
      },
      "source": [
        "## S2S BART model for MS-MARCO"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "tvf-QpYLVC3U"
      },
      "source": [
        "### Set model config values"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "DDVZ1a5fVC3U"
      },
      "outputs": [],
      "source": [
        "# set language model and tokenizer to be used\n",
        "# tokenizer is derived from model if a tokenizer name is not provided\n",
        "config.model.language_model.pretrained_model_name = \"facebook/bart-base\"\n",
        "config.model.tokenizer.tokenizer_name = \"facebook/bart-base\"\n",
        "\n",
        "# path where model will be saved\n",
        "config.model.nemo_path = f\"{WORK_DIR}/checkpoints/bart_msmarco_v2_0.nemo\"\n",
        "\n",
        "config.exp_manager.create_checkpoint_callback = True\n",
        "\n",
        "config.model.optim.lr = 5e-5"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "3N75cdLRVC3U"
      },
      "source": [
        "### Create trainer and initialize model"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Bv9UMkfxVC3U"
      },
      "outputs": [],
      "source": [
        "trainer = pl.Trainer(**config.trainer)\n",
        "model = S2SQAModel(config.model, trainer=trainer)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "BhVuV9sWVC3U"
      },
      "source": [
        "### Train, test, and save the model"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "1JeaJ_OgVC3U"
      },
      "outputs": [],
      "source": [
        "trainer.fit(model)\n",
        "trainer.test(model)\n",
        "\n",
        "model.save_to(config.model.nemo_path)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "yj0dGexaVC3U"
      },
      "source": [
        "### Load the saved model and run inference"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "l1elN-WDVC3U"
      },
      "outputs": [],
      "source": [
        "model = S2SQAModel.restore_from(config.model.nemo_path)\n",
        "\n",
        "eval_device = [config.trainer.devices[0]] if isinstance(config.trainer.devices, list) else 1\n",
        "model.trainer = pl.Trainer(\n",
        "    devices=eval_device,\n",
        "    accelerator=config.trainer.accelerator,\n",
        "    precision=16,\n",
        "    logger=False,\n",
        ")\n",
        "\n",
        "config.exp_manager.create_checkpoint_callback = False\n",
        "exp_dir = exp_manager(model.trainer, config.exp_manager)\n",
        "output_nbest_file = os.path.join(exp_dir, \"output_nbest_file.json\")\n",
        "output_prediction_file = os.path.join(exp_dir, \"output_prediction_file.json\")\n",
        "\n",
        "all_preds, all_nbest = model.inference(\n",
        "    config.model.test_ds.file,\n",
        "    output_prediction_file=output_prediction_file,\n",
        "    output_nbest_file=output_nbest_file,\n",
        "    num_samples=10, # setting to -1 will use all samples for inference\n",
        ")\n",
        "\n",
        "for question_id in all_preds:\n",
        "    print(all_preds[question_id])"
      ]
    }
  ],
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "collapsed_sections": [],
      "name": "Question_Answering.ipynb",
      "provenance": []
    },
    "gpuClass": "standard",
    "kernelspec": {
      "display_name": "Python 3.8.0 ('test_ptl_1.7')",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.8.0"
    },
    "orig_nbformat": 4,
    "vscode": {
      "interpreter": {
        "hash": "e987a19b1bc60996a600adb5d563aa4a4c022e7b31abb2e65c324714934e8ea9"
      }
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}