Numini

File size: 22,542 Bytes

bc565b0

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "12d87b30",
   "metadata": {},
   "source": [
    "# Load Data\n",
    "This notebook loads and preproceses all necessary data, namely the following.\n",
    "* OpenWebTextCorpus: for base DistilBERT model\n",
    "* SQuAD datasrt: for Q&A\n",
    "* Natural Questions (needs to be downloaded externally but is preprocessed here): for Q&A\n",
    "* HotPotQA: for Q&A"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "7c82d7fa",
   "metadata": {},
   "outputs": [],
   "source": [
    "from tqdm.auto import tqdm\n",
    "from datasets import load_dataset\n",
    "import os\n",
    "import pandas as pd\n",
    "import random"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1737f219",
   "metadata": {},
   "source": [
    "## Distilbert Data\n",
    "In the following, we download the english openwebtext dataset from huggingface (https://huggingface.co/datasets/openwebtext). The dataset is provided by Aaron Gokaslan and Vanya Cohen from Brown University (https://skylion007.github.io/OpenWebTextCorpus/).\n",
    "\n",
    "We first load the data, investigate the structure and write the dataset into files of each 10 000 texts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cce7623c",
   "metadata": {},
   "outputs": [],
   "source": [
    "ds = load_dataset(\"openwebtext\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "678a5e86",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "DatasetDict({\n",
       "    train: Dataset({\n",
       "        features: ['text'],\n",
       "        num_rows: 8013769\n",
       "    })\n",
       "})"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# we have a text-only training dataset with 8 million entries\n",
    "ds"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "b141bce7",
   "metadata": {},
   "outputs": [],
   "source": [
    "# create necessary folders\n",
    "os.mkdir('data')\n",
    "os.mkdir('data/original')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ca94f995",
   "metadata": {},
   "outputs": [],
   "source": [
    "# save text in chunks of 10000 samples\n",
    "text = []\n",
    "i = 0\n",
    "\n",
    "for sample in tqdm(ds['train']):\n",
    "    # replace all newlines\n",
    "    sample = sample['text'].replace('\\n','')\n",
    "    \n",
    "    # append cleaned sample to all texts\n",
    "    text.append(sample)\n",
    "    \n",
    "    # if we processed 10000 samples, write them to a file and start over\n",
    "    if len(text) == 10000:\n",
    "        with open(f\"data/original/text_{i}.txt\", 'w', encoding='utf-8') as f:\n",
    "            f.write('\\n'.join(text))\n",
    "        text = []\n",
    "        i += 1 \n",
    "\n",
    "# write remaining samples to a file\n",
    "with open(f\"data/original/text_{i}.txt\", 'w', encoding='utf-8') as f:\n",
    "    f.write('\\n'.join(text))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f131dcfc",
   "metadata": {},
   "source": [
    "### Testing\n",
    "If we load the first file, we should get a file that is 10000 lines long and has one column\n",
    "\n",
    "As we do not preprocess the data in any way, but just write the read text into the file, this is all testing necessary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "df50af74",
   "metadata": {},
   "outputs": [],
   "source": [
    "with open(\"data/original/text_0.txt\", 'r', encoding='utf-8') as f:\n",
    "    lines = f.read().split('\\n')\n",
    "lines = pd.DataFrame(lines)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "8ddb0085",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Passed\n"
     ]
    }
   ],
   "source": [
    "assert lines.shape==(10000,1)\n",
    "print(\"Passed\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1a65b268",
   "metadata": {},
   "source": [
    "## SQuAD Data\n",
    "In the following, we download the SQuAD dataset from huggingface (https://huggingface.co/datasets/squad). It was initially provided by Rajpurkar et al. from Stanford University.\n",
    "\n",
    "We again load the dataset and store it in chunks of 1000 into files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6750ce6e",
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset = load_dataset(\"squad\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "65a7ee23",
   "metadata": {},
   "outputs": [],
   "source": [
    "os.mkdir(\"data/training_squad\")\n",
    "os.mkdir(\"data/test_squad\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f6ebf63e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# we already have a training and test split. Each sample has an id, title, context, question and answers.\n",
    "dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f67ae448",
   "metadata": {},
   "outputs": [],
   "source": [
    "# answers are provided like that - we need to extract answer_end for the model\n",
    "dataset['train']['answers'][0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "101cd650",
   "metadata": {},
   "outputs": [],
   "source": [
    "# column contains the split (either train or validation), save_dir is the directory\n",
    "def save_samples(column, save_dir):\n",
    "    text = []\n",
    "    i = 0\n",
    "\n",
    "    for sample in tqdm(dataset[column]):\n",
    "        \n",
    "        # preprocess the context and question by removing the newlines\n",
    "        context = sample['context'].replace('\\n','')\n",
    "        question = sample['question'].replace('\\n','')\n",
    "\n",
    "        # get the answer as text and start character index\n",
    "        answer_text = sample['answers']['text'][0]\n",
    "        answer_start = str(sample['answers']['answer_start'][0])\n",
    "        \n",
    "        text.append([context, question, answer_text, answer_start])\n",
    "\n",
    "        # we choose chunks of 1000\n",
    "        if len(text) == 1000:\n",
    "            with open(f\"data/{save_dir}/text_{i}.txt\", 'w', encoding='utf-8') as f:\n",
    "                f.write(\"\\n\".join([\"\\t\".join(t) for t in text]))\n",
    "            text = []\n",
    "            i += 1\n",
    "\n",
    "    # save remaining\n",
    "    with open(f\"data/{save_dir}/text_{i}.txt\", 'w', encoding='utf-8') as f:\n",
    "        f.write(\"\\n\".join([\"\\t\".join(t) for t in text]))\n",
    "\n",
    "save_samples(\"train\", \"training_squad\")\n",
    "save_samples(\"validation\", \"test_squad\")\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "67044d13",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "source": [
    "### Testing\n",
    "If we load a file, we should get a file with 10000 lines and 4 columns\n",
    "\n",
    "Also, we want to assure the correct interval. Hence, the second test."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "446281cf",
   "metadata": {},
   "outputs": [],
   "source": [
    "with open(\"data/training_squad/text_0.txt\", 'r', encoding='utf-8') as f:\n",
    "    lines = f.read().split('\\n')\n",
    "    \n",
    "lines = pd.DataFrame([line.split(\"\\t\") for line in lines], columns=[\"context\", \"question\", \"answer\", \"answer_start\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ccd5c650",
   "metadata": {},
   "outputs": [],
   "source": [
    "assert lines.shape==(1000,4)\n",
    "print(\"Passed\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2c9e4b70",
   "metadata": {},
   "outputs": [],
   "source": [
    "# we assert that we have the right interval\n",
    "for ind, line in lines.iterrows():\n",
    "    sample = line\n",
    "    answer_start = int(sample['answer_start'])\n",
    "    assert sample['context'][answer_start:answer_start+len(sample['answer'])] == sample['answer']\n",
    "print(\"Passed\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "02265ace",
   "metadata": {},
   "source": [
    "## Natural Questions Dataset\n",
    "* Download from https://ai.google.com/research/NaturalQuestions via gsutil (the one from huggingface has 134.92GB, the one from google cloud is in archives)\n",
    "* Use gunzip to get some samples - we then get `.jsonl`files\n",
    "* The dataset is a lot more messy, as it is just wikipedia articles with all web artifacts\n",
    "  * I cleaned the html tags\n",
    "  * Also I chose a random interval (containing the answer) from the dataset\n",
    "  * We can't send the whole text into the model anyways"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f3bce0c1",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "paths = [str(x) for x in Path('data/natural_questions/v1.0/train/').glob('**/*.jsonl')]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e9c58c00",
   "metadata": {},
   "outputs": [],
   "source": [
    "os.mkdir(\"data/natural_questions_train\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0ed7ba6c",
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "\n",
    "# clean html tags\n",
    "CLEANR = re.compile('<.+?>')\n",
    "# clean multiple spaces\n",
    "CLEANMULTSPACE = re.compile('(\\s)+')\n",
    "\n",
    "# the function takes an html documents and removes artifacts\n",
    "def cleanhtml(raw_html):\n",
    "    # tags\n",
    "    cleantext = re.sub(CLEANR, '', raw_html)\n",
    "    # newlines\n",
    "    cleantext = cleantext.replace(\"\\n\", '')\n",
    "    # tabs\n",
    "    cleantext = cleantext.replace(\"\\t\", '')\n",
    "    # character encodings\n",
    "    cleantext = cleantext.replace(\"&#39;\", \"'\")\n",
    "    cleantext = cleantext.replace(\"&amp;\", \"'\")\n",
    "    cleantext = cleantext.replace(\"&quot;\", '\"')\n",
    "    # multiple spaces\n",
    "    cleantext = re.sub(CLEANMULTSPACE, ' ', cleantext)\n",
    "    # documents end with this tags, if it is present in the string, cut it off\n",
    "    idx = cleantext.find(\"<!-- NewPP limit\")\n",
    "    if idx > -1:\n",
    "        cleantext = cleantext[:idx]\n",
    "    return cleantext.strip()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "66ca19ac",
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "\n",
    "# file count\n",
    "i = 0\n",
    "data = []\n",
    "\n",
    "# iterate over all json files\n",
    "for path in paths:\n",
    "    print(path)\n",
    "    # read file and store as list (this requires much memory, as the files are huge)\n",
    "    with open(path, 'r') as json_file:\n",
    "        json_list = list(json_file)\n",
    "    \n",
    "    # process every context, question, answer pair\n",
    "    for json_str in json_list:\n",
    "        result = json.loads(json_str)\n",
    "\n",
    "        # append a question mark - SQuAD questions end with a qm too\n",
    "        question = result['question_text'] + \"?\"\n",
    "        \n",
    "        # some question do not contain an answer - we do not need them\n",
    "        if(len(result['annotations'][0]['short_answers'])==0):\n",
    "            continue\n",
    "\n",
    "        # get true start/end byte\n",
    "        true_start = result['annotations'][0]['short_answers'][0]['start_byte']\n",
    "        true_end = result['annotations'][0]['short_answers'][0]['end_byte']\n",
    "\n",
    "        # convert to bytes\n",
    "        byte_encoding = bytes(result['document_html'], encoding='utf-8')\n",
    "        \n",
    "        # the document is the whole wikipedia article, we randomly choose an appropriate part (containing the\n",
    "        # answer): we have 512 tokens as the input for the model - 4000 bytes lead to a good length\n",
    "        max_back = 3500 if true_start >= 3500 else true_start\n",
    "        first = random.randint(int(true_start)-max_back, int(true_start))\n",
    "        end = first + 3500 + true_end - true_start\n",
    "        \n",
    "        # get chosen context\n",
    "        cleanbytes = byte_encoding[first:end]\n",
    "        # decode back to text - if our end byte is the middle of a word, we ignore it and cut it off\n",
    "        cleantext = bytes.decode(cleanbytes, errors='ignore')\n",
    "        # clean html tags\n",
    "        cleantext = cleanhtml(cleantext)\n",
    "\n",
    "        # find the true answer\n",
    "        answer_start = cleanbytes.find(byte_encoding[true_start:true_end])\n",
    "        true_answer = bytes.decode(cleanbytes[answer_start:answer_start+(true_end-true_start)])\n",
    "        \n",
    "        # clean html tags\n",
    "        true_answer = cleanhtml(true_answer)\n",
    "        \n",
    "        start_ind = cleantext.find(true_answer)\n",
    "        \n",
    "        # If cleaning the string makes the answer not findable skip it\n",
    "        # this hardly ever happens, except if there is an emense amount of web artifacts\n",
    "        if start_ind == -1:\n",
    "            continue\n",
    "            \n",
    "        data.append([cleantext, question, true_answer, str(start_ind)])\n",
    "\n",
    "        if len(data) == 1000:\n",
    "            with open(f\"data/natural_questions_train/text_{i}.txt\", 'w', encoding='utf-8') as f:\n",
    "                f.write(\"\\n\".join([\"\\t\".join(t) for t in data]))\n",
    "            i += 1\n",
    "            data = []\n",
    "with open(f\"data/natural_questions_train/text_{i}.txt\", 'w', encoding='utf-8') as f:\n",
    "    f.write(\"\\n\".join([\"\\t\".join(t) for t in data]))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "30f26b4e",
   "metadata": {},
   "source": [
    "### Testing\n",
    "In the following, we first check if the shape of the file is correct.\n",
    "\n",
    "Then we iterate over the file and check if the answers according to the file are the same as in the original file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "490ac0db",
   "metadata": {},
   "outputs": [],
   "source": [
    "with open(\"data/natural_questions_train/text_0.txt\", 'r', encoding='utf-8') as f:\n",
    "    lines = f.read().split('\\n')\n",
    "    \n",
    "lines = pd.DataFrame([line.split(\"\\t\") for line in lines], columns=[\"context\", \"question\", \"answer\", \"answer_start\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0d7cc3ee",
   "metadata": {},
   "outputs": [],
   "source": [
    "assert lines.shape == (1000, 4)\n",
    "print(\"Passed\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0fd8a854",
   "metadata": {},
   "outputs": [],
   "source": [
    "with open(\"data/natural_questions/v1.0/train/nq-train-00.jsonl\", 'r') as json_file:\n",
    "    json_list = list(json_file)[:500]\n",
    "del json_file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "170bff30",
   "metadata": {},
   "outputs": [],
   "source": [
    "lines_index = 0\n",
    "for i in range(len(json_list)):\n",
    "    result = json.loads(json_list[i])\n",
    "     \n",
    "    if(len(result['annotations'][0]['short_answers'])==0):\n",
    "        pass\n",
    "    else: \n",
    "        # assert that the question text is the same\n",
    "        assert result['question_text'] + \"?\" == lines.loc[lines_index, 'question']\n",
    "        true_start = result['annotations'][0]['short_answers'][0]['start_byte']\n",
    "        true_end = result['annotations'][0]['short_answers'][0]['end_byte']\n",
    "        true_answer = bytes.decode(bytes(result['document_html'], encoding='utf-8')[true_start:true_end])\n",
    "        \n",
    "        processed_answer = lines.loc[lines_index, 'answer']\n",
    "        # assert that the answer is the same\n",
    "        assert cleanhtml(true_answer) == processed_answer\n",
    "    \n",
    "        start_ind = int(lines.loc[lines_index, 'answer_start'])\n",
    "        # assert that the answer (according to the index) is the same\n",
    "        assert cleanhtml(true_answer) == lines.loc[lines_index, 'context'][start_ind:start_ind+len(processed_answer)]\n",
    "        \n",
    "        lines_index += 1\n",
    "    \n",
    "    if lines_index == len(lines):\n",
    "        break\n",
    "print(\"Passed\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "78e6e737",
   "metadata": {},
   "source": [
    "## Hotpot QA"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "27efcc8c",
   "metadata": {},
   "outputs": [],
   "source": [
    "ds = load_dataset(\"hotpot_qa\", 'fullwiki')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1493f21f",
   "metadata": {},
   "outputs": [],
   "source": [
    "ds"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2a047946",
   "metadata": {},
   "outputs": [],
   "source": [
    "os.mkdir('data/hotpotqa_training')\n",
    "os.mkdir('data/hotpotqa_test')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e65b6485",
   "metadata": {},
   "outputs": [],
   "source": [
    "# column contains the split (either train or validation), save_dir is the directory\n",
    "def save_samples(column, save_dir):\n",
    "    text = []\n",
    "    i = 0\n",
    "\n",
    "    for sample in tqdm(ds[column]):\n",
    "        \n",
    "        # preprocess the context and question by removing the newlines\n",
    "        context = sample['context']['sentences']\n",
    "        context = \" \".join([\"\".join(sentence) for sentence in context])\n",
    "        question = sample['question'].replace('\\n','')\n",
    "        \n",
    "        # get the answer as text and start character index\n",
    "        answer_text = sample['answer']\n",
    "        answer_start = context.find(answer_text)\n",
    "        if answer_start == -1:\n",
    "            continue\n",
    "            \n",
    "        \n",
    "            \n",
    "        if answer_start > 1500:\n",
    "            first = random.randint(answer_start-1500, answer_start)\n",
    "            end = first + 1500 + len(answer_text)\n",
    "            \n",
    "            context = context[first:end+1]\n",
    "            answer_start = context.find(answer_text)\n",
    "            \n",
    "            if answer_start == -1:continue\n",
    "            \n",
    "        text.append([context, question, answer_text, str(answer_start)])\n",
    "\n",
    "        # we choose chunks of 1000\n",
    "        if len(text) == 1000:\n",
    "            with open(f\"data/{save_dir}/text_{i}.txt\", 'w', encoding='utf-8') as f:\n",
    "                f.write(\"\\n\".join([\"\\t\".join(t) for t in text]))\n",
    "            text = []\n",
    "            i += 1\n",
    "\n",
    "    # save remaining\n",
    "    with open(f\"data/{save_dir}/text_{i}.txt\", 'w', encoding='utf-8') as f:\n",
    "        f.write(\"\\n\".join([\"\\t\".join(t) for t in text]))\n",
    "\n",
    "save_samples(\"train\", \"hotpotqa_training\")\n",
    "save_samples(\"validation\", \"hotpotqa_test\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "97cc358f",
   "metadata": {},
   "source": [
    "## Testing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f321483c",
   "metadata": {},
   "outputs": [],
   "source": [
    "with open(\"data/hotpotqa_training/text_0.txt\", 'r', encoding='utf-8') as f:\n",
    "    lines = f.read().split('\\n')\n",
    "    \n",
    "lines = pd.DataFrame([line.split(\"\\t\") for line in lines], columns=[\"context\", \"question\", \"answer\", \"answer_start\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "72a96e78",
   "metadata": {},
   "outputs": [],
   "source": [
    "assert lines.shape == (1000, 4)\n",
    "print(\"Passed\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c32c2f16",
   "metadata": {},
   "outputs": [],
   "source": [
    "# we assert that we have the right interval\n",
    "for ind, line in lines.iterrows():\n",
    "    sample = line\n",
    "    answer_start = int(sample['answer_start'])\n",
    "    assert sample['context'][answer_start:answer_start+len(sample['answer'])] == sample['answer']\n",
    "print(\"Passed\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bc36fe7d",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.16"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  },
  "vscode": {
   "interpreter": {
    "hash": "85bf9c14e9ba73b783ed1274d522bec79eb0b2b739090180d8ce17bb11aff4aa"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}