Spaces:

mamogasr
/

llm_engineering

Sleeping

File size: 6,927 Bytes

5fdb69e

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "c7d95a7f-205a-4262-a1af-4579489025ff",
   "metadata": {},
   "source": [
    "# Hello everyone."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bc815dbc-acf7-45f9-a043-5767184c44c6",
   "metadata": {},
   "source": [
    "I completed the day 1, first LLM Experiment moments ago and found it really awesome. After the challenge was done, I wanted to chip in my two cents by making a PDF summarizer, basing myself on the code for the Website Summarizer. I want to share it in this contribution!\n",
    "### To consider:\n",
    "* To extract the contents of PDF files, I used the PyPDF2 library, which doesn't come with the default configuration of the virtual environment. To remedy the situation, you need to follow the steps:\n",
    "  1. Shut down Anaconda. Running `CTRL-C` in the Anaconda terminal should achieve this.\n",
    "  2. Run the following command, `pip install PyPDF2 --user`\n",
    "  3. Restart Jupyter lab with `jupyter lab`\n",
    "* To find PDF files online, you can add `filetype:url` on your browser query, i.e. searching the following can give you PDF files to add as input: `AI Engineering prompts filetype:pdf`!\n",
    "\n",
    "Without further ado, here's the PDF Summarizer!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "06b63787-c6c8-4868-8a71-eb56b7618626",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import statements\n",
    "import os\n",
    "import requests\n",
    "from dotenv import load_dotenv\n",
    "from IPython.display import Markdown, display\n",
    "from openai import OpenAI\n",
    "from io import BytesIO\n",
    "from PyPDF2 import PdfReader"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "284ca770-5da4-495c-b1cf-637727a8609f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load environment variables in a file called .env\n",
    "\n",
    "load_dotenv()\n",
    "api_key = os.getenv('OPENAI_API_KEY')\n",
    "\n",
    "# Check the key\n",
    "\n",
    "if not api_key:\n",
    "    print(\"No API key was found - please head over to the troubleshooting notebook in this folder to identify & fix!\")\n",
    "elif not api_key.startswith(\"sk-proj-\"):\n",
    "    print(\"An API key was found, but it doesn't start sk-proj-; please check you're using the right key - see troubleshooting notebook\")\n",
    "elif api_key.strip() != api_key:\n",
    "    print(\"An API key was found, but it looks like it might have space or tab characters at the start or end - please remove them - see troubleshooting notebook\")\n",
    "else:\n",
    "    print(\"API key found and looks good so far!\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d4c316d7-d9c9-4400-b03e-1dd629c6b2ad",
   "metadata": {},
   "outputs": [],
   "source": [
    "openai = OpenAI()\n",
    "\n",
    "# If this doesn't work, try Kernel menu >> Restart Kernel and Clear Outputs Of All Cells, then run the cells from the top of this notebook down.\n",
    "# If it STILL doesn't work (horrors!) then please see the troubleshooting notebook, or try the below line instead:\n",
    "# openai = OpenAI(api_key=\"your-key-here-starting-sk-proj-\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3a053092-f4f6-4156-8721-39353c8a9367",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Step 0: Create article class\n",
    "class Article:\n",
    "    def __init__(self, url):\n",
    "        \"\"\"\n",
    "        Create this Article object from the given url using the PyPDF2 library\n",
    "        \"\"\"\n",
    "        self.url = url \n",
    "        response = requests.get(self.url)\n",
    "        if response.status_code == 200:\n",
    "            pdf_bytes = BytesIO(response.content)\n",
    "            reader = PdfReader(pdf_bytes)\n",
    "        \n",
    "            text = \"\"\n",
    "            for page in reader.pages:\n",
    "                text += page.extract_text()\n",
    "        \n",
    "            self.text = text\n",
    "            self.title = reader.metadata.get(\"/Title\", \"No title found\")\n",
    "        else:\n",
    "            print(f\"Failed to fetch PDF. Status code: {response.status_code}\")\n",
    "            self.text = \"No text found\"\n",
    "            self.title = \"No title found\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "adc528f2-25ca-47b5-896e-9d417ba0195f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Step 1: Create your prompts\n",
    "\n",
    "def craft_user_prompt(article):\n",
    "    user_prompt = f\"You are looking at a research article titled {article.title}\\n Based on the body of the article, how are micro RNAs produced in the cell? State the function of the proteins \\\n",
    "    involved. The body of the article is as follows.\"\n",
    "    user_prompt += article.text\n",
    "    return user_prompt\n",
    "\n",
    "# Step 2: Make the messages list\n",
    "def craft_messages(article):\n",
    "    system_prompt = \"You are an assistant that analyses the contents of a research article and provide answers to the question asked by the user in 250 words or less. \\\n",
    "                Ignore text that doesn't belong to the article, like headers or navigation related text. Respond in markdown. Structure your text in the form of question/answer.\"\n",
    "    return [\n",
    "        {\"role\": \"system\", \"content\": system_prompt},\n",
    "        {\"role\": \"user\", \"content\": craft_user_prompt(article)}\n",
    "    ]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "81ab896e-1ba9-4964-a477-2a0608b7036c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Step 3: Call OpenAI\n",
    "def summarize(url):\n",
    "    article = Article(url)\n",
    "    response = openai.chat.completions.create(\n",
    "        model = \"gpt-4o-mini\",\n",
    "        messages = craft_messages(article)\n",
    "    )\n",
    "    return response.choices[0].message.content"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a7a98cdf-0d3b-477d-8e39-a6a4264b9feb",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Step 4: Print the result of an example pdf\n",
    "summary = summarize(\"https://www.nature.com/articles/s12276-023-01050-9.pdf\")\n",
    "display(Markdown(summary))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}