File size: 6,927 Bytes
5fdb69e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "c7d95a7f-205a-4262-a1af-4579489025ff",
   "metadata": {},
   "source": [
    "# Hello everyone."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bc815dbc-acf7-45f9-a043-5767184c44c6",
   "metadata": {},
   "source": [
    "I completed the day 1, first LLM Experiment moments ago and found it really awesome. After the challenge was done, I wanted to chip in my two cents by making a PDF summarizer, basing myself on the code for the Website Summarizer. I want to share it in this contribution!\n",
    "### To consider:\n",
    "* To extract the contents of PDF files, I used the PyPDF2 library, which doesn't come with the default configuration of the virtual environment. To remedy the situation, you need to follow the steps:\n",
    "  1. Shut down Anaconda. Running `CTRL-C` in the Anaconda terminal should achieve this.\n",
    "  2. Run the following command, `pip install PyPDF2 --user`\n",
    "  3. Restart Jupyter lab with `jupyter lab`\n",
    "* To find PDF files online, you can add `filetype:url` on your browser query, i.e. searching the following can give you PDF files to add as input: `AI Engineering prompts filetype:pdf`!\n",
    "\n",
    "Without further ado, here's the PDF Summarizer!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "06b63787-c6c8-4868-8a71-eb56b7618626",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import statements\n",
    "import os\n",
    "import requests\n",
    "from dotenv import load_dotenv\n",
    "from IPython.display import Markdown, display\n",
    "from openai import OpenAI\n",
    "from io import BytesIO\n",
    "from PyPDF2 import PdfReader"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "284ca770-5da4-495c-b1cf-637727a8609f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load environment variables in a file called .env\n",
    "\n",
    "load_dotenv()\n",
    "api_key = os.getenv('OPENAI_API_KEY')\n",
    "\n",
    "# Check the key\n",
    "\n",
    "if not api_key:\n",
    "    print(\"No API key was found - please head over to the troubleshooting notebook in this folder to identify & fix!\")\n",
    "elif not api_key.startswith(\"sk-proj-\"):\n",
    "    print(\"An API key was found, but it doesn't start sk-proj-; please check you're using the right key - see troubleshooting notebook\")\n",
    "elif api_key.strip() != api_key:\n",
    "    print(\"An API key was found, but it looks like it might have space or tab characters at the start or end - please remove them - see troubleshooting notebook\")\n",
    "else:\n",
    "    print(\"API key found and looks good so far!\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d4c316d7-d9c9-4400-b03e-1dd629c6b2ad",
   "metadata": {},
   "outputs": [],
   "source": [
    "openai = OpenAI()\n",
    "\n",
    "# If this doesn't work, try Kernel menu >> Restart Kernel and Clear Outputs Of All Cells, then run the cells from the top of this notebook down.\n",
    "# If it STILL doesn't work (horrors!) then please see the troubleshooting notebook, or try the below line instead:\n",
    "# openai = OpenAI(api_key=\"your-key-here-starting-sk-proj-\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3a053092-f4f6-4156-8721-39353c8a9367",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Step 0: Create article class\n",
    "class Article:\n",
    "    def __init__(self, url):\n",
    "        \"\"\"\n",
    "        Create this Article object from the given url using the PyPDF2 library\n",
    "        \"\"\"\n",
    "        self.url = url \n",
    "        response = requests.get(self.url)\n",
    "        if response.status_code == 200:\n",
    "            pdf_bytes = BytesIO(response.content)\n",
    "            reader = PdfReader(pdf_bytes)\n",
    "        \n",
    "            text = \"\"\n",
    "            for page in reader.pages:\n",
    "                text += page.extract_text()\n",
    "        \n",
    "            self.text = text\n",
    "            self.title = reader.metadata.get(\"/Title\", \"No title found\")\n",
    "        else:\n",
    "            print(f\"Failed to fetch PDF. Status code: {response.status_code}\")\n",
    "            self.text = \"No text found\"\n",
    "            self.title = \"No title found\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "adc528f2-25ca-47b5-896e-9d417ba0195f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Step 1: Create your prompts\n",
    "\n",
    "def craft_user_prompt(article):\n",
    "    user_prompt = f\"You are looking at a research article titled {article.title}\\n Based on the body of the article, how are micro RNAs produced in the cell? State the function of the proteins \\\n",
    "    involved. The body of the article is as follows.\"\n",
    "    user_prompt += article.text\n",
    "    return user_prompt\n",
    "\n",
    "# Step 2: Make the messages list\n",
    "def craft_messages(article):\n",
    "    system_prompt = \"You are an assistant that analyses the contents of a research article and provide answers to the question asked by the user in 250 words or less. \\\n",
    "                Ignore text that doesn't belong to the article, like headers or navigation related text. Respond in markdown. Structure your text in the form of question/answer.\"\n",
    "    return [\n",
    "        {\"role\": \"system\", \"content\": system_prompt},\n",
    "        {\"role\": \"user\", \"content\": craft_user_prompt(article)}\n",
    "    ]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "81ab896e-1ba9-4964-a477-2a0608b7036c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Step 3: Call OpenAI\n",
    "def summarize(url):\n",
    "    article = Article(url)\n",
    "    response = openai.chat.completions.create(\n",
    "        model = \"gpt-4o-mini\",\n",
    "        messages = craft_messages(article)\n",
    "    )\n",
    "    return response.choices[0].message.content"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a7a98cdf-0d3b-477d-8e39-a6a4264b9feb",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Step 4: Print the result of an example pdf\n",
    "summary = summarize(\"https://www.nature.com/articles/s12276-023-01050-9.pdf\")\n",
    "display(Markdown(summary))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}