Spaces:
Sleeping
Sleeping
File size: 6,927 Bytes
5fdb69e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 |
{
"cells": [
{
"cell_type": "markdown",
"id": "c7d95a7f-205a-4262-a1af-4579489025ff",
"metadata": {},
"source": [
"# Hello everyone."
]
},
{
"cell_type": "markdown",
"id": "bc815dbc-acf7-45f9-a043-5767184c44c6",
"metadata": {},
"source": [
"I completed the day 1, first LLM Experiment moments ago and found it really awesome. After the challenge was done, I wanted to chip in my two cents by making a PDF summarizer, basing myself on the code for the Website Summarizer. I want to share it in this contribution!\n",
"### To consider:\n",
"* To extract the contents of PDF files, I used the PyPDF2 library, which doesn't come with the default configuration of the virtual environment. To remedy the situation, you need to follow the steps:\n",
" 1. Shut down Anaconda. Running `CTRL-C` in the Anaconda terminal should achieve this.\n",
" 2. Run the following command, `pip install PyPDF2 --user`\n",
" 3. Restart Jupyter lab with `jupyter lab`\n",
"* To find PDF files online, you can add `filetype:url` on your browser query, i.e. searching the following can give you PDF files to add as input: `AI Engineering prompts filetype:pdf`!\n",
"\n",
"Without further ado, here's the PDF Summarizer!"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "06b63787-c6c8-4868-8a71-eb56b7618626",
"metadata": {},
"outputs": [],
"source": [
"# Import statements\n",
"import os\n",
"import requests\n",
"from dotenv import load_dotenv\n",
"from IPython.display import Markdown, display\n",
"from openai import OpenAI\n",
"from io import BytesIO\n",
"from PyPDF2 import PdfReader"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "284ca770-5da4-495c-b1cf-637727a8609f",
"metadata": {},
"outputs": [],
"source": [
"# Load environment variables in a file called .env\n",
"\n",
"load_dotenv()\n",
"api_key = os.getenv('OPENAI_API_KEY')\n",
"\n",
"# Check the key\n",
"\n",
"if not api_key:\n",
" print(\"No API key was found - please head over to the troubleshooting notebook in this folder to identify & fix!\")\n",
"elif not api_key.startswith(\"sk-proj-\"):\n",
" print(\"An API key was found, but it doesn't start sk-proj-; please check you're using the right key - see troubleshooting notebook\")\n",
"elif api_key.strip() != api_key:\n",
" print(\"An API key was found, but it looks like it might have space or tab characters at the start or end - please remove them - see troubleshooting notebook\")\n",
"else:\n",
" print(\"API key found and looks good so far!\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d4c316d7-d9c9-4400-b03e-1dd629c6b2ad",
"metadata": {},
"outputs": [],
"source": [
"openai = OpenAI()\n",
"\n",
"# If this doesn't work, try Kernel menu >> Restart Kernel and Clear Outputs Of All Cells, then run the cells from the top of this notebook down.\n",
"# If it STILL doesn't work (horrors!) then please see the troubleshooting notebook, or try the below line instead:\n",
"# openai = OpenAI(api_key=\"your-key-here-starting-sk-proj-\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3a053092-f4f6-4156-8721-39353c8a9367",
"metadata": {},
"outputs": [],
"source": [
"# Step 0: Create article class\n",
"class Article:\n",
" def __init__(self, url):\n",
" \"\"\"\n",
" Create this Article object from the given url using the PyPDF2 library\n",
" \"\"\"\n",
" self.url = url \n",
" response = requests.get(self.url)\n",
" if response.status_code == 200:\n",
" pdf_bytes = BytesIO(response.content)\n",
" reader = PdfReader(pdf_bytes)\n",
" \n",
" text = \"\"\n",
" for page in reader.pages:\n",
" text += page.extract_text()\n",
" \n",
" self.text = text\n",
" self.title = reader.metadata.get(\"/Title\", \"No title found\")\n",
" else:\n",
" print(f\"Failed to fetch PDF. Status code: {response.status_code}\")\n",
" self.text = \"No text found\"\n",
" self.title = \"No title found\""
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "adc528f2-25ca-47b5-896e-9d417ba0195f",
"metadata": {},
"outputs": [],
"source": [
"# Step 1: Create your prompts\n",
"\n",
"def craft_user_prompt(article):\n",
" user_prompt = f\"You are looking at a research article titled {article.title}\\n Based on the body of the article, how are micro RNAs produced in the cell? State the function of the proteins \\\n",
" involved. The body of the article is as follows.\"\n",
" user_prompt += article.text\n",
" return user_prompt\n",
"\n",
"# Step 2: Make the messages list\n",
"def craft_messages(article):\n",
" system_prompt = \"You are an assistant that analyses the contents of a research article and provide answers to the question asked by the user in 250 words or less. \\\n",
" Ignore text that doesn't belong to the article, like headers or navigation related text. Respond in markdown. Structure your text in the form of question/answer.\"\n",
" return [\n",
" {\"role\": \"system\", \"content\": system_prompt},\n",
" {\"role\": \"user\", \"content\": craft_user_prompt(article)}\n",
" ]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "81ab896e-1ba9-4964-a477-2a0608b7036c",
"metadata": {},
"outputs": [],
"source": [
"# Step 3: Call OpenAI\n",
"def summarize(url):\n",
" article = Article(url)\n",
" response = openai.chat.completions.create(\n",
" model = \"gpt-4o-mini\",\n",
" messages = craft_messages(article)\n",
" )\n",
" return response.choices[0].message.content"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a7a98cdf-0d3b-477d-8e39-a6a4264b9feb",
"metadata": {},
"outputs": [],
"source": [
"# Step 4: Print the result of an example pdf\n",
"summary = summarize(\"https://www.nature.com/articles/s12276-023-01050-9.pdf\")\n",
"display(Markdown(summary))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.10"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
|