Spaces:

mamogasr
/

llm_engineering

Sleeping

File size: 12,582 Bytes

5fdb69e

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "d15d8294-3328-4e07-ad16-8a03e9bbfdb9",
   "metadata": {},
   "source": [
    "# Instant Gratification!\n",
    "\n",
    "Let's build a useful LLM solution - in a matter of minutes.\n",
    "\n",
    "Our goal is to code a new kind of Web Browser. Give it a URL, and it will respond with a summary. The Reader's Digest of the internet!!\n",
    "\n",
    "Before starting, be sure to have followed the instructions in the \"README\" file, including creating your API key with OpenAI and adding it to the `.env` file.\n",
    "\n",
    "## If you're new to Jupyter Lab\n",
    "\n",
    "Welcome to the wonderful world of Data Science experimentation! Once you've used Jupyter Lab, you'll wonder how you ever lived without it. Simply click in each \"cell\" with code in it, like the cell immediately below this text, and hit Shift+Return to execute that cell. As you wish, you can add a cell with the + button in the toolbar, and print values of variables, or try out variations.\n",
    "\n",
    "If you need to start again, go to Kernel menu >> Restart kernel."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4e2a9393-7767-488e-a8bf-27c12dca35bd",
   "metadata": {},
   "outputs": [],
   "source": [
    "# imports\n",
    "\n",
    "import os\n",
    "import requests\n",
    "from dotenv import load_dotenv\n",
    "from bs4 import BeautifulSoup\n",
    "from IPython.display import Markdown, display\n",
    "from openai import OpenAI"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6900b2a8-6384-4316-8aaa-5e519fca4254",
   "metadata": {},
   "source": [
    "# Connecting to OpenAI\n",
    "\n",
    "The next cell is where we load in the environment variables in your `.env` file and connect to OpenAI.\n",
    "\n",
    "## Troubleshooting if you have problems:\n",
    "\n",
    "1. OpenAI takes a few minutes to register after you set up an account. If you receive an error about being over quota, try waiting a few minutes and try again.\n",
    "2. Also, double check you have the right kind of API token with the right permissions. You should find it on [this webpage](https://platform.openai.com/api-keys) and it should show with Permissions of \"All\". If not, try creating another key by:\n",
    "- Pressing \"Create new secret key\" on the top right\n",
    "- Select **Owned by:** you, **Project:** Default project, **Permissions:** All\n",
    "- Click Create secret key, and use that new key in the code and the `.env` file (it might take a few minutes to activate)\n",
    "- Do a Kernel >> Restart kernel, and execute the cells in this Jupyter lab starting at the top\n",
    "4. As a fallback, replace the line `openai = OpenAI()` with `openai = OpenAI(api_key=\"your-key-here\")` - while it's not recommended to hard code tokens in Jupyter lab, because then you can't share your lab with others, it's a workaround for now\n",
    "5. Contact me! Message me or email [email protected] and we will get this to work.\n",
    "\n",
    "Any concerns about API costs? See my notes in the README - costs should be minimal, and you can control it at every point."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7b87cadb-d513-4303-baee-a37b6f938e4d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load environment variables in a file called .env\n",
    "\n",
    "load_dotenv()\n",
    "os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY','your-key-if-not-using-env')\n",
    "openai = OpenAI()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c5e793b2-6775-426a-a139-4848291d0463",
   "metadata": {},
   "outputs": [],
   "source": [
    "# A class to represent a Webpage\n",
    "\n",
    "class Website:\n",
    "    url: str\n",
    "    title: str\n",
    "    text: str\n",
    "\n",
    "    def __init__(self, url):\n",
    "        self.url = url\n",
    "        response = requests.get(url)\n",
    "        soup = BeautifulSoup(response.content, 'html.parser')\n",
    "        self.title = soup.title.string if soup.title else \"No title found\"\n",
    "        for irrelevant in soup.body([\"script\", \"style\", \"img\", \"input\"]):\n",
    "            irrelevant.decompose()\n",
    "        self.text = soup.body.get_text(separator=\"\\n\", strip=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2ef960cf-6dc2-4cda-afb3-b38be12f4c97",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Let's try one out\n",
    "\n",
    "ed = Website(\"https://edwarddonner.com\")\n",
    "print(ed.title)\n",
    "print(ed.text)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6a478a0c-2c53-48ff-869c-4d08199931e1",
   "metadata": {},
   "source": [
    "## Types of prompts\n",
    "\n",
    "You may know this already - but if not, you will get very familiar with it!\n",
    "\n",
    "Models like GPT4o have been trained to receive instructions in a particular way.\n",
    "\n",
    "They expect to receive:\n",
    "\n",
    "**A system prompt** that tells them what task they are performing and what tone they should use\n",
    "\n",
    "**A user prompt** -- the conversation starter that they should reply to"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "abdb8417-c5dc-44bc-9bee-2e059d162699",
   "metadata": {},
   "outputs": [],
   "source": [
    "system_prompt = \"You are an assistant that analyzes the contents of a website \\\n",
    "and provides a short summary, ignoring text that might be navigation related. \\\n",
    "Respond in markdown.\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f0275b1b-7cfe-4f9d-abfa-7650d378da0c",
   "metadata": {},
   "outputs": [],
   "source": [
    "def user_prompt_for(website):\n",
    "    user_prompt = f\"You are looking at a website titled {website.title}\"\n",
    "    user_prompt += \"The contents of this website is as follows; \\\n",
    "please provide a short summary of this website in markdown. \\\n",
    "If it includes news or announcements, then summarize these too.\\n\\n\"\n",
    "    user_prompt += website.text\n",
    "    return user_prompt"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ea211b5f-28e1-4a86-8e52-c0b7677cadcc",
   "metadata": {},
   "source": [
    "## Messages\n",
    "\n",
    "The API from OpenAI expects to receive messages in a particular structure.\n",
    "Many of the other APIs share this structure:\n",
    "\n",
    "```\n",
    "[\n",
    "    {\"role\": \"system\", \"content\": \"system message goes here\"},\n",
    "    {\"role\": \"user\", \"content\": \"user message goes here\"}\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0134dfa4-8299-48b5-b444-f2a8c3403c88",
   "metadata": {},
   "outputs": [],
   "source": [
    "def messages_for(website):\n",
    "    return [\n",
    "        {\"role\": \"system\", \"content\": system_prompt},\n",
    "        {\"role\": \"user\", \"content\": user_prompt_for(website)}\n",
    "    ]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "16f49d46-bf55-4c3e-928f-68fc0bf715b0",
   "metadata": {},
   "source": [
    "## Time to bring it together - the API for OpenAI is very simple!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "905b9919-aba7-45b5-ae65-81b3d1d78e34",
   "metadata": {},
   "outputs": [],
   "source": [
    "def summarize(url):\n",
    "    website = Website(url)\n",
    "    response = openai.chat.completions.create(\n",
    "        model = \"gpt-4o-mini\",\n",
    "        messages = messages_for(website)\n",
    "    )\n",
    "    return response.choices[0].message.content"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "05e38d41-dfa4-4b20-9c96-c46ea75d9fb5",
   "metadata": {},
   "outputs": [],
   "source": [
    "summarize(\"https://edwarddonner.com\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3d926d59-450e-4609-92ba-2d6f244f1342",
   "metadata": {},
   "outputs": [],
   "source": [
    "def display_summary(url):\n",
    "    summary = summarize(url)\n",
    "    display(Markdown(summary))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3018853a-445f-41ff-9560-d925d1774b2f",
   "metadata": {},
   "outputs": [],
   "source": [
    "display_summary(\"https://edwarddonner.com\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "45d83403-a24c-44b5-84ac-961449b4008f",
   "metadata": {},
   "outputs": [],
   "source": [
    "display_summary(\"https://cnn.com\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "75e9fd40-b354-4341-991e-863ef2e59db7",
   "metadata": {},
   "outputs": [],
   "source": [
    "display_summary(\"https://anthropic.com\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "36ed9f14-b349-40e9-a42c-b367e77f8bda",
   "metadata": {},
   "source": [
    "## An extra exercise for those who enjoy web scraping\n",
    "\n",
    "You may notice that if you try `display_summary(\"https://openai.com\")` - it doesn't work! That's because OpenAI has a fancy website that uses Javascript. There are many ways around this that some of you might be familiar with. For example, Selenium is a hugely popular framework that runs a browser behind the scenes, renders the page, and allows you to query it. If you have experience with Selenium, Playwright or similar, then feel free to improve the Website class to use them. Please push your code afterwards so I can share it with other students!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "52ae98bb",
   "metadata": {},
   "outputs": [],
   "source": [
    "display_summary(\"https://openai.com\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5d57e958",
   "metadata": {},
   "outputs": [],
   "source": [
    "#Parse webpages which is designed using JavaScript heavely\n",
    "# download the chorme driver from here as per your version of chrome - https://developer.chrome.com/docs/chromedriver/downloads\n",
    "from selenium import webdriver\n",
    "from selenium.webdriver.chrome.service import Service\n",
    "from selenium.webdriver.common.by import By\n",
    "from selenium.webdriver.chrome.options import Options\n",
    "\n",
    "PATH_TO_CHROME_DRIVER = '..\\\\path\\\\to\\\\chromedriver.exe'\n",
    "\n",
    "class Website:\n",
    "    url: str\n",
    "    title: str\n",
    "    text: str\n",
    "\n",
    "    def __init__(self, url):\n",
    "        self.url = url\n",
    "\n",
    "        options = Options()\n",
    "\n",
    "        options.add_argument(\"--no-sandbox\")\n",
    "        options.add_argument(\"--disable-dev-shm-usage\")\n",
    "\n",
    "        service = Service(PATH_TO_CHROME_DRIVER)\n",
    "        driver = webdriver.Chrome(service=service, options=options)\n",
    "        driver.get(url)\n",
    "\n",
    "        input(\"Please complete the verification in the browser and press Enter to continue...\")\n",
    "        page_source = driver.page_source\n",
    "        driver.quit()\n",
    "\n",
    "        soup = BeautifulSoup(page_source, 'html.parser')\n",
    "        self.title = soup.title.string if soup.title else \"No title found\"\n",
    "        for irrelevant in soup([\"script\", \"style\", \"img\", \"input\"]):\n",
    "            irrelevant.decompose()\n",
    "        self.text = soup.get_text(separator=\"\\n\", strip=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "65192f6b",
   "metadata": {},
   "outputs": [],
   "source": [
    "display_summary(\"https://openai.com\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f2eb9599",
   "metadata": {},
   "outputs": [],
   "source": [
    "display_summary(\"https://edwarddonner.com\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e7ba56c8",
   "metadata": {},
   "outputs": [],
   "source": [
    "display_summary(\"https://cnn.com\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}