mamogasr's picture
Upload folder using huggingface_hub
5fdb69e verified

🧠 Community Contribution: Async Playwright-based OpenAI Scraper

This contribution presents a fully asynchronous, headless-browser-based scraper for https://openai.com using Playwright β€” an alternative to Selenium.

Developed by: lakovicb
IDE used: WingIDE Pro (Jupyter compatibility via nest_asyncio)


πŸ“¦ Features

  • 🧭 Simulates human-like interactions (mouse movement, scrolling)
  • 🧠 GPT-based analysis using OpenAI's API
  • πŸ§ͺ Works inside JupyterLab using nest_asyncio
  • πŸ“Š Prometheus metrics for scraping observability
  • ⚑ Smart content caching via diskcache

πŸš€ How to Run

1. Install dependencies

pip install -r requirements.txt

Ensure Playwright is installed & browsers are downloaded

playwright install

2. Set environment variables in .env

OPENAI_API_KEY=your_openai_key
BROWSER_PATH=/usr/bin/chromium-browser

You can also define optional proxy/login params if needed.


πŸ“˜ Notebooks Included

Notebook Description
Playwright_Solution_JupyterAsync.ipynb Executes async scraper directly inside Jupyter
Playwright_Solution_Showcase_Formatted.ipynb Nicely formatted output for human reading

πŸ” Output Example

  • GPT-generated summary
  • Timeline of updates
  • Entities and projects mentioned
  • Structured topics & themes

βœ… Can be extended with PDF export, LangChain pipeline, or vector store ingestion.


πŸ™ Thanks

Huge thanks to Ed Donner for the amazing course and challenge inspiration!