Spaces:
Sleeping
Sleeping
π§ Community Contribution: Async Playwright-based OpenAI Scraper
This contribution presents a fully asynchronous, headless-browser-based scraper for https://openai.com using Playwright β an alternative to Selenium.
Developed by: lakovicb
IDE used: WingIDE Pro (Jupyter compatibility via nest_asyncio
)
π¦ Features
- π§ Simulates human-like interactions (mouse movement, scrolling)
- π§ GPT-based analysis using OpenAI's API
- π§ͺ Works inside JupyterLab using
nest_asyncio
- π Prometheus metrics for scraping observability
- β‘ Smart content caching via
diskcache
π How to Run
1. Install dependencies
pip install -r requirements.txt
playwright install
2. Set environment variables in .env
OPENAI_API_KEY=your_openai_key
BROWSER_PATH=/usr/bin/chromium-browser
You can also define optional proxy/login params if needed.
π Notebooks Included
Notebook | Description |
---|---|
Playwright_Solution_JupyterAsync.ipynb |
Executes async scraper directly inside Jupyter |
Playwright_Solution_Showcase_Formatted.ipynb |
Nicely formatted output for human reading |
π Output Example
- GPT-generated summary
- Timeline of updates
- Entities and projects mentioned
- Structured topics & themes
β Can be extended with PDF export, LangChain pipeline, or vector store ingestion.
π Thanks
Huge thanks to Ed Donner for the amazing course and challenge inspiration!