mamogasr's picture
Upload folder using huggingface_hub
5fdb69e verified
# 🧠 Community Contribution: Async Playwright-based OpenAI Scraper
This contribution presents a fully asynchronous, headless-browser-based scraper for [https://openai.com](https://openai.com) using **Playwright** β€” an alternative to Selenium.
Developed by: [lakovicb](https://github.com/lakovicb)
IDE used: WingIDE Pro (Jupyter compatibility via `nest_asyncio`)
---
## πŸ“¦ Features
- 🧭 Simulates human-like interactions (mouse movement, scrolling)
- 🧠 GPT-based analysis using OpenAI's API
- πŸ§ͺ Works inside **JupyterLab** using `nest_asyncio`
- πŸ“Š Prometheus metrics for scraping observability
- ⚑ Smart content caching via `diskcache`
---
## πŸš€ How to Run
### 1. Install dependencies
```bash
pip install -r requirements.txt
```
> Ensure [Playwright is installed & browsers are downloaded](https://playwright.dev/python/docs/intro)
```bash
playwright install
```
### 2. Set environment variables in `.env`
```env
OPENAI_API_KEY=your_openai_key
BROWSER_PATH=/usr/bin/chromium-browser
```
You can also define optional proxy/login params if needed.
---
## πŸ“˜ Notebooks Included
| Notebook | Description |
|----------|-------------|
| `Playwright_Solution_JupyterAsync.ipynb` | Executes async scraper directly inside Jupyter |
| `Playwright_Solution_Showcase_Formatted.ipynb` | Nicely formatted output for human reading |
---
## πŸ” Output Example
- GPT-generated summary
- Timeline of updates
- Entities and projects mentioned
- Structured topics & themes
βœ… *Can be extended with PDF export, LangChain pipeline, or vector store ingestion.*
---
## πŸ™ Thanks
Huge thanks to Ed Donner for the amazing course and challenge inspiration!