Spaces:
Sleeping
Sleeping
# π§ Community Contribution: Async Playwright-based OpenAI Scraper | |
This contribution presents a fully asynchronous, headless-browser-based scraper for [https://openai.com](https://openai.com) using **Playwright** β an alternative to Selenium. | |
Developed by: [lakovicb](https://github.com/lakovicb) | |
IDE used: WingIDE Pro (Jupyter compatibility via `nest_asyncio`) | |
--- | |
## π¦ Features | |
- π§ Simulates human-like interactions (mouse movement, scrolling) | |
- π§ GPT-based analysis using OpenAI's API | |
- π§ͺ Works inside **JupyterLab** using `nest_asyncio` | |
- π Prometheus metrics for scraping observability | |
- β‘ Smart content caching via `diskcache` | |
--- | |
## π How to Run | |
### 1. Install dependencies | |
```bash | |
pip install -r requirements.txt | |
``` | |
> Ensure [Playwright is installed & browsers are downloaded](https://playwright.dev/python/docs/intro) | |
```bash | |
playwright install | |
``` | |
### 2. Set environment variables in `.env` | |
```env | |
OPENAI_API_KEY=your_openai_key | |
BROWSER_PATH=/usr/bin/chromium-browser | |
``` | |
You can also define optional proxy/login params if needed. | |
--- | |
## π Notebooks Included | |
| Notebook | Description | | |
|----------|-------------| | |
| `Playwright_Solution_JupyterAsync.ipynb` | Executes async scraper directly inside Jupyter | | |
| `Playwright_Solution_Showcase_Formatted.ipynb` | Nicely formatted output for human reading | | |
--- | |
## π Output Example | |
- GPT-generated summary | |
- Timeline of updates | |
- Entities and projects mentioned | |
- Structured topics & themes | |
β *Can be extended with PDF export, LangChain pipeline, or vector store ingestion.* | |
--- | |
## π Thanks | |
Huge thanks to Ed Donner for the amazing course and challenge inspiration! | |