Spaces:
Sleeping
Sleeping
File size: 1,682 Bytes
5fdb69e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 |
# π§ Community Contribution: Async Playwright-based OpenAI Scraper
This contribution presents a fully asynchronous, headless-browser-based scraper for [https://openai.com](https://openai.com) using **Playwright** β an alternative to Selenium.
Developed by: [lakovicb](https://github.com/lakovicb)
IDE used: WingIDE Pro (Jupyter compatibility via `nest_asyncio`)
---
## π¦ Features
- π§ Simulates human-like interactions (mouse movement, scrolling)
- π§ GPT-based analysis using OpenAI's API
- π§ͺ Works inside **JupyterLab** using `nest_asyncio`
- π Prometheus metrics for scraping observability
- β‘ Smart content caching via `diskcache`
---
## π How to Run
### 1. Install dependencies
```bash
pip install -r requirements.txt
```
> Ensure [Playwright is installed & browsers are downloaded](https://playwright.dev/python/docs/intro)
```bash
playwright install
```
### 2. Set environment variables in `.env`
```env
OPENAI_API_KEY=your_openai_key
BROWSER_PATH=/usr/bin/chromium-browser
```
You can also define optional proxy/login params if needed.
---
## π Notebooks Included
| Notebook | Description |
|----------|-------------|
| `Playwright_Solution_JupyterAsync.ipynb` | Executes async scraper directly inside Jupyter |
| `Playwright_Solution_Showcase_Formatted.ipynb` | Nicely formatted output for human reading |
---
## π Output Example
- GPT-generated summary
- Timeline of updates
- Entities and projects mentioned
- Structured topics & themes
β
*Can be extended with PDF export, LangChain pipeline, or vector store ingestion.*
---
## π Thanks
Huge thanks to Ed Donner for the amazing course and challenge inspiration!
|