File size: 1,682 Bytes
5fdb69e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
# 🧠 Community Contribution: Async Playwright-based OpenAI Scraper

This contribution presents a fully asynchronous, headless-browser-based scraper for [https://openai.com](https://openai.com) using **Playwright** β€” an alternative to Selenium.

Developed by: [lakovicb](https://github.com/lakovicb)  
IDE used: WingIDE Pro (Jupyter compatibility via `nest_asyncio`)

---

## πŸ“¦ Features

- 🧭 Simulates human-like interactions (mouse movement, scrolling)
- 🧠 GPT-based analysis using OpenAI's API
- πŸ§ͺ Works inside **JupyterLab** using `nest_asyncio`
- πŸ“Š Prometheus metrics for scraping observability
- ⚑ Smart content caching via `diskcache`

---

## πŸš€ How to Run

### 1. Install dependencies

```bash
pip install -r requirements.txt
```

> Ensure [Playwright is installed & browsers are downloaded](https://playwright.dev/python/docs/intro)

```bash
playwright install
```

### 2. Set environment variables in `.env`

```env
OPENAI_API_KEY=your_openai_key
BROWSER_PATH=/usr/bin/chromium-browser
```

You can also define optional proxy/login params if needed.

---

## πŸ“˜ Notebooks Included

| Notebook | Description |
|----------|-------------|
| `Playwright_Solution_JupyterAsync.ipynb` | Executes async scraper directly inside Jupyter |
| `Playwright_Solution_Showcase_Formatted.ipynb` | Nicely formatted output for human reading |

---

## πŸ” Output Example

- GPT-generated summary
- Timeline of updates
- Entities and projects mentioned
- Structured topics & themes

βœ… *Can be extended with PDF export, LangChain pipeline, or vector store ingestion.*

---

## πŸ™ Thanks

Huge thanks to Ed Donner for the amazing course and challenge inspiration!