File size: 5,847 Bytes
33bca7f
 
 
9168d80
208f746
33bca7f
 
7fe8910
 
208f746
33bca7f
dbd33b2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
185fa42
 
 
 
 
 
dbd33b2
 
 
 
 
 
 
185fa42
dbd33b2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66a5452
 
 
 
 
dbd33b2
 
 
66a5452
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
185fa42
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
---
license: gpl-3.0
title: YouTube RAG Assistant
sdk: docker
emoji: πŸŽ₯
colorFrom: blue
colorTo: red
short_description: None
pinned: false
app_port: 7860
---
# YouTube Assistant

## Problem Description

In the era of abundant video content on YouTube, users often struggle to efficiently extract specific information or insights from lengthy videos without watching them in their entirety. This challenge is particularly acute when dealing with educational content, tutorials, or informative videos where key points may be scattered throughout the video's duration.

The YouTube Assistant project addresses this problem by providing a Retrieval-Augmented Generation (RAG) application that allows users to interact with and query video transcripts directly. This solution enables users to quickly access relevant information from YouTube videos without the need to watch them completely, saving time and improving the efficiency of information retrieval from video content.

## Data

The YouTube Assistant utilizes data pulled in real-time using the YouTube Data API v3. This data is then processed and stored in two databases:

1. SQLite database: For structured data storage
2. Elasticsearch vector database: For efficient similarity searches on embedded text

### Data Schema

The main columns in our data structure are:

```json
{
    "content": {"type": "text"},
    "video_id": {"type": "keyword"},
    "segment_id": {"type": "keyword"},
    "start_time": {"type": "float"},
    "duration": {"type": "float"},
    "title": {"type": "text"},
    "author": {"type": "keyword"},
    "upload_date": {"type": "date"},
    "view_count": {"type": "integer"},
    "like_count": {"type": "integer"},
    "comment_count": {"type": "integer"},
    "video_duration": {"type": "text"}
}
```

This schema allows for comprehensive storage of video metadata alongside the transcript content, enabling rich querying and analysis capabilities.

## Functionality

The YouTube Assistant offers the following key features:

1. **Real-time Data Extraction**: Utilizes the YouTube Data API v3 to fetch video data and transcripts on-demand.

2. **Efficient Data Storage**: Stores structured data in SQLite and uses Elasticsearch for vector embeddings, allowing for fast retrieval and similarity searches.

3. **Interactive Querying**: Provides a chat interface where users can ask questions about the video transcripts that have been downloaded or extracted in real-time.

4. **Contextual Understanding**: Leverages RAG technology to understand the context of user queries and provide relevant information from the video transcripts.

5. **Metadata Analysis**: Allows users to query not just the content of the videos but also metadata such as view counts, likes, and upload dates.

6. **Time-stamped Responses**: Can provide information about specific segments of videos, including start times and durations.

By combining these features, the YouTube Assistant empowers users to efficiently extract insights and information from YouTube videos without the need to watch them in full, significantly enhancing the way people interact with and learn from video content.

## Project Structure

The YouTube Assistant project is organized as follows:

```
youtube-rag-app/
β”œβ”€β”€ app/
β”‚   β”œβ”€β”€ home.py
β”‚   β”œβ”€β”€ pages/
β”‚   β”œβ”€β”€β”€β”€β”€β”€ chat_interface.py
β”‚   β”œβ”€β”€β”€β”€β”€β”€ data_ingestion.py
β”‚   β”œβ”€β”€β”€β”€β”€β”€ evauation.py
β”‚   β”œβ”€β”€β”€β”€β”€β”€ ground_truth.py
β”‚   β”œβ”€β”€ transcript_extractor.py
β”‚   β”œβ”€β”€ data_processor.py
β”‚   β”œβ”€β”€ elasticsearch_handler.py
β”‚   β”œβ”€β”€ database.py
β”‚   β”œβ”€β”€ rag.py
β”‚   β”œβ”€β”€ query_rewriter.py
β”‚   └── evaluation.py
β”‚   └── utils.py
β”œβ”€β”€ data/
β”‚   └── sqlite.db
β”œβ”€β”€ config/
β”‚   └── config.yaml
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ Dockerfile
└── docker-compose.yml
```

### Directory and File Descriptions:

- `app/`: Contains the main application code
  - `main.py`: Entry point of the application
  - `ui.py`: Handles the user interface
  - `transcript_extractor.py`: Manages YouTube transcript extraction
  - `data_processor.py`: Processes and prepares data for storage and analysis
  - `elasticsearch_handler.py`: Manages interactions with Elasticsearch
  - `database.py`: Handles SQLite database operations
  - `rag.py`: Implements the Retrieval-Augmented Generation logic
  - `query_rewriter.py`: Refines and optimizes user queries
  - `evaluation.py`: Contains evaluation metrics and functions
- `data/`: Stores the SQLite database
- `config/`: Contains configuration files
- `requirements.txt`: Lists all Python dependencies
- `Dockerfile`: Defines the Docker image for the application
- `docker-compose.yml`: Orchestrates the application and its services

## Getting Started

git clone [email protected]:ganesh3/rag-youtube-assistant.git
cd rag-youtube-assistant
docker-compose build app
docker-compose up -d

You need to have Docker Desktop installed on your laptop/workstation along with WSL2 on windows machine.

## License
GPL v3

### Interface

I use Streamlit to ingest the youtube transcripts, query the transcripts uing LLM & RAG, generate ground truth and evaluate the ground truth.

### Ingestion

I am ingesting Youtube transcripts using Youtube Data API v3 and Youtube Transcript package and the code is in transcript_extractor.py and it is run on the Streamlit app using main.py.

### Retrieval

"hit_rate":1, "mrr":1

### RAG Flow

I used the LLM as a Judge metric to evaluate the quality of our RAG Flow on my local machine with CPU and hence the total records evaluated are pretty low (12).

* RELEVANT - 12 (100%)
* PARTLY_RELEVANT - 0 (0%)
* NON RELEVANT - 0 (0%)

### Monitoring

I used Grafana to monitor the metrics, user feedback, evaluation results, and search performance.