Mr-Geo commited on
Commit
2c1f9fb
Β·
verified Β·
1 Parent(s): 3a4341d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +61 -6
README.md CHANGED
@@ -1,13 +1,68 @@
1
- ---
2
- title: BAS Website Chat
3
- emoji: πŸ†
4
  colorFrom: indigo
5
  colorTo: green
6
  sdk: gradio
7
  sdk_version: 5.14.0
8
  app_file: app.py
9
  pinned: false
10
- short_description: LLM RAG Web scraper. on Hugging Face Spaces
11
- ---
12
 
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ title: BAS Website AI
2
+ emoji: 🌍
 
3
  colorFrom: indigo
4
  colorTo: green
5
  sdk: gradio
6
  sdk_version: 5.14.0
7
  app_file: app.py
8
  pinned: false
9
+ short_description: LLM RAG Web scraper for the British Antarctic Survey on Hugging Face Spaces
 
10
 
11
+ ## Overview
12
+ This project implements a RAG (Retrieval-Augmented Generation) system that allows users to chat with website content using LLMs. It consists of two main components:
13
+
14
+ 1. **Web Scraper**: A robust scraping system that:
15
+ - Crawls websites systematically with error handling and retry logic
16
+ - Processes and chunks content for optimal retrieval
17
+ - Stores data in ChromaDB with embeddings
18
+ - Supports checkpoint-based resumption of scraping
19
+
20
+ 2. **Chat Interface**: A Gradio-based chat application that:
21
+ - Uses Groq's LLM API for responses
22
+ - Implements semantic search with ChromaDB
23
+ - Features cross-encoder reranking for improved result relevance
24
+ - Provides source citations for responses
25
+
26
+ ## Key Features
27
+ - πŸ”„ Resumable web scraping with progress tracking
28
+ - πŸ’Ύ Persistent vector storage using ChromaDB
29
+ - πŸ” Advanced retrieval with semantic search and reranking
30
+ - πŸ€– Integration with Groq's LLM API
31
+ - πŸ“± User-friendly chat interface
32
+ - πŸ”— Source attribution for responses
33
+
34
+ ## Setup
35
+ 1. Install dependencies:
36
+ ```bash
37
+ pip install -r requirements.txt
38
+ ```
39
+
40
+ 2. Set up environment variables:
41
+ ```
42
+ GROQ_API_KEY=your_groq_api_key
43
+ HF_TOKEN=your_huggingface_token
44
+ ```
45
+
46
+ 3. Run the scraper:
47
+ ```bash
48
+ python scraper_app.py
49
+ ```
50
+ - Use `--rescrape` flag for fresh start
51
+
52
+ 4. Launch the chat interface:
53
+ ```bash
54
+ python app.py
55
+ ```
56
+
57
+ ## Tools & Technologies
58
+ - ChromaDB for vector storage
59
+ - Sentence Transformers for embeddings
60
+ - Groq for LLM inference
61
+ - Gradio for web interface
62
+ - BeautifulSoup4 for web scraping
63
+
64
+ ## Project Structure
65
+ - `app.py`: Main chat application
66
+ - `scraper.py`: Web scraping logic
67
+ - `scraper_app.py`: Scraper management
68
+ - `chroma_explorer.ipynb`: Database exploration notebook