README.md · Mr-Geo/BAS_Website

metadata

title: BAS Website AI
emoji: 🌍
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 5.25.2
app_file: app.py
pinned: false
short_description: LLM RAG Web scraper on Hugging Face Spaces

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

Overview

This project implements a RAG (Retrieval-Augmented Generation) system that allows users to chat with website content using LLMs. It consists of two main components:

Web Scraper: A robust scraping system that:
- Crawls websites systematically with error handling and retry logic
- Processes and chunks content for optimal retrieval
- Stores data in ChromaDB with embeddings
- Supports checkpoint-based resumption of scraping
Chat Interface: A Gradio-based chat application that:
- Uses Groq's LLM API for responses
- Implements semantic search with ChromaDB
- Features cross-encoder reranking for improved result relevance
- Provides source citations for responses

Key Features

🔄 Resumable web scraping with progress tracking
💾 Persistent vector storage using ChromaDB
🔍 Advanced retrieval with semantic search and reranking
🤖 Integration with Groq's LLM API
📱 User-friendly chat interface
🔗 Source attribution for responses

Setup

Install dependencies:
```
pip install -r requirements.txt
```

Set up environment variables:

GROQ_API_KEY=your_groq_api_key
HF_TOKEN=your_huggingface_token

Run the scraper:
```
python scraper_app.py
```
- Use --rescrape flag for fresh start
Launch the chat interface:
```
python app.py
```

Tools & Technologies

ChromaDB for vector storage
Sentence Transformers for embeddings
Groq for LLM inference
Gradio for web interface
BeautifulSoup4 for web scraping

Project Structure

app.py: Main chat application
scraper.py: Web scraping logic
scraper_app.py: Scraper management
chroma_explorer.ipynb: Database exploration notebook