Spaces:
Running
on
Zero
Running
on
Zero
A newer version of the Gradio SDK is available:
5.27.1
metadata
title: BAS Website AI
emoji: π
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 5.25.2
app_file: app.py
pinned: false
short_description: LLM RAG Web scraper on Hugging Face Spaces
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
Overview
This project implements a RAG (Retrieval-Augmented Generation) system that allows users to chat with website content using LLMs. It consists of two main components:
Web Scraper: A robust scraping system that:
- Crawls websites systematically with error handling and retry logic
- Processes and chunks content for optimal retrieval
- Stores data in ChromaDB with embeddings
- Supports checkpoint-based resumption of scraping
Chat Interface: A Gradio-based chat application that:
- Uses Groq's LLM API for responses
- Implements semantic search with ChromaDB
- Features cross-encoder reranking for improved result relevance
- Provides source citations for responses
Key Features
- π Resumable web scraping with progress tracking
- πΎ Persistent vector storage using ChromaDB
- π Advanced retrieval with semantic search and reranking
- π€ Integration with Groq's LLM API
- π± User-friendly chat interface
- π Source attribution for responses
Setup
Install dependencies:
pip install -r requirements.txt
Set up environment variables:
GROQ_API_KEY=your_groq_api_key HF_TOKEN=your_huggingface_token
Run the scraper:
python scraper_app.py
- Use
--rescrape
flag for fresh start
- Use
Launch the chat interface:
python app.py
Tools & Technologies
- ChromaDB for vector storage
- Sentence Transformers for embeddings
- Groq for LLM inference
- Gradio for web interface
- BeautifulSoup4 for web scraping
Project Structure
app.py
: Main chat applicationscraper.py
: Web scraping logicscraper_app.py
: Scraper managementchroma_explorer.ipynb
: Database exploration notebook