BAS_Website_AI / README.md
Mr-Geo's picture
Update README.md
a9f5d5e verified

A newer version of the Gradio SDK is available: 5.27.1

Upgrade
metadata
title: BAS Website AI
emoji: 🌍
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 5.25.2
app_file: app.py
pinned: false
short_description: LLM RAG Web scraper on Hugging Face Spaces

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

Overview

This project implements a RAG (Retrieval-Augmented Generation) system that allows users to chat with website content using LLMs. It consists of two main components:

  1. Web Scraper: A robust scraping system that:

    • Crawls websites systematically with error handling and retry logic
    • Processes and chunks content for optimal retrieval
    • Stores data in ChromaDB with embeddings
    • Supports checkpoint-based resumption of scraping
  2. Chat Interface: A Gradio-based chat application that:

    • Uses Groq's LLM API for responses
    • Implements semantic search with ChromaDB
    • Features cross-encoder reranking for improved result relevance
    • Provides source citations for responses

Key Features

  • πŸ”„ Resumable web scraping with progress tracking
  • πŸ’Ύ Persistent vector storage using ChromaDB
  • πŸ” Advanced retrieval with semantic search and reranking
  • πŸ€– Integration with Groq's LLM API
  • πŸ“± User-friendly chat interface
  • πŸ”— Source attribution for responses

Setup

  1. Install dependencies:

    pip install -r requirements.txt
    
  2. Set up environment variables:

    GROQ_API_KEY=your_groq_api_key
    HF_TOKEN=your_huggingface_token
    
  3. Run the scraper:

    python scraper_app.py
    
    • Use --rescrape flag for fresh start
  4. Launch the chat interface:

    python app.py
    

Tools & Technologies

  • ChromaDB for vector storage
  • Sentence Transformers for embeddings
  • Groq for LLM inference
  • Gradio for web interface
  • BeautifulSoup4 for web scraping

Project Structure

  • app.py: Main chat application
  • scraper.py: Web scraping logic
  • scraper_app.py: Scraper management
  • chroma_explorer.ipynb: Database exploration notebook