Spaces:

hevold
/

iver

Sleeping

App Files Files Community

hevold commited on 25 days ago

Commit

b34efa5

verified ·

1 Parent(s): 1fa38f0

Upload 29 files

Browse files

Files changed (29) hide show

README.md +116 -10
app.yaml +25 -0
data/documents/.gitkeep +0 -0
data/documents/test_document.txt +25 -0
data/processed/.gitkeep +0 -0
design/chat_interface.md +256 -0
design/document_processing.md +170 -0
design/rag_architecture.md +197 -0
prepare_deployment.sh +37 -0
requirements-minimal.txt +21 -0
requirements-ultra-light.txt +7 -0
requirements.txt +25 -1
research/norwegian_llm_research.md +81 -0
src/api/__init__.py +3 -0
src/api/config.py +61 -0
src/api/huggingface_api.py +213 -0
src/document_processing/__init__.py +3 -0
src/document_processing/chunker.py +262 -0
src/document_processing/extractor.py +167 -0
src/document_processing/processor.py +306 -0
src/main.py +60 -0
src/project_structure.md +79 -0
src/rag/__init__.py +3 -0
src/rag/generator.py +87 -0
src/rag/retriever.py +163 -0
src/web/__init__.py +3 -0
src/web/app.py +301 -0
src/web/embed.py +211 -0
todo.md +26 -0

README.md CHANGED Viewed

@@ -1,13 +1,119 @@
 ---
-title: Iver
-emoji: 💬
-colorFrom: yellow
-colorTo: purple
 sdk: gradio
-sdk_version: 5.0.1
-app_file: app.py
-pinned: false
 license: mit
----
-An example chatbot using [Gradio](https://gradio.app), [`huggingface_hub`](https://huggingface.co/docs/huggingface_hub/v0.22.2/en/index), and the [Hugging Face Inference API](https://huggingface.co/docs/api-inference/index).

+# Norwegian RAG Chatbot
+A Retrieval-Augmented Generation (RAG) based chatbot with excellent Norwegian language support, built using Hugging Face's Inference API.
+## Features
+- **Norwegian Language Support**: Leverages state-of-the-art Norwegian language models like NorMistral, Viking, and NorskGPT
+- **Document Processing**: Upload and process documents in various formats (PDF, TXT, HTML)
+- **RAG Implementation**: Retrieves relevant context from documents to generate accurate responses
+- **Embeddable Interface**: Easily embed the chatbot in any website using iframe or JavaScript widget
+- **Lightweight Architecture**: Uses Hugging Face's Inference API instead of running models locally
+## Architecture
+This chatbot uses a lightweight architecture that leverages Hugging Face's hosted models:
+1. **Document Processing**: Documents are processed locally, extracting text and splitting into chunks
+2. **Embedding Generation**: Document chunks are embedded using Hugging Face's Inference API
+3. **Retrieval**: When a query is received, the most relevant document chunks are retrieved
+4. **Response Generation**: The LLM generates a response based on the retrieved context
+## Getting Started
+### Prerequisites
+- Python 3.10+
+- A Hugging Face account (for API access)
+### Installation
+1. Clone the repository:
+```bash
+git clone https://huggingface.co/spaces/username/norwegian-rag-chatbot
+cd norwegian-rag-chatbot
+```
+2. Install dependencies:
+```bash
+pip install -r requirements-ultra-light.txt
+```
+3. Set up your Hugging Face API key:
+```bash
+export HF_API_KEY="your_api_key_here"
+```
+### Running the Chatbot
+```bash
+python src/main.py
+```
+The chatbot will be available at http://localhost:7860
+## Usage
+### Chat Interface
+The main chat interface allows you to:
+- Ask questions in Norwegian
+- Receive responses based on your uploaded documents
+- Adjust temperature and other settings
+### Document Upload
+You can upload documents to provide context for the chatbot:
+- Supported formats: PDF, TXT, HTML
+- Documents are automatically processed and indexed
+- The chatbot will use these documents to provide more accurate responses
+### Embedding
+You can embed the chatbot in your website using:
+- iFrame embedding
+- JavaScript widget
+- Direct link
+## Deployment
+The chatbot is designed to be deployed to Hugging Face Spaces:
+1. Create a new Space on Hugging Face
+2. Upload the code to the Space
+3. Set the HF_API_KEY secret in the Space settings
+4. The Space will automatically build and deploy the chatbot
+## Models
+The chatbot can use various Norwegian language models:
+- **NorMistral-7b-scratch**: A large Norwegian language model pretrained from scratch
+- **Viking 7B**: A multilingual model for Nordic languages
+- **NorskGPT**: A Norwegian language model based on Mistral or LLAMA2
+For embeddings, it uses:
+- **NbAiLab/nb-sbert-base**: A Norwegian sentence embedding model
+## License
+This project is licensed under the MIT License - see the LICENSE file for details.
+## Acknowledgements
+- [Hugging Face](https://huggingface.co/) for hosting the models and providing the Inference API
+- [Gradio](https://gradio.app/) for the web interface framework
+- The creators of the Norwegian language models used in this project
 ---
+name: norwegian-rag-chatbot
+title: Norwegian RAG Chatbot
+emoji: 🇳🇴
+colorFrom: blue
+colorTo: red
 sdk: gradio
+sdk_version: 4.0.0
+app_file: src/main.py
+pinned: true
 license: mit

app.yaml ADDED Viewed

	@@ -0,0 +1,25 @@

+sdk:
+  base_image: python:3.10
+  build_commands:
+    - pip install -r requirements-ultra-light.txt
+  python_packages:
+    - gradio>=4.0.0
+    - huggingface_hub>=0.19.0
+    - requests>=2.31.0
+    - numpy>=1.24.0
+    - PyPDF2>=3.0.0
+    - beautifulsoup4>=4.12.0
+app:
+  title: Norwegian RAG Chatbot
+  emoji: 🇳🇴
+  colorPrimary: "#00205B"
+  colorSecondary: "#EF2B2D"
+  pinned: true
+  sdk: gradio
+  python_version: "3.10"
+  suggested_hardware: cpu-basic
+  models:
+    - norallm/normistral-7b-scratch
+    - NbAiLab/nb-sbert-base
+  spaces_server_url: https://api-inference.huggingface.co/models/

data/documents/.gitkeep ADDED Viewed

File without changes

data/documents/test_document.txt ADDED Viewed

	@@ -0,0 +1,25 @@

+# Norsk historie
+Norge har en rik og fascinerende historie som strekker seg tilbake til vikingtiden. Vikingene var kjent for sine sjøreiser, handel og plyndring i store deler av Europa fra slutten av 700-tallet til midten av 1000-tallet.
+## Middelalderen
+I 1030 døde Olav Haraldsson (senere kjent som Olav den hellige) i slaget ved Stiklestad. Hans død markerte begynnelsen på kristendommens endelige gjennombrudd i Norge.
+Norge ble forent til ett rike under Harald Hårfagre på 800-tallet. Etter vikingtiden fulgte en periode med borgerkrig før landet ble stabilisert under Håkon Håkonsson på 1200-tallet.
+## Union med Danmark
+Fra 1380 til 1814 var Norge i union med Danmark, en periode kjent som "dansketiden". Under denne perioden ble dansk det offisielle språket i administrasjon og litteratur, noe som hadde stor innflytelse på det norske språket.
+## Grunnloven og union med Sverige
+I 1814 fikk Norge sin egen grunnlov, signert på Eidsvoll 17. mai. Samme år ble Norge tvunget inn i en union med Sverige, som varte frem til 1905.
+## Moderne Norge
+Norge ble okkupert av Nazi-Tyskland under andre verdenskrig fra 1940 til 1945. Etter krigen opplevde landet rask økonomisk vekst.
+Oppdagelsen av olje i Nordsjøen på slutten av 1960-tallet forvandlet Norge til en av verdens rikeste nasjoner per innbygger.
+I dag er Norge kjent for sin velferdsstat, naturskjønnhet og høy levestandard.

data/processed/.gitkeep ADDED Viewed

File without changes

design/chat_interface.md ADDED Viewed

	@@ -0,0 +1,256 @@

+# Chat Interface Design
+This document outlines the design for the chat interface of our Norwegian RAG-based chatbot. The interface will be implemented using Gradio and deployed on Hugging Face Spaces.
+## Interface Requirements
+### Functional Requirements
+1. **Chat Interaction**:
+   - Text input field for user queries
+   - Response display area for chatbot answers
+   - Support for multi-turn conversations
+   - Message history display
+2. **Document Management**:
+   - Document upload functionality
+   - Document list display
+   - Status indicators for processing
+3. **Configuration Options**:
+   - Model selection (if multiple models are supported)
+   - Language selection (Norwegian/English toggle)
+   - Advanced parameters adjustment (optional)
+4. **Embedding Functionality**:
+   - Code snippet generation for embedding
+   - Preview of embedded widget
+   - Copy-to-clipboard functionality
+### Non-Functional Requirements
+1. **Responsiveness**:
+   - Mobile-friendly design
+   - Adaptive layout for different screen sizes
+2. **Performance**:
+   - Efficient loading times
+   - Progress indicators for long operations
+   - Streaming responses for better user experience
+3. **Accessibility**:
+   - WCAG 2.1 compliance
+   - Keyboard navigation support
+   - Screen reader compatibility
+4. **Multilingual Support**:
+   - Norwegian as primary language
+   - English as secondary language
+   - Language detection and switching
+## UI Design
+### Main Chat Interface
+```
+┌─────────────────────────────────────────────────────────────┐
+│ Norwegian RAG Chatbot                                [🇳🇴/🇬🇧] │
+├─────────────────────────────────────────────────────────────┤
+│                                                             │
+│  ┌─────────────────────────────────────────────────────┐   │
+│  │                                                     │   │
+│  │                 Chat History Display                │   │
+│  │                                                     │   │
+│  │  ┌─────────────────────────────────────────────┐   │   │
+│  │  │ Bot: Hei! Hvordan kan jeg hjelpe deg i dag? │   │   │
+│  │  └─────────────────────────────────────────────┘   │   │
+│  │                                                     │   │
+│  │  ┌─────────────────────────────────────────────┐   │   │
+│  │  │ User: Fortell meg om norsk historie.        │   │   │
+│  │  └─────────────────────────────────────────────┘   │   │
+│  │                                                     │   │
+│  │  ┌─────────────────────────────────────────────┐   │   │
+│  │  │ Bot: Norsk historie strekker seg...         │   │   │
+│  │  └─────────────────────────────────────────────┘   │   │
+│  │                                                     │   │
+│  └─────────────────────────────────────────────────────┘   │
+│                                                             │
+│  ┌─────────────────────────────────────────────────────┐   │
+│  │ Type your message...                        [Send]  │   │
+│  └─────────────────────────────────────────────────────┘   │
+│                                                             │
+│  [Clear Chat]  [Settings]  [Upload Documents]  [Embed]     │
+└─────────────────────────────────────────────────────────────┘
+```
+### Document Upload Interface
+```
+┌─────────────────────────────────────────────────────��───────┐
+│ Document Management                                [Close]  │
+├─────────────────────────────────────────────────────────────┤
+│                                                             │
+│  [Upload New Document]                                      │
+│                                                             │
+│  ┌─────────────────────────────────────────────────────┐   │
+│  │ Document List                                       │   │
+│  │                                                     │   │
+│  │  ┌─────────────────────────────────────────────┐   │   │
+│  │  │ norsk_historie.pdf                [Remove]  │   │   │
+│  │  │ Status: Processed ✓                         │   │   │
+│  │  └─────────────────────────────────────────────┘   │   │
+│  │                                                     │   │
+│  │  ┌─────────────────────────────────────────────┐   │   │
+│  │  │ vikinger.docx                    [Remove]   │   │   │
+│  │  │ Status: Processing... 75%                   │   │   │
+│  │  └─────────────────────────────────────────────┘   │   │
+│  │                                                     │   │
+│  └─────────────────────────────────────────────────────┘   │
+│                                                             │
+│  [Process All]  [Remove All]                                │
+└─────────────────────────────────────────────────────────────┘
+```
+### Embed Code Interface
+```
+┌─────────────────────────────────────────────────────────────┐
+│ Embed Chatbot                                      [Close]  │
+├─────────────────────────────────────────────────────────────┤
+│                                                             │
+│  ┌─────────────────────────────────────────────────────┐   │
+│  │ Embed Code (iFrame)                                 │   │
+│  │                                                     │   │
+│  │  <iframe src="https://huggingface.co/spaces/...     │   │
+│  │                                                     │   │
+│  └─────────────────────────────────────────────────────┘   │
+│                                                             │
+│  [Copy to Clipboard]                                        │
+│                                                             │
+│  ┌─────────────────────────────────────────────────────┐   │
+│  │ Embed Code (JavaScript Widget)                      │   │
+│  │                                                     │   │
+│  │  <script src="https://huggingface.co/spaces/...     │   │
+│  │                                                     │   │
+│  └─────────────────────────────────────────────────────┘   │
+│                                                             │
+│  [Copy to Clipboard]                                        │
+│                                                             │
+│  ┌─────────────────────────────────────────────────────┐   │
+│  │                 Preview                             │   │
+│  │                                                     │   │
+│  │                                                     │   │
+│  └─────────────────────────────────────────────────────┘   │
+└─────────────────────────────────────────────────────────────┘
+```
+## Implementation with Gradio
+Gradio is an ideal choice for implementing this interface due to its simplicity, Python integration, and native support on Hugging Face Spaces.
+### Core Components
+1. **Chat Interface**:
+   ```python
+   with gr.Blocks() as demo:
+       chatbot = gr.Chatbot()
+       msg = gr.Textbox(label="Message")
+       clear = gr.Button("Clear")
+       def respond(message, chat_history):
+           # RAG processing logic here
+           bot_message = get_rag_response(message)
+           chat_history.append((message, bot_message))
+           return "", chat_history
+       msg.submit(respond, [msg, chatbot], [msg, chatbot])
+       clear.click(lambda: None, None, chatbot, queue=False)
+   ```
+2. **Document Upload**:
+   ```python
+   with gr.Tab("Upload Documents"):
+       file_output = gr.File()
+       upload_button = gr.UploadButton("Click to Upload a File", file_types=["pdf", "docx", "txt"])
+       def upload_file(file):
+           # Document processing logic here
+           process_document(file.name)
+           return file.name
+       upload_button.upload(upload_file, upload_button, file_output)
+   ```
+3. **Embedding Code Generation**:
+   ```python
+   with gr.Tab("Embed"):
+       iframe_code = gr.Textbox(label="iFrame Embed Code")
+       js_code = gr.Textbox(label="JavaScript Widget Code")
+       def generate_embed_code():
+           iframe = f'<iframe src="{SPACE_URL}" width="100%" height="500px"></iframe>'
+           js = f'<script src="{SPACE_URL}/widget.js"></script>'
+           return iframe, js
+       embed_button = gr.Button("Generate Embed Code")
+       embed_button.click(generate_embed_code, None, [iframe_code, js_code])
+   ```
+## Norwegian Language Support
+1. **Interface Localization**:
+   - Implement language switching functionality
+   - Store UI text in language-specific dictionaries
+   - Apply translations based on selected language
+2. **Input Processing**:
+   - Handle Norwegian special characters correctly
+   - Implement Norwegian-specific text normalization
+3. **Response Generation**:
+   - Ensure proper formatting of Norwegian text
+   - Handle Norwegian grammar and syntax correctly
+## Responsive Design
+1. **CSS Customization**:
+   ```python
+   with gr.Blocks(css="""
+       @media (max-width: 600px) {
+           .container { padding: 5px; }
+           .input-box { font-size: 14px; }
+       }
+   """) as demo:
+       # Interface components
+   ```
+2. **Layout Adaptation**:
+   - Use flexible layouts that adapt to screen size
+   - Implement collapsible sections for mobile view
+   - Ensure touch-friendly UI elements
+## Deployment on Hugging Face Spaces
+1. **Space Configuration**:
+   - Create a `requirements.txt` file with all dependencies
+   - Set up appropriate environment variables
+   - Configure resource allocation
+2. **Continuous Integration**:
+   - Set up GitHub repository for the project
+   - Configure automatic deployment to Hugging Face Spaces
+   - Implement version control for the interface
+3. **Monitoring and Analytics**:
+   - Add usage tracking
+   - Implement error logging
+   - Set up performance monitoring
+## Next Steps
+1. Implement the basic chat interface with Gradio
+2. Add document upload and processing functionality
+3. Create embedding code generation feature
+4. Implement responsive design and language switching
+5. Deploy to Hugging Face Spaces for testing
+6. Gather feedback and iterate on the design

design/document_processing.md ADDED Viewed

	@@ -0,0 +1,170 @@

+# Document Processing Pipeline Design
+This document outlines the design for the document processing pipeline of our Norwegian RAG-based chatbot. The pipeline will transform raw documents into embeddings that can be efficiently retrieved during the chat process.
+## Pipeline Overview
+```
+Raw Documents → Text Extraction → Text Chunking → Text Cleaning → Embedding Generation → Vector Storage
+```
+## Components
+### 1. Text Extraction
+**Purpose**: Extract plain text from various document formats.
+**Supported Formats**:
+- PDF (.pdf)
+- Word Documents (.docx, .doc)
+- Text files (.txt)
+- HTML (.html, .htm)
+- Markdown (.md)
+**Implementation**:
+- Use PyPDF2 for PDF extraction
+- Use python-docx for Word documents
+- Use BeautifulSoup for HTML parsing
+- Direct reading for text and markdown files
+### 2. Text Chunking
+**Purpose**: Split documents into manageable chunks for more precise retrieval.
+**Chunking Strategies**:
+- Fixed size chunks (512 tokens recommended for Norwegian text)
+- Semantic chunking (split at paragraph or section boundaries)
+- Overlapping chunks (100-token overlap recommended)
+**Implementation**:
+- Use LangChain's text splitters
+- Implement custom Norwegian-aware chunking logic
+### 3. Text Cleaning
+**Purpose**: Normalize and clean text to improve embedding quality.
+**Cleaning Operations**:
+- Remove excessive whitespace
+- Normalize Norwegian characters (æ, ø, å)
+- Remove irrelevant content (headers, footers, page numbers)
+- Handle special characters and symbols
+**Implementation**:
+- Custom text cleaning functions
+- Norwegian-specific normalization rules
+### 4. Embedding Generation
+**Purpose**: Generate vector representations of text chunks.
+**Embedding Model**:
+- Primary: NbAiLab/nb-sbert-base (768 dimensions)
+- Alternative: FFI/SimCSE-NB-BERT-large
+**Implementation**:
+- Use sentence-transformers library
+- Batch processing for efficiency
+- Caching mechanism for frequently embedded chunks
+### 5. Vector Storage
+**Purpose**: Store and index embeddings for efficient retrieval.
+**Storage Options**:
+- Primary: FAISS (Facebook AI Similarity Search)
+- Alternative: Milvus (for larger deployments)
+**Implementation**:
+- FAISS IndexFlatIP (Inner Product) for cosine similarity
+- Metadata storage for mapping vectors to original text
+- Serialization for persistence
+## Processing Flow
+1. **Document Ingestion**:
+   - Accept documents via upload interface
+   - Store original documents in a document store
+   - Extract document metadata (title, date, source)
+2. **Processing Pipeline Execution**:
+   - Process documents through the pipeline components
+   - Track processing status and errors
+   - Generate unique IDs for each chunk
+3. **Index Management**:
+   - Create and update vector indices
+   - Implement versioning for indices
+   - Provide reindexing capabilities
+## Norwegian Language Considerations
+- **Character Encoding**: Ensure proper handling of Norwegian characters (UTF-8)
+- **Tokenization**: Use tokenizers that properly handle Norwegian word structures
+- **Stopwords**: Implement Norwegian stopword filtering for improved retrieval
+- **Stemming/Lemmatization**: Consider Norwegian-specific stemming or lemmatization
+## Implementation Plan
+1. Create document processor class structure
+2. Implement text extraction for different formats
+3. Develop chunking strategies optimized for Norwegian
+4. Build text cleaning and normalization functions
+5. Integrate with embedding model
+6. Set up vector storage and retrieval mechanisms
+7. Create a unified API for the entire pipeline
+## Code Structure
+```python
+# Example structure for the document processing pipeline
+class DocumentProcessor:
+    def __init__(self, embedding_model, vector_store):
+        self.embedding_model = embedding_model
+        self.vector_store = vector_store
+    def process_document(self, document_path):
+        # Extract text based on document type
+        raw_text = self._extract_text(document_path)
+        # Split text into chunks
+        chunks = self._chunk_text(raw_text)
+        # Clean and normalize text chunks
+        cleaned_chunks = [self._clean_text(chunk) for chunk in chunks]
+        # Generate embeddings
+        embeddings = self._generate_embeddings(cleaned_chunks)
+        # Store in vector database
+        self._store_embeddings(embeddings, cleaned_chunks)
+    def _extract_text(self, document_path):
+        # Implementation for different document types
+        pass
+    def _chunk_text(self, text):
+        # Implementation of chunking strategy
+        pass
+    def _clean_text(self, text):
+        # Text normalization and cleaning
+        pass
+    def _generate_embeddings(self, chunks):
+        # Use embedding model to generate vectors
+        pass
+    def _store_embeddings(self, embeddings, chunks):
+        # Store in vector database with metadata
+        pass
+```
+## Next Steps
+1. Implement the document processor class
+2. Create test documents in Norwegian
+3. Evaluate chunking strategies for Norwegian text
+4. Benchmark embedding generation performance
+5. Test retrieval accuracy with Norwegian queries

design/rag_architecture.md ADDED Viewed

	@@ -0,0 +1,197 @@

+# RAG Architecture for Norwegian Chatbot
+## Overview
+This document outlines the architecture for a Retrieval-Augmented Generation (RAG) based chatbot optimized for Norwegian language, designed to be hosted on Hugging Face. The architecture leverages open-source models with strong Norwegian language support and integrates with Hugging Face's infrastructure for seamless deployment.
+## System Components
+### 1. Language Model (LLM)
+Based on our research, we recommend using one of the following models:
+**Primary Option: NorMistral-7b-scratch**
+- Strong Norwegian language support
+- Apache 2.0 license (allows commercial use)
+- 7B parameters (reasonable size for deployment)
+- Good performance on Norwegian language tasks
+- Available on Hugging Face
+**Alternative Option: Viking 7B**
+- Specifically designed for Nordic languages
+- Apache 2.0 license
+- 4K context length
+- Good multilingual capabilities (useful if the chatbot needs to handle some English queries)
+**Fallback Option: NorskGPT-Mistral**
+- Specifically designed for Norwegian
+- Note: Non-commercial license (cc-by-nc-sa-4.0)
+### 2. Embedding Model
+**Recommended: NbAiLab/nb-sbert-base**
+- Specifically trained for Norwegian
+- 768-dimensional embeddings
+- Good performance on sentence similarity tasks
+- Works well with both Norwegian and English content
+- Apache 2.0 license
+- High download count on Hugging Face (41,370 last month)
+### 3. Vector Database
+**Recommended: FAISS**
+- Lightweight and efficient
+- Easy integration with Hugging Face
+- Can be packaged with the application
+- Works well for moderate-sized document collections
+**Alternative: Milvus**
+- More scalable for larger document collections
+- Well-documented integration with Hugging Face
+- Better for production deployments with large document bases
+### 4. Document Processing Pipeline
+1. **Text Extraction**: Extract text from various document formats (PDF, DOCX, TXT)
+2. **Text Chunking**: Split documents into manageable chunks (recommended chunk size: 512 tokens)
+3. **Text Cleaning**: Remove irrelevant content, normalize text
+4. **Embedding Generation**: Generate embeddings using NbAiLab/nb-sbert-base
+5. **Vector Storage**: Store embeddings in FAISS index
+### 5. Retrieval Mechanism
+1. **Query Processing**: Process user query
+2. **Query Embedding**: Generate embedding for the query using the same embedding model
+3. **Similarity Search**: Find most relevant document chunks using cosine similarity
+4. **Context Assembly**: Assemble retrieved chunks into context for the LLM
+### 6. Generation Component
+1. **Prompt Construction**: Construct prompt with retrieved context and user query
+2. **LLM Inference**: Generate response using the LLM
+3. **Response Post-processing**: Format and clean the response
+### 7. Chat Interface
+1. **Frontend**: Lightweight, responsive web interface
+2. **API Layer**: RESTful API for communication between frontend and backend
+3. **Session Management**: Maintain conversation history
+## Hugging Face Integration
+### Deployment Options
+1. **Hugging Face Spaces**:
+   - Deploy the entire application as a Gradio or Streamlit app
+   - Provides a public URL for access
+   - Supports Git-based deployment
+2. **Model Hosting**:
+   - Host the fine-tuned LLM on Hugging Face Model Hub
+   - Use Hugging Face Inference API for model inference
+3. **Datasets**:
+   - Store and version document collections on Hugging Face Datasets
+### Implementation Approach
+1. **Gradio Interface**:
+   - Create a Gradio app for the chat interface
+   - Deploy to Hugging Face Spaces
+2. **Backend Processing**:
+   - Use Hugging Face Transformers and Sentence-Transformers libraries
+   - Implement document processing pipeline
+   - Set up FAISS for vector storage and retrieval
+3. **Model Integration**:
+   - Load models from Hugging Face Model Hub
+   - Implement caching for better performance
+## Technical Architecture Diagram
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                      Hugging Face Spaces                         │
+└─────────────────────────────────────────────────────────────────┘
+                                 │
+                                 ▼
+┌─────────────────────────────────────────────────────────────────┐
+│                        Web Interface                             │
+│                                                                 │
+│  ┌─────────────┐                               ┌────────────┐   │
+│  │   Gradio    │                               │  Session   │   │
+│  │  Interface  │◄──────────────────────────────┤  Manager   │   │
+│  └─────────────┘                               └────────────┘   │
+└─────────────────────────────────────────────────────────────────┘
+                                 │
+                                 ▼
+┌─────────────────────────────────────────────────────────────────┐
+│                      Backend Processing                          │
+│                                                                 │
+│  ┌─────────────┐    ┌─────────────┐    ┌─────────────────────┐  │
+│  │   Query     │    │  Retrieval  │    │     Generation      │  │
+│  │ Processing  │───►│   Engine    │───►│      Engine         │  │
+│  └─────────────┘    └─────────────┘    └─────────────────────┘  │
+│                            │                      ▲              │
+│                            ▼                      │              │
+│                     ┌─────────────┐               │              │
+│                     │    FAISS    │               │              │
+│                     │   Vector    │               │              │
+│                     │   Store     │               │              │
+│                     └─────────────┘               │              │
+│                            ▲                      │              │
+│                            │                      │              │
+│  ┌─────────────────────────┴──────────────────────┴───────────┐ │
+│  │                    Document Processor                       │ │
+│  └──────────────────────────────────────────────────────────────┘
+└─────────────────────────────────────────────────────────────────┘
+                                 │
+                                 ▼
+┌─────────────────────────────────────────────────────────────────┐
+│                     Hugging Face Model Hub                       │
+│                                                                 │
+│  ┌─────────────────┐                     ┌───────────────────┐  │
+│  │   NbAiLab/      │                     │   NorMistral-     │  │
+│  │  nb-sbert-base  │                     │   7b-scratch      │  │
+│  │  (Embeddings)   │                     │      (LLM)        │  │
+│  └─────────────────┘                     └───────────────────┘  │
+└─────────────────────────────────────────────────────────────────┘
+```
+## Implementation Considerations
+### 1. Performance Optimization
+- **Model Quantization**: Use GGUF or GPTQ quantized versions of the LLM to reduce memory requirements
+- **Batch Processing**: Implement batch processing for document embedding generation
+- **Caching**: Cache frequent queries and responses
+- **Progressive Loading**: Implement progressive loading for large document collections
+### 2. Norwegian Language Optimization
+- **Tokenization**: Ensure proper tokenization for Norwegian-specific characters and word structures
+- **Text Normalization**: Implement Norwegian-specific text normalization (handling of "æ", "ø", "å")
+- **Stopword Removal**: Use Norwegian stopword list for improved retrieval
+### 3. Embedding Functionality
+- **iFrame Integration**: Provide code snippets for embedding the chatbot in iFrames
+- **JavaScript Widget**: Create a JavaScript widget for easy integration into any website
+- **API Access**: Provide API endpoints for programmatic access
+### 4. Security and Privacy
+- **Data Handling**: Implement proper data handling practices
+- **User Authentication**: Add optional user authentication for personalized experiences
+- **Rate Limiting**: Implement rate limiting to prevent abuse
+## Next Steps
+1. Set up the development environment
+2. Implement the document processing pipeline
+3. Integrate the LLM and embedding models
+4. Create the chat interface
+5. Develop the embedding functionality
+6. Deploy to Hugging Face
+7. Test and optimize the solution

prepare_deployment.sh ADDED Viewed

	@@ -0,0 +1,37 @@

+#!/bin/bash
+# Create empty directories for data storage
+mkdir -p /home/ubuntu/chatbot_project/data/documents
+mkdir -p /home/ubuntu/chatbot_project/data/processed
+touch /home/ubuntu/chatbot_project/data/documents/.gitkeep
+touch /home/ubuntu/chatbot_project/data/processed/.gitkeep
+# Create a simple test document
+cat > /home/ubuntu/chatbot_project/data/documents/test_document.txt << 'EOL'
+# Norsk historie
+Norge har en rik og fascinerende historie som strekker seg tilbake til vikingtiden. Vikingene var kjent for sine sjøreiser, handel og plyndring i store deler av Europa fra slutten av 700-tallet til midten av 1000-tallet.
+## Middelalderen
+I 1030 døde Olav Haraldsson (senere kjent som Olav den hellige) i slaget ved Stiklestad. Hans død markerte begynnelsen på kristendommens endelige gjennombrudd i Norge.
+Norge ble forent til ett rike under Harald Hårfagre på 800-tallet. Etter vikingtiden fulgte en periode med borgerkrig før landet ble stabilisert under Håkon Håkonsson på 1200-tallet.
+## Union med Danmark
+Fra 1380 til 1814 var Norge i union med Danmark, en periode kjent som "dansketiden". Under denne perioden ble dansk det offisielle språket i administrasjon og litteratur, noe som hadde stor innflytelse på det norske språket.
+## Grunnloven og union med Sverige
+I 1814 fikk Norge sin egen grunnlov, signert på Eidsvoll 17. mai. Samme år ble Norge tvunget inn i en union med Sverige, som varte frem til 1905.
+## Moderne Norge
+Norge ble okkupert av Nazi-Tyskland under andre verdenskrig fra 1940 til 1945. Etter krigen opplevde landet rask økonomisk vekst.
+Oppdagelsen av olje i Nordsjøen på slutten av 1960-tallet forvandlet Norge til en av verdens rikeste nasjoner per innbygger.
+I dag er Norge kjent for sin velferdsstat, naturskjønnhet og høy levestandard.
+EOL
+echo "Deployment files prepared successfully"

requirements-minimal.txt ADDED Viewed

	@@ -0,0 +1,21 @@

+# Core dependencies - minimal version
+transformers>=4.36.0
+sentence-transformers>=2.2.2
+torch>=2.0.0
+gradio>=4.0.0
+huggingface_hub>=0.19.0
+# Document processing - essential only
+PyPDF2>=3.0.0
+beautifulsoup4>=4.12.0
+# Vector database - lightweight option
+faiss-cpu>=1.7.4
+# Utilities - minimal set
+numpy>=1.24.0
+tqdm>=4.66.0
+requests>=2.31.0
+# Norwegian language support
+nltk>=3.8.0

requirements-ultra-light.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+# Core dependencies - ultra lightweight
+requests>=2.31.0
+gradio>=4.0.0
+huggingface_hub>=0.19.0
+numpy>=1.24.0
+PyPDF2>=3.0.0
+beautifulsoup4>=4.12.0

requirements.txt CHANGED Viewed

	@@ -1 +1,25 @@
1	- ~~huggingface_hub==0.25.2~~

+# Core dependencies
+transformers>=4.36.0
+sentence-transformers>=2.2.2
+torch>=2.0.0
+gradio>=4.0.0
+huggingface_hub>=0.19.0
+# Document processing
+PyPDF2>=3.0.0
+python-docx>=0.8.11
+beautifulsoup4>=4.12.0
+markdown>=3.5.0
+# Vector database
+faiss-cpu>=1.7.4
+langchain>=0.1.0
+# Utilities
+numpy>=1.24.0
+pandas>=2.0.0
+tqdm>=4.66.0
+requests>=2.31.0
+# Norwegian language support
+nltk>=3.8.0

research/norwegian_llm_research.md ADDED Viewed

	@@ -0,0 +1,81 @@

+# Norwegian LLM and Embedding Models Research
+## Open-Source LLMs with Norwegian Language Support
+### 1. NorMistral-7b-scratch
+- **Description**: A large Norwegian language model pretrained from scratch on 260 billion subword tokens (using six repetitions of open Norwegian texts).
+- **Architecture**: Based on Mistral architecture with 7 billion parameters
+- **Context Length**: 2k tokens
+- **Performance**:
+  - Perplexity on NCC validation set: 7.43
+  - Good performance on reading comprehension, sentiment analysis, and machine translation tasks
+- **License**: Apache-2.0
+- **Hugging Face**: https://huggingface.co/norallm/normistral-7b-scratch
+- **Notes**: Part of the NORA.LLM family developed by the Language Technology Group at the University of Oslo
+### 2. Viking 7B
+- **Description**: The first multilingual large language model for all Nordic languages (including Norwegian)
+- **Architecture**: Similar to Llama 2, with flash attention, rotary embeddings, grouped query attention
+- **Context Length**: 4k tokens
+- **Performance**: Best-in-class performance in all Nordic languages without compromising English performance
+- **License**: Apache 2.0
+- **Notes**:
+  - Developed by Silo AI and University of Turku's research group TurkuNLP
+  - Also available in larger sizes (13B and 33B parameters)
+  - Trained on 2 trillion tokens including Danish, English, Finnish, Icelandic, Norwegian, Swedish and programming languages
+### 3. NorskGPT
+- **Description**: A Norwegian large language model made for Norwegian society
+- **Versions**:
+  - NorskGPT-Mistral: 7B dense transformer with 8K context window, based on Mistral 7B
+  - NorskGPT-LLAMA2: 7b and 13b parameter model with 4K context length, based on LLAMA2
+- **License**: cc-by-nc-sa-4.0 (non-commercial)
+- **Website**: https://www.norskgpt.com/norskgpt-llm
+## Embedding Models for Norwegian
+### 1. NbAiLab/nb-sbert-base
+- **Description**: A SentenceTransformers model trained on a machine translated version of the MNLI dataset
+- **Architecture**: Based on nb-bert-base
+- **Vector Dimensions**: 768
+- **Performance**:
+  - Cosine Similarity: Pearson 0.8275, Spearman 0.8245
+- **License**: apache-2.0
+- **Hugging Face**: https://huggingface.co/NbAiLab/nb-sbert-base
+- **Use Cases**:
+  - Sentence similarity
+  - Semantic search
+  - Few-shot classification (with SetFit)
+  - Keyword extraction (with KeyBERT)
+  - Topic modeling (with BERTopic)
+- **Notes**: Works well with both Norwegian and English, making it ideal for bilingual applications
+### 2. FFI/SimCSE-NB-BERT-large
+- **Description**: A Norwegian sentence embedding model trained using the SimCSE methodology
+- **Hugging Face**: https://huggingface.co/FFI/SimCSE-NB-BERT-large
+## Vector Database Options for Hugging Face RAG Integration
+### 1. Milvus
+- **Integration**: Well-documented integration with Hugging Face for RAG pipelines
+- **Reference**: https://huggingface.co/learn/cookbook/en/rag_with_hf_and_milvus
+### 2. MongoDB
+- **Integration**: Can be used with Hugging Face models for RAG systems
+- **Reference**: https://huggingface.co/learn/cookbook/en/rag_with_hugging_face_gemma_mongodb
+### 3. MyScale
+- **Integration**: Supports building RAG applications with Hugging Face embedding models
+- **Reference**: https://medium.com/@myscale/building-a-rag-application-in-10-min-with-claude-3-and-hugging-face-10caea4ea293
+### 4. FAISS (Facebook AI Similarity Search)
+- **Integration**: Lightweight vector database that works well with Hugging Face
+- **Notes**: Can be used with `autofaiss` for quick experimentation
+## Hugging Face RAG Implementation Options
+1. **Transformers Library**: Provides access to pre-trained models
+2. **Sentence Transformers**: For text embeddings
+3. **Datasets**: For managing and processing data
+4. **LangChain Integration**: For advanced RAG pipelines
+5. **Spaces**: For deploying and sharing the application

src/api/__init__.py ADDED Viewed

	@@ -0,0 +1,3 @@

+"""
+API integration module for Norwegian RAG chatbot.
+"""

src/api/config.py ADDED Viewed

	@@ -0,0 +1,61 @@

+"""
+Configuration for Hugging Face API integration.
+Contains model IDs, API endpoints, and other configuration parameters.
+"""
+# Norwegian LLM options
+LLM_MODELS = {
+    "normistral": {
+        "model_id": "norallm/normistral-7b-scratch",
+        "description": "NorMistral 7B - Norwegian language model based on Mistral architecture"
+    },
+    "viking": {
+        "model_id": "silo-ai/viking-7b",
+        "description": "Viking 7B - Multilingual model for Nordic languages"
+    },
+    "norskgpt": {
+        "model_id": "NbAiLab/NorskGPT",
+        "description": "NorskGPT - Norwegian language model"
+    }
+}
+# Default LLM model
+DEFAULT_LLM_MODEL = "normistral"
+# Norwegian embedding models
+EMBEDDING_MODELS = {
+    "nb-sbert": {
+        "model_id": "NbAiLab/nb-sbert-base",
+        "description": "NB-SBERT-BASE - Norwegian sentence embedding model"
+    },
+    "simcse": {
+        "model_id": "FFI/SimCSE-NB-BERT-large",
+        "description": "SimCSE-NB-BERT-large - Norwegian sentence embedding model"
+    }
+}
+# Default embedding model
+DEFAULT_EMBEDDING_MODEL = "nb-sbert"
+# Hugging Face API endpoints
+HF_API_ENDPOINTS = {
+    "inference": "https://api-inference.huggingface.co/models/",
+    "feature-extraction": "https://api-inference.huggingface.co/pipeline/feature-extraction/"
+}
+# API request parameters
+API_PARAMS = {
+    "max_length": 512,
+    "temperature": 0.7,
+    "top_p": 0.9,
+    "top_k": 50,
+    "repetition_penalty": 1.1
+}
+# Document processing parameters
+CHUNK_SIZE = 512
+CHUNK_OVERLAP = 100
+# RAG parameters
+MAX_CHUNKS_TO_RETRIEVE = 5
+SIMILARITY_THRESHOLD = 0.75

src/api/huggingface_api.py ADDED Viewed

	@@ -0,0 +1,213 @@

+"""
+Hugging Face API integration for Norwegian RAG chatbot.
+Provides functions to interact with Hugging Face Inference API for both LLM and embedding models.
+"""
+import os
+import json
+import time
+import requests
+from typing import Dict, List, Optional, Union, Any
+from .config import (
+    LLM_MODELS,
+    DEFAULT_LLM_MODEL,
+    EMBEDDING_MODELS,
+    DEFAULT_EMBEDDING_MODEL,
+    HF_API_ENDPOINTS,
+    API_PARAMS
+)
+class HuggingFaceAPI:
+    """
+    Client for interacting with Hugging Face Inference API.
+    Supports both text generation (LLM) and embedding generation.
+    """
+    def __init__(
+        self,
+        api_key: Optional[str] = None,
+        llm_model: str = DEFAULT_LLM_MODEL,
+        embedding_model: str = DEFAULT_EMBEDDING_MODEL
+    ):
+        """
+        Initialize the Hugging Face API client.
+        Args:
+            api_key: Hugging Face API key (optional, can use HF_API_KEY env var)
+            llm_model: LLM model identifier from config
+            embedding_model: Embedding model identifier from config
+        """
+        self.api_key = api_key or os.environ.get("HF_API_KEY", "")
+        # Set up model IDs
+        self.llm_model_id = LLM_MODELS[llm_model]["model_id"] if llm_model in LLM_MODELS else LLM_MODELS[DEFAULT_LLM_MODEL]["model_id"]
+        self.embedding_model_id = EMBEDDING_MODELS[embedding_model]["model_id"] if embedding_model in EMBEDDING_MODELS else EMBEDDING_MODELS[DEFAULT_EMBEDDING_MODEL]["model_id"]
+        # Set up headers
+        self.headers = {"Authorization": f"Bearer {self.api_key}"}
+        if not self.api_key:
+            print("Warning: No API key provided. API calls may be rate limited.")
+            self.headers = {}
+    def generate_text(
+        self,
+        prompt: str,
+        max_length: int = API_PARAMS["max_length"],
+        temperature: float = API_PARAMS["temperature"],
+        top_p: float = API_PARAMS["top_p"],
+        top_k: int = API_PARAMS["top_k"],
+        repetition_penalty: float = API_PARAMS["repetition_penalty"],
+        wait_for_model: bool = True
+    ) -> str:
+        """
+        Generate text using the LLM model.
+        Args:
+            prompt: Input text prompt
+            max_length: Maximum length of generated text
+            temperature: Sampling temperature
+            top_p: Top-p sampling parameter
+            top_k: Top-k sampling parameter
+            repetition_penalty: Penalty for repetition
+            wait_for_model: Whether to wait for model to load
+        Returns:
+            Generated text response
+        """
+        payload = {
+            "inputs": prompt,
+            "parameters": {
+                "max_length": max_length,
+                "temperature": temperature,
+                "top_p": top_p,
+                "top_k": top_k,
+                "repetition_penalty": repetition_penalty
+            }
+        }
+        api_url = f"{HF_API_ENDPOINTS['inference']}{self.llm_model_id}"
+        # Make API request
+        response = self._make_api_request(api_url, payload, wait_for_model)
+        # Parse response
+        if isinstance(response, list) and len(response) > 0:
+            if "generated_text" in response[0]:
+                return response[0]["generated_text"]
+            return response[0].get("text", "")
+        elif isinstance(response, dict):
+            return response.get("generated_text", "")
+        # Fallback
+        return str(response)
+    def generate_embeddings(
+        self,
+        texts: Union[str, List[str]],
+        wait_for_model: bool = True
+    ) -> List[List[float]]:
+        """
+        Generate embeddings for text using the embedding model.
+        Args:
+            texts: Single text or list of texts to embed
+            wait_for_model: Whether to wait for model to load
+        Returns:
+            List of embedding vectors
+        """
+        # Ensure texts is a list
+        if isinstance(texts, str):
+            texts = [texts]
+        payload = {
+            "inputs": texts,
+        }
+        api_url = f"{HF_API_ENDPOINTS['feature-extraction']}{self.embedding_model_id}"
+        # Make API request
+        response = self._make_api_request(api_url, payload, wait_for_model)
+        # Return embeddings
+        return response
+    def _make_api_request(
+        self,
+        api_url: str,
+        payload: Dict[str, Any],
+        wait_for_model: bool = True,
+        max_retries: int = 5,
+        retry_delay: int = 1
+    ) -> Any:
+        """
+        Make a request to the Hugging Face API with retry logic.
+        Args:
+            api_url: API endpoint URL
+            payload: Request payload
+            wait_for_model: Whether to wait for model to load
+            max_retries: Maximum number of retries
+            retry_delay: Delay between retries in seconds
+        Returns:
+            API response
+        """
+        for attempt in range(max_retries):
+            try:
+                response = requests.post(api_url, headers=self.headers, json=payload)
+                # Check if model is still loading
+                if response.status_code == 503 and wait_for_model:
+                    # Model is loading, wait and retry
+                    estimated_time = json.loads(response.content.decode("utf-8")).get("estimated_time", 20)
+                    print(f"Model is loading. Waiting {estimated_time} seconds...")
+                    time.sleep(estimated_time)
+                    continue
+                # Check for other errors
+                if response.status_code != 200:
+                    print(f"API request failed with status code {response.status_code}: {response.text}")
+                    if attempt < max_retries - 1:
+                        time.sleep(retry_delay * (2 ** attempt))  # Exponential backoff
+                        continue
+                    return {"error": response.text}
+                return response.json()
+            except Exception as e:
+                print(f"API request failed: {str(e)}")
+                if attempt < max_retries - 1:
+                    time.sleep(retry_delay * (2 ** attempt))  # Exponential backoff
+                    continue
+                return {"error": str(e)}
+        return {"error": "Max retries exceeded"}
+# Example RAG prompt template for Norwegian
+def create_rag_prompt(query: str, context: List[str]) -> str:
+    """
+    Create a RAG prompt with retrieved context for the LLM.
+    Args:
+        query: User query
+        context: List of retrieved document chunks
+    Returns:
+        Formatted prompt with context
+    """
+    context_text = "\n\n".join([f"Dokument {i+1}:\n{chunk}" for i, chunk in enumerate(context)])
+    prompt = f"""Du er en hjelpsom assistent som svarer på norsk. Bruk følgende kontekst for å svare på spørsmålet.
+KONTEKST:
+{context_text}
+SPØRSMÅL:
+{query}
+SVAR:
+"""
+    return prompt

src/document_processing/__init__.py ADDED Viewed

	@@ -0,0 +1,3 @@

+"""
+Document processing module for Norwegian RAG chatbot.
+"""

src/document_processing/chunker.py ADDED Viewed

	@@ -0,0 +1,262 @@

+"""
+Text chunking module for Norwegian RAG chatbot.
+Splits documents into manageable chunks for embedding and retrieval.
+"""
+import re
+from typing import List, Optional, Tuple
+from ..api.config import CHUNK_SIZE, CHUNK_OVERLAP
+class TextChunker:
+    """
+    Splits documents into manageable chunks for embedding and retrieval.
+    Supports different chunking strategies optimized for Norwegian text.
+    """
+    @staticmethod
+    def chunk_text(
+        text: str,
+        chunk_size: int = CHUNK_SIZE,
+        chunk_overlap: int = CHUNK_OVERLAP,
+        strategy: str = "paragraph"
+    ) -> List[str]:
+        """
+        Split text into chunks using the specified strategy.
+        Args:
+            text: Text to split into chunks
+            chunk_size: Maximum size of each chunk
+            chunk_overlap: Overlap between consecutive chunks
+            strategy: Chunking strategy ('fixed', 'paragraph', or 'sentence')
+        Returns:
+            List of text chunks
+        """
+        if not text:
+            return []
+        if strategy == "fixed":
+            return TextChunker.fixed_size_chunks(text, chunk_size, chunk_overlap)
+        elif strategy == "paragraph":
+            return TextChunker.paragraph_chunks(text, chunk_size, chunk_overlap)
+        elif strategy == "sentence":
+            return TextChunker.sentence_chunks(text, chunk_size, chunk_overlap)
+        else:
+            raise ValueError(f"Unknown chunking strategy: {strategy}")
+    @staticmethod
+    def fixed_size_chunks(
+        text: str,
+        chunk_size: int = CHUNK_SIZE,
+        chunk_overlap: int = CHUNK_OVERLAP
+    ) -> List[str]:
+        """
+        Split text into fixed-size chunks with overlap.
+        Args:
+            text: Text to split into chunks
+            chunk_size: Maximum size of each chunk
+            chunk_overlap: Overlap between consecutive chunks
+        Returns:
+            List of text chunks
+        """
+        if not text:
+            return []
+        chunks = []
+        start = 0
+        text_length = len(text)
+        while start < text_length:
+            end = min(start + chunk_size, text_length)
+            # If this is not the first chunk and we're not at the end,
+            # try to find a good breaking point (whitespace)
+            if start > 0 and end < text_length:
+                # Look for the last whitespace within the chunk
+                last_whitespace = text.rfind(' ', start, end)
+                if last_whitespace != -1:
+                    end = last_whitespace + 1  # Include the space
+            # Add the chunk
+            chunks.append(text[start:end].strip())
+            # Move the start position for the next chunk, considering overlap
+            start = end - chunk_overlap if end < text_length else text_length
+        return chunks
+    @staticmethod
+    def paragraph_chunks(
+        text: str,
+        max_chunk_size: int = CHUNK_SIZE,
+        chunk_overlap: int = CHUNK_OVERLAP
+    ) -> List[str]:
+        """
+        Split text into chunks based on paragraphs.
+        Args:
+            text: Text to split into chunks
+            max_chunk_size: Maximum size of each chunk
+            chunk_overlap: Overlap between consecutive chunks
+        Returns:
+            List of text chunks
+        """
+        if not text:
+            return []
+        # Split text into paragraphs
+        paragraphs = re.split(r'\n\s*\n', text)
+        paragraphs = [p.strip() for p in paragraphs if p.strip()]
+        chunks = []
+        current_chunk = []
+        current_size = 0
+        for paragraph in paragraphs:
+            paragraph_size = len(paragraph)
+            # If adding this paragraph would exceed the max chunk size and we already have content,
+            # save the current chunk and start a new one
+            if current_size + paragraph_size > max_chunk_size and current_chunk:
+                chunks.append('\n\n'.join(current_chunk))
+                # For overlap, keep some paragraphs from the previous chunk
+                overlap_size = 0
+                overlap_paragraphs = []
+                # Add paragraphs from the end until we reach the desired overlap
+                for p in reversed(current_chunk):
+                    if overlap_size + len(p) <= chunk_overlap:
+                        overlap_paragraphs.insert(0, p)
+                        overlap_size += len(p)
+                    else:
+                        break
+                current_chunk = overlap_paragraphs
+                current_size = overlap_size
+            # If the paragraph itself is larger than the max chunk size, split it further
+            if paragraph_size > max_chunk_size:
+                # First, add the current chunk if it's not empty
+                if current_chunk:
+                    chunks.append('\n\n'.join(current_chunk))
+                    current_chunk = []
+                    current_size = 0
+                # Then split the large paragraph into fixed-size chunks
+                paragraph_chunks = TextChunker.fixed_size_chunks(paragraph, max_chunk_size, chunk_overlap)
+                chunks.extend(paragraph_chunks)
+            else:
+                # Add the paragraph to the current chunk
+                current_chunk.append(paragraph)
+                current_size += paragraph_size
+        # Add the last chunk if it's not empty
+        if current_chunk:
+            chunks.append('\n\n'.join(current_chunk))
+        return chunks
+    @staticmethod
+    def sentence_chunks(
+        text: str,
+        max_chunk_size: int = CHUNK_SIZE,
+        chunk_overlap: int = CHUNK_OVERLAP
+    ) -> List[str]:
+        """
+        Split text into chunks based on sentences.
+        Args:
+            text: Text to split into chunks
+            max_chunk_size: Maximum size of each chunk
+            chunk_overlap: Overlap between consecutive chunks
+        Returns:
+            List of text chunks
+        """
+        if not text:
+            return []
+        # Norwegian-aware sentence splitting
+        # This pattern handles common Norwegian sentence endings
+        sentence_pattern = r'(?<=[.!?])\s+(?=[A-ZÆØÅ])'
+        sentences = re.split(sentence_pattern, text)
+        sentences = [s.strip() for s in sentences if s.strip()]
+        chunks = []
+        current_chunk = []
+        current_size = 0
+        for sentence in sentences:
+            sentence_size = len(sentence)
+            # If adding this sentence would exceed the max chunk size and we already have content,
+            # save the current chunk and start a new one
+            if current_size + sentence_size > max_chunk_size and current_chunk:
+                chunks.append(' '.join(current_chunk))
+                # For overlap, keep some sentences from the previous chunk
+                overlap_size = 0
+                overlap_sentences = []
+                # Add sentences from the end until we reach the desired overlap
+                for s in reversed(current_chunk):
+                    if overlap_size + len(s) <= chunk_overlap:
+                        overlap_sentences.insert(0, s)
+                        overlap_size += len(s)
+                    else:
+                        break
+                current_chunk = overlap_sentences
+                current_size = overlap_size
+            # If the sentence itself is larger than the max chunk size, split it further
+            if sentence_size > max_chunk_size:
+                # First, add the current chunk if it's not empty
+                if current_chunk:
+                    chunks.append(' '.join(current_chunk))
+                    current_chunk = []
+                    current_size = 0
+                # Then split the large sentence into fixed-size chunks
+                sentence_chunks = TextChunker.fixed_size_chunks(sentence, max_chunk_size, chunk_overlap)
+                chunks.extend(sentence_chunks)
+            else:
+                # Add the sentence to the current chunk
+                current_chunk.append(sentence)
+                current_size += sentence_size
+        # Add the last chunk if it's not empty
+        if current_chunk:
+            chunks.append(' '.join(current_chunk))
+        return chunks
+    @staticmethod
+    def clean_chunk(chunk: str) -> str:
+        """
+        Clean a text chunk by removing excessive whitespace and normalizing.
+        Args:
+            chunk: Text chunk to clean
+        Returns:
+            Cleaned text chunk
+        """
+        if not chunk:
+            return ""
+        # Replace multiple whitespace with a single space
+        cleaned = re.sub(r'\s+', ' ', chunk)
+        # Normalize Norwegian characters (if needed)
+        # This ensures consistent handling of æ, ø, å
+        cleaned = cleaned.replace('æ', 'æ').replace('Æ', 'Æ')
+        cleaned = cleaned.replace('ø', 'ø').replace('Ø', 'Ø')
+        cleaned = cleaned.replace('å', 'å').replace('Å', 'Å')
+        return cleaned.strip()

src/document_processing/extractor.py ADDED Viewed

	@@ -0,0 +1,167 @@

+"""
+Text extraction module for Norwegian RAG chatbot.
+Extracts text from various document formats.
+"""
+import os
+import PyPDF2
+from typing import List, Optional
+from bs4 import BeautifulSoup
+class TextExtractor:
+    """
+    Extracts text from various document formats.
+    Currently supports:
+    - PDF (.pdf)
+    - Text files (.txt)
+    - HTML (.html, .htm)
+    """
+    @staticmethod
+    def extract_from_file(file_path: str) -> str:
+        """
+        Extract text from a file based on its extension.
+        Args:
+            file_path: Path to the document file
+        Returns:
+            Extracted text content
+        """
+        if not os.path.exists(file_path):
+            raise FileNotFoundError(f"File not found: {file_path}")
+        file_extension = os.path.splitext(file_path)[1].lower()
+        if file_extension == '.pdf':
+            return TextExtractor.extract_from_pdf(file_path)
+        elif file_extension == '.txt':
+            return TextExtractor.extract_from_text(file_path)
+        elif file_extension in ['.html', '.htm']:
+            return TextExtractor.extract_from_html(file_path)
+        else:
+            raise ValueError(f"Unsupported file format: {file_extension}")
+    @staticmethod
+    def extract_from_pdf(file_path: str) -> str:
+        """
+        Extract text from a PDF file.
+        Args:
+            file_path: Path to the PDF file
+        Returns:
+            Extracted text content
+        """
+        text = ""
+        try:
+            with open(file_path, 'rb') as file:
+                pdf_reader = PyPDF2.PdfReader(file)
+                for page_num in range(len(pdf_reader.pages)):
+                    page = pdf_reader.pages[page_num]
+                    text += page.extract_text() + "\n\n"
+        except Exception as e:
+            print(f"Error extracting text from PDF {file_path}: {str(e)}")
+            return ""
+        return text
+    @staticmethod
+    def extract_from_text(file_path: str) -> str:
+        """
+        Extract text from a plain text file.
+        Args:
+            file_path: Path to the text file
+        Returns:
+            Extracted text content
+        """
+        try:
+            with open(file_path, 'r', encoding='utf-8') as file:
+                return file.read()
+        except UnicodeDecodeError:
+            # Try with different encoding if UTF-8 fails
+            try:
+                with open(file_path, 'r', encoding='latin-1') as file:
+                    return file.read()
+            except Exception as e:
+                print(f"Error extracting text from file {file_path}: {str(e)}")
+                return ""
+        except Exception as e:
+            print(f"Error extracting text from file {file_path}: {str(e)}")
+            return ""
+    @staticmethod
+    def extract_from_html(file_path: str) -> str:
+        """
+        Extract text from an HTML file.
+        Args:
+            file_path: Path to the HTML file
+        Returns:
+            Extracted text content
+        """
+        try:
+            with open(file_path, 'r', encoding='utf-8') as file:
+                html_content = file.read()
+                soup = BeautifulSoup(html_content, 'html.parser')
+                # Remove script and style elements
+                for script in soup(["script", "style"]):
+                    script.extract()
+                # Get text
+                text = soup.get_text()
+                # Break into lines and remove leading and trailing space on each
+                lines = (line.strip() for line in text.splitlines())
+                # Break multi-headlines into a line each
+                chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
+                # Drop blank lines
+                text = '\n'.join(chunk for chunk in chunks if chunk)
+                return text
+        except Exception as e:
+            print(f"Error extracting text from HTML {file_path}: {str(e)}")
+            return ""
+    @staticmethod
+    def extract_from_url(url: str) -> str:
+        """
+        Extract text from a web URL.
+        Args:
+            url: Web URL to extract text from
+        Returns:
+            Extracted text content
+        """
+        try:
+            import requests
+            response = requests.get(url)
+            soup = BeautifulSoup(response.content, 'html.parser')
+            # Remove script and style elements
+            for script in soup(["script", "style"]):
+                script.extract()
+            # Get text
+            text = soup.get_text()
+            # Break into lines and remove leading and trailing space on each
+            lines = (line.strip() for line in text.splitlines())
+            # Break multi-headlines into a line each
+            chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
+            # Drop blank lines
+            text = '\n'.join(chunk for chunk in chunks if chunk)
+            return text
+        except Exception as e:
+            print(f"Error extracting text from URL {url}: {str(e)}")
+            return ""

src/document_processing/processor.py ADDED Viewed

	@@ -0,0 +1,306 @@

+"""
+Document processor module for Norwegian RAG chatbot.
+Orchestrates the document processing pipeline with remote embeddings.
+"""
+import os
+import json
+import numpy as np
+from typing import List, Dict, Any, Optional, Tuple, Union
+from datetime import datetime
+from .extractor import TextExtractor
+from .chunker import TextChunker
+from ..api.huggingface_api import HuggingFaceAPI
+from ..api.config import CHUNK_SIZE, CHUNK_OVERLAP
+class DocumentProcessor:
+    """
+    Orchestrates the document processing pipeline:
+    1. Extract text from documents
+    2. Split text into chunks
+    3. Generate embeddings using remote API
+    4. Store processed documents and embeddings
+    """
+    def __init__(
+        self,
+        api_client: Optional[HuggingFaceAPI] = None,
+        documents_dir: str = "/home/ubuntu/chatbot_project/data/documents",
+        processed_dir: str = "/home/ubuntu/chatbot_project/data/processed",
+        chunk_size: int = CHUNK_SIZE,
+        chunk_overlap: int = CHUNK_OVERLAP,
+        chunking_strategy: str = "paragraph"
+    ):
+        """
+        Initialize the document processor.
+        Args:
+            api_client: HuggingFaceAPI client for generating embeddings
+            documents_dir: Directory for storing original documents
+            processed_dir: Directory for storing processed documents and embeddings
+            chunk_size: Maximum size of each chunk
+            chunk_overlap: Overlap between consecutive chunks
+            chunking_strategy: Strategy for chunking text ('fixed', 'paragraph', or 'sentence')
+        """
+        self.api_client = api_client or HuggingFaceAPI()
+        self.documents_dir = documents_dir
+        self.processed_dir = processed_dir
+        self.chunk_size = chunk_size
+        self.chunk_overlap = chunk_overlap
+        self.chunking_strategy = chunking_strategy
+        # Ensure directories exist
+        os.makedirs(self.documents_dir, exist_ok=True)
+        os.makedirs(self.processed_dir, exist_ok=True)
+        # Initialize document index
+        self.document_index_path = os.path.join(self.processed_dir, "document_index.json")
+        self.document_index = self._load_document_index()
+    def process_document(
+        self,
+        file_path: str,
+        document_id: Optional[str] = None,
+        metadata: Optional[Dict[str, Any]] = None
+    ) -> str:
+        """
+        Process a document through the entire pipeline.
+        Args:
+            file_path: Path to the document file
+            document_id: Optional custom document ID
+            metadata: Optional metadata for the document
+        Returns:
+            Document ID
+        """
+        # Generate document ID if not provided
+        if document_id is None:
+            document_id = f"doc_{datetime.now().strftime('%Y%m%d%H%M%S')}_{os.path.basename(file_path)}"
+        # Extract text from document
+        text = TextExtractor.extract_from_file(file_path)
+        if not text:
+            raise ValueError(f"Failed to extract text from {file_path}")
+        # Split text into chunks
+        chunks = TextChunker.chunk_text(
+            text,
+            chunk_size=self.chunk_size,
+            chunk_overlap=self.chunk_overlap,
+            strategy=self.chunking_strategy
+        )
+        # Clean chunks
+        chunks = [TextChunker.clean_chunk(chunk) for chunk in chunks]
+        # Generate embeddings using remote API
+        embeddings = self.api_client.generate_embeddings(chunks)
+        # Prepare metadata
+        if metadata is None:
+            metadata = {}
+        metadata.update({
+            "filename": os.path.basename(file_path),
+            "processed_date": datetime.now().isoformat(),
+            "chunk_count": len(chunks),
+            "chunking_strategy": self.chunking_strategy,
+            "embedding_model": self.api_client.embedding_model_id
+        })
+        # Save processed document
+        self._save_processed_document(document_id, chunks, embeddings, metadata)
+        # Update document index
+        self._update_document_index(document_id, metadata)
+        return document_id
+    def process_text(
+        self,
+        text: str,
+        document_id: Optional[str] = None,
+        metadata: Optional[Dict[str, Any]] = None
+    ) -> str:
+        """
+        Process text directly through the pipeline.
+        Args:
+            text: Text content to process
+            document_id: Optional custom document ID
+            metadata: Optional metadata for the document
+        Returns:
+            Document ID
+        """
+        # Generate document ID if not provided
+        if document_id is None:
+            document_id = f"text_{datetime.now().strftime('%Y%m%d%H%M%S')}"
+        # Split text into chunks
+        chunks = TextChunker.chunk_text(
+            text,
+            chunk_size=self.chunk_size,
+            chunk_overlap=self.chunk_overlap,
+            strategy=self.chunking_strategy
+        )
+        # Clean chunks
+        chunks = [TextChunker.clean_chunk(chunk) for chunk in chunks]
+        # Generate embeddings using remote API
+        embeddings = self.api_client.generate_embeddings(chunks)
+        # Prepare metadata
+        if metadata is None:
+            metadata = {}
+        metadata.update({
+            "source": "direct_text",
+            "processed_date": datetime.now().isoformat(),
+            "chunk_count": len(chunks),
+            "chunking_strategy": self.chunking_strategy,
+            "embedding_model": self.api_client.embedding_model_id
+        })
+        # Save processed document
+        self._save_processed_document(document_id, chunks, embeddings, metadata)
+        # Update document index
+        self._update_document_index(document_id, metadata)
+        return document_id
+    def get_document_chunks(self, document_id: str) -> List[str]:
+        """
+        Get all chunks for a document.
+        Args:
+            document_id: Document ID
+        Returns:
+            List of text chunks
+        """
+        document_path = os.path.join(self.processed_dir, f"{document_id}.json")
+        if not os.path.exists(document_path):
+            raise FileNotFoundError(f"Document not found: {document_id}")
+        with open(document_path, 'r', encoding='utf-8') as f:
+            document_data = json.load(f)
+        return document_data.get("chunks", [])
+    def get_document_embeddings(self, document_id: str) -> List[List[float]]:
+        """
+        Get all embeddings for a document.
+        Args:
+            document_id: Document ID
+        Returns:
+            List of embedding vectors
+        """
+        document_path = os.path.join(self.processed_dir, f"{document_id}.json")
+        if not os.path.exists(document_path):
+            raise FileNotFoundError(f"Document not found: {document_id}")
+        with open(document_path, 'r', encoding='utf-8') as f:
+            document_data = json.load(f)
+        return document_data.get("embeddings", [])
+    def get_all_documents(self) -> Dict[str, Dict[str, Any]]:
+        """
+        Get all documents in the index.
+        Returns:
+            Dictionary of document IDs to metadata
+        """
+        return self.document_index
+    def delete_document(self, document_id: str) -> bool:
+        """
+        Delete a document and its processed data.
+        Args:
+            document_id: Document ID
+        Returns:
+            True if successful, False otherwise
+        """
+        if document_id not in self.document_index:
+            return False
+        # Remove from index
+        del self.document_index[document_id]
+        self._save_document_index()
+        # Delete processed file
+        document_path = os.path.join(self.processed_dir, f"{document_id}.json")
+        if os.path.exists(document_path):
+            os.remove(document_path)
+        return True
+    def _save_processed_document(
+        self,
+        document_id: str,
+        chunks: List[str],
+        embeddings: List[List[float]],
+        metadata: Dict[str, Any]
+    ) -> None:
+        """
+        Save processed document data.
+        Args:
+            document_id: Document ID
+            chunks: List of text chunks
+            embeddings: List of embedding vectors
+            metadata: Document metadata
+        """
+        document_data = {
+            "document_id": document_id,
+            "metadata": metadata,
+            "chunks": chunks,
+            "embeddings": embeddings
+        }
+        document_path = os.path.join(self.processed_dir, f"{document_id}.json")
+        with open(document_path, 'w', encoding='utf-8') as f:
+            json.dump(document_data, f, ensure_ascii=False, indent=2)
+    def _load_document_index(self) -> Dict[str, Dict[str, Any]]:
+        """
+        Load the document index from disk.
+        Returns:
+            Dictionary of document IDs to metadata
+        """
+        if os.path.exists(self.document_index_path):
+            try:
+                with open(self.document_index_path, 'r', encoding='utf-8') as f:
+                    return json.load(f)
+            except Exception as e:
+                print(f"Error loading document index: {str(e)}")
+        return {}
+    def _save_document_index(self) -> None:
+        """
+        Save the document index to disk.
+        """
+        with open(self.document_index_path, 'w', encoding='utf-8') as f:
+            json.dump(self.document_index, f, ensure_ascii=False, indent=2)
+    def _update_document_index(self, document_id: str, metadata: Dict[str, Any]) -> None:
+        """
+        Update the document index with a new or updated document.
+        Args:
+            document_id: Document ID
+            metadata: Document metadata
+        """
+        self.document_index[document_id] = metadata
+        self._save_document_index()

src/main.py ADDED Viewed

	@@ -0,0 +1,60 @@

+"""
+Main application entry point for Norwegian RAG chatbot.
+"""
+import os
+import argparse
+from typing import Dict, Any, Optional
+from src.api.huggingface_api import HuggingFaceAPI
+from src.document_processing.processor import DocumentProcessor
+from src.rag.retriever import Retriever
+from src.rag.generator import Generator
+from src.web.app import ChatbotApp
+from src.web.embed import EmbedGenerator, create_embed_html_file
+def main():
+    """
+    Main entry point for the Norwegian RAG chatbot application.
+    """
+    # Parse command line arguments
+    parser = argparse.ArgumentParser(description="Norwegian RAG Chatbot")
+    parser.add_argument("--host", type=str, default="0.0.0.0", help="Host to run the server on")
+    parser.add_argument("--port", type=int, default=7860, help="Port to run the server on")
+    parser.add_argument("--share", action="store_true", help="Create a public link for sharing")
+    parser.add_argument("--debug", action="store_true", help="Enable debug mode")
+    args = parser.parse_args()
+    # Initialize API client
+    api_key = os.environ.get("HF_API_KEY", "")
+    api_client = HuggingFaceAPI(api_key=api_key)
+    # Initialize components
+    document_processor = DocumentProcessor(api_client=api_client)
+    retriever = Retriever(api_client=api_client)
+    generator = Generator(api_client=api_client)
+    # Create app
+    app = ChatbotApp(
+        api_client=api_client,
+        document_processor=document_processor,
+        retriever=retriever,
+        generator=generator,
+        title="Norwegian RAG Chatbot",
+        description="En chatbot basert på Retrieval-Augmented Generation (RAG) for norsk språk."
+    )
+    # Create embedding example
+    embed_generator = EmbedGenerator()
+    create_embed_html_file(embed_generator)
+    # Launch app
+    app.launch(
+        server_name=args.host,
+        server_port=args.port,
+        share=args.share,
+        debug=args.debug
+    )
+if __name__ == "__main__":
+    main()

src/project_structure.md ADDED Viewed

	@@ -0,0 +1,79 @@

+# Norwegian RAG Chatbot Project Structure
+## Overview
+This document outlines the project structure for our lightweight Norwegian RAG chatbot implementation that uses Hugging Face's Inference API instead of running models locally.
+## Directory Structure
+```
+chatbot_project/
+├── design/                  # Design documents
+│   ├── rag_architecture.md
+│   ├── document_processing.md
+│   └── chat_interface.md
+├── research/                # Research findings
+│   └── norwegian_llm_research.md
+├── src/                     # Source code
+│   ├── api/                 # API integration
+│   │   ├── __init__.py
+│   │   ├── huggingface_api.py  # HF Inference API integration
+│   │   └── config.py        # API configuration
+│   ├── document_processing/ # Document processing
+│   │   ├── __init__.py
+│   │   ├── extractor.py     # Text extraction from documents
+│   │   ├── chunker.py       # Text chunking
+│   │   └── processor.py     # Main document processor
+│   ├── rag/                 # RAG implementation
+│   │   ├── __init__.py
+│   │   ├── retriever.py     # Document retrieval
+│   │   └── generator.py     # Response generation
+│   ├── web/                 # Web interface
+│   │   ├── __init__.py
+│   │   ├── app.py           # Gradio app
+│   │   └── embed.py         # Embedding functionality
+│   ├── utils/               # Utilities
+│   │   ├── __init__.py
+│   │   └── helpers.py       # Helper functions
+│   └── main.py              # Main application entry point
+├── data/                    # Data storage
+│   ├── documents/           # Original documents
+│   └── processed/           # Processed documents and embeddings
+├── tests/                   # Tests
+│   ├── test_api.py
+│   ├── test_document_processing.py
+│   └── test_rag.py
+├── venv/                    # Virtual environment
+├── requirements-ultra-light.txt  # Lightweight dependencies
+├── requirements.txt         # Original requirements (for reference)
+└── README.md                # Project documentation
+```
+## Key Components
+### 1. API Integration (`src/api/`)
+- `huggingface_api.py`: Integration with Hugging Face Inference API for both LLM and embedding models
+- `config.py`: Configuration for API endpoints, model IDs, and API keys
+### 2. Document Processing (`src/document_processing/`)
+- `extractor.py`: Extract text from various document formats
+- `chunker.py`: Split documents into manageable chunks
+- `processor.py`: Orchestrate the document processing pipeline
+### 3. RAG Implementation (`src/rag/`)
+- `retriever.py`: Retrieve relevant document chunks based on query
+- `generator.py`: Generate responses using retrieved context
+### 4. Web Interface (`src/web/`)
+- `app.py`: Gradio web interface for the chatbot
+- `embed.py`: Generate embedding code for website integration
+### 5. Main Application (`src/main.py`)
+- Entry point for the application
+- Orchestrates the different components
+## Implementation Approach
+1. **Remote Model Execution**: Use Hugging Face's Inference API for both LLM and embedding models
+2. **Lightweight Document Processing**: Process documents locally but use remote APIs for embedding generation
+3. **Simple Vector Storage**: Store embeddings in simple file-based format rather than dedicated vector database
+4. **Gradio Interface**: Create a simple but effective chat interface using Gradio
+5. **Hugging Face Spaces Deployment**: Deploy the final solution to Hugging Face Spaces

src/rag/__init__.py ADDED Viewed

	@@ -0,0 +1,3 @@

+"""
+RAG module for Norwegian chatbot.
+"""

src/rag/generator.py ADDED Viewed

	@@ -0,0 +1,87 @@

+"""
+Generator module for Norwegian RAG chatbot.
+Generates responses using retrieved context and LLM.
+"""
+from typing import List, Dict, Any, Optional
+from ..api.huggingface_api import HuggingFaceAPI, create_rag_prompt
+class Generator:
+    """
+    Generates responses using retrieved context and LLM.
+    Uses Hugging Face Inference API for text generation.
+    """
+    def __init__(
+        self,
+        api_client: Optional[HuggingFaceAPI] = None,
+    ):
+        """
+        Initialize the generator.
+        Args:
+            api_client: HuggingFaceAPI client for text generation
+        """
+        self.api_client = api_client or HuggingFaceAPI()
+    def generate(
+        self,
+        query: str,
+        retrieved_chunks: List[Dict[str, Any]],
+        temperature: float = 0.7
+    ) -> str:
+        """
+        Generate a response using retrieved context.
+        Args:
+            query: User query
+            retrieved_chunks: List of retrieved chunks with metadata
+            temperature: Temperature for text generation
+        Returns:
+            Generated response
+        """
+        # Extract text from retrieved chunks
+        context_texts = [chunk["chunk_text"] for chunk in retrieved_chunks]
+        # If no context is retrieved, generate a response without context
+        if not context_texts:
+            return self._generate_without_context(query, temperature)
+        # Create RAG prompt
+        prompt = create_rag_prompt(query, context_texts)
+        # Generate response
+        response = self.api_client.generate_text(
+            prompt=prompt,
+            temperature=temperature
+        )
+        return response
+    def _generate_without_context(self, query: str, temperature: float = 0.7) -> str:
+        """
+        Generate a response without context when no relevant chunks are found.
+        Args:
+            query: User query
+            temperature: Temperature for text generation
+        Returns:
+            Generated response
+        """
+        prompt = f"""Du er en hjelpsom assistent som svarer på norsk. Svar på følgende spørsmål så godt du kan.
+SPØRSMÅL:
+{query}
+SVAR:
+"""
+        response = self.api_client.generate_text(
+            prompt=prompt,
+            temperature=temperature
+        )
+        return response

src/rag/retriever.py ADDED Viewed

	@@ -0,0 +1,163 @@

+"""
+Retriever module for Norwegian RAG chatbot.
+Retrieves relevant document chunks based on query embeddings.
+"""
+import os
+import json
+import numpy as np
+from typing import List, Dict, Any, Optional, Tuple, Union
+from ..api.huggingface_api import HuggingFaceAPI
+from ..api.config import MAX_CHUNKS_TO_RETRIEVE, SIMILARITY_THRESHOLD
+class Retriever:
+    """
+    Retrieves relevant document chunks based on query embeddings.
+    Uses cosine similarity to find the most relevant chunks.
+    """
+    def __init__(
+        self,
+        api_client: Optional[HuggingFaceAPI] = None,
+        processed_dir: str = "/home/ubuntu/chatbot_project/data/processed",
+        max_chunks: int = MAX_CHUNKS_TO_RETRIEVE,
+        similarity_threshold: float = SIMILARITY_THRESHOLD
+    ):
+        """
+        Initialize the retriever.
+        Args:
+            api_client: HuggingFaceAPI client for generating embeddings
+            processed_dir: Directory containing processed documents
+            max_chunks: Maximum number of chunks to retrieve
+            similarity_threshold: Minimum similarity score for retrieval
+        """
+        self.api_client = api_client or HuggingFaceAPI()
+        self.processed_dir = processed_dir
+        self.max_chunks = max_chunks
+        self.similarity_threshold = similarity_threshold
+        # Load document index
+        self.document_index_path = os.path.join(self.processed_dir, "document_index.json")
+        self.document_index = self._load_document_index()
+    def retrieve(self, query: str) -> List[Dict[str, Any]]:
+        """
+        Retrieve relevant document chunks for a query.
+        Args:
+            query: User query
+        Returns:
+            List of retrieved chunks with metadata
+        """
+        # Generate embedding for the query
+        query_embedding = self.api_client.generate_embeddings(query)[0]
+        # Find relevant chunks across all documents
+        all_results = []
+        for doc_id in self.document_index:
+            try:
+                # Load document data
+                doc_results = self._retrieve_from_document(doc_id, query_embedding)
+                all_results.extend(doc_results)
+            except Exception as e:
+                print(f"Error retrieving from document {doc_id}: {str(e)}")
+        # Sort all results by similarity score
+        all_results.sort(key=lambda x: x["similarity"], reverse=True)
+        # Return top results above threshold
+        return [
+            result for result in all_results[:self.max_chunks]
+            if result["similarity"] >= self.similarity_threshold
+        ]
+    def _retrieve_from_document(
+        self,
+        document_id: str,
+        query_embedding: List[float]
+    ) -> List[Dict[str, Any]]:
+        """
+        Retrieve relevant chunks from a specific document.
+        Args:
+            document_id: Document ID
+            query_embedding: Query embedding vector
+        Returns:
+            List of retrieved chunks with metadata
+        """
+        document_path = os.path.join(self.processed_dir, f"{document_id}.json")
+        if not os.path.exists(document_path):
+            return []
+        # Load document data
+        with open(document_path, 'r', encoding='utf-8') as f:
+            document_data = json.load(f)
+        chunks = document_data.get("chunks", [])
+        embeddings = document_data.get("embeddings", [])
+        metadata = document_data.get("metadata", {})
+        if not chunks or not embeddings or len(chunks) != len(embeddings):
+            return []
+        # Calculate similarity scores
+        results = []
+        for i, (chunk, embedding) in enumerate(zip(chunks, embeddings)):
+            similarity = self._cosine_similarity(query_embedding, embedding)
+            results.append({
+                "document_id": document_id,
+                "chunk_index": i,
+                "chunk_text": chunk,
+                "similarity": similarity,
+                "metadata": metadata
+            })
+        # Sort by similarity
+        results.sort(key=lambda x: x["similarity"], reverse=True)
+        return results[:self.max_chunks]
+    def _cosine_similarity(self, vec1: List[float], vec2: List[float]) -> float:
+        """
+        Calculate cosine similarity between two vectors.
+        Args:
+            vec1: First vector
+            vec2: Second vector
+        Returns:
+            Cosine similarity score
+        """
+        vec1 = np.array(vec1)
+        vec2 = np.array(vec2)
+        dot_product = np.dot(vec1, vec2)
+        norm1 = np.linalg.norm(vec1)
+        norm2 = np.linalg.norm(vec2)
+        if norm1 == 0 or norm2 == 0:
+            return 0.0
+        return dot_product / (norm1 * norm2)
+    def _load_document_index(self) -> Dict[str, Dict[str, Any]]:
+        """
+        Load the document index from disk.
+        Returns:
+            Dictionary of document IDs to metadata
+        """
+        if os.path.exists(self.document_index_path):
+            try:
+                with open(self.document_index_path, 'r', encoding='utf-8') as f:
+                    return json.load(f)
+            except Exception as e:
+                print(f"Error loading document index: {str(e)}")
+        return {}

src/web/__init__.py ADDED Viewed

	@@ -0,0 +1,3 @@

+"""
+Web interface module for Norwegian RAG chatbot.
+"""

src/web/app.py ADDED Viewed

	@@ -0,0 +1,301 @@

+"""
+Gradio app for Norwegian RAG chatbot.
+Provides a web interface for interacting with the chatbot.
+"""
+import os
+import gradio as gr
+import tempfile
+from typing import List, Dict, Any, Tuple, Optional
+from ..api.huggingface_api import HuggingFaceAPI
+from ..document_processing.processor import DocumentProcessor
+from ..rag.retriever import Retriever
+from ..rag.generator import Generator
+class ChatbotApp:
+    """
+    Gradio app for Norwegian RAG chatbot.
+    """
+    def __init__(
+        self,
+        api_client: Optional[HuggingFaceAPI] = None,
+        document_processor: Optional[DocumentProcessor] = None,
+        retriever: Optional[Retriever] = None,
+        generator: Optional[Generator] = None,
+        title: str = "Norwegian RAG Chatbot",
+        description: str = "En chatbot basert på Retrieval-Augmented Generation (RAG) for norsk språk."
+    ):
+        """
+        Initialize the chatbot app.
+        Args:
+            api_client: HuggingFaceAPI client
+            document_processor: Document processor
+            retriever: Retriever for finding relevant chunks
+            generator: Generator for creating responses
+            title: App title
+            description: App description
+        """
+        # Initialize components
+        self.api_client = api_client or HuggingFaceAPI()
+        self.document_processor = document_processor or DocumentProcessor(api_client=self.api_client)
+        self.retriever = retriever or Retriever(api_client=self.api_client)
+        self.generator = generator or Generator(api_client=self.api_client)
+        # App settings
+        self.title = title
+        self.description = description
+        # Initialize Gradio app
+        self.app = self._build_interface()
+    def _build_interface(self) -> gr.Blocks:
+        """
+        Build the Gradio interface.
+        Returns:
+            Gradio Blocks interface
+        """
+        with gr.Blocks(title=self.title) as app:
+            gr.Markdown(f"# {self.title}")
+            gr.Markdown(self.description)
+            with gr.Tabs():
+                # Chat tab
+                with gr.Tab("Chat"):
+                    chatbot = gr.Chatbot(height=500)
+                    with gr.Row():
+                        msg = gr.Textbox(
+                            placeholder="Skriv din melding her...",
+                            show_label=False,
+                            scale=9
+                        )
+                        submit_btn = gr.Button("Send", scale=1)
+                    with gr.Accordion("Avanserte innstillinger", open=False):
+                        temperature = gr.Slider(
+                            minimum=0.1,
+                            maximum=1.0,
+                            value=0.7,
+                            step=0.1,
+                            label="Temperatur"
+                        )
+                    clear_btn = gr.Button("Tøm chat")
+                    # Set up event handlers
+                    submit_btn.click(
+                        fn=self._respond,
+                        inputs=[msg, chatbot, temperature],
+                        outputs=[msg, chatbot]
+                    )
+                    msg.submit(
+                        fn=self._respond,
+                        inputs=[msg, chatbot, temperature],
+                        outputs=[msg, chatbot]
+                    )
+                    clear_btn.click(
+                        fn=lambda: None,
+                        inputs=None,
+                        outputs=chatbot,
+                        queue=False
+                    )
+                # Document upload tab
+                with gr.Tab("Last opp dokumenter"):
+                    with gr.Row():
+                        with gr.Column(scale=2):
+                            file_output = gr.File(label="Opplastede dokumenter")
+                            upload_button = gr.UploadButton(
+                                "Klikk for å laste opp dokument",
+                                file_types=["pdf", "txt", "html"],
+                                file_count="multiple"
+                            )
+                        with gr.Column(scale=3):
+                            documents_list = gr.Dataframe(
+                                headers=["Dokument ID", "Filnavn", "Dato", "Chunks"],
+                                label="Dokumentliste",
+                                interactive=False
+                            )
+                    process_status = gr.Textbox(label="Status", interactive=False)
+                    refresh_btn = gr.Button("Oppdater dokumentliste")
+                    # Set up event handlers
+                    upload_button.upload(
+                        fn=self._process_uploaded_files,
+                        inputs=[upload_button],
+                        outputs=[process_status, documents_list]
+                    )
+                    refresh_btn.click(
+                        fn=self._get_documents_list,
+                        inputs=None,
+                        outputs=[documents_list]
+                    )
+                # Embed tab
+                with gr.Tab("Integrer"):
+                    gr.Markdown("## Integrer chatboten på din nettside")
+                    with gr.Row():
+                        with gr.Column():
+                            gr.Markdown("### iFrame-kode")
+                            iframe_code = gr.Code(
+                                label="iFrame",
+                                language="html",
+                                value='<iframe src="https://huggingface.co/spaces/username/norwegian-rag-chatbot" width="100%" height="500px"></iframe>'
+                            )
+                        with gr.Column():
+                            gr.Markdown("### JavaScript Widget")
+                            js_code = gr.Code(
+                                label="JavaScript",
+                                language="html",
+                                value='<script src="https://huggingface.co/spaces/username/norwegian-rag-chatbot/widget.js"></script>'
+                            )
+                    gr.Markdown("### Forhåndsvisning")
+                    gr.Markdown("*Forhåndsvisning vil være tilgjengelig etter at chatboten er distribuert til Hugging Face Spaces.*")
+            gr.Markdown("---")
+            gr.Markdown("Bygget med [Hugging Face](https://huggingface.co/) og [Gradio](https://gradio.app/)")
+        return app
+    def _respond(
+        self,
+        message: str,
+        chat_history: List[Tuple[str, str]],
+        temperature: float
+    ) -> Tuple[str, List[Tuple[str, str]]]:
+        """
+        Generate a response to the user message.
+        Args:
+            message: User message
+            chat_history: Chat history
+            temperature: Temperature for text generation
+        Returns:
+            Empty message and updated chat history
+        """
+        if not message:
+            return "", chat_history
+        # Add user message to chat history
+        chat_history.append((message, None))
+        try:
+            # Retrieve relevant chunks
+            retrieved_chunks = self.retriever.retrieve(message)
+            # Generate response
+            response = self.generator.generate(
+                query=message,
+                retrieved_chunks=retrieved_chunks,
+                temperature=temperature
+            )
+            # Update chat history with response
+            chat_history[-1] = (message, response)
+        except Exception as e:
+            # Handle errors
+            error_message = f"Beklager, det oppstod en feil: {str(e)}"
+            chat_history[-1] = (message, error_message)
+        return "", chat_history
+    def _process_uploaded_files(
+        self,
+        files: List[tempfile._TemporaryFileWrapper]
+    ) -> Tuple[str, List[List[str]]]:
+        """
+        Process uploaded files.
+        Args:
+            files: List of uploaded files
+        Returns:
+            Status message and updated documents list
+        """
+        if not files:
+            return "Ingen filer lastet opp.", self._get_documents_list()
+        processed_files = []
+        for file in files:
+            try:
+                # Process the document
+                document_id = self.document_processor.process_document(file.name)
+                processed_files.append(os.path.basename(file.name))
+            except Exception as e:
+                return f"Feil ved behandling av {os.path.basename(file.name)}: {str(e)}", self._get_documents_list()
+        if len(processed_files) == 1:
+            status = f"Fil behandlet: {processed_files[0]}"
+        else:
+            status = f"{len(processed_files)} filer behandlet: {', '.join(processed_files)}"
+        return status, self._get_documents_list()
+    def _get_documents_list(self) -> List[List[str]]:
+        """
+        Get list of processed documents.
+        Returns:
+            List of document information
+        """
+        documents = self.document_processor.get_all_documents()
+        # Format for dataframe
+        documents_list = []
+        for doc_id, metadata in documents.items():
+            filename = metadata.get("filename", "N/A")
+            processed_date = metadata.get("processed_date", "N/A")
+            chunk_count = metadata.get("chunk_count", 0)
+            documents_list.append([doc_id, filename, processed_date, chunk_count])
+        return documents_list
+    def launch(self, **kwargs):
+        """
+        Launch the Gradio app.
+        Args:
+            **kwargs: Additional arguments for gr.launch()
+        """
+        self.app.launch(**kwargs)
+def create_app():
+    """
+    Create and configure the chatbot app.
+    Returns:
+        Configured ChatbotApp instance
+    """
+    # Initialize API client
+    api_client = HuggingFaceAPI()
+    # Initialize components
+    document_processor = DocumentProcessor(api_client=api_client)
+    retriever = Retriever(api_client=api_client)
+    generator = Generator(api_client=api_client)
+    # Create app
+    app = ChatbotApp(
+        api_client=api_client,
+        document_processor=document_processor,
+        retriever=retriever,
+        generator=generator
+    )
+    return app

src/web/embed.py ADDED Viewed

	@@ -0,0 +1,211 @@

+"""
+Embedding functionality for Norwegian RAG chatbot.
+Provides utilities for embedding the chatbot in external websites.
+"""
+import os
+from typing import Dict, Optional
+class EmbedGenerator:
+    """
+    Generates embedding code for integrating the chatbot into external websites.
+    """
+    def __init__(
+        self,
+        space_name: Optional[str] = None,
+        username: Optional[str] = None,
+        height: int = 500,
+        width: str = "100%"
+    ):
+        """
+        Initialize the embed generator.
+        Args:
+            space_name: Hugging Face Space name
+            username: Hugging Face username
+            height: Default iframe height in pixels
+            width: Default iframe width (can be pixels or percentage)
+        """
+        self.space_name = space_name or "norwegian-rag-chatbot"
+        self.username = username or "username"
+        self.height = height
+        self.width = width
+    def get_iframe_code(
+        self,
+        height: Optional[int] = None,
+        width: Optional[str] = None
+    ) -> str:
+        """
+        Generate iframe embed code.
+        Args:
+            height: Optional custom height
+            width: Optional custom width
+        Returns:
+            HTML iframe code
+        """
+        h = height or self.height
+        w = width or self.width
+        return f'<iframe src="https://huggingface.co/spaces/{self.username}/{self.space_name}" width="{w}" height="{h}px" frameborder="0"></iframe>'
+    def get_javascript_widget_code(self) -> str:
+        """
+        Generate JavaScript widget embed code.
+        Returns:
+            HTML script tag for widget
+        """
+        return f'<script src="https://huggingface.co/spaces/{self.username}/{self.space_name}/widget.js"></script>'
+    def get_direct_url(self) -> str:
+        """
+        Get direct URL to the Hugging Face Space.
+        Returns:
+            URL to the Hugging Face Space
+        """
+        return f"https://huggingface.co/spaces/{self.username}/{self.space_name}"
+    def get_embed_options(self) -> Dict[str, str]:
+        """
+        Get all embedding options.
+        Returns:
+            Dictionary of embedding options
+        """
+        return {
+            "iframe": self.get_iframe_code(),
+            "javascript": self.get_javascript_widget_code(),
+            "url": self.get_direct_url()
+        }
+    def update_space_info(self, username: str, space_name: str) -> None:
+        """
+        Update Hugging Face Space information.
+        Args:
+            username: Hugging Face username
+            space_name: Hugging Face Space name
+        """
+        self.username = username
+        self.space_name = space_name
+def create_embed_html_file(
+    embed_generator: EmbedGenerator,
+    output_path: str = "/home/ubuntu/chatbot_project/embed_example.html"
+) -> str:
+    """
+    Create an HTML file with embedding examples.
+    Args:
+        embed_generator: EmbedGenerator instance
+        output_path: Path to save the HTML file
+    Returns:
+        Path to the created HTML file
+    """
+    html_content = f"""<!DOCTYPE html>
+<html lang="no">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>Norwegian RAG Chatbot - Embedding Examples</title>
+    <style>
+        body {{
+            font-family: Arial, sans-serif;
+            line-height: 1.6;
+            max-width: 800px;
+            margin: 0 auto;
+            padding: 20px;
+        }}
+        h1, h2, h3 {{
+            color: #2c3e50;
+        }}
+        .code-block {{
+            background-color: #f8f9fa;
+            border: 1px solid #ddd;
+            border-radius: 4px;
+            padding: 15px;
+            margin: 15px 0;
+            overflow-x: auto;
+        }}
+        .example {{
+            margin: 30px 0;
+            padding: 20px;
+            border: 1px solid #eee;
+            border-radius: 5px;
+        }}
+    </style>
+</head>
+<body>
+    <h1>Norwegian RAG Chatbot - Embedding Examples</h1>
+    <p>
+        This page demonstrates how to embed the Norwegian RAG Chatbot into your website.
+        There are multiple ways to integrate the chatbot, depending on your needs.
+    </p>
+    <h2>Option 1: iFrame Embedding</h2>
+    <p>
+        The simplest way to embed the chatbot is using an iFrame. Copy and paste the following code into your HTML:
+    </p>
+    <div class="code-block">
+        <pre>{embed_generator.get_iframe_code()}</pre>
+    </div>
+    <div class="example">
+        <h3>Example:</h3>
+        {embed_generator.get_iframe_code()}
+    </div>
+    <h2>Option 2: JavaScript Widget</h2>
+    <p>
+        For a more integrated experience, you can use the JavaScript widget. Copy and paste the following code into your HTML:
+    </p>
+    <div class="code-block">
+        <pre>{embed_generator.get_javascript_widget_code()}</pre>
+    </div>
+    <div class="example">
+        <h3>Example:</h3>
+        <p>The widget will appear below once the page is hosted on a web server:</p>
+        <!-- Widget will be inserted here when the script runs -->
+    </div>
+    <h2>Option 3: Direct Link</h2>
+    <p>
+        You can also provide a direct link to the chatbot:
+    </p>
+    <div class="code-block">
+        <pre>{embed_generator.get_direct_url()}</pre>
+    </div>
+    <h2>Customization</h2>
+    <p>
+        You can customize the appearance of the embedded chatbot by modifying the iFrame dimensions:
+    </p>
+    <div class="code-block">
+        <pre>{embed_generator.get_iframe_code(height=600, width="80%")}</pre>
+    </div>
+    <footer>
+        <p>
+            <small>
+                Created with <a href="https://huggingface.co/" target="_blank">Hugging Face</a> and
+                <a href="https://gradio.app/" target="_blank">Gradio</a>.
+            </small>
+        </p>
+    </footer>
+</body>
+</html>
+"""
+    with open(output_path, 'w', encoding='utf-8') as f:
+        f.write(html_content)
+    return output_path

todo.md ADDED Viewed

	@@ -0,0 +1,26 @@

+# Norwegian RAG Chatbot Project Todo
+## Research Phase
+- [x] Research open-source LLMs with good Norwegian language support
+- [x] Evaluate embedding models for Norwegian text
+- [x] Research vector database options for RAG implementation
+- [x] Document findings and select best options
+## Design Phase
+- [x] Design RAG architecture
+- [x] Plan document processing pipeline
+- [x] Design chat interface
+- [x] Plan embedding functionality
+## Implementation Phase
+- [ ] Set up development environment
+- [ ] Implement document processing and embedding
+- [ ] Integrate LLM
+- [ ] Create chat interface
+- [ ] Develop embedding functionality
+## Testing and Finalization
+- [ ] Test with Norwegian content
+- [ ] Optimize performance
+- [ ] Document usage and integration
+- [ ] Finalize solution