hevold commited on
Commit
b34efa5
·
verified ·
1 Parent(s): 1fa38f0

Upload 29 files

Browse files
README.md CHANGED
@@ -1,13 +1,119 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
- title: Iver
3
- emoji: 💬
4
- colorFrom: yellow
5
- colorTo: purple
 
 
6
  sdk: gradio
7
- sdk_version: 5.0.1
8
- app_file: app.py
9
- pinned: false
10
  license: mit
11
- ---
12
-
13
- An example chatbot using [Gradio](https://gradio.app), [`huggingface_hub`](https://huggingface.co/docs/huggingface_hub/v0.22.2/en/index), and the [Hugging Face Inference API](https://huggingface.co/docs/api-inference/index).
 
1
+ # Norwegian RAG Chatbot
2
+
3
+ A Retrieval-Augmented Generation (RAG) based chatbot with excellent Norwegian language support, built using Hugging Face's Inference API.
4
+
5
+ ## Features
6
+
7
+ - **Norwegian Language Support**: Leverages state-of-the-art Norwegian language models like NorMistral, Viking, and NorskGPT
8
+ - **Document Processing**: Upload and process documents in various formats (PDF, TXT, HTML)
9
+ - **RAG Implementation**: Retrieves relevant context from documents to generate accurate responses
10
+ - **Embeddable Interface**: Easily embed the chatbot in any website using iframe or JavaScript widget
11
+ - **Lightweight Architecture**: Uses Hugging Face's Inference API instead of running models locally
12
+
13
+ ## Architecture
14
+
15
+ This chatbot uses a lightweight architecture that leverages Hugging Face's hosted models:
16
+
17
+ 1. **Document Processing**: Documents are processed locally, extracting text and splitting into chunks
18
+ 2. **Embedding Generation**: Document chunks are embedded using Hugging Face's Inference API
19
+ 3. **Retrieval**: When a query is received, the most relevant document chunks are retrieved
20
+ 4. **Response Generation**: The LLM generates a response based on the retrieved context
21
+
22
+ ## Getting Started
23
+
24
+ ### Prerequisites
25
+
26
+ - Python 3.10+
27
+ - A Hugging Face account (for API access)
28
+
29
+ ### Installation
30
+
31
+ 1. Clone the repository:
32
+ ```bash
33
+ git clone https://huggingface.co/spaces/username/norwegian-rag-chatbot
34
+ cd norwegian-rag-chatbot
35
+ ```
36
+
37
+ 2. Install dependencies:
38
+ ```bash
39
+ pip install -r requirements-ultra-light.txt
40
+ ```
41
+
42
+ 3. Set up your Hugging Face API key:
43
+ ```bash
44
+ export HF_API_KEY="your_api_key_here"
45
+ ```
46
+
47
+ ### Running the Chatbot
48
+
49
+ ```bash
50
+ python src/main.py
51
+ ```
52
+
53
+ The chatbot will be available at http://localhost:7860
54
+
55
+ ## Usage
56
+
57
+ ### Chat Interface
58
+
59
+ The main chat interface allows you to:
60
+ - Ask questions in Norwegian
61
+ - Receive responses based on your uploaded documents
62
+ - Adjust temperature and other settings
63
+
64
+ ### Document Upload
65
+
66
+ You can upload documents to provide context for the chatbot:
67
+ - Supported formats: PDF, TXT, HTML
68
+ - Documents are automatically processed and indexed
69
+ - The chatbot will use these documents to provide more accurate responses
70
+
71
+ ### Embedding
72
+
73
+ You can embed the chatbot in your website using:
74
+ - iFrame embedding
75
+ - JavaScript widget
76
+ - Direct link
77
+
78
+ ## Deployment
79
+
80
+ The chatbot is designed to be deployed to Hugging Face Spaces:
81
+
82
+ 1. Create a new Space on Hugging Face
83
+ 2. Upload the code to the Space
84
+ 3. Set the HF_API_KEY secret in the Space settings
85
+ 4. The Space will automatically build and deploy the chatbot
86
+
87
+ ## Models
88
+
89
+ The chatbot can use various Norwegian language models:
90
+
91
+ - **NorMistral-7b-scratch**: A large Norwegian language model pretrained from scratch
92
+ - **Viking 7B**: A multilingual model for Nordic languages
93
+ - **NorskGPT**: A Norwegian language model based on Mistral or LLAMA2
94
+
95
+ For embeddings, it uses:
96
+ - **NbAiLab/nb-sbert-base**: A Norwegian sentence embedding model
97
+
98
+ ## License
99
+
100
+ This project is licensed under the MIT License - see the LICENSE file for details.
101
+
102
+ ## Acknowledgements
103
+
104
+ - [Hugging Face](https://huggingface.co/) for hosting the models and providing the Inference API
105
+ - [Gradio](https://gradio.app/) for the web interface framework
106
+ - The creators of the Norwegian language models used in this project
107
+
108
  ---
109
+
110
+ name: norwegian-rag-chatbot
111
+ title: Norwegian RAG Chatbot
112
+ emoji: 🇳🇴
113
+ colorFrom: blue
114
+ colorTo: red
115
  sdk: gradio
116
+ sdk_version: 4.0.0
117
+ app_file: src/main.py
118
+ pinned: true
119
  license: mit
 
 
 
app.yaml ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ sdk:
2
+ base_image: python:3.10
3
+ build_commands:
4
+ - pip install -r requirements-ultra-light.txt
5
+ python_packages:
6
+ - gradio>=4.0.0
7
+ - huggingface_hub>=0.19.0
8
+ - requests>=2.31.0
9
+ - numpy>=1.24.0
10
+ - PyPDF2>=3.0.0
11
+ - beautifulsoup4>=4.12.0
12
+
13
+ app:
14
+ title: Norwegian RAG Chatbot
15
+ emoji: 🇳🇴
16
+ colorPrimary: "#00205B"
17
+ colorSecondary: "#EF2B2D"
18
+ pinned: true
19
+ sdk: gradio
20
+ python_version: "3.10"
21
+ suggested_hardware: cpu-basic
22
+ models:
23
+ - norallm/normistral-7b-scratch
24
+ - NbAiLab/nb-sbert-base
25
+ spaces_server_url: https://api-inference.huggingface.co/models/
data/documents/.gitkeep ADDED
File without changes
data/documents/test_document.txt ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Norsk historie
2
+
3
+ Norge har en rik og fascinerende historie som strekker seg tilbake til vikingtiden. Vikingene var kjent for sine sjøreiser, handel og plyndring i store deler av Europa fra slutten av 700-tallet til midten av 1000-tallet.
4
+
5
+ ## Middelalderen
6
+
7
+ I 1030 døde Olav Haraldsson (senere kjent som Olav den hellige) i slaget ved Stiklestad. Hans død markerte begynnelsen på kristendommens endelige gjennombrudd i Norge.
8
+
9
+ Norge ble forent til ett rike under Harald Hårfagre på 800-tallet. Etter vikingtiden fulgte en periode med borgerkrig før landet ble stabilisert under Håkon Håkonsson på 1200-tallet.
10
+
11
+ ## Union med Danmark
12
+
13
+ Fra 1380 til 1814 var Norge i union med Danmark, en periode kjent som "dansketiden". Under denne perioden ble dansk det offisielle språket i administrasjon og litteratur, noe som hadde stor innflytelse på det norske språket.
14
+
15
+ ## Grunnloven og union med Sverige
16
+
17
+ I 1814 fikk Norge sin egen grunnlov, signert på Eidsvoll 17. mai. Samme år ble Norge tvunget inn i en union med Sverige, som varte frem til 1905.
18
+
19
+ ## Moderne Norge
20
+
21
+ Norge ble okkupert av Nazi-Tyskland under andre verdenskrig fra 1940 til 1945. Etter krigen opplevde landet rask økonomisk vekst.
22
+
23
+ Oppdagelsen av olje i Nordsjøen på slutten av 1960-tallet forvandlet Norge til en av verdens rikeste nasjoner per innbygger.
24
+
25
+ I dag er Norge kjent for sin velferdsstat, naturskjønnhet og høy levestandard.
data/processed/.gitkeep ADDED
File without changes
design/chat_interface.md ADDED
@@ -0,0 +1,256 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Chat Interface Design
2
+
3
+ This document outlines the design for the chat interface of our Norwegian RAG-based chatbot. The interface will be implemented using Gradio and deployed on Hugging Face Spaces.
4
+
5
+ ## Interface Requirements
6
+
7
+ ### Functional Requirements
8
+
9
+ 1. **Chat Interaction**:
10
+ - Text input field for user queries
11
+ - Response display area for chatbot answers
12
+ - Support for multi-turn conversations
13
+ - Message history display
14
+
15
+ 2. **Document Management**:
16
+ - Document upload functionality
17
+ - Document list display
18
+ - Status indicators for processing
19
+
20
+ 3. **Configuration Options**:
21
+ - Model selection (if multiple models are supported)
22
+ - Language selection (Norwegian/English toggle)
23
+ - Advanced parameters adjustment (optional)
24
+
25
+ 4. **Embedding Functionality**:
26
+ - Code snippet generation for embedding
27
+ - Preview of embedded widget
28
+ - Copy-to-clipboard functionality
29
+
30
+ ### Non-Functional Requirements
31
+
32
+ 1. **Responsiveness**:
33
+ - Mobile-friendly design
34
+ - Adaptive layout for different screen sizes
35
+
36
+ 2. **Performance**:
37
+ - Efficient loading times
38
+ - Progress indicators for long operations
39
+ - Streaming responses for better user experience
40
+
41
+ 3. **Accessibility**:
42
+ - WCAG 2.1 compliance
43
+ - Keyboard navigation support
44
+ - Screen reader compatibility
45
+
46
+ 4. **Multilingual Support**:
47
+ - Norwegian as primary language
48
+ - English as secondary language
49
+ - Language detection and switching
50
+
51
+ ## UI Design
52
+
53
+ ### Main Chat Interface
54
+
55
+ ```
56
+ ┌─────────────────────────────────────────────────────────────┐
57
+ │ Norwegian RAG Chatbot [🇳🇴/🇬🇧] │
58
+ ├─────────────────────────────────────────────────────────────┤
59
+ │ │
60
+ │ ┌─────────────────────────────────────────────────────┐ │
61
+ │ │ │ │
62
+ │ │ Chat History Display │ │
63
+ │ │ │ │
64
+ │ │ ┌─────────────────────────────────────────────┐ │ │
65
+ │ │ │ Bot: Hei! Hvordan kan jeg hjelpe deg i dag? │ │ │
66
+ │ │ └─────────────────────────────────────────────┘ │ │
67
+ │ │ │ │
68
+ │ │ ┌─────────────────────────────────────────────┐ │ │
69
+ │ │ │ User: Fortell meg om norsk historie. │ │ │
70
+ │ │ └─────────────────────────────────────────────┘ │ │
71
+ │ │ │ │
72
+ │ │ ┌─────────────────────────────────────────────┐ │ │
73
+ │ │ │ Bot: Norsk historie strekker seg... │ │ │
74
+ │ │ └─────────────────────────────────────────────┘ │ │
75
+ │ │ │ │
76
+ │ └─────────────────────────────────────────────────────┘ │
77
+ │ │
78
+ │ ┌─────────────────────────────────────────────────────┐ │
79
+ │ │ Type your message... [Send] │ │
80
+ │ └─────────────────────────────────────────────────────┘ │
81
+ │ │
82
+ │ [Clear Chat] [Settings] [Upload Documents] [Embed] │
83
+ └─────────────────────────────────────────────────────────────┘
84
+ ```
85
+
86
+ ### Document Upload Interface
87
+
88
+ ```
89
+ ┌─────────────────────────────────────────────────────��───────┐
90
+ │ Document Management [Close] │
91
+ ├─────────────────────────────────────────────────────────────┤
92
+ │ │
93
+ │ [Upload New Document] │
94
+ │ │
95
+ │ ┌─────────────────────────────────────────────────────┐ │
96
+ │ │ Document List │ │
97
+ │ │ │ │
98
+ │ │ ┌─────────────────────────────────────────────┐ │ │
99
+ │ │ │ norsk_historie.pdf [Remove] │ │ │
100
+ │ │ │ Status: Processed ✓ │ │ │
101
+ │ │ └─────────────────────────────────────────────┘ │ │
102
+ │ │ │ │
103
+ │ │ ┌─────────────────────────────────────────────┐ │ │
104
+ │ │ │ vikinger.docx [Remove] │ │ │
105
+ │ │ │ Status: Processing... 75% │ │ │
106
+ │ │ └─────────────────────────────────────────────┘ │ │
107
+ │ │ │ │
108
+ │ └─────────────────────────────────────────────────────┘ │
109
+ │ │
110
+ │ [Process All] [Remove All] │
111
+ └─────────────────────────────────────────────────────────────┘
112
+ ```
113
+
114
+ ### Embed Code Interface
115
+
116
+ ```
117
+ ┌─────────────────────────────────────────────────────────────┐
118
+ │ Embed Chatbot [Close] │
119
+ ├─────────────────────────────────────────────────────────────┤
120
+ │ │
121
+ │ ┌─────────────────────────────────────────────────────┐ │
122
+ │ │ Embed Code (iFrame) │ │
123
+ │ │ │ │
124
+ │ │ <iframe src="https://huggingface.co/spaces/... │ │
125
+ │ │ │ │
126
+ │ └─────────────────────────────────────────────────────┘ │
127
+ │ │
128
+ │ [Copy to Clipboard] │
129
+ │ │
130
+ │ ┌─────────────────────────────────────────────────────┐ │
131
+ │ │ Embed Code (JavaScript Widget) │ │
132
+ │ │ │ │
133
+ │ │ <script src="https://huggingface.co/spaces/... │ │
134
+ │ │ │ │
135
+ │ └─────────────────────────────────────────────────────┘ │
136
+ │ │
137
+ │ [Copy to Clipboard] │
138
+ │ │
139
+ │ ┌─────────────────────────────────────────────────────┐ │
140
+ │ │ Preview │ │
141
+ │ │ │ │
142
+ │ │ │ │
143
+ │ └─────────────────────────────────────────────────────┘ │
144
+ └─────────────────────────────────────────────────────────────┘
145
+ ```
146
+
147
+ ## Implementation with Gradio
148
+
149
+ Gradio is an ideal choice for implementing this interface due to its simplicity, Python integration, and native support on Hugging Face Spaces.
150
+
151
+ ### Core Components
152
+
153
+ 1. **Chat Interface**:
154
+ ```python
155
+ with gr.Blocks() as demo:
156
+ chatbot = gr.Chatbot()
157
+ msg = gr.Textbox(label="Message")
158
+ clear = gr.Button("Clear")
159
+
160
+ def respond(message, chat_history):
161
+ # RAG processing logic here
162
+ bot_message = get_rag_response(message)
163
+ chat_history.append((message, bot_message))
164
+ return "", chat_history
165
+
166
+ msg.submit(respond, [msg, chatbot], [msg, chatbot])
167
+ clear.click(lambda: None, None, chatbot, queue=False)
168
+ ```
169
+
170
+ 2. **Document Upload**:
171
+ ```python
172
+ with gr.Tab("Upload Documents"):
173
+ file_output = gr.File()
174
+ upload_button = gr.UploadButton("Click to Upload a File", file_types=["pdf", "docx", "txt"])
175
+
176
+ def upload_file(file):
177
+ # Document processing logic here
178
+ process_document(file.name)
179
+ return file.name
180
+
181
+ upload_button.upload(upload_file, upload_button, file_output)
182
+ ```
183
+
184
+ 3. **Embedding Code Generation**:
185
+ ```python
186
+ with gr.Tab("Embed"):
187
+ iframe_code = gr.Textbox(label="iFrame Embed Code")
188
+ js_code = gr.Textbox(label="JavaScript Widget Code")
189
+
190
+ def generate_embed_code():
191
+ iframe = f'<iframe src="{SPACE_URL}" width="100%" height="500px"></iframe>'
192
+ js = f'<script src="{SPACE_URL}/widget.js"></script>'
193
+ return iframe, js
194
+
195
+ embed_button = gr.Button("Generate Embed Code")
196
+ embed_button.click(generate_embed_code, None, [iframe_code, js_code])
197
+ ```
198
+
199
+ ## Norwegian Language Support
200
+
201
+ 1. **Interface Localization**:
202
+ - Implement language switching functionality
203
+ - Store UI text in language-specific dictionaries
204
+ - Apply translations based on selected language
205
+
206
+ 2. **Input Processing**:
207
+ - Handle Norwegian special characters correctly
208
+ - Implement Norwegian-specific text normalization
209
+
210
+ 3. **Response Generation**:
211
+ - Ensure proper formatting of Norwegian text
212
+ - Handle Norwegian grammar and syntax correctly
213
+
214
+ ## Responsive Design
215
+
216
+ 1. **CSS Customization**:
217
+ ```python
218
+ with gr.Blocks(css="""
219
+ @media (max-width: 600px) {
220
+ .container { padding: 5px; }
221
+ .input-box { font-size: 14px; }
222
+ }
223
+ """) as demo:
224
+ # Interface components
225
+ ```
226
+
227
+ 2. **Layout Adaptation**:
228
+ - Use flexible layouts that adapt to screen size
229
+ - Implement collapsible sections for mobile view
230
+ - Ensure touch-friendly UI elements
231
+
232
+ ## Deployment on Hugging Face Spaces
233
+
234
+ 1. **Space Configuration**:
235
+ - Create a `requirements.txt` file with all dependencies
236
+ - Set up appropriate environment variables
237
+ - Configure resource allocation
238
+
239
+ 2. **Continuous Integration**:
240
+ - Set up GitHub repository for the project
241
+ - Configure automatic deployment to Hugging Face Spaces
242
+ - Implement version control for the interface
243
+
244
+ 3. **Monitoring and Analytics**:
245
+ - Add usage tracking
246
+ - Implement error logging
247
+ - Set up performance monitoring
248
+
249
+ ## Next Steps
250
+
251
+ 1. Implement the basic chat interface with Gradio
252
+ 2. Add document upload and processing functionality
253
+ 3. Create embedding code generation feature
254
+ 4. Implement responsive design and language switching
255
+ 5. Deploy to Hugging Face Spaces for testing
256
+ 6. Gather feedback and iterate on the design
design/document_processing.md ADDED
@@ -0,0 +1,170 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Document Processing Pipeline Design
2
+
3
+ This document outlines the design for the document processing pipeline of our Norwegian RAG-based chatbot. The pipeline will transform raw documents into embeddings that can be efficiently retrieved during the chat process.
4
+
5
+ ## Pipeline Overview
6
+
7
+ ```
8
+ Raw Documents → Text Extraction → Text Chunking → Text Cleaning → Embedding Generation → Vector Storage
9
+ ```
10
+
11
+ ## Components
12
+
13
+ ### 1. Text Extraction
14
+
15
+ **Purpose**: Extract plain text from various document formats.
16
+
17
+ **Supported Formats**:
18
+ - PDF (.pdf)
19
+ - Word Documents (.docx, .doc)
20
+ - Text files (.txt)
21
+ - HTML (.html, .htm)
22
+ - Markdown (.md)
23
+
24
+ **Implementation**:
25
+ - Use PyPDF2 for PDF extraction
26
+ - Use python-docx for Word documents
27
+ - Use BeautifulSoup for HTML parsing
28
+ - Direct reading for text and markdown files
29
+
30
+ ### 2. Text Chunking
31
+
32
+ **Purpose**: Split documents into manageable chunks for more precise retrieval.
33
+
34
+ **Chunking Strategies**:
35
+ - Fixed size chunks (512 tokens recommended for Norwegian text)
36
+ - Semantic chunking (split at paragraph or section boundaries)
37
+ - Overlapping chunks (100-token overlap recommended)
38
+
39
+ **Implementation**:
40
+ - Use LangChain's text splitters
41
+ - Implement custom Norwegian-aware chunking logic
42
+
43
+ ### 3. Text Cleaning
44
+
45
+ **Purpose**: Normalize and clean text to improve embedding quality.
46
+
47
+ **Cleaning Operations**:
48
+ - Remove excessive whitespace
49
+ - Normalize Norwegian characters (æ, ø, å)
50
+ - Remove irrelevant content (headers, footers, page numbers)
51
+ - Handle special characters and symbols
52
+
53
+ **Implementation**:
54
+ - Custom text cleaning functions
55
+ - Norwegian-specific normalization rules
56
+
57
+ ### 4. Embedding Generation
58
+
59
+ **Purpose**: Generate vector representations of text chunks.
60
+
61
+ **Embedding Model**:
62
+ - Primary: NbAiLab/nb-sbert-base (768 dimensions)
63
+ - Alternative: FFI/SimCSE-NB-BERT-large
64
+
65
+ **Implementation**:
66
+ - Use sentence-transformers library
67
+ - Batch processing for efficiency
68
+ - Caching mechanism for frequently embedded chunks
69
+
70
+ ### 5. Vector Storage
71
+
72
+ **Purpose**: Store and index embeddings for efficient retrieval.
73
+
74
+ **Storage Options**:
75
+ - Primary: FAISS (Facebook AI Similarity Search)
76
+ - Alternative: Milvus (for larger deployments)
77
+
78
+ **Implementation**:
79
+ - FAISS IndexFlatIP (Inner Product) for cosine similarity
80
+ - Metadata storage for mapping vectors to original text
81
+ - Serialization for persistence
82
+
83
+ ## Processing Flow
84
+
85
+ 1. **Document Ingestion**:
86
+ - Accept documents via upload interface
87
+ - Store original documents in a document store
88
+ - Extract document metadata (title, date, source)
89
+
90
+ 2. **Processing Pipeline Execution**:
91
+ - Process documents through the pipeline components
92
+ - Track processing status and errors
93
+ - Generate unique IDs for each chunk
94
+
95
+ 3. **Index Management**:
96
+ - Create and update vector indices
97
+ - Implement versioning for indices
98
+ - Provide reindexing capabilities
99
+
100
+ ## Norwegian Language Considerations
101
+
102
+ - **Character Encoding**: Ensure proper handling of Norwegian characters (UTF-8)
103
+ - **Tokenization**: Use tokenizers that properly handle Norwegian word structures
104
+ - **Stopwords**: Implement Norwegian stopword filtering for improved retrieval
105
+ - **Stemming/Lemmatization**: Consider Norwegian-specific stemming or lemmatization
106
+
107
+ ## Implementation Plan
108
+
109
+ 1. Create document processor class structure
110
+ 2. Implement text extraction for different formats
111
+ 3. Develop chunking strategies optimized for Norwegian
112
+ 4. Build text cleaning and normalization functions
113
+ 5. Integrate with embedding model
114
+ 6. Set up vector storage and retrieval mechanisms
115
+ 7. Create a unified API for the entire pipeline
116
+
117
+ ## Code Structure
118
+
119
+ ```python
120
+ # Example structure for the document processing pipeline
121
+
122
+ class DocumentProcessor:
123
+ def __init__(self, embedding_model, vector_store):
124
+ self.embedding_model = embedding_model
125
+ self.vector_store = vector_store
126
+
127
+ def process_document(self, document_path):
128
+ # Extract text based on document type
129
+ raw_text = self._extract_text(document_path)
130
+
131
+ # Split text into chunks
132
+ chunks = self._chunk_text(raw_text)
133
+
134
+ # Clean and normalize text chunks
135
+ cleaned_chunks = [self._clean_text(chunk) for chunk in chunks]
136
+
137
+ # Generate embeddings
138
+ embeddings = self._generate_embeddings(cleaned_chunks)
139
+
140
+ # Store in vector database
141
+ self._store_embeddings(embeddings, cleaned_chunks)
142
+
143
+ def _extract_text(self, document_path):
144
+ # Implementation for different document types
145
+ pass
146
+
147
+ def _chunk_text(self, text):
148
+ # Implementation of chunking strategy
149
+ pass
150
+
151
+ def _clean_text(self, text):
152
+ # Text normalization and cleaning
153
+ pass
154
+
155
+ def _generate_embeddings(self, chunks):
156
+ # Use embedding model to generate vectors
157
+ pass
158
+
159
+ def _store_embeddings(self, embeddings, chunks):
160
+ # Store in vector database with metadata
161
+ pass
162
+ ```
163
+
164
+ ## Next Steps
165
+
166
+ 1. Implement the document processor class
167
+ 2. Create test documents in Norwegian
168
+ 3. Evaluate chunking strategies for Norwegian text
169
+ 4. Benchmark embedding generation performance
170
+ 5. Test retrieval accuracy with Norwegian queries
design/rag_architecture.md ADDED
@@ -0,0 +1,197 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # RAG Architecture for Norwegian Chatbot
2
+
3
+ ## Overview
4
+
5
+ This document outlines the architecture for a Retrieval-Augmented Generation (RAG) based chatbot optimized for Norwegian language, designed to be hosted on Hugging Face. The architecture leverages open-source models with strong Norwegian language support and integrates with Hugging Face's infrastructure for seamless deployment.
6
+
7
+ ## System Components
8
+
9
+ ### 1. Language Model (LLM)
10
+
11
+ Based on our research, we recommend using one of the following models:
12
+
13
+ **Primary Option: NorMistral-7b-scratch**
14
+ - Strong Norwegian language support
15
+ - Apache 2.0 license (allows commercial use)
16
+ - 7B parameters (reasonable size for deployment)
17
+ - Good performance on Norwegian language tasks
18
+ - Available on Hugging Face
19
+
20
+ **Alternative Option: Viking 7B**
21
+ - Specifically designed for Nordic languages
22
+ - Apache 2.0 license
23
+ - 4K context length
24
+ - Good multilingual capabilities (useful if the chatbot needs to handle some English queries)
25
+
26
+ **Fallback Option: NorskGPT-Mistral**
27
+ - Specifically designed for Norwegian
28
+ - Note: Non-commercial license (cc-by-nc-sa-4.0)
29
+
30
+ ### 2. Embedding Model
31
+
32
+ **Recommended: NbAiLab/nb-sbert-base**
33
+ - Specifically trained for Norwegian
34
+ - 768-dimensional embeddings
35
+ - Good performance on sentence similarity tasks
36
+ - Works well with both Norwegian and English content
37
+ - Apache 2.0 license
38
+ - High download count on Hugging Face (41,370 last month)
39
+
40
+ ### 3. Vector Database
41
+
42
+ **Recommended: FAISS**
43
+ - Lightweight and efficient
44
+ - Easy integration with Hugging Face
45
+ - Can be packaged with the application
46
+ - Works well for moderate-sized document collections
47
+
48
+ **Alternative: Milvus**
49
+ - More scalable for larger document collections
50
+ - Well-documented integration with Hugging Face
51
+ - Better for production deployments with large document bases
52
+
53
+ ### 4. Document Processing Pipeline
54
+
55
+ 1. **Text Extraction**: Extract text from various document formats (PDF, DOCX, TXT)
56
+ 2. **Text Chunking**: Split documents into manageable chunks (recommended chunk size: 512 tokens)
57
+ 3. **Text Cleaning**: Remove irrelevant content, normalize text
58
+ 4. **Embedding Generation**: Generate embeddings using NbAiLab/nb-sbert-base
59
+ 5. **Vector Storage**: Store embeddings in FAISS index
60
+
61
+ ### 5. Retrieval Mechanism
62
+
63
+ 1. **Query Processing**: Process user query
64
+ 2. **Query Embedding**: Generate embedding for the query using the same embedding model
65
+ 3. **Similarity Search**: Find most relevant document chunks using cosine similarity
66
+ 4. **Context Assembly**: Assemble retrieved chunks into context for the LLM
67
+
68
+ ### 6. Generation Component
69
+
70
+ 1. **Prompt Construction**: Construct prompt with retrieved context and user query
71
+ 2. **LLM Inference**: Generate response using the LLM
72
+ 3. **Response Post-processing**: Format and clean the response
73
+
74
+ ### 7. Chat Interface
75
+
76
+ 1. **Frontend**: Lightweight, responsive web interface
77
+ 2. **API Layer**: RESTful API for communication between frontend and backend
78
+ 3. **Session Management**: Maintain conversation history
79
+
80
+ ## Hugging Face Integration
81
+
82
+ ### Deployment Options
83
+
84
+ 1. **Hugging Face Spaces**:
85
+ - Deploy the entire application as a Gradio or Streamlit app
86
+ - Provides a public URL for access
87
+ - Supports Git-based deployment
88
+
89
+ 2. **Model Hosting**:
90
+ - Host the fine-tuned LLM on Hugging Face Model Hub
91
+ - Use Hugging Face Inference API for model inference
92
+
93
+ 3. **Datasets**:
94
+ - Store and version document collections on Hugging Face Datasets
95
+
96
+ ### Implementation Approach
97
+
98
+ 1. **Gradio Interface**:
99
+ - Create a Gradio app for the chat interface
100
+ - Deploy to Hugging Face Spaces
101
+
102
+ 2. **Backend Processing**:
103
+ - Use Hugging Face Transformers and Sentence-Transformers libraries
104
+ - Implement document processing pipeline
105
+ - Set up FAISS for vector storage and retrieval
106
+
107
+ 3. **Model Integration**:
108
+ - Load models from Hugging Face Model Hub
109
+ - Implement caching for better performance
110
+
111
+ ## Technical Architecture Diagram
112
+
113
+ ```
114
+ ┌─────────────────────────────────────────────────────────────────┐
115
+ │ Hugging Face Spaces │
116
+ └─────────────────────────────────────────────────────────────────┘
117
+
118
+
119
+ ┌─────────────────────────────────────────────────────────────────┐
120
+ │ Web Interface │
121
+ │ │
122
+ │ ┌─────────────┐ ┌────────────┐ │
123
+ │ │ Gradio │ │ Session │ │
124
+ │ │ Interface │◄──────────────────────────────┤ Manager │ │
125
+ │ └─────────────┘ └────────────┘ │
126
+ └─────────────────────────────────────────────────────────────────┘
127
+
128
+
129
+ ┌─────────────────────────────────────────────────────────────────┐
130
+ │ Backend Processing │
131
+ │ │
132
+ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │
133
+ │ │ Query │ │ Retrieval │ │ Generation │ │
134
+ │ │ Processing │───►│ Engine │───►│ Engine │ │
135
+ │ └─────────────┘ └─────────────┘ └─────────────────────┘ │
136
+ │ │ ▲ │
137
+ │ ▼ │ │
138
+ │ ┌─────────────┐ │ │
139
+ │ │ FAISS │ │ │
140
+ │ │ Vector │ │ │
141
+ │ │ Store │ │ │
142
+ │ └─────────────┘ │ │
143
+ │ ▲ │ │
144
+ │ │ │ │
145
+ │ ┌─────────────────────────┴──────────────────────┴───────────┐ │
146
+ │ │ Document Processor │ │
147
+ │ └──────────────────────────────────────────────────────────────┘
148
+ └─────────────────────────────────────────────────────────────────┘
149
+
150
+
151
+ ┌─────────────────────────────────────────────────────────────────┐
152
+ │ Hugging Face Model Hub │
153
+ │ │
154
+ │ ┌─────────────────┐ ┌───────────────────┐ │
155
+ │ │ NbAiLab/ │ │ NorMistral- │ │
156
+ │ │ nb-sbert-base │ │ 7b-scratch │ │
157
+ │ │ (Embeddings) │ │ (LLM) │ │
158
+ │ └─────────────────┘ └───────────────────┘ │
159
+ └─────────────────────────────────────────────────────────────────┘
160
+ ```
161
+
162
+ ## Implementation Considerations
163
+
164
+ ### 1. Performance Optimization
165
+
166
+ - **Model Quantization**: Use GGUF or GPTQ quantized versions of the LLM to reduce memory requirements
167
+ - **Batch Processing**: Implement batch processing for document embedding generation
168
+ - **Caching**: Cache frequent queries and responses
169
+ - **Progressive Loading**: Implement progressive loading for large document collections
170
+
171
+ ### 2. Norwegian Language Optimization
172
+
173
+ - **Tokenization**: Ensure proper tokenization for Norwegian-specific characters and word structures
174
+ - **Text Normalization**: Implement Norwegian-specific text normalization (handling of "æ", "ø", "å")
175
+ - **Stopword Removal**: Use Norwegian stopword list for improved retrieval
176
+
177
+ ### 3. Embedding Functionality
178
+
179
+ - **iFrame Integration**: Provide code snippets for embedding the chatbot in iFrames
180
+ - **JavaScript Widget**: Create a JavaScript widget for easy integration into any website
181
+ - **API Access**: Provide API endpoints for programmatic access
182
+
183
+ ### 4. Security and Privacy
184
+
185
+ - **Data Handling**: Implement proper data handling practices
186
+ - **User Authentication**: Add optional user authentication for personalized experiences
187
+ - **Rate Limiting**: Implement rate limiting to prevent abuse
188
+
189
+ ## Next Steps
190
+
191
+ 1. Set up the development environment
192
+ 2. Implement the document processing pipeline
193
+ 3. Integrate the LLM and embedding models
194
+ 4. Create the chat interface
195
+ 5. Develop the embedding functionality
196
+ 6. Deploy to Hugging Face
197
+ 7. Test and optimize the solution
prepare_deployment.sh ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+ # Create empty directories for data storage
3
+ mkdir -p /home/ubuntu/chatbot_project/data/documents
4
+ mkdir -p /home/ubuntu/chatbot_project/data/processed
5
+ touch /home/ubuntu/chatbot_project/data/documents/.gitkeep
6
+ touch /home/ubuntu/chatbot_project/data/processed/.gitkeep
7
+
8
+ # Create a simple test document
9
+ cat > /home/ubuntu/chatbot_project/data/documents/test_document.txt << 'EOL'
10
+ # Norsk historie
11
+
12
+ Norge har en rik og fascinerende historie som strekker seg tilbake til vikingtiden. Vikingene var kjent for sine sjøreiser, handel og plyndring i store deler av Europa fra slutten av 700-tallet til midten av 1000-tallet.
13
+
14
+ ## Middelalderen
15
+
16
+ I 1030 døde Olav Haraldsson (senere kjent som Olav den hellige) i slaget ved Stiklestad. Hans død markerte begynnelsen på kristendommens endelige gjennombrudd i Norge.
17
+
18
+ Norge ble forent til ett rike under Harald Hårfagre på 800-tallet. Etter vikingtiden fulgte en periode med borgerkrig før landet ble stabilisert under Håkon Håkonsson på 1200-tallet.
19
+
20
+ ## Union med Danmark
21
+
22
+ Fra 1380 til 1814 var Norge i union med Danmark, en periode kjent som "dansketiden". Under denne perioden ble dansk det offisielle språket i administrasjon og litteratur, noe som hadde stor innflytelse på det norske språket.
23
+
24
+ ## Grunnloven og union med Sverige
25
+
26
+ I 1814 fikk Norge sin egen grunnlov, signert på Eidsvoll 17. mai. Samme år ble Norge tvunget inn i en union med Sverige, som varte frem til 1905.
27
+
28
+ ## Moderne Norge
29
+
30
+ Norge ble okkupert av Nazi-Tyskland under andre verdenskrig fra 1940 til 1945. Etter krigen opplevde landet rask økonomisk vekst.
31
+
32
+ Oppdagelsen av olje i Nordsjøen på slutten av 1960-tallet forvandlet Norge til en av verdens rikeste nasjoner per innbygger.
33
+
34
+ I dag er Norge kjent for sin velferdsstat, naturskjønnhet og høy levestandard.
35
+ EOL
36
+
37
+ echo "Deployment files prepared successfully"
requirements-minimal.txt ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Core dependencies - minimal version
2
+ transformers>=4.36.0
3
+ sentence-transformers>=2.2.2
4
+ torch>=2.0.0
5
+ gradio>=4.0.0
6
+ huggingface_hub>=0.19.0
7
+
8
+ # Document processing - essential only
9
+ PyPDF2>=3.0.0
10
+ beautifulsoup4>=4.12.0
11
+
12
+ # Vector database - lightweight option
13
+ faiss-cpu>=1.7.4
14
+
15
+ # Utilities - minimal set
16
+ numpy>=1.24.0
17
+ tqdm>=4.66.0
18
+ requests>=2.31.0
19
+
20
+ # Norwegian language support
21
+ nltk>=3.8.0
requirements-ultra-light.txt ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ # Core dependencies - ultra lightweight
2
+ requests>=2.31.0
3
+ gradio>=4.0.0
4
+ huggingface_hub>=0.19.0
5
+ numpy>=1.24.0
6
+ PyPDF2>=3.0.0
7
+ beautifulsoup4>=4.12.0
requirements.txt CHANGED
@@ -1 +1,25 @@
1
- huggingface_hub==0.25.2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Core dependencies
2
+ transformers>=4.36.0
3
+ sentence-transformers>=2.2.2
4
+ torch>=2.0.0
5
+ gradio>=4.0.0
6
+ huggingface_hub>=0.19.0
7
+
8
+ # Document processing
9
+ PyPDF2>=3.0.0
10
+ python-docx>=0.8.11
11
+ beautifulsoup4>=4.12.0
12
+ markdown>=3.5.0
13
+
14
+ # Vector database
15
+ faiss-cpu>=1.7.4
16
+ langchain>=0.1.0
17
+
18
+ # Utilities
19
+ numpy>=1.24.0
20
+ pandas>=2.0.0
21
+ tqdm>=4.66.0
22
+ requests>=2.31.0
23
+
24
+ # Norwegian language support
25
+ nltk>=3.8.0
research/norwegian_llm_research.md ADDED
@@ -0,0 +1,81 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Norwegian LLM and Embedding Models Research
2
+
3
+ ## Open-Source LLMs with Norwegian Language Support
4
+
5
+ ### 1. NorMistral-7b-scratch
6
+ - **Description**: A large Norwegian language model pretrained from scratch on 260 billion subword tokens (using six repetitions of open Norwegian texts).
7
+ - **Architecture**: Based on Mistral architecture with 7 billion parameters
8
+ - **Context Length**: 2k tokens
9
+ - **Performance**:
10
+ - Perplexity on NCC validation set: 7.43
11
+ - Good performance on reading comprehension, sentiment analysis, and machine translation tasks
12
+ - **License**: Apache-2.0
13
+ - **Hugging Face**: https://huggingface.co/norallm/normistral-7b-scratch
14
+ - **Notes**: Part of the NORA.LLM family developed by the Language Technology Group at the University of Oslo
15
+
16
+ ### 2. Viking 7B
17
+ - **Description**: The first multilingual large language model for all Nordic languages (including Norwegian)
18
+ - **Architecture**: Similar to Llama 2, with flash attention, rotary embeddings, grouped query attention
19
+ - **Context Length**: 4k tokens
20
+ - **Performance**: Best-in-class performance in all Nordic languages without compromising English performance
21
+ - **License**: Apache 2.0
22
+ - **Notes**:
23
+ - Developed by Silo AI and University of Turku's research group TurkuNLP
24
+ - Also available in larger sizes (13B and 33B parameters)
25
+ - Trained on 2 trillion tokens including Danish, English, Finnish, Icelandic, Norwegian, Swedish and programming languages
26
+
27
+ ### 3. NorskGPT
28
+ - **Description**: A Norwegian large language model made for Norwegian society
29
+ - **Versions**:
30
+ - NorskGPT-Mistral: 7B dense transformer with 8K context window, based on Mistral 7B
31
+ - NorskGPT-LLAMA2: 7b and 13b parameter model with 4K context length, based on LLAMA2
32
+ - **License**: cc-by-nc-sa-4.0 (non-commercial)
33
+ - **Website**: https://www.norskgpt.com/norskgpt-llm
34
+
35
+ ## Embedding Models for Norwegian
36
+
37
+ ### 1. NbAiLab/nb-sbert-base
38
+ - **Description**: A SentenceTransformers model trained on a machine translated version of the MNLI dataset
39
+ - **Architecture**: Based on nb-bert-base
40
+ - **Vector Dimensions**: 768
41
+ - **Performance**:
42
+ - Cosine Similarity: Pearson 0.8275, Spearman 0.8245
43
+ - **License**: apache-2.0
44
+ - **Hugging Face**: https://huggingface.co/NbAiLab/nb-sbert-base
45
+ - **Use Cases**:
46
+ - Sentence similarity
47
+ - Semantic search
48
+ - Few-shot classification (with SetFit)
49
+ - Keyword extraction (with KeyBERT)
50
+ - Topic modeling (with BERTopic)
51
+ - **Notes**: Works well with both Norwegian and English, making it ideal for bilingual applications
52
+
53
+ ### 2. FFI/SimCSE-NB-BERT-large
54
+ - **Description**: A Norwegian sentence embedding model trained using the SimCSE methodology
55
+ - **Hugging Face**: https://huggingface.co/FFI/SimCSE-NB-BERT-large
56
+
57
+ ## Vector Database Options for Hugging Face RAG Integration
58
+
59
+ ### 1. Milvus
60
+ - **Integration**: Well-documented integration with Hugging Face for RAG pipelines
61
+ - **Reference**: https://huggingface.co/learn/cookbook/en/rag_with_hf_and_milvus
62
+
63
+ ### 2. MongoDB
64
+ - **Integration**: Can be used with Hugging Face models for RAG systems
65
+ - **Reference**: https://huggingface.co/learn/cookbook/en/rag_with_hugging_face_gemma_mongodb
66
+
67
+ ### 3. MyScale
68
+ - **Integration**: Supports building RAG applications with Hugging Face embedding models
69
+ - **Reference**: https://medium.com/@myscale/building-a-rag-application-in-10-min-with-claude-3-and-hugging-face-10caea4ea293
70
+
71
+ ### 4. FAISS (Facebook AI Similarity Search)
72
+ - **Integration**: Lightweight vector database that works well with Hugging Face
73
+ - **Notes**: Can be used with `autofaiss` for quick experimentation
74
+
75
+ ## Hugging Face RAG Implementation Options
76
+
77
+ 1. **Transformers Library**: Provides access to pre-trained models
78
+ 2. **Sentence Transformers**: For text embeddings
79
+ 3. **Datasets**: For managing and processing data
80
+ 4. **LangChain Integration**: For advanced RAG pipelines
81
+ 5. **Spaces**: For deploying and sharing the application
src/api/__init__.py ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ """
2
+ API integration module for Norwegian RAG chatbot.
3
+ """
src/api/config.py ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Configuration for Hugging Face API integration.
3
+ Contains model IDs, API endpoints, and other configuration parameters.
4
+ """
5
+
6
+ # Norwegian LLM options
7
+ LLM_MODELS = {
8
+ "normistral": {
9
+ "model_id": "norallm/normistral-7b-scratch",
10
+ "description": "NorMistral 7B - Norwegian language model based on Mistral architecture"
11
+ },
12
+ "viking": {
13
+ "model_id": "silo-ai/viking-7b",
14
+ "description": "Viking 7B - Multilingual model for Nordic languages"
15
+ },
16
+ "norskgpt": {
17
+ "model_id": "NbAiLab/NorskGPT",
18
+ "description": "NorskGPT - Norwegian language model"
19
+ }
20
+ }
21
+
22
+ # Default LLM model
23
+ DEFAULT_LLM_MODEL = "normistral"
24
+
25
+ # Norwegian embedding models
26
+ EMBEDDING_MODELS = {
27
+ "nb-sbert": {
28
+ "model_id": "NbAiLab/nb-sbert-base",
29
+ "description": "NB-SBERT-BASE - Norwegian sentence embedding model"
30
+ },
31
+ "simcse": {
32
+ "model_id": "FFI/SimCSE-NB-BERT-large",
33
+ "description": "SimCSE-NB-BERT-large - Norwegian sentence embedding model"
34
+ }
35
+ }
36
+
37
+ # Default embedding model
38
+ DEFAULT_EMBEDDING_MODEL = "nb-sbert"
39
+
40
+ # Hugging Face API endpoints
41
+ HF_API_ENDPOINTS = {
42
+ "inference": "https://api-inference.huggingface.co/models/",
43
+ "feature-extraction": "https://api-inference.huggingface.co/pipeline/feature-extraction/"
44
+ }
45
+
46
+ # API request parameters
47
+ API_PARAMS = {
48
+ "max_length": 512,
49
+ "temperature": 0.7,
50
+ "top_p": 0.9,
51
+ "top_k": 50,
52
+ "repetition_penalty": 1.1
53
+ }
54
+
55
+ # Document processing parameters
56
+ CHUNK_SIZE = 512
57
+ CHUNK_OVERLAP = 100
58
+
59
+ # RAG parameters
60
+ MAX_CHUNKS_TO_RETRIEVE = 5
61
+ SIMILARITY_THRESHOLD = 0.75
src/api/huggingface_api.py ADDED
@@ -0,0 +1,213 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Hugging Face API integration for Norwegian RAG chatbot.
3
+ Provides functions to interact with Hugging Face Inference API for both LLM and embedding models.
4
+ """
5
+
6
+ import os
7
+ import json
8
+ import time
9
+ import requests
10
+ from typing import Dict, List, Optional, Union, Any
11
+
12
+ from .config import (
13
+ LLM_MODELS,
14
+ DEFAULT_LLM_MODEL,
15
+ EMBEDDING_MODELS,
16
+ DEFAULT_EMBEDDING_MODEL,
17
+ HF_API_ENDPOINTS,
18
+ API_PARAMS
19
+ )
20
+
21
+ class HuggingFaceAPI:
22
+ """
23
+ Client for interacting with Hugging Face Inference API.
24
+ Supports both text generation (LLM) and embedding generation.
25
+ """
26
+
27
+ def __init__(
28
+ self,
29
+ api_key: Optional[str] = None,
30
+ llm_model: str = DEFAULT_LLM_MODEL,
31
+ embedding_model: str = DEFAULT_EMBEDDING_MODEL
32
+ ):
33
+ """
34
+ Initialize the Hugging Face API client.
35
+
36
+ Args:
37
+ api_key: Hugging Face API key (optional, can use HF_API_KEY env var)
38
+ llm_model: LLM model identifier from config
39
+ embedding_model: Embedding model identifier from config
40
+ """
41
+ self.api_key = api_key or os.environ.get("HF_API_KEY", "")
42
+
43
+ # Set up model IDs
44
+ self.llm_model_id = LLM_MODELS[llm_model]["model_id"] if llm_model in LLM_MODELS else LLM_MODELS[DEFAULT_LLM_MODEL]["model_id"]
45
+ self.embedding_model_id = EMBEDDING_MODELS[embedding_model]["model_id"] if embedding_model in EMBEDDING_MODELS else EMBEDDING_MODELS[DEFAULT_EMBEDDING_MODEL]["model_id"]
46
+
47
+ # Set up headers
48
+ self.headers = {"Authorization": f"Bearer {self.api_key}"}
49
+ if not self.api_key:
50
+ print("Warning: No API key provided. API calls may be rate limited.")
51
+ self.headers = {}
52
+
53
+ def generate_text(
54
+ self,
55
+ prompt: str,
56
+ max_length: int = API_PARAMS["max_length"],
57
+ temperature: float = API_PARAMS["temperature"],
58
+ top_p: float = API_PARAMS["top_p"],
59
+ top_k: int = API_PARAMS["top_k"],
60
+ repetition_penalty: float = API_PARAMS["repetition_penalty"],
61
+ wait_for_model: bool = True
62
+ ) -> str:
63
+ """
64
+ Generate text using the LLM model.
65
+
66
+ Args:
67
+ prompt: Input text prompt
68
+ max_length: Maximum length of generated text
69
+ temperature: Sampling temperature
70
+ top_p: Top-p sampling parameter
71
+ top_k: Top-k sampling parameter
72
+ repetition_penalty: Penalty for repetition
73
+ wait_for_model: Whether to wait for model to load
74
+
75
+ Returns:
76
+ Generated text response
77
+ """
78
+ payload = {
79
+ "inputs": prompt,
80
+ "parameters": {
81
+ "max_length": max_length,
82
+ "temperature": temperature,
83
+ "top_p": top_p,
84
+ "top_k": top_k,
85
+ "repetition_penalty": repetition_penalty
86
+ }
87
+ }
88
+
89
+ api_url = f"{HF_API_ENDPOINTS['inference']}{self.llm_model_id}"
90
+
91
+ # Make API request
92
+ response = self._make_api_request(api_url, payload, wait_for_model)
93
+
94
+ # Parse response
95
+ if isinstance(response, list) and len(response) > 0:
96
+ if "generated_text" in response[0]:
97
+ return response[0]["generated_text"]
98
+ return response[0].get("text", "")
99
+ elif isinstance(response, dict):
100
+ return response.get("generated_text", "")
101
+
102
+ # Fallback
103
+ return str(response)
104
+
105
+ def generate_embeddings(
106
+ self,
107
+ texts: Union[str, List[str]],
108
+ wait_for_model: bool = True
109
+ ) -> List[List[float]]:
110
+ """
111
+ Generate embeddings for text using the embedding model.
112
+
113
+ Args:
114
+ texts: Single text or list of texts to embed
115
+ wait_for_model: Whether to wait for model to load
116
+
117
+ Returns:
118
+ List of embedding vectors
119
+ """
120
+ # Ensure texts is a list
121
+ if isinstance(texts, str):
122
+ texts = [texts]
123
+
124
+ payload = {
125
+ "inputs": texts,
126
+ }
127
+
128
+ api_url = f"{HF_API_ENDPOINTS['feature-extraction']}{self.embedding_model_id}"
129
+
130
+ # Make API request
131
+ response = self._make_api_request(api_url, payload, wait_for_model)
132
+
133
+ # Return embeddings
134
+ return response
135
+
136
+ def _make_api_request(
137
+ self,
138
+ api_url: str,
139
+ payload: Dict[str, Any],
140
+ wait_for_model: bool = True,
141
+ max_retries: int = 5,
142
+ retry_delay: int = 1
143
+ ) -> Any:
144
+ """
145
+ Make a request to the Hugging Face API with retry logic.
146
+
147
+ Args:
148
+ api_url: API endpoint URL
149
+ payload: Request payload
150
+ wait_for_model: Whether to wait for model to load
151
+ max_retries: Maximum number of retries
152
+ retry_delay: Delay between retries in seconds
153
+
154
+ Returns:
155
+ API response
156
+ """
157
+ for attempt in range(max_retries):
158
+ try:
159
+ response = requests.post(api_url, headers=self.headers, json=payload)
160
+
161
+ # Check if model is still loading
162
+ if response.status_code == 503 and wait_for_model:
163
+ # Model is loading, wait and retry
164
+ estimated_time = json.loads(response.content.decode("utf-8")).get("estimated_time", 20)
165
+ print(f"Model is loading. Waiting {estimated_time} seconds...")
166
+ time.sleep(estimated_time)
167
+ continue
168
+
169
+ # Check for other errors
170
+ if response.status_code != 200:
171
+ print(f"API request failed with status code {response.status_code}: {response.text}")
172
+ if attempt < max_retries - 1:
173
+ time.sleep(retry_delay * (2 ** attempt)) # Exponential backoff
174
+ continue
175
+ return {"error": response.text}
176
+
177
+ return response.json()
178
+
179
+ except Exception as e:
180
+ print(f"API request failed: {str(e)}")
181
+ if attempt < max_retries - 1:
182
+ time.sleep(retry_delay * (2 ** attempt)) # Exponential backoff
183
+ continue
184
+ return {"error": str(e)}
185
+
186
+ return {"error": "Max retries exceeded"}
187
+
188
+
189
+ # Example RAG prompt template for Norwegian
190
+ def create_rag_prompt(query: str, context: List[str]) -> str:
191
+ """
192
+ Create a RAG prompt with retrieved context for the LLM.
193
+
194
+ Args:
195
+ query: User query
196
+ context: List of retrieved document chunks
197
+
198
+ Returns:
199
+ Formatted prompt with context
200
+ """
201
+ context_text = "\n\n".join([f"Dokument {i+1}:\n{chunk}" for i, chunk in enumerate(context)])
202
+
203
+ prompt = f"""Du er en hjelpsom assistent som svarer på norsk. Bruk følgende kontekst for å svare på spørsmålet.
204
+
205
+ KONTEKST:
206
+ {context_text}
207
+
208
+ SPØRSMÅL:
209
+ {query}
210
+
211
+ SVAR:
212
+ """
213
+ return prompt
src/document_processing/__init__.py ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ """
2
+ Document processing module for Norwegian RAG chatbot.
3
+ """
src/document_processing/chunker.py ADDED
@@ -0,0 +1,262 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Text chunking module for Norwegian RAG chatbot.
3
+ Splits documents into manageable chunks for embedding and retrieval.
4
+ """
5
+
6
+ import re
7
+ from typing import List, Optional, Tuple
8
+
9
+ from ..api.config import CHUNK_SIZE, CHUNK_OVERLAP
10
+
11
+ class TextChunker:
12
+ """
13
+ Splits documents into manageable chunks for embedding and retrieval.
14
+ Supports different chunking strategies optimized for Norwegian text.
15
+ """
16
+
17
+ @staticmethod
18
+ def chunk_text(
19
+ text: str,
20
+ chunk_size: int = CHUNK_SIZE,
21
+ chunk_overlap: int = CHUNK_OVERLAP,
22
+ strategy: str = "paragraph"
23
+ ) -> List[str]:
24
+ """
25
+ Split text into chunks using the specified strategy.
26
+
27
+ Args:
28
+ text: Text to split into chunks
29
+ chunk_size: Maximum size of each chunk
30
+ chunk_overlap: Overlap between consecutive chunks
31
+ strategy: Chunking strategy ('fixed', 'paragraph', or 'sentence')
32
+
33
+ Returns:
34
+ List of text chunks
35
+ """
36
+ if not text:
37
+ return []
38
+
39
+ if strategy == "fixed":
40
+ return TextChunker.fixed_size_chunks(text, chunk_size, chunk_overlap)
41
+ elif strategy == "paragraph":
42
+ return TextChunker.paragraph_chunks(text, chunk_size, chunk_overlap)
43
+ elif strategy == "sentence":
44
+ return TextChunker.sentence_chunks(text, chunk_size, chunk_overlap)
45
+ else:
46
+ raise ValueError(f"Unknown chunking strategy: {strategy}")
47
+
48
+ @staticmethod
49
+ def fixed_size_chunks(
50
+ text: str,
51
+ chunk_size: int = CHUNK_SIZE,
52
+ chunk_overlap: int = CHUNK_OVERLAP
53
+ ) -> List[str]:
54
+ """
55
+ Split text into fixed-size chunks with overlap.
56
+
57
+ Args:
58
+ text: Text to split into chunks
59
+ chunk_size: Maximum size of each chunk
60
+ chunk_overlap: Overlap between consecutive chunks
61
+
62
+ Returns:
63
+ List of text chunks
64
+ """
65
+ if not text:
66
+ return []
67
+
68
+ chunks = []
69
+ start = 0
70
+ text_length = len(text)
71
+
72
+ while start < text_length:
73
+ end = min(start + chunk_size, text_length)
74
+
75
+ # If this is not the first chunk and we're not at the end,
76
+ # try to find a good breaking point (whitespace)
77
+ if start > 0 and end < text_length:
78
+ # Look for the last whitespace within the chunk
79
+ last_whitespace = text.rfind(' ', start, end)
80
+ if last_whitespace != -1:
81
+ end = last_whitespace + 1 # Include the space
82
+
83
+ # Add the chunk
84
+ chunks.append(text[start:end].strip())
85
+
86
+ # Move the start position for the next chunk, considering overlap
87
+ start = end - chunk_overlap if end < text_length else text_length
88
+
89
+ return chunks
90
+
91
+ @staticmethod
92
+ def paragraph_chunks(
93
+ text: str,
94
+ max_chunk_size: int = CHUNK_SIZE,
95
+ chunk_overlap: int = CHUNK_OVERLAP
96
+ ) -> List[str]:
97
+ """
98
+ Split text into chunks based on paragraphs.
99
+
100
+ Args:
101
+ text: Text to split into chunks
102
+ max_chunk_size: Maximum size of each chunk
103
+ chunk_overlap: Overlap between consecutive chunks
104
+
105
+ Returns:
106
+ List of text chunks
107
+ """
108
+ if not text:
109
+ return []
110
+
111
+ # Split text into paragraphs
112
+ paragraphs = re.split(r'\n\s*\n', text)
113
+ paragraphs = [p.strip() for p in paragraphs if p.strip()]
114
+
115
+ chunks = []
116
+ current_chunk = []
117
+ current_size = 0
118
+
119
+ for paragraph in paragraphs:
120
+ paragraph_size = len(paragraph)
121
+
122
+ # If adding this paragraph would exceed the max chunk size and we already have content,
123
+ # save the current chunk and start a new one
124
+ if current_size + paragraph_size > max_chunk_size and current_chunk:
125
+ chunks.append('\n\n'.join(current_chunk))
126
+
127
+ # For overlap, keep some paragraphs from the previous chunk
128
+ overlap_size = 0
129
+ overlap_paragraphs = []
130
+
131
+ # Add paragraphs from the end until we reach the desired overlap
132
+ for p in reversed(current_chunk):
133
+ if overlap_size + len(p) <= chunk_overlap:
134
+ overlap_paragraphs.insert(0, p)
135
+ overlap_size += len(p)
136
+ else:
137
+ break
138
+
139
+ current_chunk = overlap_paragraphs
140
+ current_size = overlap_size
141
+
142
+ # If the paragraph itself is larger than the max chunk size, split it further
143
+ if paragraph_size > max_chunk_size:
144
+ # First, add the current chunk if it's not empty
145
+ if current_chunk:
146
+ chunks.append('\n\n'.join(current_chunk))
147
+ current_chunk = []
148
+ current_size = 0
149
+
150
+ # Then split the large paragraph into fixed-size chunks
151
+ paragraph_chunks = TextChunker.fixed_size_chunks(paragraph, max_chunk_size, chunk_overlap)
152
+ chunks.extend(paragraph_chunks)
153
+ else:
154
+ # Add the paragraph to the current chunk
155
+ current_chunk.append(paragraph)
156
+ current_size += paragraph_size
157
+
158
+ # Add the last chunk if it's not empty
159
+ if current_chunk:
160
+ chunks.append('\n\n'.join(current_chunk))
161
+
162
+ return chunks
163
+
164
+ @staticmethod
165
+ def sentence_chunks(
166
+ text: str,
167
+ max_chunk_size: int = CHUNK_SIZE,
168
+ chunk_overlap: int = CHUNK_OVERLAP
169
+ ) -> List[str]:
170
+ """
171
+ Split text into chunks based on sentences.
172
+
173
+ Args:
174
+ text: Text to split into chunks
175
+ max_chunk_size: Maximum size of each chunk
176
+ chunk_overlap: Overlap between consecutive chunks
177
+
178
+ Returns:
179
+ List of text chunks
180
+ """
181
+ if not text:
182
+ return []
183
+
184
+ # Norwegian-aware sentence splitting
185
+ # This pattern handles common Norwegian sentence endings
186
+ sentence_pattern = r'(?<=[.!?])\s+(?=[A-ZÆØÅ])'
187
+ sentences = re.split(sentence_pattern, text)
188
+ sentences = [s.strip() for s in sentences if s.strip()]
189
+
190
+ chunks = []
191
+ current_chunk = []
192
+ current_size = 0
193
+
194
+ for sentence in sentences:
195
+ sentence_size = len(sentence)
196
+
197
+ # If adding this sentence would exceed the max chunk size and we already have content,
198
+ # save the current chunk and start a new one
199
+ if current_size + sentence_size > max_chunk_size and current_chunk:
200
+ chunks.append(' '.join(current_chunk))
201
+
202
+ # For overlap, keep some sentences from the previous chunk
203
+ overlap_size = 0
204
+ overlap_sentences = []
205
+
206
+ # Add sentences from the end until we reach the desired overlap
207
+ for s in reversed(current_chunk):
208
+ if overlap_size + len(s) <= chunk_overlap:
209
+ overlap_sentences.insert(0, s)
210
+ overlap_size += len(s)
211
+ else:
212
+ break
213
+
214
+ current_chunk = overlap_sentences
215
+ current_size = overlap_size
216
+
217
+ # If the sentence itself is larger than the max chunk size, split it further
218
+ if sentence_size > max_chunk_size:
219
+ # First, add the current chunk if it's not empty
220
+ if current_chunk:
221
+ chunks.append(' '.join(current_chunk))
222
+ current_chunk = []
223
+ current_size = 0
224
+
225
+ # Then split the large sentence into fixed-size chunks
226
+ sentence_chunks = TextChunker.fixed_size_chunks(sentence, max_chunk_size, chunk_overlap)
227
+ chunks.extend(sentence_chunks)
228
+ else:
229
+ # Add the sentence to the current chunk
230
+ current_chunk.append(sentence)
231
+ current_size += sentence_size
232
+
233
+ # Add the last chunk if it's not empty
234
+ if current_chunk:
235
+ chunks.append(' '.join(current_chunk))
236
+
237
+ return chunks
238
+
239
+ @staticmethod
240
+ def clean_chunk(chunk: str) -> str:
241
+ """
242
+ Clean a text chunk by removing excessive whitespace and normalizing.
243
+
244
+ Args:
245
+ chunk: Text chunk to clean
246
+
247
+ Returns:
248
+ Cleaned text chunk
249
+ """
250
+ if not chunk:
251
+ return ""
252
+
253
+ # Replace multiple whitespace with a single space
254
+ cleaned = re.sub(r'\s+', ' ', chunk)
255
+
256
+ # Normalize Norwegian characters (if needed)
257
+ # This ensures consistent handling of æ, ø, å
258
+ cleaned = cleaned.replace('æ', 'æ').replace('Æ', 'Æ')
259
+ cleaned = cleaned.replace('ø', 'ø').replace('Ø', 'Ø')
260
+ cleaned = cleaned.replace('å', 'å').replace('Å', 'Å')
261
+
262
+ return cleaned.strip()
src/document_processing/extractor.py ADDED
@@ -0,0 +1,167 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Text extraction module for Norwegian RAG chatbot.
3
+ Extracts text from various document formats.
4
+ """
5
+
6
+ import os
7
+ import PyPDF2
8
+ from typing import List, Optional
9
+ from bs4 import BeautifulSoup
10
+
11
+ class TextExtractor:
12
+ """
13
+ Extracts text from various document formats.
14
+ Currently supports:
15
+ - PDF (.pdf)
16
+ - Text files (.txt)
17
+ - HTML (.html, .htm)
18
+ """
19
+
20
+ @staticmethod
21
+ def extract_from_file(file_path: str) -> str:
22
+ """
23
+ Extract text from a file based on its extension.
24
+
25
+ Args:
26
+ file_path: Path to the document file
27
+
28
+ Returns:
29
+ Extracted text content
30
+ """
31
+ if not os.path.exists(file_path):
32
+ raise FileNotFoundError(f"File not found: {file_path}")
33
+
34
+ file_extension = os.path.splitext(file_path)[1].lower()
35
+
36
+ if file_extension == '.pdf':
37
+ return TextExtractor.extract_from_pdf(file_path)
38
+ elif file_extension == '.txt':
39
+ return TextExtractor.extract_from_text(file_path)
40
+ elif file_extension in ['.html', '.htm']:
41
+ return TextExtractor.extract_from_html(file_path)
42
+ else:
43
+ raise ValueError(f"Unsupported file format: {file_extension}")
44
+
45
+ @staticmethod
46
+ def extract_from_pdf(file_path: str) -> str:
47
+ """
48
+ Extract text from a PDF file.
49
+
50
+ Args:
51
+ file_path: Path to the PDF file
52
+
53
+ Returns:
54
+ Extracted text content
55
+ """
56
+ text = ""
57
+ try:
58
+ with open(file_path, 'rb') as file:
59
+ pdf_reader = PyPDF2.PdfReader(file)
60
+ for page_num in range(len(pdf_reader.pages)):
61
+ page = pdf_reader.pages[page_num]
62
+ text += page.extract_text() + "\n\n"
63
+ except Exception as e:
64
+ print(f"Error extracting text from PDF {file_path}: {str(e)}")
65
+ return ""
66
+
67
+ return text
68
+
69
+ @staticmethod
70
+ def extract_from_text(file_path: str) -> str:
71
+ """
72
+ Extract text from a plain text file.
73
+
74
+ Args:
75
+ file_path: Path to the text file
76
+
77
+ Returns:
78
+ Extracted text content
79
+ """
80
+ try:
81
+ with open(file_path, 'r', encoding='utf-8') as file:
82
+ return file.read()
83
+ except UnicodeDecodeError:
84
+ # Try with different encoding if UTF-8 fails
85
+ try:
86
+ with open(file_path, 'r', encoding='latin-1') as file:
87
+ return file.read()
88
+ except Exception as e:
89
+ print(f"Error extracting text from file {file_path}: {str(e)}")
90
+ return ""
91
+ except Exception as e:
92
+ print(f"Error extracting text from file {file_path}: {str(e)}")
93
+ return ""
94
+
95
+ @staticmethod
96
+ def extract_from_html(file_path: str) -> str:
97
+ """
98
+ Extract text from an HTML file.
99
+
100
+ Args:
101
+ file_path: Path to the HTML file
102
+
103
+ Returns:
104
+ Extracted text content
105
+ """
106
+ try:
107
+ with open(file_path, 'r', encoding='utf-8') as file:
108
+ html_content = file.read()
109
+ soup = BeautifulSoup(html_content, 'html.parser')
110
+
111
+ # Remove script and style elements
112
+ for script in soup(["script", "style"]):
113
+ script.extract()
114
+
115
+ # Get text
116
+ text = soup.get_text()
117
+
118
+ # Break into lines and remove leading and trailing space on each
119
+ lines = (line.strip() for line in text.splitlines())
120
+
121
+ # Break multi-headlines into a line each
122
+ chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
123
+
124
+ # Drop blank lines
125
+ text = '\n'.join(chunk for chunk in chunks if chunk)
126
+
127
+ return text
128
+ except Exception as e:
129
+ print(f"Error extracting text from HTML {file_path}: {str(e)}")
130
+ return ""
131
+
132
+ @staticmethod
133
+ def extract_from_url(url: str) -> str:
134
+ """
135
+ Extract text from a web URL.
136
+
137
+ Args:
138
+ url: Web URL to extract text from
139
+
140
+ Returns:
141
+ Extracted text content
142
+ """
143
+ try:
144
+ import requests
145
+ response = requests.get(url)
146
+ soup = BeautifulSoup(response.content, 'html.parser')
147
+
148
+ # Remove script and style elements
149
+ for script in soup(["script", "style"]):
150
+ script.extract()
151
+
152
+ # Get text
153
+ text = soup.get_text()
154
+
155
+ # Break into lines and remove leading and trailing space on each
156
+ lines = (line.strip() for line in text.splitlines())
157
+
158
+ # Break multi-headlines into a line each
159
+ chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
160
+
161
+ # Drop blank lines
162
+ text = '\n'.join(chunk for chunk in chunks if chunk)
163
+
164
+ return text
165
+ except Exception as e:
166
+ print(f"Error extracting text from URL {url}: {str(e)}")
167
+ return ""
src/document_processing/processor.py ADDED
@@ -0,0 +1,306 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Document processor module for Norwegian RAG chatbot.
3
+ Orchestrates the document processing pipeline with remote embeddings.
4
+ """
5
+
6
+ import os
7
+ import json
8
+ import numpy as np
9
+ from typing import List, Dict, Any, Optional, Tuple, Union
10
+ from datetime import datetime
11
+
12
+ from .extractor import TextExtractor
13
+ from .chunker import TextChunker
14
+ from ..api.huggingface_api import HuggingFaceAPI
15
+ from ..api.config import CHUNK_SIZE, CHUNK_OVERLAP
16
+
17
+ class DocumentProcessor:
18
+ """
19
+ Orchestrates the document processing pipeline:
20
+ 1. Extract text from documents
21
+ 2. Split text into chunks
22
+ 3. Generate embeddings using remote API
23
+ 4. Store processed documents and embeddings
24
+ """
25
+
26
+ def __init__(
27
+ self,
28
+ api_client: Optional[HuggingFaceAPI] = None,
29
+ documents_dir: str = "/home/ubuntu/chatbot_project/data/documents",
30
+ processed_dir: str = "/home/ubuntu/chatbot_project/data/processed",
31
+ chunk_size: int = CHUNK_SIZE,
32
+ chunk_overlap: int = CHUNK_OVERLAP,
33
+ chunking_strategy: str = "paragraph"
34
+ ):
35
+ """
36
+ Initialize the document processor.
37
+
38
+ Args:
39
+ api_client: HuggingFaceAPI client for generating embeddings
40
+ documents_dir: Directory for storing original documents
41
+ processed_dir: Directory for storing processed documents and embeddings
42
+ chunk_size: Maximum size of each chunk
43
+ chunk_overlap: Overlap between consecutive chunks
44
+ chunking_strategy: Strategy for chunking text ('fixed', 'paragraph', or 'sentence')
45
+ """
46
+ self.api_client = api_client or HuggingFaceAPI()
47
+ self.documents_dir = documents_dir
48
+ self.processed_dir = processed_dir
49
+ self.chunk_size = chunk_size
50
+ self.chunk_overlap = chunk_overlap
51
+ self.chunking_strategy = chunking_strategy
52
+
53
+ # Ensure directories exist
54
+ os.makedirs(self.documents_dir, exist_ok=True)
55
+ os.makedirs(self.processed_dir, exist_ok=True)
56
+
57
+ # Initialize document index
58
+ self.document_index_path = os.path.join(self.processed_dir, "document_index.json")
59
+ self.document_index = self._load_document_index()
60
+
61
+ def process_document(
62
+ self,
63
+ file_path: str,
64
+ document_id: Optional[str] = None,
65
+ metadata: Optional[Dict[str, Any]] = None
66
+ ) -> str:
67
+ """
68
+ Process a document through the entire pipeline.
69
+
70
+ Args:
71
+ file_path: Path to the document file
72
+ document_id: Optional custom document ID
73
+ metadata: Optional metadata for the document
74
+
75
+ Returns:
76
+ Document ID
77
+ """
78
+ # Generate document ID if not provided
79
+ if document_id is None:
80
+ document_id = f"doc_{datetime.now().strftime('%Y%m%d%H%M%S')}_{os.path.basename(file_path)}"
81
+
82
+ # Extract text from document
83
+ text = TextExtractor.extract_from_file(file_path)
84
+ if not text:
85
+ raise ValueError(f"Failed to extract text from {file_path}")
86
+
87
+ # Split text into chunks
88
+ chunks = TextChunker.chunk_text(
89
+ text,
90
+ chunk_size=self.chunk_size,
91
+ chunk_overlap=self.chunk_overlap,
92
+ strategy=self.chunking_strategy
93
+ )
94
+
95
+ # Clean chunks
96
+ chunks = [TextChunker.clean_chunk(chunk) for chunk in chunks]
97
+
98
+ # Generate embeddings using remote API
99
+ embeddings = self.api_client.generate_embeddings(chunks)
100
+
101
+ # Prepare metadata
102
+ if metadata is None:
103
+ metadata = {}
104
+
105
+ metadata.update({
106
+ "filename": os.path.basename(file_path),
107
+ "processed_date": datetime.now().isoformat(),
108
+ "chunk_count": len(chunks),
109
+ "chunking_strategy": self.chunking_strategy,
110
+ "embedding_model": self.api_client.embedding_model_id
111
+ })
112
+
113
+ # Save processed document
114
+ self._save_processed_document(document_id, chunks, embeddings, metadata)
115
+
116
+ # Update document index
117
+ self._update_document_index(document_id, metadata)
118
+
119
+ return document_id
120
+
121
+ def process_text(
122
+ self,
123
+ text: str,
124
+ document_id: Optional[str] = None,
125
+ metadata: Optional[Dict[str, Any]] = None
126
+ ) -> str:
127
+ """
128
+ Process text directly through the pipeline.
129
+
130
+ Args:
131
+ text: Text content to process
132
+ document_id: Optional custom document ID
133
+ metadata: Optional metadata for the document
134
+
135
+ Returns:
136
+ Document ID
137
+ """
138
+ # Generate document ID if not provided
139
+ if document_id is None:
140
+ document_id = f"text_{datetime.now().strftime('%Y%m%d%H%M%S')}"
141
+
142
+ # Split text into chunks
143
+ chunks = TextChunker.chunk_text(
144
+ text,
145
+ chunk_size=self.chunk_size,
146
+ chunk_overlap=self.chunk_overlap,
147
+ strategy=self.chunking_strategy
148
+ )
149
+
150
+ # Clean chunks
151
+ chunks = [TextChunker.clean_chunk(chunk) for chunk in chunks]
152
+
153
+ # Generate embeddings using remote API
154
+ embeddings = self.api_client.generate_embeddings(chunks)
155
+
156
+ # Prepare metadata
157
+ if metadata is None:
158
+ metadata = {}
159
+
160
+ metadata.update({
161
+ "source": "direct_text",
162
+ "processed_date": datetime.now().isoformat(),
163
+ "chunk_count": len(chunks),
164
+ "chunking_strategy": self.chunking_strategy,
165
+ "embedding_model": self.api_client.embedding_model_id
166
+ })
167
+
168
+ # Save processed document
169
+ self._save_processed_document(document_id, chunks, embeddings, metadata)
170
+
171
+ # Update document index
172
+ self._update_document_index(document_id, metadata)
173
+
174
+ return document_id
175
+
176
+ def get_document_chunks(self, document_id: str) -> List[str]:
177
+ """
178
+ Get all chunks for a document.
179
+
180
+ Args:
181
+ document_id: Document ID
182
+
183
+ Returns:
184
+ List of text chunks
185
+ """
186
+ document_path = os.path.join(self.processed_dir, f"{document_id}.json")
187
+ if not os.path.exists(document_path):
188
+ raise FileNotFoundError(f"Document not found: {document_id}")
189
+
190
+ with open(document_path, 'r', encoding='utf-8') as f:
191
+ document_data = json.load(f)
192
+
193
+ return document_data.get("chunks", [])
194
+
195
+ def get_document_embeddings(self, document_id: str) -> List[List[float]]:
196
+ """
197
+ Get all embeddings for a document.
198
+
199
+ Args:
200
+ document_id: Document ID
201
+
202
+ Returns:
203
+ List of embedding vectors
204
+ """
205
+ document_path = os.path.join(self.processed_dir, f"{document_id}.json")
206
+ if not os.path.exists(document_path):
207
+ raise FileNotFoundError(f"Document not found: {document_id}")
208
+
209
+ with open(document_path, 'r', encoding='utf-8') as f:
210
+ document_data = json.load(f)
211
+
212
+ return document_data.get("embeddings", [])
213
+
214
+ def get_all_documents(self) -> Dict[str, Dict[str, Any]]:
215
+ """
216
+ Get all documents in the index.
217
+
218
+ Returns:
219
+ Dictionary of document IDs to metadata
220
+ """
221
+ return self.document_index
222
+
223
+ def delete_document(self, document_id: str) -> bool:
224
+ """
225
+ Delete a document and its processed data.
226
+
227
+ Args:
228
+ document_id: Document ID
229
+
230
+ Returns:
231
+ True if successful, False otherwise
232
+ """
233
+ if document_id not in self.document_index:
234
+ return False
235
+
236
+ # Remove from index
237
+ del self.document_index[document_id]
238
+ self._save_document_index()
239
+
240
+ # Delete processed file
241
+ document_path = os.path.join(self.processed_dir, f"{document_id}.json")
242
+ if os.path.exists(document_path):
243
+ os.remove(document_path)
244
+
245
+ return True
246
+
247
+ def _save_processed_document(
248
+ self,
249
+ document_id: str,
250
+ chunks: List[str],
251
+ embeddings: List[List[float]],
252
+ metadata: Dict[str, Any]
253
+ ) -> None:
254
+ """
255
+ Save processed document data.
256
+
257
+ Args:
258
+ document_id: Document ID
259
+ chunks: List of text chunks
260
+ embeddings: List of embedding vectors
261
+ metadata: Document metadata
262
+ """
263
+ document_data = {
264
+ "document_id": document_id,
265
+ "metadata": metadata,
266
+ "chunks": chunks,
267
+ "embeddings": embeddings
268
+ }
269
+
270
+ document_path = os.path.join(self.processed_dir, f"{document_id}.json")
271
+ with open(document_path, 'w', encoding='utf-8') as f:
272
+ json.dump(document_data, f, ensure_ascii=False, indent=2)
273
+
274
+ def _load_document_index(self) -> Dict[str, Dict[str, Any]]:
275
+ """
276
+ Load the document index from disk.
277
+
278
+ Returns:
279
+ Dictionary of document IDs to metadata
280
+ """
281
+ if os.path.exists(self.document_index_path):
282
+ try:
283
+ with open(self.document_index_path, 'r', encoding='utf-8') as f:
284
+ return json.load(f)
285
+ except Exception as e:
286
+ print(f"Error loading document index: {str(e)}")
287
+
288
+ return {}
289
+
290
+ def _save_document_index(self) -> None:
291
+ """
292
+ Save the document index to disk.
293
+ """
294
+ with open(self.document_index_path, 'w', encoding='utf-8') as f:
295
+ json.dump(self.document_index, f, ensure_ascii=False, indent=2)
296
+
297
+ def _update_document_index(self, document_id: str, metadata: Dict[str, Any]) -> None:
298
+ """
299
+ Update the document index with a new or updated document.
300
+
301
+ Args:
302
+ document_id: Document ID
303
+ metadata: Document metadata
304
+ """
305
+ self.document_index[document_id] = metadata
306
+ self._save_document_index()
src/main.py ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Main application entry point for Norwegian RAG chatbot.
3
+ """
4
+
5
+ import os
6
+ import argparse
7
+ from typing import Dict, Any, Optional
8
+
9
+ from src.api.huggingface_api import HuggingFaceAPI
10
+ from src.document_processing.processor import DocumentProcessor
11
+ from src.rag.retriever import Retriever
12
+ from src.rag.generator import Generator
13
+ from src.web.app import ChatbotApp
14
+ from src.web.embed import EmbedGenerator, create_embed_html_file
15
+
16
+ def main():
17
+ """
18
+ Main entry point for the Norwegian RAG chatbot application.
19
+ """
20
+ # Parse command line arguments
21
+ parser = argparse.ArgumentParser(description="Norwegian RAG Chatbot")
22
+ parser.add_argument("--host", type=str, default="0.0.0.0", help="Host to run the server on")
23
+ parser.add_argument("--port", type=int, default=7860, help="Port to run the server on")
24
+ parser.add_argument("--share", action="store_true", help="Create a public link for sharing")
25
+ parser.add_argument("--debug", action="store_true", help="Enable debug mode")
26
+ args = parser.parse_args()
27
+
28
+ # Initialize API client
29
+ api_key = os.environ.get("HF_API_KEY", "")
30
+ api_client = HuggingFaceAPI(api_key=api_key)
31
+
32
+ # Initialize components
33
+ document_processor = DocumentProcessor(api_client=api_client)
34
+ retriever = Retriever(api_client=api_client)
35
+ generator = Generator(api_client=api_client)
36
+
37
+ # Create app
38
+ app = ChatbotApp(
39
+ api_client=api_client,
40
+ document_processor=document_processor,
41
+ retriever=retriever,
42
+ generator=generator,
43
+ title="Norwegian RAG Chatbot",
44
+ description="En chatbot basert på Retrieval-Augmented Generation (RAG) for norsk språk."
45
+ )
46
+
47
+ # Create embedding example
48
+ embed_generator = EmbedGenerator()
49
+ create_embed_html_file(embed_generator)
50
+
51
+ # Launch app
52
+ app.launch(
53
+ server_name=args.host,
54
+ server_port=args.port,
55
+ share=args.share,
56
+ debug=args.debug
57
+ )
58
+
59
+ if __name__ == "__main__":
60
+ main()
src/project_structure.md ADDED
@@ -0,0 +1,79 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Norwegian RAG Chatbot Project Structure
2
+
3
+ ## Overview
4
+ This document outlines the project structure for our lightweight Norwegian RAG chatbot implementation that uses Hugging Face's Inference API instead of running models locally.
5
+
6
+ ## Directory Structure
7
+ ```
8
+ chatbot_project/
9
+ ├── design/ # Design documents
10
+ │ ├── rag_architecture.md
11
+ │ ├── document_processing.md
12
+ │ └── chat_interface.md
13
+ ├── research/ # Research findings
14
+ │ └── norwegian_llm_research.md
15
+ ├── src/ # Source code
16
+ │ ├── api/ # API integration
17
+ │ │ ├── __init__.py
18
+ │ │ ├── huggingface_api.py # HF Inference API integration
19
+ │ │ └── config.py # API configuration
20
+ │ ├── document_processing/ # Document processing
21
+ │ │ ├── __init__.py
22
+ │ │ ├── extractor.py # Text extraction from documents
23
+ │ │ ├── chunker.py # Text chunking
24
+ │ │ └── processor.py # Main document processor
25
+ │ ├── rag/ # RAG implementation
26
+ │ │ ├── __init__.py
27
+ │ │ ├── retriever.py # Document retrieval
28
+ │ │ └── generator.py # Response generation
29
+ │ ├── web/ # Web interface
30
+ │ │ ├── __init__.py
31
+ │ │ ├── app.py # Gradio app
32
+ │ │ └── embed.py # Embedding functionality
33
+ │ ├── utils/ # Utilities
34
+ │ │ ├── __init__.py
35
+ │ │ └── helpers.py # Helper functions
36
+ │ └── main.py # Main application entry point
37
+ ├── data/ # Data storage
38
+ │ ├── documents/ # Original documents
39
+ │ └── processed/ # Processed documents and embeddings
40
+ ├── tests/ # Tests
41
+ │ ├── test_api.py
42
+ │ ├── test_document_processing.py
43
+ │ └── test_rag.py
44
+ ├── venv/ # Virtual environment
45
+ ├── requirements-ultra-light.txt # Lightweight dependencies
46
+ ├── requirements.txt # Original requirements (for reference)
47
+ └── README.md # Project documentation
48
+ ```
49
+
50
+ ## Key Components
51
+
52
+ ### 1. API Integration (`src/api/`)
53
+ - `huggingface_api.py`: Integration with Hugging Face Inference API for both LLM and embedding models
54
+ - `config.py`: Configuration for API endpoints, model IDs, and API keys
55
+
56
+ ### 2. Document Processing (`src/document_processing/`)
57
+ - `extractor.py`: Extract text from various document formats
58
+ - `chunker.py`: Split documents into manageable chunks
59
+ - `processor.py`: Orchestrate the document processing pipeline
60
+
61
+ ### 3. RAG Implementation (`src/rag/`)
62
+ - `retriever.py`: Retrieve relevant document chunks based on query
63
+ - `generator.py`: Generate responses using retrieved context
64
+
65
+ ### 4. Web Interface (`src/web/`)
66
+ - `app.py`: Gradio web interface for the chatbot
67
+ - `embed.py`: Generate embedding code for website integration
68
+
69
+ ### 5. Main Application (`src/main.py`)
70
+ - Entry point for the application
71
+ - Orchestrates the different components
72
+
73
+ ## Implementation Approach
74
+
75
+ 1. **Remote Model Execution**: Use Hugging Face's Inference API for both LLM and embedding models
76
+ 2. **Lightweight Document Processing**: Process documents locally but use remote APIs for embedding generation
77
+ 3. **Simple Vector Storage**: Store embeddings in simple file-based format rather than dedicated vector database
78
+ 4. **Gradio Interface**: Create a simple but effective chat interface using Gradio
79
+ 5. **Hugging Face Spaces Deployment**: Deploy the final solution to Hugging Face Spaces
src/rag/__init__.py ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ """
2
+ RAG module for Norwegian chatbot.
3
+ """
src/rag/generator.py ADDED
@@ -0,0 +1,87 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Generator module for Norwegian RAG chatbot.
3
+ Generates responses using retrieved context and LLM.
4
+ """
5
+
6
+ from typing import List, Dict, Any, Optional
7
+
8
+ from ..api.huggingface_api import HuggingFaceAPI, create_rag_prompt
9
+
10
+ class Generator:
11
+ """
12
+ Generates responses using retrieved context and LLM.
13
+ Uses Hugging Face Inference API for text generation.
14
+ """
15
+
16
+ def __init__(
17
+ self,
18
+ api_client: Optional[HuggingFaceAPI] = None,
19
+ ):
20
+ """
21
+ Initialize the generator.
22
+
23
+ Args:
24
+ api_client: HuggingFaceAPI client for text generation
25
+ """
26
+ self.api_client = api_client or HuggingFaceAPI()
27
+
28
+ def generate(
29
+ self,
30
+ query: str,
31
+ retrieved_chunks: List[Dict[str, Any]],
32
+ temperature: float = 0.7
33
+ ) -> str:
34
+ """
35
+ Generate a response using retrieved context.
36
+
37
+ Args:
38
+ query: User query
39
+ retrieved_chunks: List of retrieved chunks with metadata
40
+ temperature: Temperature for text generation
41
+
42
+ Returns:
43
+ Generated response
44
+ """
45
+ # Extract text from retrieved chunks
46
+ context_texts = [chunk["chunk_text"] for chunk in retrieved_chunks]
47
+
48
+ # If no context is retrieved, generate a response without context
49
+ if not context_texts:
50
+ return self._generate_without_context(query, temperature)
51
+
52
+ # Create RAG prompt
53
+ prompt = create_rag_prompt(query, context_texts)
54
+
55
+ # Generate response
56
+ response = self.api_client.generate_text(
57
+ prompt=prompt,
58
+ temperature=temperature
59
+ )
60
+
61
+ return response
62
+
63
+ def _generate_without_context(self, query: str, temperature: float = 0.7) -> str:
64
+ """
65
+ Generate a response without context when no relevant chunks are found.
66
+
67
+ Args:
68
+ query: User query
69
+ temperature: Temperature for text generation
70
+
71
+ Returns:
72
+ Generated response
73
+ """
74
+ prompt = f"""Du er en hjelpsom assistent som svarer på norsk. Svar på følgende spørsmål så godt du kan.
75
+
76
+ SPØRSMÅL:
77
+ {query}
78
+
79
+ SVAR:
80
+ """
81
+
82
+ response = self.api_client.generate_text(
83
+ prompt=prompt,
84
+ temperature=temperature
85
+ )
86
+
87
+ return response
src/rag/retriever.py ADDED
@@ -0,0 +1,163 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Retriever module for Norwegian RAG chatbot.
3
+ Retrieves relevant document chunks based on query embeddings.
4
+ """
5
+
6
+ import os
7
+ import json
8
+ import numpy as np
9
+ from typing import List, Dict, Any, Optional, Tuple, Union
10
+
11
+ from ..api.huggingface_api import HuggingFaceAPI
12
+ from ..api.config import MAX_CHUNKS_TO_RETRIEVE, SIMILARITY_THRESHOLD
13
+
14
+ class Retriever:
15
+ """
16
+ Retrieves relevant document chunks based on query embeddings.
17
+ Uses cosine similarity to find the most relevant chunks.
18
+ """
19
+
20
+ def __init__(
21
+ self,
22
+ api_client: Optional[HuggingFaceAPI] = None,
23
+ processed_dir: str = "/home/ubuntu/chatbot_project/data/processed",
24
+ max_chunks: int = MAX_CHUNKS_TO_RETRIEVE,
25
+ similarity_threshold: float = SIMILARITY_THRESHOLD
26
+ ):
27
+ """
28
+ Initialize the retriever.
29
+
30
+ Args:
31
+ api_client: HuggingFaceAPI client for generating embeddings
32
+ processed_dir: Directory containing processed documents
33
+ max_chunks: Maximum number of chunks to retrieve
34
+ similarity_threshold: Minimum similarity score for retrieval
35
+ """
36
+ self.api_client = api_client or HuggingFaceAPI()
37
+ self.processed_dir = processed_dir
38
+ self.max_chunks = max_chunks
39
+ self.similarity_threshold = similarity_threshold
40
+
41
+ # Load document index
42
+ self.document_index_path = os.path.join(self.processed_dir, "document_index.json")
43
+ self.document_index = self._load_document_index()
44
+
45
+ def retrieve(self, query: str) -> List[Dict[str, Any]]:
46
+ """
47
+ Retrieve relevant document chunks for a query.
48
+
49
+ Args:
50
+ query: User query
51
+
52
+ Returns:
53
+ List of retrieved chunks with metadata
54
+ """
55
+ # Generate embedding for the query
56
+ query_embedding = self.api_client.generate_embeddings(query)[0]
57
+
58
+ # Find relevant chunks across all documents
59
+ all_results = []
60
+
61
+ for doc_id in self.document_index:
62
+ try:
63
+ # Load document data
64
+ doc_results = self._retrieve_from_document(doc_id, query_embedding)
65
+ all_results.extend(doc_results)
66
+ except Exception as e:
67
+ print(f"Error retrieving from document {doc_id}: {str(e)}")
68
+
69
+ # Sort all results by similarity score
70
+ all_results.sort(key=lambda x: x["similarity"], reverse=True)
71
+
72
+ # Return top results above threshold
73
+ return [
74
+ result for result in all_results[:self.max_chunks]
75
+ if result["similarity"] >= self.similarity_threshold
76
+ ]
77
+
78
+ def _retrieve_from_document(
79
+ self,
80
+ document_id: str,
81
+ query_embedding: List[float]
82
+ ) -> List[Dict[str, Any]]:
83
+ """
84
+ Retrieve relevant chunks from a specific document.
85
+
86
+ Args:
87
+ document_id: Document ID
88
+ query_embedding: Query embedding vector
89
+
90
+ Returns:
91
+ List of retrieved chunks with metadata
92
+ """
93
+ document_path = os.path.join(self.processed_dir, f"{document_id}.json")
94
+ if not os.path.exists(document_path):
95
+ return []
96
+
97
+ # Load document data
98
+ with open(document_path, 'r', encoding='utf-8') as f:
99
+ document_data = json.load(f)
100
+
101
+ chunks = document_data.get("chunks", [])
102
+ embeddings = document_data.get("embeddings", [])
103
+ metadata = document_data.get("metadata", {})
104
+
105
+ if not chunks or not embeddings or len(chunks) != len(embeddings):
106
+ return []
107
+
108
+ # Calculate similarity scores
109
+ results = []
110
+ for i, (chunk, embedding) in enumerate(zip(chunks, embeddings)):
111
+ similarity = self._cosine_similarity(query_embedding, embedding)
112
+
113
+ results.append({
114
+ "document_id": document_id,
115
+ "chunk_index": i,
116
+ "chunk_text": chunk,
117
+ "similarity": similarity,
118
+ "metadata": metadata
119
+ })
120
+
121
+ # Sort by similarity
122
+ results.sort(key=lambda x: x["similarity"], reverse=True)
123
+
124
+ return results[:self.max_chunks]
125
+
126
+ def _cosine_similarity(self, vec1: List[float], vec2: List[float]) -> float:
127
+ """
128
+ Calculate cosine similarity between two vectors.
129
+
130
+ Args:
131
+ vec1: First vector
132
+ vec2: Second vector
133
+
134
+ Returns:
135
+ Cosine similarity score
136
+ """
137
+ vec1 = np.array(vec1)
138
+ vec2 = np.array(vec2)
139
+
140
+ dot_product = np.dot(vec1, vec2)
141
+ norm1 = np.linalg.norm(vec1)
142
+ norm2 = np.linalg.norm(vec2)
143
+
144
+ if norm1 == 0 or norm2 == 0:
145
+ return 0.0
146
+
147
+ return dot_product / (norm1 * norm2)
148
+
149
+ def _load_document_index(self) -> Dict[str, Dict[str, Any]]:
150
+ """
151
+ Load the document index from disk.
152
+
153
+ Returns:
154
+ Dictionary of document IDs to metadata
155
+ """
156
+ if os.path.exists(self.document_index_path):
157
+ try:
158
+ with open(self.document_index_path, 'r', encoding='utf-8') as f:
159
+ return json.load(f)
160
+ except Exception as e:
161
+ print(f"Error loading document index: {str(e)}")
162
+
163
+ return {}
src/web/__init__.py ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ """
2
+ Web interface module for Norwegian RAG chatbot.
3
+ """
src/web/app.py ADDED
@@ -0,0 +1,301 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Gradio app for Norwegian RAG chatbot.
3
+ Provides a web interface for interacting with the chatbot.
4
+ """
5
+
6
+ import os
7
+ import gradio as gr
8
+ import tempfile
9
+ from typing import List, Dict, Any, Tuple, Optional
10
+
11
+ from ..api.huggingface_api import HuggingFaceAPI
12
+ from ..document_processing.processor import DocumentProcessor
13
+ from ..rag.retriever import Retriever
14
+ from ..rag.generator import Generator
15
+
16
+ class ChatbotApp:
17
+ """
18
+ Gradio app for Norwegian RAG chatbot.
19
+ """
20
+
21
+ def __init__(
22
+ self,
23
+ api_client: Optional[HuggingFaceAPI] = None,
24
+ document_processor: Optional[DocumentProcessor] = None,
25
+ retriever: Optional[Retriever] = None,
26
+ generator: Optional[Generator] = None,
27
+ title: str = "Norwegian RAG Chatbot",
28
+ description: str = "En chatbot basert på Retrieval-Augmented Generation (RAG) for norsk språk."
29
+ ):
30
+ """
31
+ Initialize the chatbot app.
32
+
33
+ Args:
34
+ api_client: HuggingFaceAPI client
35
+ document_processor: Document processor
36
+ retriever: Retriever for finding relevant chunks
37
+ generator: Generator for creating responses
38
+ title: App title
39
+ description: App description
40
+ """
41
+ # Initialize components
42
+ self.api_client = api_client or HuggingFaceAPI()
43
+ self.document_processor = document_processor or DocumentProcessor(api_client=self.api_client)
44
+ self.retriever = retriever or Retriever(api_client=self.api_client)
45
+ self.generator = generator or Generator(api_client=self.api_client)
46
+
47
+ # App settings
48
+ self.title = title
49
+ self.description = description
50
+
51
+ # Initialize Gradio app
52
+ self.app = self._build_interface()
53
+
54
+ def _build_interface(self) -> gr.Blocks:
55
+ """
56
+ Build the Gradio interface.
57
+
58
+ Returns:
59
+ Gradio Blocks interface
60
+ """
61
+ with gr.Blocks(title=self.title) as app:
62
+ gr.Markdown(f"# {self.title}")
63
+ gr.Markdown(self.description)
64
+
65
+ with gr.Tabs():
66
+ # Chat tab
67
+ with gr.Tab("Chat"):
68
+ chatbot = gr.Chatbot(height=500)
69
+
70
+ with gr.Row():
71
+ msg = gr.Textbox(
72
+ placeholder="Skriv din melding her...",
73
+ show_label=False,
74
+ scale=9
75
+ )
76
+ submit_btn = gr.Button("Send", scale=1)
77
+
78
+ with gr.Accordion("Avanserte innstillinger", open=False):
79
+ temperature = gr.Slider(
80
+ minimum=0.1,
81
+ maximum=1.0,
82
+ value=0.7,
83
+ step=0.1,
84
+ label="Temperatur"
85
+ )
86
+
87
+ clear_btn = gr.Button("Tøm chat")
88
+
89
+ # Set up event handlers
90
+ submit_btn.click(
91
+ fn=self._respond,
92
+ inputs=[msg, chatbot, temperature],
93
+ outputs=[msg, chatbot]
94
+ )
95
+
96
+ msg.submit(
97
+ fn=self._respond,
98
+ inputs=[msg, chatbot, temperature],
99
+ outputs=[msg, chatbot]
100
+ )
101
+
102
+ clear_btn.click(
103
+ fn=lambda: None,
104
+ inputs=None,
105
+ outputs=chatbot,
106
+ queue=False
107
+ )
108
+
109
+ # Document upload tab
110
+ with gr.Tab("Last opp dokumenter"):
111
+ with gr.Row():
112
+ with gr.Column(scale=2):
113
+ file_output = gr.File(label="Opplastede dokumenter")
114
+ upload_button = gr.UploadButton(
115
+ "Klikk for å laste opp dokument",
116
+ file_types=["pdf", "txt", "html"],
117
+ file_count="multiple"
118
+ )
119
+
120
+ with gr.Column(scale=3):
121
+ documents_list = gr.Dataframe(
122
+ headers=["Dokument ID", "Filnavn", "Dato", "Chunks"],
123
+ label="Dokumentliste",
124
+ interactive=False
125
+ )
126
+
127
+ process_status = gr.Textbox(label="Status", interactive=False)
128
+ refresh_btn = gr.Button("Oppdater dokumentliste")
129
+
130
+ # Set up event handlers
131
+ upload_button.upload(
132
+ fn=self._process_uploaded_files,
133
+ inputs=[upload_button],
134
+ outputs=[process_status, documents_list]
135
+ )
136
+
137
+ refresh_btn.click(
138
+ fn=self._get_documents_list,
139
+ inputs=None,
140
+ outputs=[documents_list]
141
+ )
142
+
143
+ # Embed tab
144
+ with gr.Tab("Integrer"):
145
+ gr.Markdown("## Integrer chatboten på din nettside")
146
+
147
+ with gr.Row():
148
+ with gr.Column():
149
+ gr.Markdown("### iFrame-kode")
150
+ iframe_code = gr.Code(
151
+ label="iFrame",
152
+ language="html",
153
+ value='<iframe src="https://huggingface.co/spaces/username/norwegian-rag-chatbot" width="100%" height="500px"></iframe>'
154
+ )
155
+
156
+ with gr.Column():
157
+ gr.Markdown("### JavaScript Widget")
158
+ js_code = gr.Code(
159
+ label="JavaScript",
160
+ language="html",
161
+ value='<script src="https://huggingface.co/spaces/username/norwegian-rag-chatbot/widget.js"></script>'
162
+ )
163
+
164
+ gr.Markdown("### Forhåndsvisning")
165
+ gr.Markdown("*Forhåndsvisning vil være tilgjengelig etter at chatboten er distribuert til Hugging Face Spaces.*")
166
+
167
+ gr.Markdown("---")
168
+ gr.Markdown("Bygget med [Hugging Face](https://huggingface.co/) og [Gradio](https://gradio.app/)")
169
+
170
+ return app
171
+
172
+ def _respond(
173
+ self,
174
+ message: str,
175
+ chat_history: List[Tuple[str, str]],
176
+ temperature: float
177
+ ) -> Tuple[str, List[Tuple[str, str]]]:
178
+ """
179
+ Generate a response to the user message.
180
+
181
+ Args:
182
+ message: User message
183
+ chat_history: Chat history
184
+ temperature: Temperature for text generation
185
+
186
+ Returns:
187
+ Empty message and updated chat history
188
+ """
189
+ if not message:
190
+ return "", chat_history
191
+
192
+ # Add user message to chat history
193
+ chat_history.append((message, None))
194
+
195
+ try:
196
+ # Retrieve relevant chunks
197
+ retrieved_chunks = self.retriever.retrieve(message)
198
+
199
+ # Generate response
200
+ response = self.generator.generate(
201
+ query=message,
202
+ retrieved_chunks=retrieved_chunks,
203
+ temperature=temperature
204
+ )
205
+
206
+ # Update chat history with response
207
+ chat_history[-1] = (message, response)
208
+ except Exception as e:
209
+ # Handle errors
210
+ error_message = f"Beklager, det oppstod en feil: {str(e)}"
211
+ chat_history[-1] = (message, error_message)
212
+
213
+ return "", chat_history
214
+
215
+ def _process_uploaded_files(
216
+ self,
217
+ files: List[tempfile._TemporaryFileWrapper]
218
+ ) -> Tuple[str, List[List[str]]]:
219
+ """
220
+ Process uploaded files.
221
+
222
+ Args:
223
+ files: List of uploaded files
224
+
225
+ Returns:
226
+ Status message and updated documents list
227
+ """
228
+ if not files:
229
+ return "Ingen filer lastet opp.", self._get_documents_list()
230
+
231
+ processed_files = []
232
+
233
+ for file in files:
234
+ try:
235
+ # Process the document
236
+ document_id = self.document_processor.process_document(file.name)
237
+ processed_files.append(os.path.basename(file.name))
238
+ except Exception as e:
239
+ return f"Feil ved behandling av {os.path.basename(file.name)}: {str(e)}", self._get_documents_list()
240
+
241
+ if len(processed_files) == 1:
242
+ status = f"Fil behandlet: {processed_files[0]}"
243
+ else:
244
+ status = f"{len(processed_files)} filer behandlet: {', '.join(processed_files)}"
245
+
246
+ return status, self._get_documents_list()
247
+
248
+ def _get_documents_list(self) -> List[List[str]]:
249
+ """
250
+ Get list of processed documents.
251
+
252
+ Returns:
253
+ List of document information
254
+ """
255
+ documents = self.document_processor.get_all_documents()
256
+
257
+ # Format for dataframe
258
+ documents_list = []
259
+ for doc_id, metadata in documents.items():
260
+ filename = metadata.get("filename", "N/A")
261
+ processed_date = metadata.get("processed_date", "N/A")
262
+ chunk_count = metadata.get("chunk_count", 0)
263
+
264
+ documents_list.append([doc_id, filename, processed_date, chunk_count])
265
+
266
+ return documents_list
267
+
268
+ def launch(self, **kwargs):
269
+ """
270
+ Launch the Gradio app.
271
+
272
+ Args:
273
+ **kwargs: Additional arguments for gr.launch()
274
+ """
275
+ self.app.launch(**kwargs)
276
+
277
+
278
+ def create_app():
279
+ """
280
+ Create and configure the chatbot app.
281
+
282
+ Returns:
283
+ Configured ChatbotApp instance
284
+ """
285
+ # Initialize API client
286
+ api_client = HuggingFaceAPI()
287
+
288
+ # Initialize components
289
+ document_processor = DocumentProcessor(api_client=api_client)
290
+ retriever = Retriever(api_client=api_client)
291
+ generator = Generator(api_client=api_client)
292
+
293
+ # Create app
294
+ app = ChatbotApp(
295
+ api_client=api_client,
296
+ document_processor=document_processor,
297
+ retriever=retriever,
298
+ generator=generator
299
+ )
300
+
301
+ return app
src/web/embed.py ADDED
@@ -0,0 +1,211 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Embedding functionality for Norwegian RAG chatbot.
3
+ Provides utilities for embedding the chatbot in external websites.
4
+ """
5
+
6
+ import os
7
+ from typing import Dict, Optional
8
+
9
+ class EmbedGenerator:
10
+ """
11
+ Generates embedding code for integrating the chatbot into external websites.
12
+ """
13
+
14
+ def __init__(
15
+ self,
16
+ space_name: Optional[str] = None,
17
+ username: Optional[str] = None,
18
+ height: int = 500,
19
+ width: str = "100%"
20
+ ):
21
+ """
22
+ Initialize the embed generator.
23
+
24
+ Args:
25
+ space_name: Hugging Face Space name
26
+ username: Hugging Face username
27
+ height: Default iframe height in pixels
28
+ width: Default iframe width (can be pixels or percentage)
29
+ """
30
+ self.space_name = space_name or "norwegian-rag-chatbot"
31
+ self.username = username or "username"
32
+ self.height = height
33
+ self.width = width
34
+
35
+ def get_iframe_code(
36
+ self,
37
+ height: Optional[int] = None,
38
+ width: Optional[str] = None
39
+ ) -> str:
40
+ """
41
+ Generate iframe embed code.
42
+
43
+ Args:
44
+ height: Optional custom height
45
+ width: Optional custom width
46
+
47
+ Returns:
48
+ HTML iframe code
49
+ """
50
+ h = height or self.height
51
+ w = width or self.width
52
+
53
+ return f'<iframe src="https://huggingface.co/spaces/{self.username}/{self.space_name}" width="{w}" height="{h}px" frameborder="0"></iframe>'
54
+
55
+ def get_javascript_widget_code(self) -> str:
56
+ """
57
+ Generate JavaScript widget embed code.
58
+
59
+ Returns:
60
+ HTML script tag for widget
61
+ """
62
+ return f'<script src="https://huggingface.co/spaces/{self.username}/{self.space_name}/widget.js"></script>'
63
+
64
+ def get_direct_url(self) -> str:
65
+ """
66
+ Get direct URL to the Hugging Face Space.
67
+
68
+ Returns:
69
+ URL to the Hugging Face Space
70
+ """
71
+ return f"https://huggingface.co/spaces/{self.username}/{self.space_name}"
72
+
73
+ def get_embed_options(self) -> Dict[str, str]:
74
+ """
75
+ Get all embedding options.
76
+
77
+ Returns:
78
+ Dictionary of embedding options
79
+ """
80
+ return {
81
+ "iframe": self.get_iframe_code(),
82
+ "javascript": self.get_javascript_widget_code(),
83
+ "url": self.get_direct_url()
84
+ }
85
+
86
+ def update_space_info(self, username: str, space_name: str) -> None:
87
+ """
88
+ Update Hugging Face Space information.
89
+
90
+ Args:
91
+ username: Hugging Face username
92
+ space_name: Hugging Face Space name
93
+ """
94
+ self.username = username
95
+ self.space_name = space_name
96
+
97
+
98
+ def create_embed_html_file(
99
+ embed_generator: EmbedGenerator,
100
+ output_path: str = "/home/ubuntu/chatbot_project/embed_example.html"
101
+ ) -> str:
102
+ """
103
+ Create an HTML file with embedding examples.
104
+
105
+ Args:
106
+ embed_generator: EmbedGenerator instance
107
+ output_path: Path to save the HTML file
108
+
109
+ Returns:
110
+ Path to the created HTML file
111
+ """
112
+ html_content = f"""<!DOCTYPE html>
113
+ <html lang="no">
114
+ <head>
115
+ <meta charset="UTF-8">
116
+ <meta name="viewport" content="width=device-width, initial-scale=1.0">
117
+ <title>Norwegian RAG Chatbot - Embedding Examples</title>
118
+ <style>
119
+ body {{
120
+ font-family: Arial, sans-serif;
121
+ line-height: 1.6;
122
+ max-width: 800px;
123
+ margin: 0 auto;
124
+ padding: 20px;
125
+ }}
126
+ h1, h2, h3 {{
127
+ color: #2c3e50;
128
+ }}
129
+ .code-block {{
130
+ background-color: #f8f9fa;
131
+ border: 1px solid #ddd;
132
+ border-radius: 4px;
133
+ padding: 15px;
134
+ margin: 15px 0;
135
+ overflow-x: auto;
136
+ }}
137
+ .example {{
138
+ margin: 30px 0;
139
+ padding: 20px;
140
+ border: 1px solid #eee;
141
+ border-radius: 5px;
142
+ }}
143
+ </style>
144
+ </head>
145
+ <body>
146
+ <h1>Norwegian RAG Chatbot - Embedding Examples</h1>
147
+
148
+ <p>
149
+ This page demonstrates how to embed the Norwegian RAG Chatbot into your website.
150
+ There are multiple ways to integrate the chatbot, depending on your needs.
151
+ </p>
152
+
153
+ <h2>Option 1: iFrame Embedding</h2>
154
+ <p>
155
+ The simplest way to embed the chatbot is using an iFrame. Copy and paste the following code into your HTML:
156
+ </p>
157
+ <div class="code-block">
158
+ <pre>{embed_generator.get_iframe_code()}</pre>
159
+ </div>
160
+
161
+ <div class="example">
162
+ <h3>Example:</h3>
163
+ {embed_generator.get_iframe_code()}
164
+ </div>
165
+
166
+ <h2>Option 2: JavaScript Widget</h2>
167
+ <p>
168
+ For a more integrated experience, you can use the JavaScript widget. Copy and paste the following code into your HTML:
169
+ </p>
170
+ <div class="code-block">
171
+ <pre>{embed_generator.get_javascript_widget_code()}</pre>
172
+ </div>
173
+
174
+ <div class="example">
175
+ <h3>Example:</h3>
176
+ <p>The widget will appear below once the page is hosted on a web server:</p>
177
+ <!-- Widget will be inserted here when the script runs -->
178
+ </div>
179
+
180
+ <h2>Option 3: Direct Link</h2>
181
+ <p>
182
+ You can also provide a direct link to the chatbot:
183
+ </p>
184
+ <div class="code-block">
185
+ <pre>{embed_generator.get_direct_url()}</pre>
186
+ </div>
187
+
188
+ <h2>Customization</h2>
189
+ <p>
190
+ You can customize the appearance of the embedded chatbot by modifying the iFrame dimensions:
191
+ </p>
192
+ <div class="code-block">
193
+ <pre>{embed_generator.get_iframe_code(height=600, width="80%")}</pre>
194
+ </div>
195
+
196
+ <footer>
197
+ <p>
198
+ <small>
199
+ Created with <a href="https://huggingface.co/" target="_blank">Hugging Face</a> and
200
+ <a href="https://gradio.app/" target="_blank">Gradio</a>.
201
+ </small>
202
+ </p>
203
+ </footer>
204
+ </body>
205
+ </html>
206
+ """
207
+
208
+ with open(output_path, 'w', encoding='utf-8') as f:
209
+ f.write(html_content)
210
+
211
+ return output_path
todo.md ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Norwegian RAG Chatbot Project Todo
2
+
3
+ ## Research Phase
4
+ - [x] Research open-source LLMs with good Norwegian language support
5
+ - [x] Evaluate embedding models for Norwegian text
6
+ - [x] Research vector database options for RAG implementation
7
+ - [x] Document findings and select best options
8
+
9
+ ## Design Phase
10
+ - [x] Design RAG architecture
11
+ - [x] Plan document processing pipeline
12
+ - [x] Design chat interface
13
+ - [x] Plan embedding functionality
14
+
15
+ ## Implementation Phase
16
+ - [ ] Set up development environment
17
+ - [ ] Implement document processing and embedding
18
+ - [ ] Integrate LLM
19
+ - [ ] Create chat interface
20
+ - [ ] Develop embedding functionality
21
+
22
+ ## Testing and Finalization
23
+ - [ ] Test with Norwegian content
24
+ - [ ] Optimize performance
25
+ - [ ] Document usage and integration
26
+ - [ ] Finalize solution