Upload 29 files
Browse files- README.md +116 -10
- app.yaml +25 -0
- data/documents/.gitkeep +0 -0
- data/documents/test_document.txt +25 -0
- data/processed/.gitkeep +0 -0
- design/chat_interface.md +256 -0
- design/document_processing.md +170 -0
- design/rag_architecture.md +197 -0
- prepare_deployment.sh +37 -0
- requirements-minimal.txt +21 -0
- requirements-ultra-light.txt +7 -0
- requirements.txt +25 -1
- research/norwegian_llm_research.md +81 -0
- src/api/__init__.py +3 -0
- src/api/config.py +61 -0
- src/api/huggingface_api.py +213 -0
- src/document_processing/__init__.py +3 -0
- src/document_processing/chunker.py +262 -0
- src/document_processing/extractor.py +167 -0
- src/document_processing/processor.py +306 -0
- src/main.py +60 -0
- src/project_structure.md +79 -0
- src/rag/__init__.py +3 -0
- src/rag/generator.py +87 -0
- src/rag/retriever.py +163 -0
- src/web/__init__.py +3 -0
- src/web/app.py +301 -0
- src/web/embed.py +211 -0
- todo.md +26 -0
README.md
CHANGED
@@ -1,13 +1,119 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
-
|
3 |
-
|
4 |
-
|
5 |
-
|
|
|
|
|
6 |
sdk: gradio
|
7 |
-
sdk_version:
|
8 |
-
app_file:
|
9 |
-
pinned:
|
10 |
license: mit
|
11 |
-
---
|
12 |
-
|
13 |
-
An example chatbot using [Gradio](https://gradio.app), [`huggingface_hub`](https://huggingface.co/docs/huggingface_hub/v0.22.2/en/index), and the [Hugging Face Inference API](https://huggingface.co/docs/api-inference/index).
|
|
|
1 |
+
# Norwegian RAG Chatbot
|
2 |
+
|
3 |
+
A Retrieval-Augmented Generation (RAG) based chatbot with excellent Norwegian language support, built using Hugging Face's Inference API.
|
4 |
+
|
5 |
+
## Features
|
6 |
+
|
7 |
+
- **Norwegian Language Support**: Leverages state-of-the-art Norwegian language models like NorMistral, Viking, and NorskGPT
|
8 |
+
- **Document Processing**: Upload and process documents in various formats (PDF, TXT, HTML)
|
9 |
+
- **RAG Implementation**: Retrieves relevant context from documents to generate accurate responses
|
10 |
+
- **Embeddable Interface**: Easily embed the chatbot in any website using iframe or JavaScript widget
|
11 |
+
- **Lightweight Architecture**: Uses Hugging Face's Inference API instead of running models locally
|
12 |
+
|
13 |
+
## Architecture
|
14 |
+
|
15 |
+
This chatbot uses a lightweight architecture that leverages Hugging Face's hosted models:
|
16 |
+
|
17 |
+
1. **Document Processing**: Documents are processed locally, extracting text and splitting into chunks
|
18 |
+
2. **Embedding Generation**: Document chunks are embedded using Hugging Face's Inference API
|
19 |
+
3. **Retrieval**: When a query is received, the most relevant document chunks are retrieved
|
20 |
+
4. **Response Generation**: The LLM generates a response based on the retrieved context
|
21 |
+
|
22 |
+
## Getting Started
|
23 |
+
|
24 |
+
### Prerequisites
|
25 |
+
|
26 |
+
- Python 3.10+
|
27 |
+
- A Hugging Face account (for API access)
|
28 |
+
|
29 |
+
### Installation
|
30 |
+
|
31 |
+
1. Clone the repository:
|
32 |
+
```bash
|
33 |
+
git clone https://huggingface.co/spaces/username/norwegian-rag-chatbot
|
34 |
+
cd norwegian-rag-chatbot
|
35 |
+
```
|
36 |
+
|
37 |
+
2. Install dependencies:
|
38 |
+
```bash
|
39 |
+
pip install -r requirements-ultra-light.txt
|
40 |
+
```
|
41 |
+
|
42 |
+
3. Set up your Hugging Face API key:
|
43 |
+
```bash
|
44 |
+
export HF_API_KEY="your_api_key_here"
|
45 |
+
```
|
46 |
+
|
47 |
+
### Running the Chatbot
|
48 |
+
|
49 |
+
```bash
|
50 |
+
python src/main.py
|
51 |
+
```
|
52 |
+
|
53 |
+
The chatbot will be available at http://localhost:7860
|
54 |
+
|
55 |
+
## Usage
|
56 |
+
|
57 |
+
### Chat Interface
|
58 |
+
|
59 |
+
The main chat interface allows you to:
|
60 |
+
- Ask questions in Norwegian
|
61 |
+
- Receive responses based on your uploaded documents
|
62 |
+
- Adjust temperature and other settings
|
63 |
+
|
64 |
+
### Document Upload
|
65 |
+
|
66 |
+
You can upload documents to provide context for the chatbot:
|
67 |
+
- Supported formats: PDF, TXT, HTML
|
68 |
+
- Documents are automatically processed and indexed
|
69 |
+
- The chatbot will use these documents to provide more accurate responses
|
70 |
+
|
71 |
+
### Embedding
|
72 |
+
|
73 |
+
You can embed the chatbot in your website using:
|
74 |
+
- iFrame embedding
|
75 |
+
- JavaScript widget
|
76 |
+
- Direct link
|
77 |
+
|
78 |
+
## Deployment
|
79 |
+
|
80 |
+
The chatbot is designed to be deployed to Hugging Face Spaces:
|
81 |
+
|
82 |
+
1. Create a new Space on Hugging Face
|
83 |
+
2. Upload the code to the Space
|
84 |
+
3. Set the HF_API_KEY secret in the Space settings
|
85 |
+
4. The Space will automatically build and deploy the chatbot
|
86 |
+
|
87 |
+
## Models
|
88 |
+
|
89 |
+
The chatbot can use various Norwegian language models:
|
90 |
+
|
91 |
+
- **NorMistral-7b-scratch**: A large Norwegian language model pretrained from scratch
|
92 |
+
- **Viking 7B**: A multilingual model for Nordic languages
|
93 |
+
- **NorskGPT**: A Norwegian language model based on Mistral or LLAMA2
|
94 |
+
|
95 |
+
For embeddings, it uses:
|
96 |
+
- **NbAiLab/nb-sbert-base**: A Norwegian sentence embedding model
|
97 |
+
|
98 |
+
## License
|
99 |
+
|
100 |
+
This project is licensed under the MIT License - see the LICENSE file for details.
|
101 |
+
|
102 |
+
## Acknowledgements
|
103 |
+
|
104 |
+
- [Hugging Face](https://huggingface.co/) for hosting the models and providing the Inference API
|
105 |
+
- [Gradio](https://gradio.app/) for the web interface framework
|
106 |
+
- The creators of the Norwegian language models used in this project
|
107 |
+
|
108 |
---
|
109 |
+
|
110 |
+
name: norwegian-rag-chatbot
|
111 |
+
title: Norwegian RAG Chatbot
|
112 |
+
emoji: 🇳🇴
|
113 |
+
colorFrom: blue
|
114 |
+
colorTo: red
|
115 |
sdk: gradio
|
116 |
+
sdk_version: 4.0.0
|
117 |
+
app_file: src/main.py
|
118 |
+
pinned: true
|
119 |
license: mit
|
|
|
|
|
|
app.yaml
ADDED
@@ -0,0 +1,25 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
sdk:
|
2 |
+
base_image: python:3.10
|
3 |
+
build_commands:
|
4 |
+
- pip install -r requirements-ultra-light.txt
|
5 |
+
python_packages:
|
6 |
+
- gradio>=4.0.0
|
7 |
+
- huggingface_hub>=0.19.0
|
8 |
+
- requests>=2.31.0
|
9 |
+
- numpy>=1.24.0
|
10 |
+
- PyPDF2>=3.0.0
|
11 |
+
- beautifulsoup4>=4.12.0
|
12 |
+
|
13 |
+
app:
|
14 |
+
title: Norwegian RAG Chatbot
|
15 |
+
emoji: 🇳🇴
|
16 |
+
colorPrimary: "#00205B"
|
17 |
+
colorSecondary: "#EF2B2D"
|
18 |
+
pinned: true
|
19 |
+
sdk: gradio
|
20 |
+
python_version: "3.10"
|
21 |
+
suggested_hardware: cpu-basic
|
22 |
+
models:
|
23 |
+
- norallm/normistral-7b-scratch
|
24 |
+
- NbAiLab/nb-sbert-base
|
25 |
+
spaces_server_url: https://api-inference.huggingface.co/models/
|
data/documents/.gitkeep
ADDED
File without changes
|
data/documents/test_document.txt
ADDED
@@ -0,0 +1,25 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Norsk historie
|
2 |
+
|
3 |
+
Norge har en rik og fascinerende historie som strekker seg tilbake til vikingtiden. Vikingene var kjent for sine sjøreiser, handel og plyndring i store deler av Europa fra slutten av 700-tallet til midten av 1000-tallet.
|
4 |
+
|
5 |
+
## Middelalderen
|
6 |
+
|
7 |
+
I 1030 døde Olav Haraldsson (senere kjent som Olav den hellige) i slaget ved Stiklestad. Hans død markerte begynnelsen på kristendommens endelige gjennombrudd i Norge.
|
8 |
+
|
9 |
+
Norge ble forent til ett rike under Harald Hårfagre på 800-tallet. Etter vikingtiden fulgte en periode med borgerkrig før landet ble stabilisert under Håkon Håkonsson på 1200-tallet.
|
10 |
+
|
11 |
+
## Union med Danmark
|
12 |
+
|
13 |
+
Fra 1380 til 1814 var Norge i union med Danmark, en periode kjent som "dansketiden". Under denne perioden ble dansk det offisielle språket i administrasjon og litteratur, noe som hadde stor innflytelse på det norske språket.
|
14 |
+
|
15 |
+
## Grunnloven og union med Sverige
|
16 |
+
|
17 |
+
I 1814 fikk Norge sin egen grunnlov, signert på Eidsvoll 17. mai. Samme år ble Norge tvunget inn i en union med Sverige, som varte frem til 1905.
|
18 |
+
|
19 |
+
## Moderne Norge
|
20 |
+
|
21 |
+
Norge ble okkupert av Nazi-Tyskland under andre verdenskrig fra 1940 til 1945. Etter krigen opplevde landet rask økonomisk vekst.
|
22 |
+
|
23 |
+
Oppdagelsen av olje i Nordsjøen på slutten av 1960-tallet forvandlet Norge til en av verdens rikeste nasjoner per innbygger.
|
24 |
+
|
25 |
+
I dag er Norge kjent for sin velferdsstat, naturskjønnhet og høy levestandard.
|
data/processed/.gitkeep
ADDED
File without changes
|
design/chat_interface.md
ADDED
@@ -0,0 +1,256 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Chat Interface Design
|
2 |
+
|
3 |
+
This document outlines the design for the chat interface of our Norwegian RAG-based chatbot. The interface will be implemented using Gradio and deployed on Hugging Face Spaces.
|
4 |
+
|
5 |
+
## Interface Requirements
|
6 |
+
|
7 |
+
### Functional Requirements
|
8 |
+
|
9 |
+
1. **Chat Interaction**:
|
10 |
+
- Text input field for user queries
|
11 |
+
- Response display area for chatbot answers
|
12 |
+
- Support for multi-turn conversations
|
13 |
+
- Message history display
|
14 |
+
|
15 |
+
2. **Document Management**:
|
16 |
+
- Document upload functionality
|
17 |
+
- Document list display
|
18 |
+
- Status indicators for processing
|
19 |
+
|
20 |
+
3. **Configuration Options**:
|
21 |
+
- Model selection (if multiple models are supported)
|
22 |
+
- Language selection (Norwegian/English toggle)
|
23 |
+
- Advanced parameters adjustment (optional)
|
24 |
+
|
25 |
+
4. **Embedding Functionality**:
|
26 |
+
- Code snippet generation for embedding
|
27 |
+
- Preview of embedded widget
|
28 |
+
- Copy-to-clipboard functionality
|
29 |
+
|
30 |
+
### Non-Functional Requirements
|
31 |
+
|
32 |
+
1. **Responsiveness**:
|
33 |
+
- Mobile-friendly design
|
34 |
+
- Adaptive layout for different screen sizes
|
35 |
+
|
36 |
+
2. **Performance**:
|
37 |
+
- Efficient loading times
|
38 |
+
- Progress indicators for long operations
|
39 |
+
- Streaming responses for better user experience
|
40 |
+
|
41 |
+
3. **Accessibility**:
|
42 |
+
- WCAG 2.1 compliance
|
43 |
+
- Keyboard navigation support
|
44 |
+
- Screen reader compatibility
|
45 |
+
|
46 |
+
4. **Multilingual Support**:
|
47 |
+
- Norwegian as primary language
|
48 |
+
- English as secondary language
|
49 |
+
- Language detection and switching
|
50 |
+
|
51 |
+
## UI Design
|
52 |
+
|
53 |
+
### Main Chat Interface
|
54 |
+
|
55 |
+
```
|
56 |
+
┌─────────────────────────────────────────────────────────────┐
|
57 |
+
│ Norwegian RAG Chatbot [🇳🇴/🇬🇧] │
|
58 |
+
├─────────────────────────────────────────────────────────────┤
|
59 |
+
│ │
|
60 |
+
│ ┌─────────────────────────────────────────────────────┐ │
|
61 |
+
│ │ │ │
|
62 |
+
│ │ Chat History Display │ │
|
63 |
+
│ │ │ │
|
64 |
+
│ │ ┌─────────────────────────────────────────────┐ │ │
|
65 |
+
│ │ │ Bot: Hei! Hvordan kan jeg hjelpe deg i dag? │ │ │
|
66 |
+
│ │ └─────────────────────────────────────────────┘ │ │
|
67 |
+
│ │ │ │
|
68 |
+
│ │ ┌─────────────────────────────────────────────┐ │ │
|
69 |
+
│ │ │ User: Fortell meg om norsk historie. │ │ │
|
70 |
+
│ │ └─────────────────────────────────────────────┘ │ │
|
71 |
+
│ │ │ │
|
72 |
+
│ │ ┌─────────────────────────────────────────────┐ │ │
|
73 |
+
│ │ │ Bot: Norsk historie strekker seg... │ │ │
|
74 |
+
│ │ └─────────────────────────────────────────────┘ │ │
|
75 |
+
│ │ │ │
|
76 |
+
│ └─────────────────────────────────────────────────────┘ │
|
77 |
+
│ │
|
78 |
+
│ ┌─────────────────────────────────────────────────────┐ │
|
79 |
+
│ │ Type your message... [Send] │ │
|
80 |
+
│ └─────────────────────────────────────────────────────┘ │
|
81 |
+
│ │
|
82 |
+
│ [Clear Chat] [Settings] [Upload Documents] [Embed] │
|
83 |
+
└─────────────────────────────────────────────────────────────┘
|
84 |
+
```
|
85 |
+
|
86 |
+
### Document Upload Interface
|
87 |
+
|
88 |
+
```
|
89 |
+
┌─────────────────────────────────────────────────────��───────┐
|
90 |
+
│ Document Management [Close] │
|
91 |
+
├─────────────────────────────────────────────────────────────┤
|
92 |
+
│ │
|
93 |
+
│ [Upload New Document] │
|
94 |
+
│ │
|
95 |
+
│ ┌─────────────────────────────────────────────────────┐ │
|
96 |
+
│ │ Document List │ │
|
97 |
+
│ │ │ │
|
98 |
+
│ │ ┌─────────────────────────────────────────────┐ │ │
|
99 |
+
│ │ │ norsk_historie.pdf [Remove] │ │ │
|
100 |
+
│ │ │ Status: Processed ✓ │ │ │
|
101 |
+
│ │ └─────────────────────────────────────────────┘ │ │
|
102 |
+
│ │ │ │
|
103 |
+
│ │ ┌─────────────────────────────────────────────┐ │ │
|
104 |
+
│ │ │ vikinger.docx [Remove] │ │ │
|
105 |
+
│ │ │ Status: Processing... 75% │ │ │
|
106 |
+
│ │ └─────────────────────────────────────────────┘ │ │
|
107 |
+
│ │ │ │
|
108 |
+
│ └─────────────────────────────────────────────────────┘ │
|
109 |
+
│ │
|
110 |
+
│ [Process All] [Remove All] │
|
111 |
+
└─────────────────────────────────────────────────────────────┘
|
112 |
+
```
|
113 |
+
|
114 |
+
### Embed Code Interface
|
115 |
+
|
116 |
+
```
|
117 |
+
┌─────────────────────────────────────────────────────────────┐
|
118 |
+
│ Embed Chatbot [Close] │
|
119 |
+
├─────────────────────────────────────────────────────────────┤
|
120 |
+
│ │
|
121 |
+
│ ┌─────────────────────────────────────────────────────┐ │
|
122 |
+
│ │ Embed Code (iFrame) │ │
|
123 |
+
│ │ │ │
|
124 |
+
│ │ <iframe src="https://huggingface.co/spaces/... │ │
|
125 |
+
│ │ │ │
|
126 |
+
│ └─────────────────────────────────────────────────────┘ │
|
127 |
+
│ │
|
128 |
+
│ [Copy to Clipboard] │
|
129 |
+
│ │
|
130 |
+
│ ┌─────────────────────────────────────────────────────┐ │
|
131 |
+
│ │ Embed Code (JavaScript Widget) │ │
|
132 |
+
│ │ │ │
|
133 |
+
│ │ <script src="https://huggingface.co/spaces/... │ │
|
134 |
+
│ │ │ │
|
135 |
+
│ └─────────────────────────────────────────────────────┘ │
|
136 |
+
│ │
|
137 |
+
│ [Copy to Clipboard] │
|
138 |
+
│ │
|
139 |
+
│ ┌─────────────────────────────────────────────────────┐ │
|
140 |
+
│ │ Preview │ │
|
141 |
+
│ │ │ │
|
142 |
+
│ │ │ │
|
143 |
+
│ └─────────────────────────────────────────────────────┘ │
|
144 |
+
└─────────────────────────────────────────────────────────────┘
|
145 |
+
```
|
146 |
+
|
147 |
+
## Implementation with Gradio
|
148 |
+
|
149 |
+
Gradio is an ideal choice for implementing this interface due to its simplicity, Python integration, and native support on Hugging Face Spaces.
|
150 |
+
|
151 |
+
### Core Components
|
152 |
+
|
153 |
+
1. **Chat Interface**:
|
154 |
+
```python
|
155 |
+
with gr.Blocks() as demo:
|
156 |
+
chatbot = gr.Chatbot()
|
157 |
+
msg = gr.Textbox(label="Message")
|
158 |
+
clear = gr.Button("Clear")
|
159 |
+
|
160 |
+
def respond(message, chat_history):
|
161 |
+
# RAG processing logic here
|
162 |
+
bot_message = get_rag_response(message)
|
163 |
+
chat_history.append((message, bot_message))
|
164 |
+
return "", chat_history
|
165 |
+
|
166 |
+
msg.submit(respond, [msg, chatbot], [msg, chatbot])
|
167 |
+
clear.click(lambda: None, None, chatbot, queue=False)
|
168 |
+
```
|
169 |
+
|
170 |
+
2. **Document Upload**:
|
171 |
+
```python
|
172 |
+
with gr.Tab("Upload Documents"):
|
173 |
+
file_output = gr.File()
|
174 |
+
upload_button = gr.UploadButton("Click to Upload a File", file_types=["pdf", "docx", "txt"])
|
175 |
+
|
176 |
+
def upload_file(file):
|
177 |
+
# Document processing logic here
|
178 |
+
process_document(file.name)
|
179 |
+
return file.name
|
180 |
+
|
181 |
+
upload_button.upload(upload_file, upload_button, file_output)
|
182 |
+
```
|
183 |
+
|
184 |
+
3. **Embedding Code Generation**:
|
185 |
+
```python
|
186 |
+
with gr.Tab("Embed"):
|
187 |
+
iframe_code = gr.Textbox(label="iFrame Embed Code")
|
188 |
+
js_code = gr.Textbox(label="JavaScript Widget Code")
|
189 |
+
|
190 |
+
def generate_embed_code():
|
191 |
+
iframe = f'<iframe src="{SPACE_URL}" width="100%" height="500px"></iframe>'
|
192 |
+
js = f'<script src="{SPACE_URL}/widget.js"></script>'
|
193 |
+
return iframe, js
|
194 |
+
|
195 |
+
embed_button = gr.Button("Generate Embed Code")
|
196 |
+
embed_button.click(generate_embed_code, None, [iframe_code, js_code])
|
197 |
+
```
|
198 |
+
|
199 |
+
## Norwegian Language Support
|
200 |
+
|
201 |
+
1. **Interface Localization**:
|
202 |
+
- Implement language switching functionality
|
203 |
+
- Store UI text in language-specific dictionaries
|
204 |
+
- Apply translations based on selected language
|
205 |
+
|
206 |
+
2. **Input Processing**:
|
207 |
+
- Handle Norwegian special characters correctly
|
208 |
+
- Implement Norwegian-specific text normalization
|
209 |
+
|
210 |
+
3. **Response Generation**:
|
211 |
+
- Ensure proper formatting of Norwegian text
|
212 |
+
- Handle Norwegian grammar and syntax correctly
|
213 |
+
|
214 |
+
## Responsive Design
|
215 |
+
|
216 |
+
1. **CSS Customization**:
|
217 |
+
```python
|
218 |
+
with gr.Blocks(css="""
|
219 |
+
@media (max-width: 600px) {
|
220 |
+
.container { padding: 5px; }
|
221 |
+
.input-box { font-size: 14px; }
|
222 |
+
}
|
223 |
+
""") as demo:
|
224 |
+
# Interface components
|
225 |
+
```
|
226 |
+
|
227 |
+
2. **Layout Adaptation**:
|
228 |
+
- Use flexible layouts that adapt to screen size
|
229 |
+
- Implement collapsible sections for mobile view
|
230 |
+
- Ensure touch-friendly UI elements
|
231 |
+
|
232 |
+
## Deployment on Hugging Face Spaces
|
233 |
+
|
234 |
+
1. **Space Configuration**:
|
235 |
+
- Create a `requirements.txt` file with all dependencies
|
236 |
+
- Set up appropriate environment variables
|
237 |
+
- Configure resource allocation
|
238 |
+
|
239 |
+
2. **Continuous Integration**:
|
240 |
+
- Set up GitHub repository for the project
|
241 |
+
- Configure automatic deployment to Hugging Face Spaces
|
242 |
+
- Implement version control for the interface
|
243 |
+
|
244 |
+
3. **Monitoring and Analytics**:
|
245 |
+
- Add usage tracking
|
246 |
+
- Implement error logging
|
247 |
+
- Set up performance monitoring
|
248 |
+
|
249 |
+
## Next Steps
|
250 |
+
|
251 |
+
1. Implement the basic chat interface with Gradio
|
252 |
+
2. Add document upload and processing functionality
|
253 |
+
3. Create embedding code generation feature
|
254 |
+
4. Implement responsive design and language switching
|
255 |
+
5. Deploy to Hugging Face Spaces for testing
|
256 |
+
6. Gather feedback and iterate on the design
|
design/document_processing.md
ADDED
@@ -0,0 +1,170 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Document Processing Pipeline Design
|
2 |
+
|
3 |
+
This document outlines the design for the document processing pipeline of our Norwegian RAG-based chatbot. The pipeline will transform raw documents into embeddings that can be efficiently retrieved during the chat process.
|
4 |
+
|
5 |
+
## Pipeline Overview
|
6 |
+
|
7 |
+
```
|
8 |
+
Raw Documents → Text Extraction → Text Chunking → Text Cleaning → Embedding Generation → Vector Storage
|
9 |
+
```
|
10 |
+
|
11 |
+
## Components
|
12 |
+
|
13 |
+
### 1. Text Extraction
|
14 |
+
|
15 |
+
**Purpose**: Extract plain text from various document formats.
|
16 |
+
|
17 |
+
**Supported Formats**:
|
18 |
+
- PDF (.pdf)
|
19 |
+
- Word Documents (.docx, .doc)
|
20 |
+
- Text files (.txt)
|
21 |
+
- HTML (.html, .htm)
|
22 |
+
- Markdown (.md)
|
23 |
+
|
24 |
+
**Implementation**:
|
25 |
+
- Use PyPDF2 for PDF extraction
|
26 |
+
- Use python-docx for Word documents
|
27 |
+
- Use BeautifulSoup for HTML parsing
|
28 |
+
- Direct reading for text and markdown files
|
29 |
+
|
30 |
+
### 2. Text Chunking
|
31 |
+
|
32 |
+
**Purpose**: Split documents into manageable chunks for more precise retrieval.
|
33 |
+
|
34 |
+
**Chunking Strategies**:
|
35 |
+
- Fixed size chunks (512 tokens recommended for Norwegian text)
|
36 |
+
- Semantic chunking (split at paragraph or section boundaries)
|
37 |
+
- Overlapping chunks (100-token overlap recommended)
|
38 |
+
|
39 |
+
**Implementation**:
|
40 |
+
- Use LangChain's text splitters
|
41 |
+
- Implement custom Norwegian-aware chunking logic
|
42 |
+
|
43 |
+
### 3. Text Cleaning
|
44 |
+
|
45 |
+
**Purpose**: Normalize and clean text to improve embedding quality.
|
46 |
+
|
47 |
+
**Cleaning Operations**:
|
48 |
+
- Remove excessive whitespace
|
49 |
+
- Normalize Norwegian characters (æ, ø, å)
|
50 |
+
- Remove irrelevant content (headers, footers, page numbers)
|
51 |
+
- Handle special characters and symbols
|
52 |
+
|
53 |
+
**Implementation**:
|
54 |
+
- Custom text cleaning functions
|
55 |
+
- Norwegian-specific normalization rules
|
56 |
+
|
57 |
+
### 4. Embedding Generation
|
58 |
+
|
59 |
+
**Purpose**: Generate vector representations of text chunks.
|
60 |
+
|
61 |
+
**Embedding Model**:
|
62 |
+
- Primary: NbAiLab/nb-sbert-base (768 dimensions)
|
63 |
+
- Alternative: FFI/SimCSE-NB-BERT-large
|
64 |
+
|
65 |
+
**Implementation**:
|
66 |
+
- Use sentence-transformers library
|
67 |
+
- Batch processing for efficiency
|
68 |
+
- Caching mechanism for frequently embedded chunks
|
69 |
+
|
70 |
+
### 5. Vector Storage
|
71 |
+
|
72 |
+
**Purpose**: Store and index embeddings for efficient retrieval.
|
73 |
+
|
74 |
+
**Storage Options**:
|
75 |
+
- Primary: FAISS (Facebook AI Similarity Search)
|
76 |
+
- Alternative: Milvus (for larger deployments)
|
77 |
+
|
78 |
+
**Implementation**:
|
79 |
+
- FAISS IndexFlatIP (Inner Product) for cosine similarity
|
80 |
+
- Metadata storage for mapping vectors to original text
|
81 |
+
- Serialization for persistence
|
82 |
+
|
83 |
+
## Processing Flow
|
84 |
+
|
85 |
+
1. **Document Ingestion**:
|
86 |
+
- Accept documents via upload interface
|
87 |
+
- Store original documents in a document store
|
88 |
+
- Extract document metadata (title, date, source)
|
89 |
+
|
90 |
+
2. **Processing Pipeline Execution**:
|
91 |
+
- Process documents through the pipeline components
|
92 |
+
- Track processing status and errors
|
93 |
+
- Generate unique IDs for each chunk
|
94 |
+
|
95 |
+
3. **Index Management**:
|
96 |
+
- Create and update vector indices
|
97 |
+
- Implement versioning for indices
|
98 |
+
- Provide reindexing capabilities
|
99 |
+
|
100 |
+
## Norwegian Language Considerations
|
101 |
+
|
102 |
+
- **Character Encoding**: Ensure proper handling of Norwegian characters (UTF-8)
|
103 |
+
- **Tokenization**: Use tokenizers that properly handle Norwegian word structures
|
104 |
+
- **Stopwords**: Implement Norwegian stopword filtering for improved retrieval
|
105 |
+
- **Stemming/Lemmatization**: Consider Norwegian-specific stemming or lemmatization
|
106 |
+
|
107 |
+
## Implementation Plan
|
108 |
+
|
109 |
+
1. Create document processor class structure
|
110 |
+
2. Implement text extraction for different formats
|
111 |
+
3. Develop chunking strategies optimized for Norwegian
|
112 |
+
4. Build text cleaning and normalization functions
|
113 |
+
5. Integrate with embedding model
|
114 |
+
6. Set up vector storage and retrieval mechanisms
|
115 |
+
7. Create a unified API for the entire pipeline
|
116 |
+
|
117 |
+
## Code Structure
|
118 |
+
|
119 |
+
```python
|
120 |
+
# Example structure for the document processing pipeline
|
121 |
+
|
122 |
+
class DocumentProcessor:
|
123 |
+
def __init__(self, embedding_model, vector_store):
|
124 |
+
self.embedding_model = embedding_model
|
125 |
+
self.vector_store = vector_store
|
126 |
+
|
127 |
+
def process_document(self, document_path):
|
128 |
+
# Extract text based on document type
|
129 |
+
raw_text = self._extract_text(document_path)
|
130 |
+
|
131 |
+
# Split text into chunks
|
132 |
+
chunks = self._chunk_text(raw_text)
|
133 |
+
|
134 |
+
# Clean and normalize text chunks
|
135 |
+
cleaned_chunks = [self._clean_text(chunk) for chunk in chunks]
|
136 |
+
|
137 |
+
# Generate embeddings
|
138 |
+
embeddings = self._generate_embeddings(cleaned_chunks)
|
139 |
+
|
140 |
+
# Store in vector database
|
141 |
+
self._store_embeddings(embeddings, cleaned_chunks)
|
142 |
+
|
143 |
+
def _extract_text(self, document_path):
|
144 |
+
# Implementation for different document types
|
145 |
+
pass
|
146 |
+
|
147 |
+
def _chunk_text(self, text):
|
148 |
+
# Implementation of chunking strategy
|
149 |
+
pass
|
150 |
+
|
151 |
+
def _clean_text(self, text):
|
152 |
+
# Text normalization and cleaning
|
153 |
+
pass
|
154 |
+
|
155 |
+
def _generate_embeddings(self, chunks):
|
156 |
+
# Use embedding model to generate vectors
|
157 |
+
pass
|
158 |
+
|
159 |
+
def _store_embeddings(self, embeddings, chunks):
|
160 |
+
# Store in vector database with metadata
|
161 |
+
pass
|
162 |
+
```
|
163 |
+
|
164 |
+
## Next Steps
|
165 |
+
|
166 |
+
1. Implement the document processor class
|
167 |
+
2. Create test documents in Norwegian
|
168 |
+
3. Evaluate chunking strategies for Norwegian text
|
169 |
+
4. Benchmark embedding generation performance
|
170 |
+
5. Test retrieval accuracy with Norwegian queries
|
design/rag_architecture.md
ADDED
@@ -0,0 +1,197 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# RAG Architecture for Norwegian Chatbot
|
2 |
+
|
3 |
+
## Overview
|
4 |
+
|
5 |
+
This document outlines the architecture for a Retrieval-Augmented Generation (RAG) based chatbot optimized for Norwegian language, designed to be hosted on Hugging Face. The architecture leverages open-source models with strong Norwegian language support and integrates with Hugging Face's infrastructure for seamless deployment.
|
6 |
+
|
7 |
+
## System Components
|
8 |
+
|
9 |
+
### 1. Language Model (LLM)
|
10 |
+
|
11 |
+
Based on our research, we recommend using one of the following models:
|
12 |
+
|
13 |
+
**Primary Option: NorMistral-7b-scratch**
|
14 |
+
- Strong Norwegian language support
|
15 |
+
- Apache 2.0 license (allows commercial use)
|
16 |
+
- 7B parameters (reasonable size for deployment)
|
17 |
+
- Good performance on Norwegian language tasks
|
18 |
+
- Available on Hugging Face
|
19 |
+
|
20 |
+
**Alternative Option: Viking 7B**
|
21 |
+
- Specifically designed for Nordic languages
|
22 |
+
- Apache 2.0 license
|
23 |
+
- 4K context length
|
24 |
+
- Good multilingual capabilities (useful if the chatbot needs to handle some English queries)
|
25 |
+
|
26 |
+
**Fallback Option: NorskGPT-Mistral**
|
27 |
+
- Specifically designed for Norwegian
|
28 |
+
- Note: Non-commercial license (cc-by-nc-sa-4.0)
|
29 |
+
|
30 |
+
### 2. Embedding Model
|
31 |
+
|
32 |
+
**Recommended: NbAiLab/nb-sbert-base**
|
33 |
+
- Specifically trained for Norwegian
|
34 |
+
- 768-dimensional embeddings
|
35 |
+
- Good performance on sentence similarity tasks
|
36 |
+
- Works well with both Norwegian and English content
|
37 |
+
- Apache 2.0 license
|
38 |
+
- High download count on Hugging Face (41,370 last month)
|
39 |
+
|
40 |
+
### 3. Vector Database
|
41 |
+
|
42 |
+
**Recommended: FAISS**
|
43 |
+
- Lightweight and efficient
|
44 |
+
- Easy integration with Hugging Face
|
45 |
+
- Can be packaged with the application
|
46 |
+
- Works well for moderate-sized document collections
|
47 |
+
|
48 |
+
**Alternative: Milvus**
|
49 |
+
- More scalable for larger document collections
|
50 |
+
- Well-documented integration with Hugging Face
|
51 |
+
- Better for production deployments with large document bases
|
52 |
+
|
53 |
+
### 4. Document Processing Pipeline
|
54 |
+
|
55 |
+
1. **Text Extraction**: Extract text from various document formats (PDF, DOCX, TXT)
|
56 |
+
2. **Text Chunking**: Split documents into manageable chunks (recommended chunk size: 512 tokens)
|
57 |
+
3. **Text Cleaning**: Remove irrelevant content, normalize text
|
58 |
+
4. **Embedding Generation**: Generate embeddings using NbAiLab/nb-sbert-base
|
59 |
+
5. **Vector Storage**: Store embeddings in FAISS index
|
60 |
+
|
61 |
+
### 5. Retrieval Mechanism
|
62 |
+
|
63 |
+
1. **Query Processing**: Process user query
|
64 |
+
2. **Query Embedding**: Generate embedding for the query using the same embedding model
|
65 |
+
3. **Similarity Search**: Find most relevant document chunks using cosine similarity
|
66 |
+
4. **Context Assembly**: Assemble retrieved chunks into context for the LLM
|
67 |
+
|
68 |
+
### 6. Generation Component
|
69 |
+
|
70 |
+
1. **Prompt Construction**: Construct prompt with retrieved context and user query
|
71 |
+
2. **LLM Inference**: Generate response using the LLM
|
72 |
+
3. **Response Post-processing**: Format and clean the response
|
73 |
+
|
74 |
+
### 7. Chat Interface
|
75 |
+
|
76 |
+
1. **Frontend**: Lightweight, responsive web interface
|
77 |
+
2. **API Layer**: RESTful API for communication between frontend and backend
|
78 |
+
3. **Session Management**: Maintain conversation history
|
79 |
+
|
80 |
+
## Hugging Face Integration
|
81 |
+
|
82 |
+
### Deployment Options
|
83 |
+
|
84 |
+
1. **Hugging Face Spaces**:
|
85 |
+
- Deploy the entire application as a Gradio or Streamlit app
|
86 |
+
- Provides a public URL for access
|
87 |
+
- Supports Git-based deployment
|
88 |
+
|
89 |
+
2. **Model Hosting**:
|
90 |
+
- Host the fine-tuned LLM on Hugging Face Model Hub
|
91 |
+
- Use Hugging Face Inference API for model inference
|
92 |
+
|
93 |
+
3. **Datasets**:
|
94 |
+
- Store and version document collections on Hugging Face Datasets
|
95 |
+
|
96 |
+
### Implementation Approach
|
97 |
+
|
98 |
+
1. **Gradio Interface**:
|
99 |
+
- Create a Gradio app for the chat interface
|
100 |
+
- Deploy to Hugging Face Spaces
|
101 |
+
|
102 |
+
2. **Backend Processing**:
|
103 |
+
- Use Hugging Face Transformers and Sentence-Transformers libraries
|
104 |
+
- Implement document processing pipeline
|
105 |
+
- Set up FAISS for vector storage and retrieval
|
106 |
+
|
107 |
+
3. **Model Integration**:
|
108 |
+
- Load models from Hugging Face Model Hub
|
109 |
+
- Implement caching for better performance
|
110 |
+
|
111 |
+
## Technical Architecture Diagram
|
112 |
+
|
113 |
+
```
|
114 |
+
┌─────────────────────────────────────────────────────────────────┐
|
115 |
+
│ Hugging Face Spaces │
|
116 |
+
└─────────────────────────────────────────────────────────────────┘
|
117 |
+
│
|
118 |
+
▼
|
119 |
+
┌─────────────────────────────────────────────────────────────────┐
|
120 |
+
│ Web Interface │
|
121 |
+
│ │
|
122 |
+
│ ┌─────────────┐ ┌────────────┐ │
|
123 |
+
│ │ Gradio │ │ Session │ │
|
124 |
+
│ │ Interface │◄──────────────────────────────┤ Manager │ │
|
125 |
+
│ └─────────────┘ └────────────┘ │
|
126 |
+
└─────────────────────────────────────────────────────────────────┘
|
127 |
+
│
|
128 |
+
▼
|
129 |
+
┌─────────────────────────────────────────────────────────────────┐
|
130 |
+
│ Backend Processing │
|
131 |
+
│ │
|
132 |
+
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │
|
133 |
+
│ │ Query │ │ Retrieval │ │ Generation │ │
|
134 |
+
│ │ Processing │───►│ Engine │───►│ Engine │ │
|
135 |
+
│ └─────────────┘ └─────────────┘ └─────────────────────┘ │
|
136 |
+
│ │ ▲ │
|
137 |
+
│ ▼ │ │
|
138 |
+
│ ┌─────────────┐ │ │
|
139 |
+
│ │ FAISS │ │ │
|
140 |
+
│ │ Vector │ │ │
|
141 |
+
│ │ Store │ │ │
|
142 |
+
│ └─────────────┘ │ │
|
143 |
+
│ ▲ │ │
|
144 |
+
│ │ │ │
|
145 |
+
│ ┌─────────────────────────┴──────────────────────┴───────────┐ │
|
146 |
+
│ │ Document Processor │ │
|
147 |
+
│ └──────────────────────────────────────────────────────────────┘
|
148 |
+
└─────────────────────────────────────────────────────────────────┘
|
149 |
+
│
|
150 |
+
▼
|
151 |
+
┌─────────────────────────────────────────────────────────────────┐
|
152 |
+
│ Hugging Face Model Hub │
|
153 |
+
│ │
|
154 |
+
│ ┌─────────────────┐ ┌───────────────────┐ │
|
155 |
+
│ │ NbAiLab/ │ │ NorMistral- │ │
|
156 |
+
│ │ nb-sbert-base │ │ 7b-scratch │ │
|
157 |
+
│ │ (Embeddings) │ │ (LLM) │ │
|
158 |
+
│ └─────────────────┘ └───────────────────┘ │
|
159 |
+
└─────────────────────────────────────────────────────────────────┘
|
160 |
+
```
|
161 |
+
|
162 |
+
## Implementation Considerations
|
163 |
+
|
164 |
+
### 1. Performance Optimization
|
165 |
+
|
166 |
+
- **Model Quantization**: Use GGUF or GPTQ quantized versions of the LLM to reduce memory requirements
|
167 |
+
- **Batch Processing**: Implement batch processing for document embedding generation
|
168 |
+
- **Caching**: Cache frequent queries and responses
|
169 |
+
- **Progressive Loading**: Implement progressive loading for large document collections
|
170 |
+
|
171 |
+
### 2. Norwegian Language Optimization
|
172 |
+
|
173 |
+
- **Tokenization**: Ensure proper tokenization for Norwegian-specific characters and word structures
|
174 |
+
- **Text Normalization**: Implement Norwegian-specific text normalization (handling of "æ", "ø", "å")
|
175 |
+
- **Stopword Removal**: Use Norwegian stopword list for improved retrieval
|
176 |
+
|
177 |
+
### 3. Embedding Functionality
|
178 |
+
|
179 |
+
- **iFrame Integration**: Provide code snippets for embedding the chatbot in iFrames
|
180 |
+
- **JavaScript Widget**: Create a JavaScript widget for easy integration into any website
|
181 |
+
- **API Access**: Provide API endpoints for programmatic access
|
182 |
+
|
183 |
+
### 4. Security and Privacy
|
184 |
+
|
185 |
+
- **Data Handling**: Implement proper data handling practices
|
186 |
+
- **User Authentication**: Add optional user authentication for personalized experiences
|
187 |
+
- **Rate Limiting**: Implement rate limiting to prevent abuse
|
188 |
+
|
189 |
+
## Next Steps
|
190 |
+
|
191 |
+
1. Set up the development environment
|
192 |
+
2. Implement the document processing pipeline
|
193 |
+
3. Integrate the LLM and embedding models
|
194 |
+
4. Create the chat interface
|
195 |
+
5. Develop the embedding functionality
|
196 |
+
6. Deploy to Hugging Face
|
197 |
+
7. Test and optimize the solution
|
prepare_deployment.sh
ADDED
@@ -0,0 +1,37 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
#!/bin/bash
|
2 |
+
# Create empty directories for data storage
|
3 |
+
mkdir -p /home/ubuntu/chatbot_project/data/documents
|
4 |
+
mkdir -p /home/ubuntu/chatbot_project/data/processed
|
5 |
+
touch /home/ubuntu/chatbot_project/data/documents/.gitkeep
|
6 |
+
touch /home/ubuntu/chatbot_project/data/processed/.gitkeep
|
7 |
+
|
8 |
+
# Create a simple test document
|
9 |
+
cat > /home/ubuntu/chatbot_project/data/documents/test_document.txt << 'EOL'
|
10 |
+
# Norsk historie
|
11 |
+
|
12 |
+
Norge har en rik og fascinerende historie som strekker seg tilbake til vikingtiden. Vikingene var kjent for sine sjøreiser, handel og plyndring i store deler av Europa fra slutten av 700-tallet til midten av 1000-tallet.
|
13 |
+
|
14 |
+
## Middelalderen
|
15 |
+
|
16 |
+
I 1030 døde Olav Haraldsson (senere kjent som Olav den hellige) i slaget ved Stiklestad. Hans død markerte begynnelsen på kristendommens endelige gjennombrudd i Norge.
|
17 |
+
|
18 |
+
Norge ble forent til ett rike under Harald Hårfagre på 800-tallet. Etter vikingtiden fulgte en periode med borgerkrig før landet ble stabilisert under Håkon Håkonsson på 1200-tallet.
|
19 |
+
|
20 |
+
## Union med Danmark
|
21 |
+
|
22 |
+
Fra 1380 til 1814 var Norge i union med Danmark, en periode kjent som "dansketiden". Under denne perioden ble dansk det offisielle språket i administrasjon og litteratur, noe som hadde stor innflytelse på det norske språket.
|
23 |
+
|
24 |
+
## Grunnloven og union med Sverige
|
25 |
+
|
26 |
+
I 1814 fikk Norge sin egen grunnlov, signert på Eidsvoll 17. mai. Samme år ble Norge tvunget inn i en union med Sverige, som varte frem til 1905.
|
27 |
+
|
28 |
+
## Moderne Norge
|
29 |
+
|
30 |
+
Norge ble okkupert av Nazi-Tyskland under andre verdenskrig fra 1940 til 1945. Etter krigen opplevde landet rask økonomisk vekst.
|
31 |
+
|
32 |
+
Oppdagelsen av olje i Nordsjøen på slutten av 1960-tallet forvandlet Norge til en av verdens rikeste nasjoner per innbygger.
|
33 |
+
|
34 |
+
I dag er Norge kjent for sin velferdsstat, naturskjønnhet og høy levestandard.
|
35 |
+
EOL
|
36 |
+
|
37 |
+
echo "Deployment files prepared successfully"
|
requirements-minimal.txt
ADDED
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Core dependencies - minimal version
|
2 |
+
transformers>=4.36.0
|
3 |
+
sentence-transformers>=2.2.2
|
4 |
+
torch>=2.0.0
|
5 |
+
gradio>=4.0.0
|
6 |
+
huggingface_hub>=0.19.0
|
7 |
+
|
8 |
+
# Document processing - essential only
|
9 |
+
PyPDF2>=3.0.0
|
10 |
+
beautifulsoup4>=4.12.0
|
11 |
+
|
12 |
+
# Vector database - lightweight option
|
13 |
+
faiss-cpu>=1.7.4
|
14 |
+
|
15 |
+
# Utilities - minimal set
|
16 |
+
numpy>=1.24.0
|
17 |
+
tqdm>=4.66.0
|
18 |
+
requests>=2.31.0
|
19 |
+
|
20 |
+
# Norwegian language support
|
21 |
+
nltk>=3.8.0
|
requirements-ultra-light.txt
ADDED
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Core dependencies - ultra lightweight
|
2 |
+
requests>=2.31.0
|
3 |
+
gradio>=4.0.0
|
4 |
+
huggingface_hub>=0.19.0
|
5 |
+
numpy>=1.24.0
|
6 |
+
PyPDF2>=3.0.0
|
7 |
+
beautifulsoup4>=4.12.0
|
requirements.txt
CHANGED
@@ -1 +1,25 @@
|
|
1 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Core dependencies
|
2 |
+
transformers>=4.36.0
|
3 |
+
sentence-transformers>=2.2.2
|
4 |
+
torch>=2.0.0
|
5 |
+
gradio>=4.0.0
|
6 |
+
huggingface_hub>=0.19.0
|
7 |
+
|
8 |
+
# Document processing
|
9 |
+
PyPDF2>=3.0.0
|
10 |
+
python-docx>=0.8.11
|
11 |
+
beautifulsoup4>=4.12.0
|
12 |
+
markdown>=3.5.0
|
13 |
+
|
14 |
+
# Vector database
|
15 |
+
faiss-cpu>=1.7.4
|
16 |
+
langchain>=0.1.0
|
17 |
+
|
18 |
+
# Utilities
|
19 |
+
numpy>=1.24.0
|
20 |
+
pandas>=2.0.0
|
21 |
+
tqdm>=4.66.0
|
22 |
+
requests>=2.31.0
|
23 |
+
|
24 |
+
# Norwegian language support
|
25 |
+
nltk>=3.8.0
|
research/norwegian_llm_research.md
ADDED
@@ -0,0 +1,81 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Norwegian LLM and Embedding Models Research
|
2 |
+
|
3 |
+
## Open-Source LLMs with Norwegian Language Support
|
4 |
+
|
5 |
+
### 1. NorMistral-7b-scratch
|
6 |
+
- **Description**: A large Norwegian language model pretrained from scratch on 260 billion subword tokens (using six repetitions of open Norwegian texts).
|
7 |
+
- **Architecture**: Based on Mistral architecture with 7 billion parameters
|
8 |
+
- **Context Length**: 2k tokens
|
9 |
+
- **Performance**:
|
10 |
+
- Perplexity on NCC validation set: 7.43
|
11 |
+
- Good performance on reading comprehension, sentiment analysis, and machine translation tasks
|
12 |
+
- **License**: Apache-2.0
|
13 |
+
- **Hugging Face**: https://huggingface.co/norallm/normistral-7b-scratch
|
14 |
+
- **Notes**: Part of the NORA.LLM family developed by the Language Technology Group at the University of Oslo
|
15 |
+
|
16 |
+
### 2. Viking 7B
|
17 |
+
- **Description**: The first multilingual large language model for all Nordic languages (including Norwegian)
|
18 |
+
- **Architecture**: Similar to Llama 2, with flash attention, rotary embeddings, grouped query attention
|
19 |
+
- **Context Length**: 4k tokens
|
20 |
+
- **Performance**: Best-in-class performance in all Nordic languages without compromising English performance
|
21 |
+
- **License**: Apache 2.0
|
22 |
+
- **Notes**:
|
23 |
+
- Developed by Silo AI and University of Turku's research group TurkuNLP
|
24 |
+
- Also available in larger sizes (13B and 33B parameters)
|
25 |
+
- Trained on 2 trillion tokens including Danish, English, Finnish, Icelandic, Norwegian, Swedish and programming languages
|
26 |
+
|
27 |
+
### 3. NorskGPT
|
28 |
+
- **Description**: A Norwegian large language model made for Norwegian society
|
29 |
+
- **Versions**:
|
30 |
+
- NorskGPT-Mistral: 7B dense transformer with 8K context window, based on Mistral 7B
|
31 |
+
- NorskGPT-LLAMA2: 7b and 13b parameter model with 4K context length, based on LLAMA2
|
32 |
+
- **License**: cc-by-nc-sa-4.0 (non-commercial)
|
33 |
+
- **Website**: https://www.norskgpt.com/norskgpt-llm
|
34 |
+
|
35 |
+
## Embedding Models for Norwegian
|
36 |
+
|
37 |
+
### 1. NbAiLab/nb-sbert-base
|
38 |
+
- **Description**: A SentenceTransformers model trained on a machine translated version of the MNLI dataset
|
39 |
+
- **Architecture**: Based on nb-bert-base
|
40 |
+
- **Vector Dimensions**: 768
|
41 |
+
- **Performance**:
|
42 |
+
- Cosine Similarity: Pearson 0.8275, Spearman 0.8245
|
43 |
+
- **License**: apache-2.0
|
44 |
+
- **Hugging Face**: https://huggingface.co/NbAiLab/nb-sbert-base
|
45 |
+
- **Use Cases**:
|
46 |
+
- Sentence similarity
|
47 |
+
- Semantic search
|
48 |
+
- Few-shot classification (with SetFit)
|
49 |
+
- Keyword extraction (with KeyBERT)
|
50 |
+
- Topic modeling (with BERTopic)
|
51 |
+
- **Notes**: Works well with both Norwegian and English, making it ideal for bilingual applications
|
52 |
+
|
53 |
+
### 2. FFI/SimCSE-NB-BERT-large
|
54 |
+
- **Description**: A Norwegian sentence embedding model trained using the SimCSE methodology
|
55 |
+
- **Hugging Face**: https://huggingface.co/FFI/SimCSE-NB-BERT-large
|
56 |
+
|
57 |
+
## Vector Database Options for Hugging Face RAG Integration
|
58 |
+
|
59 |
+
### 1. Milvus
|
60 |
+
- **Integration**: Well-documented integration with Hugging Face for RAG pipelines
|
61 |
+
- **Reference**: https://huggingface.co/learn/cookbook/en/rag_with_hf_and_milvus
|
62 |
+
|
63 |
+
### 2. MongoDB
|
64 |
+
- **Integration**: Can be used with Hugging Face models for RAG systems
|
65 |
+
- **Reference**: https://huggingface.co/learn/cookbook/en/rag_with_hugging_face_gemma_mongodb
|
66 |
+
|
67 |
+
### 3. MyScale
|
68 |
+
- **Integration**: Supports building RAG applications with Hugging Face embedding models
|
69 |
+
- **Reference**: https://medium.com/@myscale/building-a-rag-application-in-10-min-with-claude-3-and-hugging-face-10caea4ea293
|
70 |
+
|
71 |
+
### 4. FAISS (Facebook AI Similarity Search)
|
72 |
+
- **Integration**: Lightweight vector database that works well with Hugging Face
|
73 |
+
- **Notes**: Can be used with `autofaiss` for quick experimentation
|
74 |
+
|
75 |
+
## Hugging Face RAG Implementation Options
|
76 |
+
|
77 |
+
1. **Transformers Library**: Provides access to pre-trained models
|
78 |
+
2. **Sentence Transformers**: For text embeddings
|
79 |
+
3. **Datasets**: For managing and processing data
|
80 |
+
4. **LangChain Integration**: For advanced RAG pipelines
|
81 |
+
5. **Spaces**: For deploying and sharing the application
|
src/api/__init__.py
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
"""
|
2 |
+
API integration module for Norwegian RAG chatbot.
|
3 |
+
"""
|
src/api/config.py
ADDED
@@ -0,0 +1,61 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
"""
|
2 |
+
Configuration for Hugging Face API integration.
|
3 |
+
Contains model IDs, API endpoints, and other configuration parameters.
|
4 |
+
"""
|
5 |
+
|
6 |
+
# Norwegian LLM options
|
7 |
+
LLM_MODELS = {
|
8 |
+
"normistral": {
|
9 |
+
"model_id": "norallm/normistral-7b-scratch",
|
10 |
+
"description": "NorMistral 7B - Norwegian language model based on Mistral architecture"
|
11 |
+
},
|
12 |
+
"viking": {
|
13 |
+
"model_id": "silo-ai/viking-7b",
|
14 |
+
"description": "Viking 7B - Multilingual model for Nordic languages"
|
15 |
+
},
|
16 |
+
"norskgpt": {
|
17 |
+
"model_id": "NbAiLab/NorskGPT",
|
18 |
+
"description": "NorskGPT - Norwegian language model"
|
19 |
+
}
|
20 |
+
}
|
21 |
+
|
22 |
+
# Default LLM model
|
23 |
+
DEFAULT_LLM_MODEL = "normistral"
|
24 |
+
|
25 |
+
# Norwegian embedding models
|
26 |
+
EMBEDDING_MODELS = {
|
27 |
+
"nb-sbert": {
|
28 |
+
"model_id": "NbAiLab/nb-sbert-base",
|
29 |
+
"description": "NB-SBERT-BASE - Norwegian sentence embedding model"
|
30 |
+
},
|
31 |
+
"simcse": {
|
32 |
+
"model_id": "FFI/SimCSE-NB-BERT-large",
|
33 |
+
"description": "SimCSE-NB-BERT-large - Norwegian sentence embedding model"
|
34 |
+
}
|
35 |
+
}
|
36 |
+
|
37 |
+
# Default embedding model
|
38 |
+
DEFAULT_EMBEDDING_MODEL = "nb-sbert"
|
39 |
+
|
40 |
+
# Hugging Face API endpoints
|
41 |
+
HF_API_ENDPOINTS = {
|
42 |
+
"inference": "https://api-inference.huggingface.co/models/",
|
43 |
+
"feature-extraction": "https://api-inference.huggingface.co/pipeline/feature-extraction/"
|
44 |
+
}
|
45 |
+
|
46 |
+
# API request parameters
|
47 |
+
API_PARAMS = {
|
48 |
+
"max_length": 512,
|
49 |
+
"temperature": 0.7,
|
50 |
+
"top_p": 0.9,
|
51 |
+
"top_k": 50,
|
52 |
+
"repetition_penalty": 1.1
|
53 |
+
}
|
54 |
+
|
55 |
+
# Document processing parameters
|
56 |
+
CHUNK_SIZE = 512
|
57 |
+
CHUNK_OVERLAP = 100
|
58 |
+
|
59 |
+
# RAG parameters
|
60 |
+
MAX_CHUNKS_TO_RETRIEVE = 5
|
61 |
+
SIMILARITY_THRESHOLD = 0.75
|
src/api/huggingface_api.py
ADDED
@@ -0,0 +1,213 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
"""
|
2 |
+
Hugging Face API integration for Norwegian RAG chatbot.
|
3 |
+
Provides functions to interact with Hugging Face Inference API for both LLM and embedding models.
|
4 |
+
"""
|
5 |
+
|
6 |
+
import os
|
7 |
+
import json
|
8 |
+
import time
|
9 |
+
import requests
|
10 |
+
from typing import Dict, List, Optional, Union, Any
|
11 |
+
|
12 |
+
from .config import (
|
13 |
+
LLM_MODELS,
|
14 |
+
DEFAULT_LLM_MODEL,
|
15 |
+
EMBEDDING_MODELS,
|
16 |
+
DEFAULT_EMBEDDING_MODEL,
|
17 |
+
HF_API_ENDPOINTS,
|
18 |
+
API_PARAMS
|
19 |
+
)
|
20 |
+
|
21 |
+
class HuggingFaceAPI:
|
22 |
+
"""
|
23 |
+
Client for interacting with Hugging Face Inference API.
|
24 |
+
Supports both text generation (LLM) and embedding generation.
|
25 |
+
"""
|
26 |
+
|
27 |
+
def __init__(
|
28 |
+
self,
|
29 |
+
api_key: Optional[str] = None,
|
30 |
+
llm_model: str = DEFAULT_LLM_MODEL,
|
31 |
+
embedding_model: str = DEFAULT_EMBEDDING_MODEL
|
32 |
+
):
|
33 |
+
"""
|
34 |
+
Initialize the Hugging Face API client.
|
35 |
+
|
36 |
+
Args:
|
37 |
+
api_key: Hugging Face API key (optional, can use HF_API_KEY env var)
|
38 |
+
llm_model: LLM model identifier from config
|
39 |
+
embedding_model: Embedding model identifier from config
|
40 |
+
"""
|
41 |
+
self.api_key = api_key or os.environ.get("HF_API_KEY", "")
|
42 |
+
|
43 |
+
# Set up model IDs
|
44 |
+
self.llm_model_id = LLM_MODELS[llm_model]["model_id"] if llm_model in LLM_MODELS else LLM_MODELS[DEFAULT_LLM_MODEL]["model_id"]
|
45 |
+
self.embedding_model_id = EMBEDDING_MODELS[embedding_model]["model_id"] if embedding_model in EMBEDDING_MODELS else EMBEDDING_MODELS[DEFAULT_EMBEDDING_MODEL]["model_id"]
|
46 |
+
|
47 |
+
# Set up headers
|
48 |
+
self.headers = {"Authorization": f"Bearer {self.api_key}"}
|
49 |
+
if not self.api_key:
|
50 |
+
print("Warning: No API key provided. API calls may be rate limited.")
|
51 |
+
self.headers = {}
|
52 |
+
|
53 |
+
def generate_text(
|
54 |
+
self,
|
55 |
+
prompt: str,
|
56 |
+
max_length: int = API_PARAMS["max_length"],
|
57 |
+
temperature: float = API_PARAMS["temperature"],
|
58 |
+
top_p: float = API_PARAMS["top_p"],
|
59 |
+
top_k: int = API_PARAMS["top_k"],
|
60 |
+
repetition_penalty: float = API_PARAMS["repetition_penalty"],
|
61 |
+
wait_for_model: bool = True
|
62 |
+
) -> str:
|
63 |
+
"""
|
64 |
+
Generate text using the LLM model.
|
65 |
+
|
66 |
+
Args:
|
67 |
+
prompt: Input text prompt
|
68 |
+
max_length: Maximum length of generated text
|
69 |
+
temperature: Sampling temperature
|
70 |
+
top_p: Top-p sampling parameter
|
71 |
+
top_k: Top-k sampling parameter
|
72 |
+
repetition_penalty: Penalty for repetition
|
73 |
+
wait_for_model: Whether to wait for model to load
|
74 |
+
|
75 |
+
Returns:
|
76 |
+
Generated text response
|
77 |
+
"""
|
78 |
+
payload = {
|
79 |
+
"inputs": prompt,
|
80 |
+
"parameters": {
|
81 |
+
"max_length": max_length,
|
82 |
+
"temperature": temperature,
|
83 |
+
"top_p": top_p,
|
84 |
+
"top_k": top_k,
|
85 |
+
"repetition_penalty": repetition_penalty
|
86 |
+
}
|
87 |
+
}
|
88 |
+
|
89 |
+
api_url = f"{HF_API_ENDPOINTS['inference']}{self.llm_model_id}"
|
90 |
+
|
91 |
+
# Make API request
|
92 |
+
response = self._make_api_request(api_url, payload, wait_for_model)
|
93 |
+
|
94 |
+
# Parse response
|
95 |
+
if isinstance(response, list) and len(response) > 0:
|
96 |
+
if "generated_text" in response[0]:
|
97 |
+
return response[0]["generated_text"]
|
98 |
+
return response[0].get("text", "")
|
99 |
+
elif isinstance(response, dict):
|
100 |
+
return response.get("generated_text", "")
|
101 |
+
|
102 |
+
# Fallback
|
103 |
+
return str(response)
|
104 |
+
|
105 |
+
def generate_embeddings(
|
106 |
+
self,
|
107 |
+
texts: Union[str, List[str]],
|
108 |
+
wait_for_model: bool = True
|
109 |
+
) -> List[List[float]]:
|
110 |
+
"""
|
111 |
+
Generate embeddings for text using the embedding model.
|
112 |
+
|
113 |
+
Args:
|
114 |
+
texts: Single text or list of texts to embed
|
115 |
+
wait_for_model: Whether to wait for model to load
|
116 |
+
|
117 |
+
Returns:
|
118 |
+
List of embedding vectors
|
119 |
+
"""
|
120 |
+
# Ensure texts is a list
|
121 |
+
if isinstance(texts, str):
|
122 |
+
texts = [texts]
|
123 |
+
|
124 |
+
payload = {
|
125 |
+
"inputs": texts,
|
126 |
+
}
|
127 |
+
|
128 |
+
api_url = f"{HF_API_ENDPOINTS['feature-extraction']}{self.embedding_model_id}"
|
129 |
+
|
130 |
+
# Make API request
|
131 |
+
response = self._make_api_request(api_url, payload, wait_for_model)
|
132 |
+
|
133 |
+
# Return embeddings
|
134 |
+
return response
|
135 |
+
|
136 |
+
def _make_api_request(
|
137 |
+
self,
|
138 |
+
api_url: str,
|
139 |
+
payload: Dict[str, Any],
|
140 |
+
wait_for_model: bool = True,
|
141 |
+
max_retries: int = 5,
|
142 |
+
retry_delay: int = 1
|
143 |
+
) -> Any:
|
144 |
+
"""
|
145 |
+
Make a request to the Hugging Face API with retry logic.
|
146 |
+
|
147 |
+
Args:
|
148 |
+
api_url: API endpoint URL
|
149 |
+
payload: Request payload
|
150 |
+
wait_for_model: Whether to wait for model to load
|
151 |
+
max_retries: Maximum number of retries
|
152 |
+
retry_delay: Delay between retries in seconds
|
153 |
+
|
154 |
+
Returns:
|
155 |
+
API response
|
156 |
+
"""
|
157 |
+
for attempt in range(max_retries):
|
158 |
+
try:
|
159 |
+
response = requests.post(api_url, headers=self.headers, json=payload)
|
160 |
+
|
161 |
+
# Check if model is still loading
|
162 |
+
if response.status_code == 503 and wait_for_model:
|
163 |
+
# Model is loading, wait and retry
|
164 |
+
estimated_time = json.loads(response.content.decode("utf-8")).get("estimated_time", 20)
|
165 |
+
print(f"Model is loading. Waiting {estimated_time} seconds...")
|
166 |
+
time.sleep(estimated_time)
|
167 |
+
continue
|
168 |
+
|
169 |
+
# Check for other errors
|
170 |
+
if response.status_code != 200:
|
171 |
+
print(f"API request failed with status code {response.status_code}: {response.text}")
|
172 |
+
if attempt < max_retries - 1:
|
173 |
+
time.sleep(retry_delay * (2 ** attempt)) # Exponential backoff
|
174 |
+
continue
|
175 |
+
return {"error": response.text}
|
176 |
+
|
177 |
+
return response.json()
|
178 |
+
|
179 |
+
except Exception as e:
|
180 |
+
print(f"API request failed: {str(e)}")
|
181 |
+
if attempt < max_retries - 1:
|
182 |
+
time.sleep(retry_delay * (2 ** attempt)) # Exponential backoff
|
183 |
+
continue
|
184 |
+
return {"error": str(e)}
|
185 |
+
|
186 |
+
return {"error": "Max retries exceeded"}
|
187 |
+
|
188 |
+
|
189 |
+
# Example RAG prompt template for Norwegian
|
190 |
+
def create_rag_prompt(query: str, context: List[str]) -> str:
|
191 |
+
"""
|
192 |
+
Create a RAG prompt with retrieved context for the LLM.
|
193 |
+
|
194 |
+
Args:
|
195 |
+
query: User query
|
196 |
+
context: List of retrieved document chunks
|
197 |
+
|
198 |
+
Returns:
|
199 |
+
Formatted prompt with context
|
200 |
+
"""
|
201 |
+
context_text = "\n\n".join([f"Dokument {i+1}:\n{chunk}" for i, chunk in enumerate(context)])
|
202 |
+
|
203 |
+
prompt = f"""Du er en hjelpsom assistent som svarer på norsk. Bruk følgende kontekst for å svare på spørsmålet.
|
204 |
+
|
205 |
+
KONTEKST:
|
206 |
+
{context_text}
|
207 |
+
|
208 |
+
SPØRSMÅL:
|
209 |
+
{query}
|
210 |
+
|
211 |
+
SVAR:
|
212 |
+
"""
|
213 |
+
return prompt
|
src/document_processing/__init__.py
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
"""
|
2 |
+
Document processing module for Norwegian RAG chatbot.
|
3 |
+
"""
|
src/document_processing/chunker.py
ADDED
@@ -0,0 +1,262 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
"""
|
2 |
+
Text chunking module for Norwegian RAG chatbot.
|
3 |
+
Splits documents into manageable chunks for embedding and retrieval.
|
4 |
+
"""
|
5 |
+
|
6 |
+
import re
|
7 |
+
from typing import List, Optional, Tuple
|
8 |
+
|
9 |
+
from ..api.config import CHUNK_SIZE, CHUNK_OVERLAP
|
10 |
+
|
11 |
+
class TextChunker:
|
12 |
+
"""
|
13 |
+
Splits documents into manageable chunks for embedding and retrieval.
|
14 |
+
Supports different chunking strategies optimized for Norwegian text.
|
15 |
+
"""
|
16 |
+
|
17 |
+
@staticmethod
|
18 |
+
def chunk_text(
|
19 |
+
text: str,
|
20 |
+
chunk_size: int = CHUNK_SIZE,
|
21 |
+
chunk_overlap: int = CHUNK_OVERLAP,
|
22 |
+
strategy: str = "paragraph"
|
23 |
+
) -> List[str]:
|
24 |
+
"""
|
25 |
+
Split text into chunks using the specified strategy.
|
26 |
+
|
27 |
+
Args:
|
28 |
+
text: Text to split into chunks
|
29 |
+
chunk_size: Maximum size of each chunk
|
30 |
+
chunk_overlap: Overlap between consecutive chunks
|
31 |
+
strategy: Chunking strategy ('fixed', 'paragraph', or 'sentence')
|
32 |
+
|
33 |
+
Returns:
|
34 |
+
List of text chunks
|
35 |
+
"""
|
36 |
+
if not text:
|
37 |
+
return []
|
38 |
+
|
39 |
+
if strategy == "fixed":
|
40 |
+
return TextChunker.fixed_size_chunks(text, chunk_size, chunk_overlap)
|
41 |
+
elif strategy == "paragraph":
|
42 |
+
return TextChunker.paragraph_chunks(text, chunk_size, chunk_overlap)
|
43 |
+
elif strategy == "sentence":
|
44 |
+
return TextChunker.sentence_chunks(text, chunk_size, chunk_overlap)
|
45 |
+
else:
|
46 |
+
raise ValueError(f"Unknown chunking strategy: {strategy}")
|
47 |
+
|
48 |
+
@staticmethod
|
49 |
+
def fixed_size_chunks(
|
50 |
+
text: str,
|
51 |
+
chunk_size: int = CHUNK_SIZE,
|
52 |
+
chunk_overlap: int = CHUNK_OVERLAP
|
53 |
+
) -> List[str]:
|
54 |
+
"""
|
55 |
+
Split text into fixed-size chunks with overlap.
|
56 |
+
|
57 |
+
Args:
|
58 |
+
text: Text to split into chunks
|
59 |
+
chunk_size: Maximum size of each chunk
|
60 |
+
chunk_overlap: Overlap between consecutive chunks
|
61 |
+
|
62 |
+
Returns:
|
63 |
+
List of text chunks
|
64 |
+
"""
|
65 |
+
if not text:
|
66 |
+
return []
|
67 |
+
|
68 |
+
chunks = []
|
69 |
+
start = 0
|
70 |
+
text_length = len(text)
|
71 |
+
|
72 |
+
while start < text_length:
|
73 |
+
end = min(start + chunk_size, text_length)
|
74 |
+
|
75 |
+
# If this is not the first chunk and we're not at the end,
|
76 |
+
# try to find a good breaking point (whitespace)
|
77 |
+
if start > 0 and end < text_length:
|
78 |
+
# Look for the last whitespace within the chunk
|
79 |
+
last_whitespace = text.rfind(' ', start, end)
|
80 |
+
if last_whitespace != -1:
|
81 |
+
end = last_whitespace + 1 # Include the space
|
82 |
+
|
83 |
+
# Add the chunk
|
84 |
+
chunks.append(text[start:end].strip())
|
85 |
+
|
86 |
+
# Move the start position for the next chunk, considering overlap
|
87 |
+
start = end - chunk_overlap if end < text_length else text_length
|
88 |
+
|
89 |
+
return chunks
|
90 |
+
|
91 |
+
@staticmethod
|
92 |
+
def paragraph_chunks(
|
93 |
+
text: str,
|
94 |
+
max_chunk_size: int = CHUNK_SIZE,
|
95 |
+
chunk_overlap: int = CHUNK_OVERLAP
|
96 |
+
) -> List[str]:
|
97 |
+
"""
|
98 |
+
Split text into chunks based on paragraphs.
|
99 |
+
|
100 |
+
Args:
|
101 |
+
text: Text to split into chunks
|
102 |
+
max_chunk_size: Maximum size of each chunk
|
103 |
+
chunk_overlap: Overlap between consecutive chunks
|
104 |
+
|
105 |
+
Returns:
|
106 |
+
List of text chunks
|
107 |
+
"""
|
108 |
+
if not text:
|
109 |
+
return []
|
110 |
+
|
111 |
+
# Split text into paragraphs
|
112 |
+
paragraphs = re.split(r'\n\s*\n', text)
|
113 |
+
paragraphs = [p.strip() for p in paragraphs if p.strip()]
|
114 |
+
|
115 |
+
chunks = []
|
116 |
+
current_chunk = []
|
117 |
+
current_size = 0
|
118 |
+
|
119 |
+
for paragraph in paragraphs:
|
120 |
+
paragraph_size = len(paragraph)
|
121 |
+
|
122 |
+
# If adding this paragraph would exceed the max chunk size and we already have content,
|
123 |
+
# save the current chunk and start a new one
|
124 |
+
if current_size + paragraph_size > max_chunk_size and current_chunk:
|
125 |
+
chunks.append('\n\n'.join(current_chunk))
|
126 |
+
|
127 |
+
# For overlap, keep some paragraphs from the previous chunk
|
128 |
+
overlap_size = 0
|
129 |
+
overlap_paragraphs = []
|
130 |
+
|
131 |
+
# Add paragraphs from the end until we reach the desired overlap
|
132 |
+
for p in reversed(current_chunk):
|
133 |
+
if overlap_size + len(p) <= chunk_overlap:
|
134 |
+
overlap_paragraphs.insert(0, p)
|
135 |
+
overlap_size += len(p)
|
136 |
+
else:
|
137 |
+
break
|
138 |
+
|
139 |
+
current_chunk = overlap_paragraphs
|
140 |
+
current_size = overlap_size
|
141 |
+
|
142 |
+
# If the paragraph itself is larger than the max chunk size, split it further
|
143 |
+
if paragraph_size > max_chunk_size:
|
144 |
+
# First, add the current chunk if it's not empty
|
145 |
+
if current_chunk:
|
146 |
+
chunks.append('\n\n'.join(current_chunk))
|
147 |
+
current_chunk = []
|
148 |
+
current_size = 0
|
149 |
+
|
150 |
+
# Then split the large paragraph into fixed-size chunks
|
151 |
+
paragraph_chunks = TextChunker.fixed_size_chunks(paragraph, max_chunk_size, chunk_overlap)
|
152 |
+
chunks.extend(paragraph_chunks)
|
153 |
+
else:
|
154 |
+
# Add the paragraph to the current chunk
|
155 |
+
current_chunk.append(paragraph)
|
156 |
+
current_size += paragraph_size
|
157 |
+
|
158 |
+
# Add the last chunk if it's not empty
|
159 |
+
if current_chunk:
|
160 |
+
chunks.append('\n\n'.join(current_chunk))
|
161 |
+
|
162 |
+
return chunks
|
163 |
+
|
164 |
+
@staticmethod
|
165 |
+
def sentence_chunks(
|
166 |
+
text: str,
|
167 |
+
max_chunk_size: int = CHUNK_SIZE,
|
168 |
+
chunk_overlap: int = CHUNK_OVERLAP
|
169 |
+
) -> List[str]:
|
170 |
+
"""
|
171 |
+
Split text into chunks based on sentences.
|
172 |
+
|
173 |
+
Args:
|
174 |
+
text: Text to split into chunks
|
175 |
+
max_chunk_size: Maximum size of each chunk
|
176 |
+
chunk_overlap: Overlap between consecutive chunks
|
177 |
+
|
178 |
+
Returns:
|
179 |
+
List of text chunks
|
180 |
+
"""
|
181 |
+
if not text:
|
182 |
+
return []
|
183 |
+
|
184 |
+
# Norwegian-aware sentence splitting
|
185 |
+
# This pattern handles common Norwegian sentence endings
|
186 |
+
sentence_pattern = r'(?<=[.!?])\s+(?=[A-ZÆØÅ])'
|
187 |
+
sentences = re.split(sentence_pattern, text)
|
188 |
+
sentences = [s.strip() for s in sentences if s.strip()]
|
189 |
+
|
190 |
+
chunks = []
|
191 |
+
current_chunk = []
|
192 |
+
current_size = 0
|
193 |
+
|
194 |
+
for sentence in sentences:
|
195 |
+
sentence_size = len(sentence)
|
196 |
+
|
197 |
+
# If adding this sentence would exceed the max chunk size and we already have content,
|
198 |
+
# save the current chunk and start a new one
|
199 |
+
if current_size + sentence_size > max_chunk_size and current_chunk:
|
200 |
+
chunks.append(' '.join(current_chunk))
|
201 |
+
|
202 |
+
# For overlap, keep some sentences from the previous chunk
|
203 |
+
overlap_size = 0
|
204 |
+
overlap_sentences = []
|
205 |
+
|
206 |
+
# Add sentences from the end until we reach the desired overlap
|
207 |
+
for s in reversed(current_chunk):
|
208 |
+
if overlap_size + len(s) <= chunk_overlap:
|
209 |
+
overlap_sentences.insert(0, s)
|
210 |
+
overlap_size += len(s)
|
211 |
+
else:
|
212 |
+
break
|
213 |
+
|
214 |
+
current_chunk = overlap_sentences
|
215 |
+
current_size = overlap_size
|
216 |
+
|
217 |
+
# If the sentence itself is larger than the max chunk size, split it further
|
218 |
+
if sentence_size > max_chunk_size:
|
219 |
+
# First, add the current chunk if it's not empty
|
220 |
+
if current_chunk:
|
221 |
+
chunks.append(' '.join(current_chunk))
|
222 |
+
current_chunk = []
|
223 |
+
current_size = 0
|
224 |
+
|
225 |
+
# Then split the large sentence into fixed-size chunks
|
226 |
+
sentence_chunks = TextChunker.fixed_size_chunks(sentence, max_chunk_size, chunk_overlap)
|
227 |
+
chunks.extend(sentence_chunks)
|
228 |
+
else:
|
229 |
+
# Add the sentence to the current chunk
|
230 |
+
current_chunk.append(sentence)
|
231 |
+
current_size += sentence_size
|
232 |
+
|
233 |
+
# Add the last chunk if it's not empty
|
234 |
+
if current_chunk:
|
235 |
+
chunks.append(' '.join(current_chunk))
|
236 |
+
|
237 |
+
return chunks
|
238 |
+
|
239 |
+
@staticmethod
|
240 |
+
def clean_chunk(chunk: str) -> str:
|
241 |
+
"""
|
242 |
+
Clean a text chunk by removing excessive whitespace and normalizing.
|
243 |
+
|
244 |
+
Args:
|
245 |
+
chunk: Text chunk to clean
|
246 |
+
|
247 |
+
Returns:
|
248 |
+
Cleaned text chunk
|
249 |
+
"""
|
250 |
+
if not chunk:
|
251 |
+
return ""
|
252 |
+
|
253 |
+
# Replace multiple whitespace with a single space
|
254 |
+
cleaned = re.sub(r'\s+', ' ', chunk)
|
255 |
+
|
256 |
+
# Normalize Norwegian characters (if needed)
|
257 |
+
# This ensures consistent handling of æ, ø, å
|
258 |
+
cleaned = cleaned.replace('æ', 'æ').replace('Æ', 'Æ')
|
259 |
+
cleaned = cleaned.replace('ø', 'ø').replace('Ø', 'Ø')
|
260 |
+
cleaned = cleaned.replace('å', 'å').replace('Å', 'Å')
|
261 |
+
|
262 |
+
return cleaned.strip()
|
src/document_processing/extractor.py
ADDED
@@ -0,0 +1,167 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
"""
|
2 |
+
Text extraction module for Norwegian RAG chatbot.
|
3 |
+
Extracts text from various document formats.
|
4 |
+
"""
|
5 |
+
|
6 |
+
import os
|
7 |
+
import PyPDF2
|
8 |
+
from typing import List, Optional
|
9 |
+
from bs4 import BeautifulSoup
|
10 |
+
|
11 |
+
class TextExtractor:
|
12 |
+
"""
|
13 |
+
Extracts text from various document formats.
|
14 |
+
Currently supports:
|
15 |
+
- PDF (.pdf)
|
16 |
+
- Text files (.txt)
|
17 |
+
- HTML (.html, .htm)
|
18 |
+
"""
|
19 |
+
|
20 |
+
@staticmethod
|
21 |
+
def extract_from_file(file_path: str) -> str:
|
22 |
+
"""
|
23 |
+
Extract text from a file based on its extension.
|
24 |
+
|
25 |
+
Args:
|
26 |
+
file_path: Path to the document file
|
27 |
+
|
28 |
+
Returns:
|
29 |
+
Extracted text content
|
30 |
+
"""
|
31 |
+
if not os.path.exists(file_path):
|
32 |
+
raise FileNotFoundError(f"File not found: {file_path}")
|
33 |
+
|
34 |
+
file_extension = os.path.splitext(file_path)[1].lower()
|
35 |
+
|
36 |
+
if file_extension == '.pdf':
|
37 |
+
return TextExtractor.extract_from_pdf(file_path)
|
38 |
+
elif file_extension == '.txt':
|
39 |
+
return TextExtractor.extract_from_text(file_path)
|
40 |
+
elif file_extension in ['.html', '.htm']:
|
41 |
+
return TextExtractor.extract_from_html(file_path)
|
42 |
+
else:
|
43 |
+
raise ValueError(f"Unsupported file format: {file_extension}")
|
44 |
+
|
45 |
+
@staticmethod
|
46 |
+
def extract_from_pdf(file_path: str) -> str:
|
47 |
+
"""
|
48 |
+
Extract text from a PDF file.
|
49 |
+
|
50 |
+
Args:
|
51 |
+
file_path: Path to the PDF file
|
52 |
+
|
53 |
+
Returns:
|
54 |
+
Extracted text content
|
55 |
+
"""
|
56 |
+
text = ""
|
57 |
+
try:
|
58 |
+
with open(file_path, 'rb') as file:
|
59 |
+
pdf_reader = PyPDF2.PdfReader(file)
|
60 |
+
for page_num in range(len(pdf_reader.pages)):
|
61 |
+
page = pdf_reader.pages[page_num]
|
62 |
+
text += page.extract_text() + "\n\n"
|
63 |
+
except Exception as e:
|
64 |
+
print(f"Error extracting text from PDF {file_path}: {str(e)}")
|
65 |
+
return ""
|
66 |
+
|
67 |
+
return text
|
68 |
+
|
69 |
+
@staticmethod
|
70 |
+
def extract_from_text(file_path: str) -> str:
|
71 |
+
"""
|
72 |
+
Extract text from a plain text file.
|
73 |
+
|
74 |
+
Args:
|
75 |
+
file_path: Path to the text file
|
76 |
+
|
77 |
+
Returns:
|
78 |
+
Extracted text content
|
79 |
+
"""
|
80 |
+
try:
|
81 |
+
with open(file_path, 'r', encoding='utf-8') as file:
|
82 |
+
return file.read()
|
83 |
+
except UnicodeDecodeError:
|
84 |
+
# Try with different encoding if UTF-8 fails
|
85 |
+
try:
|
86 |
+
with open(file_path, 'r', encoding='latin-1') as file:
|
87 |
+
return file.read()
|
88 |
+
except Exception as e:
|
89 |
+
print(f"Error extracting text from file {file_path}: {str(e)}")
|
90 |
+
return ""
|
91 |
+
except Exception as e:
|
92 |
+
print(f"Error extracting text from file {file_path}: {str(e)}")
|
93 |
+
return ""
|
94 |
+
|
95 |
+
@staticmethod
|
96 |
+
def extract_from_html(file_path: str) -> str:
|
97 |
+
"""
|
98 |
+
Extract text from an HTML file.
|
99 |
+
|
100 |
+
Args:
|
101 |
+
file_path: Path to the HTML file
|
102 |
+
|
103 |
+
Returns:
|
104 |
+
Extracted text content
|
105 |
+
"""
|
106 |
+
try:
|
107 |
+
with open(file_path, 'r', encoding='utf-8') as file:
|
108 |
+
html_content = file.read()
|
109 |
+
soup = BeautifulSoup(html_content, 'html.parser')
|
110 |
+
|
111 |
+
# Remove script and style elements
|
112 |
+
for script in soup(["script", "style"]):
|
113 |
+
script.extract()
|
114 |
+
|
115 |
+
# Get text
|
116 |
+
text = soup.get_text()
|
117 |
+
|
118 |
+
# Break into lines and remove leading and trailing space on each
|
119 |
+
lines = (line.strip() for line in text.splitlines())
|
120 |
+
|
121 |
+
# Break multi-headlines into a line each
|
122 |
+
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
|
123 |
+
|
124 |
+
# Drop blank lines
|
125 |
+
text = '\n'.join(chunk for chunk in chunks if chunk)
|
126 |
+
|
127 |
+
return text
|
128 |
+
except Exception as e:
|
129 |
+
print(f"Error extracting text from HTML {file_path}: {str(e)}")
|
130 |
+
return ""
|
131 |
+
|
132 |
+
@staticmethod
|
133 |
+
def extract_from_url(url: str) -> str:
|
134 |
+
"""
|
135 |
+
Extract text from a web URL.
|
136 |
+
|
137 |
+
Args:
|
138 |
+
url: Web URL to extract text from
|
139 |
+
|
140 |
+
Returns:
|
141 |
+
Extracted text content
|
142 |
+
"""
|
143 |
+
try:
|
144 |
+
import requests
|
145 |
+
response = requests.get(url)
|
146 |
+
soup = BeautifulSoup(response.content, 'html.parser')
|
147 |
+
|
148 |
+
# Remove script and style elements
|
149 |
+
for script in soup(["script", "style"]):
|
150 |
+
script.extract()
|
151 |
+
|
152 |
+
# Get text
|
153 |
+
text = soup.get_text()
|
154 |
+
|
155 |
+
# Break into lines and remove leading and trailing space on each
|
156 |
+
lines = (line.strip() for line in text.splitlines())
|
157 |
+
|
158 |
+
# Break multi-headlines into a line each
|
159 |
+
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
|
160 |
+
|
161 |
+
# Drop blank lines
|
162 |
+
text = '\n'.join(chunk for chunk in chunks if chunk)
|
163 |
+
|
164 |
+
return text
|
165 |
+
except Exception as e:
|
166 |
+
print(f"Error extracting text from URL {url}: {str(e)}")
|
167 |
+
return ""
|
src/document_processing/processor.py
ADDED
@@ -0,0 +1,306 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
"""
|
2 |
+
Document processor module for Norwegian RAG chatbot.
|
3 |
+
Orchestrates the document processing pipeline with remote embeddings.
|
4 |
+
"""
|
5 |
+
|
6 |
+
import os
|
7 |
+
import json
|
8 |
+
import numpy as np
|
9 |
+
from typing import List, Dict, Any, Optional, Tuple, Union
|
10 |
+
from datetime import datetime
|
11 |
+
|
12 |
+
from .extractor import TextExtractor
|
13 |
+
from .chunker import TextChunker
|
14 |
+
from ..api.huggingface_api import HuggingFaceAPI
|
15 |
+
from ..api.config import CHUNK_SIZE, CHUNK_OVERLAP
|
16 |
+
|
17 |
+
class DocumentProcessor:
|
18 |
+
"""
|
19 |
+
Orchestrates the document processing pipeline:
|
20 |
+
1. Extract text from documents
|
21 |
+
2. Split text into chunks
|
22 |
+
3. Generate embeddings using remote API
|
23 |
+
4. Store processed documents and embeddings
|
24 |
+
"""
|
25 |
+
|
26 |
+
def __init__(
|
27 |
+
self,
|
28 |
+
api_client: Optional[HuggingFaceAPI] = None,
|
29 |
+
documents_dir: str = "/home/ubuntu/chatbot_project/data/documents",
|
30 |
+
processed_dir: str = "/home/ubuntu/chatbot_project/data/processed",
|
31 |
+
chunk_size: int = CHUNK_SIZE,
|
32 |
+
chunk_overlap: int = CHUNK_OVERLAP,
|
33 |
+
chunking_strategy: str = "paragraph"
|
34 |
+
):
|
35 |
+
"""
|
36 |
+
Initialize the document processor.
|
37 |
+
|
38 |
+
Args:
|
39 |
+
api_client: HuggingFaceAPI client for generating embeddings
|
40 |
+
documents_dir: Directory for storing original documents
|
41 |
+
processed_dir: Directory for storing processed documents and embeddings
|
42 |
+
chunk_size: Maximum size of each chunk
|
43 |
+
chunk_overlap: Overlap between consecutive chunks
|
44 |
+
chunking_strategy: Strategy for chunking text ('fixed', 'paragraph', or 'sentence')
|
45 |
+
"""
|
46 |
+
self.api_client = api_client or HuggingFaceAPI()
|
47 |
+
self.documents_dir = documents_dir
|
48 |
+
self.processed_dir = processed_dir
|
49 |
+
self.chunk_size = chunk_size
|
50 |
+
self.chunk_overlap = chunk_overlap
|
51 |
+
self.chunking_strategy = chunking_strategy
|
52 |
+
|
53 |
+
# Ensure directories exist
|
54 |
+
os.makedirs(self.documents_dir, exist_ok=True)
|
55 |
+
os.makedirs(self.processed_dir, exist_ok=True)
|
56 |
+
|
57 |
+
# Initialize document index
|
58 |
+
self.document_index_path = os.path.join(self.processed_dir, "document_index.json")
|
59 |
+
self.document_index = self._load_document_index()
|
60 |
+
|
61 |
+
def process_document(
|
62 |
+
self,
|
63 |
+
file_path: str,
|
64 |
+
document_id: Optional[str] = None,
|
65 |
+
metadata: Optional[Dict[str, Any]] = None
|
66 |
+
) -> str:
|
67 |
+
"""
|
68 |
+
Process a document through the entire pipeline.
|
69 |
+
|
70 |
+
Args:
|
71 |
+
file_path: Path to the document file
|
72 |
+
document_id: Optional custom document ID
|
73 |
+
metadata: Optional metadata for the document
|
74 |
+
|
75 |
+
Returns:
|
76 |
+
Document ID
|
77 |
+
"""
|
78 |
+
# Generate document ID if not provided
|
79 |
+
if document_id is None:
|
80 |
+
document_id = f"doc_{datetime.now().strftime('%Y%m%d%H%M%S')}_{os.path.basename(file_path)}"
|
81 |
+
|
82 |
+
# Extract text from document
|
83 |
+
text = TextExtractor.extract_from_file(file_path)
|
84 |
+
if not text:
|
85 |
+
raise ValueError(f"Failed to extract text from {file_path}")
|
86 |
+
|
87 |
+
# Split text into chunks
|
88 |
+
chunks = TextChunker.chunk_text(
|
89 |
+
text,
|
90 |
+
chunk_size=self.chunk_size,
|
91 |
+
chunk_overlap=self.chunk_overlap,
|
92 |
+
strategy=self.chunking_strategy
|
93 |
+
)
|
94 |
+
|
95 |
+
# Clean chunks
|
96 |
+
chunks = [TextChunker.clean_chunk(chunk) for chunk in chunks]
|
97 |
+
|
98 |
+
# Generate embeddings using remote API
|
99 |
+
embeddings = self.api_client.generate_embeddings(chunks)
|
100 |
+
|
101 |
+
# Prepare metadata
|
102 |
+
if metadata is None:
|
103 |
+
metadata = {}
|
104 |
+
|
105 |
+
metadata.update({
|
106 |
+
"filename": os.path.basename(file_path),
|
107 |
+
"processed_date": datetime.now().isoformat(),
|
108 |
+
"chunk_count": len(chunks),
|
109 |
+
"chunking_strategy": self.chunking_strategy,
|
110 |
+
"embedding_model": self.api_client.embedding_model_id
|
111 |
+
})
|
112 |
+
|
113 |
+
# Save processed document
|
114 |
+
self._save_processed_document(document_id, chunks, embeddings, metadata)
|
115 |
+
|
116 |
+
# Update document index
|
117 |
+
self._update_document_index(document_id, metadata)
|
118 |
+
|
119 |
+
return document_id
|
120 |
+
|
121 |
+
def process_text(
|
122 |
+
self,
|
123 |
+
text: str,
|
124 |
+
document_id: Optional[str] = None,
|
125 |
+
metadata: Optional[Dict[str, Any]] = None
|
126 |
+
) -> str:
|
127 |
+
"""
|
128 |
+
Process text directly through the pipeline.
|
129 |
+
|
130 |
+
Args:
|
131 |
+
text: Text content to process
|
132 |
+
document_id: Optional custom document ID
|
133 |
+
metadata: Optional metadata for the document
|
134 |
+
|
135 |
+
Returns:
|
136 |
+
Document ID
|
137 |
+
"""
|
138 |
+
# Generate document ID if not provided
|
139 |
+
if document_id is None:
|
140 |
+
document_id = f"text_{datetime.now().strftime('%Y%m%d%H%M%S')}"
|
141 |
+
|
142 |
+
# Split text into chunks
|
143 |
+
chunks = TextChunker.chunk_text(
|
144 |
+
text,
|
145 |
+
chunk_size=self.chunk_size,
|
146 |
+
chunk_overlap=self.chunk_overlap,
|
147 |
+
strategy=self.chunking_strategy
|
148 |
+
)
|
149 |
+
|
150 |
+
# Clean chunks
|
151 |
+
chunks = [TextChunker.clean_chunk(chunk) for chunk in chunks]
|
152 |
+
|
153 |
+
# Generate embeddings using remote API
|
154 |
+
embeddings = self.api_client.generate_embeddings(chunks)
|
155 |
+
|
156 |
+
# Prepare metadata
|
157 |
+
if metadata is None:
|
158 |
+
metadata = {}
|
159 |
+
|
160 |
+
metadata.update({
|
161 |
+
"source": "direct_text",
|
162 |
+
"processed_date": datetime.now().isoformat(),
|
163 |
+
"chunk_count": len(chunks),
|
164 |
+
"chunking_strategy": self.chunking_strategy,
|
165 |
+
"embedding_model": self.api_client.embedding_model_id
|
166 |
+
})
|
167 |
+
|
168 |
+
# Save processed document
|
169 |
+
self._save_processed_document(document_id, chunks, embeddings, metadata)
|
170 |
+
|
171 |
+
# Update document index
|
172 |
+
self._update_document_index(document_id, metadata)
|
173 |
+
|
174 |
+
return document_id
|
175 |
+
|
176 |
+
def get_document_chunks(self, document_id: str) -> List[str]:
|
177 |
+
"""
|
178 |
+
Get all chunks for a document.
|
179 |
+
|
180 |
+
Args:
|
181 |
+
document_id: Document ID
|
182 |
+
|
183 |
+
Returns:
|
184 |
+
List of text chunks
|
185 |
+
"""
|
186 |
+
document_path = os.path.join(self.processed_dir, f"{document_id}.json")
|
187 |
+
if not os.path.exists(document_path):
|
188 |
+
raise FileNotFoundError(f"Document not found: {document_id}")
|
189 |
+
|
190 |
+
with open(document_path, 'r', encoding='utf-8') as f:
|
191 |
+
document_data = json.load(f)
|
192 |
+
|
193 |
+
return document_data.get("chunks", [])
|
194 |
+
|
195 |
+
def get_document_embeddings(self, document_id: str) -> List[List[float]]:
|
196 |
+
"""
|
197 |
+
Get all embeddings for a document.
|
198 |
+
|
199 |
+
Args:
|
200 |
+
document_id: Document ID
|
201 |
+
|
202 |
+
Returns:
|
203 |
+
List of embedding vectors
|
204 |
+
"""
|
205 |
+
document_path = os.path.join(self.processed_dir, f"{document_id}.json")
|
206 |
+
if not os.path.exists(document_path):
|
207 |
+
raise FileNotFoundError(f"Document not found: {document_id}")
|
208 |
+
|
209 |
+
with open(document_path, 'r', encoding='utf-8') as f:
|
210 |
+
document_data = json.load(f)
|
211 |
+
|
212 |
+
return document_data.get("embeddings", [])
|
213 |
+
|
214 |
+
def get_all_documents(self) -> Dict[str, Dict[str, Any]]:
|
215 |
+
"""
|
216 |
+
Get all documents in the index.
|
217 |
+
|
218 |
+
Returns:
|
219 |
+
Dictionary of document IDs to metadata
|
220 |
+
"""
|
221 |
+
return self.document_index
|
222 |
+
|
223 |
+
def delete_document(self, document_id: str) -> bool:
|
224 |
+
"""
|
225 |
+
Delete a document and its processed data.
|
226 |
+
|
227 |
+
Args:
|
228 |
+
document_id: Document ID
|
229 |
+
|
230 |
+
Returns:
|
231 |
+
True if successful, False otherwise
|
232 |
+
"""
|
233 |
+
if document_id not in self.document_index:
|
234 |
+
return False
|
235 |
+
|
236 |
+
# Remove from index
|
237 |
+
del self.document_index[document_id]
|
238 |
+
self._save_document_index()
|
239 |
+
|
240 |
+
# Delete processed file
|
241 |
+
document_path = os.path.join(self.processed_dir, f"{document_id}.json")
|
242 |
+
if os.path.exists(document_path):
|
243 |
+
os.remove(document_path)
|
244 |
+
|
245 |
+
return True
|
246 |
+
|
247 |
+
def _save_processed_document(
|
248 |
+
self,
|
249 |
+
document_id: str,
|
250 |
+
chunks: List[str],
|
251 |
+
embeddings: List[List[float]],
|
252 |
+
metadata: Dict[str, Any]
|
253 |
+
) -> None:
|
254 |
+
"""
|
255 |
+
Save processed document data.
|
256 |
+
|
257 |
+
Args:
|
258 |
+
document_id: Document ID
|
259 |
+
chunks: List of text chunks
|
260 |
+
embeddings: List of embedding vectors
|
261 |
+
metadata: Document metadata
|
262 |
+
"""
|
263 |
+
document_data = {
|
264 |
+
"document_id": document_id,
|
265 |
+
"metadata": metadata,
|
266 |
+
"chunks": chunks,
|
267 |
+
"embeddings": embeddings
|
268 |
+
}
|
269 |
+
|
270 |
+
document_path = os.path.join(self.processed_dir, f"{document_id}.json")
|
271 |
+
with open(document_path, 'w', encoding='utf-8') as f:
|
272 |
+
json.dump(document_data, f, ensure_ascii=False, indent=2)
|
273 |
+
|
274 |
+
def _load_document_index(self) -> Dict[str, Dict[str, Any]]:
|
275 |
+
"""
|
276 |
+
Load the document index from disk.
|
277 |
+
|
278 |
+
Returns:
|
279 |
+
Dictionary of document IDs to metadata
|
280 |
+
"""
|
281 |
+
if os.path.exists(self.document_index_path):
|
282 |
+
try:
|
283 |
+
with open(self.document_index_path, 'r', encoding='utf-8') as f:
|
284 |
+
return json.load(f)
|
285 |
+
except Exception as e:
|
286 |
+
print(f"Error loading document index: {str(e)}")
|
287 |
+
|
288 |
+
return {}
|
289 |
+
|
290 |
+
def _save_document_index(self) -> None:
|
291 |
+
"""
|
292 |
+
Save the document index to disk.
|
293 |
+
"""
|
294 |
+
with open(self.document_index_path, 'w', encoding='utf-8') as f:
|
295 |
+
json.dump(self.document_index, f, ensure_ascii=False, indent=2)
|
296 |
+
|
297 |
+
def _update_document_index(self, document_id: str, metadata: Dict[str, Any]) -> None:
|
298 |
+
"""
|
299 |
+
Update the document index with a new or updated document.
|
300 |
+
|
301 |
+
Args:
|
302 |
+
document_id: Document ID
|
303 |
+
metadata: Document metadata
|
304 |
+
"""
|
305 |
+
self.document_index[document_id] = metadata
|
306 |
+
self._save_document_index()
|
src/main.py
ADDED
@@ -0,0 +1,60 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
"""
|
2 |
+
Main application entry point for Norwegian RAG chatbot.
|
3 |
+
"""
|
4 |
+
|
5 |
+
import os
|
6 |
+
import argparse
|
7 |
+
from typing import Dict, Any, Optional
|
8 |
+
|
9 |
+
from src.api.huggingface_api import HuggingFaceAPI
|
10 |
+
from src.document_processing.processor import DocumentProcessor
|
11 |
+
from src.rag.retriever import Retriever
|
12 |
+
from src.rag.generator import Generator
|
13 |
+
from src.web.app import ChatbotApp
|
14 |
+
from src.web.embed import EmbedGenerator, create_embed_html_file
|
15 |
+
|
16 |
+
def main():
|
17 |
+
"""
|
18 |
+
Main entry point for the Norwegian RAG chatbot application.
|
19 |
+
"""
|
20 |
+
# Parse command line arguments
|
21 |
+
parser = argparse.ArgumentParser(description="Norwegian RAG Chatbot")
|
22 |
+
parser.add_argument("--host", type=str, default="0.0.0.0", help="Host to run the server on")
|
23 |
+
parser.add_argument("--port", type=int, default=7860, help="Port to run the server on")
|
24 |
+
parser.add_argument("--share", action="store_true", help="Create a public link for sharing")
|
25 |
+
parser.add_argument("--debug", action="store_true", help="Enable debug mode")
|
26 |
+
args = parser.parse_args()
|
27 |
+
|
28 |
+
# Initialize API client
|
29 |
+
api_key = os.environ.get("HF_API_KEY", "")
|
30 |
+
api_client = HuggingFaceAPI(api_key=api_key)
|
31 |
+
|
32 |
+
# Initialize components
|
33 |
+
document_processor = DocumentProcessor(api_client=api_client)
|
34 |
+
retriever = Retriever(api_client=api_client)
|
35 |
+
generator = Generator(api_client=api_client)
|
36 |
+
|
37 |
+
# Create app
|
38 |
+
app = ChatbotApp(
|
39 |
+
api_client=api_client,
|
40 |
+
document_processor=document_processor,
|
41 |
+
retriever=retriever,
|
42 |
+
generator=generator,
|
43 |
+
title="Norwegian RAG Chatbot",
|
44 |
+
description="En chatbot basert på Retrieval-Augmented Generation (RAG) for norsk språk."
|
45 |
+
)
|
46 |
+
|
47 |
+
# Create embedding example
|
48 |
+
embed_generator = EmbedGenerator()
|
49 |
+
create_embed_html_file(embed_generator)
|
50 |
+
|
51 |
+
# Launch app
|
52 |
+
app.launch(
|
53 |
+
server_name=args.host,
|
54 |
+
server_port=args.port,
|
55 |
+
share=args.share,
|
56 |
+
debug=args.debug
|
57 |
+
)
|
58 |
+
|
59 |
+
if __name__ == "__main__":
|
60 |
+
main()
|
src/project_structure.md
ADDED
@@ -0,0 +1,79 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Norwegian RAG Chatbot Project Structure
|
2 |
+
|
3 |
+
## Overview
|
4 |
+
This document outlines the project structure for our lightweight Norwegian RAG chatbot implementation that uses Hugging Face's Inference API instead of running models locally.
|
5 |
+
|
6 |
+
## Directory Structure
|
7 |
+
```
|
8 |
+
chatbot_project/
|
9 |
+
├── design/ # Design documents
|
10 |
+
│ ├── rag_architecture.md
|
11 |
+
│ ├── document_processing.md
|
12 |
+
│ └── chat_interface.md
|
13 |
+
├── research/ # Research findings
|
14 |
+
│ └── norwegian_llm_research.md
|
15 |
+
├── src/ # Source code
|
16 |
+
│ ├── api/ # API integration
|
17 |
+
│ │ ├── __init__.py
|
18 |
+
│ │ ├── huggingface_api.py # HF Inference API integration
|
19 |
+
│ │ └── config.py # API configuration
|
20 |
+
│ ├── document_processing/ # Document processing
|
21 |
+
│ │ ├── __init__.py
|
22 |
+
│ │ ├── extractor.py # Text extraction from documents
|
23 |
+
│ │ ├── chunker.py # Text chunking
|
24 |
+
│ │ └── processor.py # Main document processor
|
25 |
+
│ ├── rag/ # RAG implementation
|
26 |
+
│ │ ├── __init__.py
|
27 |
+
│ │ ├── retriever.py # Document retrieval
|
28 |
+
│ │ └── generator.py # Response generation
|
29 |
+
│ ├── web/ # Web interface
|
30 |
+
│ │ ├── __init__.py
|
31 |
+
│ │ ├── app.py # Gradio app
|
32 |
+
│ │ └── embed.py # Embedding functionality
|
33 |
+
│ ├── utils/ # Utilities
|
34 |
+
│ │ ├── __init__.py
|
35 |
+
│ │ └── helpers.py # Helper functions
|
36 |
+
│ └── main.py # Main application entry point
|
37 |
+
├── data/ # Data storage
|
38 |
+
│ ├── documents/ # Original documents
|
39 |
+
│ └── processed/ # Processed documents and embeddings
|
40 |
+
├── tests/ # Tests
|
41 |
+
│ ├── test_api.py
|
42 |
+
│ ├── test_document_processing.py
|
43 |
+
│ └── test_rag.py
|
44 |
+
├── venv/ # Virtual environment
|
45 |
+
├── requirements-ultra-light.txt # Lightweight dependencies
|
46 |
+
├── requirements.txt # Original requirements (for reference)
|
47 |
+
└── README.md # Project documentation
|
48 |
+
```
|
49 |
+
|
50 |
+
## Key Components
|
51 |
+
|
52 |
+
### 1. API Integration (`src/api/`)
|
53 |
+
- `huggingface_api.py`: Integration with Hugging Face Inference API for both LLM and embedding models
|
54 |
+
- `config.py`: Configuration for API endpoints, model IDs, and API keys
|
55 |
+
|
56 |
+
### 2. Document Processing (`src/document_processing/`)
|
57 |
+
- `extractor.py`: Extract text from various document formats
|
58 |
+
- `chunker.py`: Split documents into manageable chunks
|
59 |
+
- `processor.py`: Orchestrate the document processing pipeline
|
60 |
+
|
61 |
+
### 3. RAG Implementation (`src/rag/`)
|
62 |
+
- `retriever.py`: Retrieve relevant document chunks based on query
|
63 |
+
- `generator.py`: Generate responses using retrieved context
|
64 |
+
|
65 |
+
### 4. Web Interface (`src/web/`)
|
66 |
+
- `app.py`: Gradio web interface for the chatbot
|
67 |
+
- `embed.py`: Generate embedding code for website integration
|
68 |
+
|
69 |
+
### 5. Main Application (`src/main.py`)
|
70 |
+
- Entry point for the application
|
71 |
+
- Orchestrates the different components
|
72 |
+
|
73 |
+
## Implementation Approach
|
74 |
+
|
75 |
+
1. **Remote Model Execution**: Use Hugging Face's Inference API for both LLM and embedding models
|
76 |
+
2. **Lightweight Document Processing**: Process documents locally but use remote APIs for embedding generation
|
77 |
+
3. **Simple Vector Storage**: Store embeddings in simple file-based format rather than dedicated vector database
|
78 |
+
4. **Gradio Interface**: Create a simple but effective chat interface using Gradio
|
79 |
+
5. **Hugging Face Spaces Deployment**: Deploy the final solution to Hugging Face Spaces
|
src/rag/__init__.py
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
"""
|
2 |
+
RAG module for Norwegian chatbot.
|
3 |
+
"""
|
src/rag/generator.py
ADDED
@@ -0,0 +1,87 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
"""
|
2 |
+
Generator module for Norwegian RAG chatbot.
|
3 |
+
Generates responses using retrieved context and LLM.
|
4 |
+
"""
|
5 |
+
|
6 |
+
from typing import List, Dict, Any, Optional
|
7 |
+
|
8 |
+
from ..api.huggingface_api import HuggingFaceAPI, create_rag_prompt
|
9 |
+
|
10 |
+
class Generator:
|
11 |
+
"""
|
12 |
+
Generates responses using retrieved context and LLM.
|
13 |
+
Uses Hugging Face Inference API for text generation.
|
14 |
+
"""
|
15 |
+
|
16 |
+
def __init__(
|
17 |
+
self,
|
18 |
+
api_client: Optional[HuggingFaceAPI] = None,
|
19 |
+
):
|
20 |
+
"""
|
21 |
+
Initialize the generator.
|
22 |
+
|
23 |
+
Args:
|
24 |
+
api_client: HuggingFaceAPI client for text generation
|
25 |
+
"""
|
26 |
+
self.api_client = api_client or HuggingFaceAPI()
|
27 |
+
|
28 |
+
def generate(
|
29 |
+
self,
|
30 |
+
query: str,
|
31 |
+
retrieved_chunks: List[Dict[str, Any]],
|
32 |
+
temperature: float = 0.7
|
33 |
+
) -> str:
|
34 |
+
"""
|
35 |
+
Generate a response using retrieved context.
|
36 |
+
|
37 |
+
Args:
|
38 |
+
query: User query
|
39 |
+
retrieved_chunks: List of retrieved chunks with metadata
|
40 |
+
temperature: Temperature for text generation
|
41 |
+
|
42 |
+
Returns:
|
43 |
+
Generated response
|
44 |
+
"""
|
45 |
+
# Extract text from retrieved chunks
|
46 |
+
context_texts = [chunk["chunk_text"] for chunk in retrieved_chunks]
|
47 |
+
|
48 |
+
# If no context is retrieved, generate a response without context
|
49 |
+
if not context_texts:
|
50 |
+
return self._generate_without_context(query, temperature)
|
51 |
+
|
52 |
+
# Create RAG prompt
|
53 |
+
prompt = create_rag_prompt(query, context_texts)
|
54 |
+
|
55 |
+
# Generate response
|
56 |
+
response = self.api_client.generate_text(
|
57 |
+
prompt=prompt,
|
58 |
+
temperature=temperature
|
59 |
+
)
|
60 |
+
|
61 |
+
return response
|
62 |
+
|
63 |
+
def _generate_without_context(self, query: str, temperature: float = 0.7) -> str:
|
64 |
+
"""
|
65 |
+
Generate a response without context when no relevant chunks are found.
|
66 |
+
|
67 |
+
Args:
|
68 |
+
query: User query
|
69 |
+
temperature: Temperature for text generation
|
70 |
+
|
71 |
+
Returns:
|
72 |
+
Generated response
|
73 |
+
"""
|
74 |
+
prompt = f"""Du er en hjelpsom assistent som svarer på norsk. Svar på følgende spørsmål så godt du kan.
|
75 |
+
|
76 |
+
SPØRSMÅL:
|
77 |
+
{query}
|
78 |
+
|
79 |
+
SVAR:
|
80 |
+
"""
|
81 |
+
|
82 |
+
response = self.api_client.generate_text(
|
83 |
+
prompt=prompt,
|
84 |
+
temperature=temperature
|
85 |
+
)
|
86 |
+
|
87 |
+
return response
|
src/rag/retriever.py
ADDED
@@ -0,0 +1,163 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
"""
|
2 |
+
Retriever module for Norwegian RAG chatbot.
|
3 |
+
Retrieves relevant document chunks based on query embeddings.
|
4 |
+
"""
|
5 |
+
|
6 |
+
import os
|
7 |
+
import json
|
8 |
+
import numpy as np
|
9 |
+
from typing import List, Dict, Any, Optional, Tuple, Union
|
10 |
+
|
11 |
+
from ..api.huggingface_api import HuggingFaceAPI
|
12 |
+
from ..api.config import MAX_CHUNKS_TO_RETRIEVE, SIMILARITY_THRESHOLD
|
13 |
+
|
14 |
+
class Retriever:
|
15 |
+
"""
|
16 |
+
Retrieves relevant document chunks based on query embeddings.
|
17 |
+
Uses cosine similarity to find the most relevant chunks.
|
18 |
+
"""
|
19 |
+
|
20 |
+
def __init__(
|
21 |
+
self,
|
22 |
+
api_client: Optional[HuggingFaceAPI] = None,
|
23 |
+
processed_dir: str = "/home/ubuntu/chatbot_project/data/processed",
|
24 |
+
max_chunks: int = MAX_CHUNKS_TO_RETRIEVE,
|
25 |
+
similarity_threshold: float = SIMILARITY_THRESHOLD
|
26 |
+
):
|
27 |
+
"""
|
28 |
+
Initialize the retriever.
|
29 |
+
|
30 |
+
Args:
|
31 |
+
api_client: HuggingFaceAPI client for generating embeddings
|
32 |
+
processed_dir: Directory containing processed documents
|
33 |
+
max_chunks: Maximum number of chunks to retrieve
|
34 |
+
similarity_threshold: Minimum similarity score for retrieval
|
35 |
+
"""
|
36 |
+
self.api_client = api_client or HuggingFaceAPI()
|
37 |
+
self.processed_dir = processed_dir
|
38 |
+
self.max_chunks = max_chunks
|
39 |
+
self.similarity_threshold = similarity_threshold
|
40 |
+
|
41 |
+
# Load document index
|
42 |
+
self.document_index_path = os.path.join(self.processed_dir, "document_index.json")
|
43 |
+
self.document_index = self._load_document_index()
|
44 |
+
|
45 |
+
def retrieve(self, query: str) -> List[Dict[str, Any]]:
|
46 |
+
"""
|
47 |
+
Retrieve relevant document chunks for a query.
|
48 |
+
|
49 |
+
Args:
|
50 |
+
query: User query
|
51 |
+
|
52 |
+
Returns:
|
53 |
+
List of retrieved chunks with metadata
|
54 |
+
"""
|
55 |
+
# Generate embedding for the query
|
56 |
+
query_embedding = self.api_client.generate_embeddings(query)[0]
|
57 |
+
|
58 |
+
# Find relevant chunks across all documents
|
59 |
+
all_results = []
|
60 |
+
|
61 |
+
for doc_id in self.document_index:
|
62 |
+
try:
|
63 |
+
# Load document data
|
64 |
+
doc_results = self._retrieve_from_document(doc_id, query_embedding)
|
65 |
+
all_results.extend(doc_results)
|
66 |
+
except Exception as e:
|
67 |
+
print(f"Error retrieving from document {doc_id}: {str(e)}")
|
68 |
+
|
69 |
+
# Sort all results by similarity score
|
70 |
+
all_results.sort(key=lambda x: x["similarity"], reverse=True)
|
71 |
+
|
72 |
+
# Return top results above threshold
|
73 |
+
return [
|
74 |
+
result for result in all_results[:self.max_chunks]
|
75 |
+
if result["similarity"] >= self.similarity_threshold
|
76 |
+
]
|
77 |
+
|
78 |
+
def _retrieve_from_document(
|
79 |
+
self,
|
80 |
+
document_id: str,
|
81 |
+
query_embedding: List[float]
|
82 |
+
) -> List[Dict[str, Any]]:
|
83 |
+
"""
|
84 |
+
Retrieve relevant chunks from a specific document.
|
85 |
+
|
86 |
+
Args:
|
87 |
+
document_id: Document ID
|
88 |
+
query_embedding: Query embedding vector
|
89 |
+
|
90 |
+
Returns:
|
91 |
+
List of retrieved chunks with metadata
|
92 |
+
"""
|
93 |
+
document_path = os.path.join(self.processed_dir, f"{document_id}.json")
|
94 |
+
if not os.path.exists(document_path):
|
95 |
+
return []
|
96 |
+
|
97 |
+
# Load document data
|
98 |
+
with open(document_path, 'r', encoding='utf-8') as f:
|
99 |
+
document_data = json.load(f)
|
100 |
+
|
101 |
+
chunks = document_data.get("chunks", [])
|
102 |
+
embeddings = document_data.get("embeddings", [])
|
103 |
+
metadata = document_data.get("metadata", {})
|
104 |
+
|
105 |
+
if not chunks or not embeddings or len(chunks) != len(embeddings):
|
106 |
+
return []
|
107 |
+
|
108 |
+
# Calculate similarity scores
|
109 |
+
results = []
|
110 |
+
for i, (chunk, embedding) in enumerate(zip(chunks, embeddings)):
|
111 |
+
similarity = self._cosine_similarity(query_embedding, embedding)
|
112 |
+
|
113 |
+
results.append({
|
114 |
+
"document_id": document_id,
|
115 |
+
"chunk_index": i,
|
116 |
+
"chunk_text": chunk,
|
117 |
+
"similarity": similarity,
|
118 |
+
"metadata": metadata
|
119 |
+
})
|
120 |
+
|
121 |
+
# Sort by similarity
|
122 |
+
results.sort(key=lambda x: x["similarity"], reverse=True)
|
123 |
+
|
124 |
+
return results[:self.max_chunks]
|
125 |
+
|
126 |
+
def _cosine_similarity(self, vec1: List[float], vec2: List[float]) -> float:
|
127 |
+
"""
|
128 |
+
Calculate cosine similarity between two vectors.
|
129 |
+
|
130 |
+
Args:
|
131 |
+
vec1: First vector
|
132 |
+
vec2: Second vector
|
133 |
+
|
134 |
+
Returns:
|
135 |
+
Cosine similarity score
|
136 |
+
"""
|
137 |
+
vec1 = np.array(vec1)
|
138 |
+
vec2 = np.array(vec2)
|
139 |
+
|
140 |
+
dot_product = np.dot(vec1, vec2)
|
141 |
+
norm1 = np.linalg.norm(vec1)
|
142 |
+
norm2 = np.linalg.norm(vec2)
|
143 |
+
|
144 |
+
if norm1 == 0 or norm2 == 0:
|
145 |
+
return 0.0
|
146 |
+
|
147 |
+
return dot_product / (norm1 * norm2)
|
148 |
+
|
149 |
+
def _load_document_index(self) -> Dict[str, Dict[str, Any]]:
|
150 |
+
"""
|
151 |
+
Load the document index from disk.
|
152 |
+
|
153 |
+
Returns:
|
154 |
+
Dictionary of document IDs to metadata
|
155 |
+
"""
|
156 |
+
if os.path.exists(self.document_index_path):
|
157 |
+
try:
|
158 |
+
with open(self.document_index_path, 'r', encoding='utf-8') as f:
|
159 |
+
return json.load(f)
|
160 |
+
except Exception as e:
|
161 |
+
print(f"Error loading document index: {str(e)}")
|
162 |
+
|
163 |
+
return {}
|
src/web/__init__.py
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
"""
|
2 |
+
Web interface module for Norwegian RAG chatbot.
|
3 |
+
"""
|
src/web/app.py
ADDED
@@ -0,0 +1,301 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
"""
|
2 |
+
Gradio app for Norwegian RAG chatbot.
|
3 |
+
Provides a web interface for interacting with the chatbot.
|
4 |
+
"""
|
5 |
+
|
6 |
+
import os
|
7 |
+
import gradio as gr
|
8 |
+
import tempfile
|
9 |
+
from typing import List, Dict, Any, Tuple, Optional
|
10 |
+
|
11 |
+
from ..api.huggingface_api import HuggingFaceAPI
|
12 |
+
from ..document_processing.processor import DocumentProcessor
|
13 |
+
from ..rag.retriever import Retriever
|
14 |
+
from ..rag.generator import Generator
|
15 |
+
|
16 |
+
class ChatbotApp:
|
17 |
+
"""
|
18 |
+
Gradio app for Norwegian RAG chatbot.
|
19 |
+
"""
|
20 |
+
|
21 |
+
def __init__(
|
22 |
+
self,
|
23 |
+
api_client: Optional[HuggingFaceAPI] = None,
|
24 |
+
document_processor: Optional[DocumentProcessor] = None,
|
25 |
+
retriever: Optional[Retriever] = None,
|
26 |
+
generator: Optional[Generator] = None,
|
27 |
+
title: str = "Norwegian RAG Chatbot",
|
28 |
+
description: str = "En chatbot basert på Retrieval-Augmented Generation (RAG) for norsk språk."
|
29 |
+
):
|
30 |
+
"""
|
31 |
+
Initialize the chatbot app.
|
32 |
+
|
33 |
+
Args:
|
34 |
+
api_client: HuggingFaceAPI client
|
35 |
+
document_processor: Document processor
|
36 |
+
retriever: Retriever for finding relevant chunks
|
37 |
+
generator: Generator for creating responses
|
38 |
+
title: App title
|
39 |
+
description: App description
|
40 |
+
"""
|
41 |
+
# Initialize components
|
42 |
+
self.api_client = api_client or HuggingFaceAPI()
|
43 |
+
self.document_processor = document_processor or DocumentProcessor(api_client=self.api_client)
|
44 |
+
self.retriever = retriever or Retriever(api_client=self.api_client)
|
45 |
+
self.generator = generator or Generator(api_client=self.api_client)
|
46 |
+
|
47 |
+
# App settings
|
48 |
+
self.title = title
|
49 |
+
self.description = description
|
50 |
+
|
51 |
+
# Initialize Gradio app
|
52 |
+
self.app = self._build_interface()
|
53 |
+
|
54 |
+
def _build_interface(self) -> gr.Blocks:
|
55 |
+
"""
|
56 |
+
Build the Gradio interface.
|
57 |
+
|
58 |
+
Returns:
|
59 |
+
Gradio Blocks interface
|
60 |
+
"""
|
61 |
+
with gr.Blocks(title=self.title) as app:
|
62 |
+
gr.Markdown(f"# {self.title}")
|
63 |
+
gr.Markdown(self.description)
|
64 |
+
|
65 |
+
with gr.Tabs():
|
66 |
+
# Chat tab
|
67 |
+
with gr.Tab("Chat"):
|
68 |
+
chatbot = gr.Chatbot(height=500)
|
69 |
+
|
70 |
+
with gr.Row():
|
71 |
+
msg = gr.Textbox(
|
72 |
+
placeholder="Skriv din melding her...",
|
73 |
+
show_label=False,
|
74 |
+
scale=9
|
75 |
+
)
|
76 |
+
submit_btn = gr.Button("Send", scale=1)
|
77 |
+
|
78 |
+
with gr.Accordion("Avanserte innstillinger", open=False):
|
79 |
+
temperature = gr.Slider(
|
80 |
+
minimum=0.1,
|
81 |
+
maximum=1.0,
|
82 |
+
value=0.7,
|
83 |
+
step=0.1,
|
84 |
+
label="Temperatur"
|
85 |
+
)
|
86 |
+
|
87 |
+
clear_btn = gr.Button("Tøm chat")
|
88 |
+
|
89 |
+
# Set up event handlers
|
90 |
+
submit_btn.click(
|
91 |
+
fn=self._respond,
|
92 |
+
inputs=[msg, chatbot, temperature],
|
93 |
+
outputs=[msg, chatbot]
|
94 |
+
)
|
95 |
+
|
96 |
+
msg.submit(
|
97 |
+
fn=self._respond,
|
98 |
+
inputs=[msg, chatbot, temperature],
|
99 |
+
outputs=[msg, chatbot]
|
100 |
+
)
|
101 |
+
|
102 |
+
clear_btn.click(
|
103 |
+
fn=lambda: None,
|
104 |
+
inputs=None,
|
105 |
+
outputs=chatbot,
|
106 |
+
queue=False
|
107 |
+
)
|
108 |
+
|
109 |
+
# Document upload tab
|
110 |
+
with gr.Tab("Last opp dokumenter"):
|
111 |
+
with gr.Row():
|
112 |
+
with gr.Column(scale=2):
|
113 |
+
file_output = gr.File(label="Opplastede dokumenter")
|
114 |
+
upload_button = gr.UploadButton(
|
115 |
+
"Klikk for å laste opp dokument",
|
116 |
+
file_types=["pdf", "txt", "html"],
|
117 |
+
file_count="multiple"
|
118 |
+
)
|
119 |
+
|
120 |
+
with gr.Column(scale=3):
|
121 |
+
documents_list = gr.Dataframe(
|
122 |
+
headers=["Dokument ID", "Filnavn", "Dato", "Chunks"],
|
123 |
+
label="Dokumentliste",
|
124 |
+
interactive=False
|
125 |
+
)
|
126 |
+
|
127 |
+
process_status = gr.Textbox(label="Status", interactive=False)
|
128 |
+
refresh_btn = gr.Button("Oppdater dokumentliste")
|
129 |
+
|
130 |
+
# Set up event handlers
|
131 |
+
upload_button.upload(
|
132 |
+
fn=self._process_uploaded_files,
|
133 |
+
inputs=[upload_button],
|
134 |
+
outputs=[process_status, documents_list]
|
135 |
+
)
|
136 |
+
|
137 |
+
refresh_btn.click(
|
138 |
+
fn=self._get_documents_list,
|
139 |
+
inputs=None,
|
140 |
+
outputs=[documents_list]
|
141 |
+
)
|
142 |
+
|
143 |
+
# Embed tab
|
144 |
+
with gr.Tab("Integrer"):
|
145 |
+
gr.Markdown("## Integrer chatboten på din nettside")
|
146 |
+
|
147 |
+
with gr.Row():
|
148 |
+
with gr.Column():
|
149 |
+
gr.Markdown("### iFrame-kode")
|
150 |
+
iframe_code = gr.Code(
|
151 |
+
label="iFrame",
|
152 |
+
language="html",
|
153 |
+
value='<iframe src="https://huggingface.co/spaces/username/norwegian-rag-chatbot" width="100%" height="500px"></iframe>'
|
154 |
+
)
|
155 |
+
|
156 |
+
with gr.Column():
|
157 |
+
gr.Markdown("### JavaScript Widget")
|
158 |
+
js_code = gr.Code(
|
159 |
+
label="JavaScript",
|
160 |
+
language="html",
|
161 |
+
value='<script src="https://huggingface.co/spaces/username/norwegian-rag-chatbot/widget.js"></script>'
|
162 |
+
)
|
163 |
+
|
164 |
+
gr.Markdown("### Forhåndsvisning")
|
165 |
+
gr.Markdown("*Forhåndsvisning vil være tilgjengelig etter at chatboten er distribuert til Hugging Face Spaces.*")
|
166 |
+
|
167 |
+
gr.Markdown("---")
|
168 |
+
gr.Markdown("Bygget med [Hugging Face](https://huggingface.co/) og [Gradio](https://gradio.app/)")
|
169 |
+
|
170 |
+
return app
|
171 |
+
|
172 |
+
def _respond(
|
173 |
+
self,
|
174 |
+
message: str,
|
175 |
+
chat_history: List[Tuple[str, str]],
|
176 |
+
temperature: float
|
177 |
+
) -> Tuple[str, List[Tuple[str, str]]]:
|
178 |
+
"""
|
179 |
+
Generate a response to the user message.
|
180 |
+
|
181 |
+
Args:
|
182 |
+
message: User message
|
183 |
+
chat_history: Chat history
|
184 |
+
temperature: Temperature for text generation
|
185 |
+
|
186 |
+
Returns:
|
187 |
+
Empty message and updated chat history
|
188 |
+
"""
|
189 |
+
if not message:
|
190 |
+
return "", chat_history
|
191 |
+
|
192 |
+
# Add user message to chat history
|
193 |
+
chat_history.append((message, None))
|
194 |
+
|
195 |
+
try:
|
196 |
+
# Retrieve relevant chunks
|
197 |
+
retrieved_chunks = self.retriever.retrieve(message)
|
198 |
+
|
199 |
+
# Generate response
|
200 |
+
response = self.generator.generate(
|
201 |
+
query=message,
|
202 |
+
retrieved_chunks=retrieved_chunks,
|
203 |
+
temperature=temperature
|
204 |
+
)
|
205 |
+
|
206 |
+
# Update chat history with response
|
207 |
+
chat_history[-1] = (message, response)
|
208 |
+
except Exception as e:
|
209 |
+
# Handle errors
|
210 |
+
error_message = f"Beklager, det oppstod en feil: {str(e)}"
|
211 |
+
chat_history[-1] = (message, error_message)
|
212 |
+
|
213 |
+
return "", chat_history
|
214 |
+
|
215 |
+
def _process_uploaded_files(
|
216 |
+
self,
|
217 |
+
files: List[tempfile._TemporaryFileWrapper]
|
218 |
+
) -> Tuple[str, List[List[str]]]:
|
219 |
+
"""
|
220 |
+
Process uploaded files.
|
221 |
+
|
222 |
+
Args:
|
223 |
+
files: List of uploaded files
|
224 |
+
|
225 |
+
Returns:
|
226 |
+
Status message and updated documents list
|
227 |
+
"""
|
228 |
+
if not files:
|
229 |
+
return "Ingen filer lastet opp.", self._get_documents_list()
|
230 |
+
|
231 |
+
processed_files = []
|
232 |
+
|
233 |
+
for file in files:
|
234 |
+
try:
|
235 |
+
# Process the document
|
236 |
+
document_id = self.document_processor.process_document(file.name)
|
237 |
+
processed_files.append(os.path.basename(file.name))
|
238 |
+
except Exception as e:
|
239 |
+
return f"Feil ved behandling av {os.path.basename(file.name)}: {str(e)}", self._get_documents_list()
|
240 |
+
|
241 |
+
if len(processed_files) == 1:
|
242 |
+
status = f"Fil behandlet: {processed_files[0]}"
|
243 |
+
else:
|
244 |
+
status = f"{len(processed_files)} filer behandlet: {', '.join(processed_files)}"
|
245 |
+
|
246 |
+
return status, self._get_documents_list()
|
247 |
+
|
248 |
+
def _get_documents_list(self) -> List[List[str]]:
|
249 |
+
"""
|
250 |
+
Get list of processed documents.
|
251 |
+
|
252 |
+
Returns:
|
253 |
+
List of document information
|
254 |
+
"""
|
255 |
+
documents = self.document_processor.get_all_documents()
|
256 |
+
|
257 |
+
# Format for dataframe
|
258 |
+
documents_list = []
|
259 |
+
for doc_id, metadata in documents.items():
|
260 |
+
filename = metadata.get("filename", "N/A")
|
261 |
+
processed_date = metadata.get("processed_date", "N/A")
|
262 |
+
chunk_count = metadata.get("chunk_count", 0)
|
263 |
+
|
264 |
+
documents_list.append([doc_id, filename, processed_date, chunk_count])
|
265 |
+
|
266 |
+
return documents_list
|
267 |
+
|
268 |
+
def launch(self, **kwargs):
|
269 |
+
"""
|
270 |
+
Launch the Gradio app.
|
271 |
+
|
272 |
+
Args:
|
273 |
+
**kwargs: Additional arguments for gr.launch()
|
274 |
+
"""
|
275 |
+
self.app.launch(**kwargs)
|
276 |
+
|
277 |
+
|
278 |
+
def create_app():
|
279 |
+
"""
|
280 |
+
Create and configure the chatbot app.
|
281 |
+
|
282 |
+
Returns:
|
283 |
+
Configured ChatbotApp instance
|
284 |
+
"""
|
285 |
+
# Initialize API client
|
286 |
+
api_client = HuggingFaceAPI()
|
287 |
+
|
288 |
+
# Initialize components
|
289 |
+
document_processor = DocumentProcessor(api_client=api_client)
|
290 |
+
retriever = Retriever(api_client=api_client)
|
291 |
+
generator = Generator(api_client=api_client)
|
292 |
+
|
293 |
+
# Create app
|
294 |
+
app = ChatbotApp(
|
295 |
+
api_client=api_client,
|
296 |
+
document_processor=document_processor,
|
297 |
+
retriever=retriever,
|
298 |
+
generator=generator
|
299 |
+
)
|
300 |
+
|
301 |
+
return app
|
src/web/embed.py
ADDED
@@ -0,0 +1,211 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
"""
|
2 |
+
Embedding functionality for Norwegian RAG chatbot.
|
3 |
+
Provides utilities for embedding the chatbot in external websites.
|
4 |
+
"""
|
5 |
+
|
6 |
+
import os
|
7 |
+
from typing import Dict, Optional
|
8 |
+
|
9 |
+
class EmbedGenerator:
|
10 |
+
"""
|
11 |
+
Generates embedding code for integrating the chatbot into external websites.
|
12 |
+
"""
|
13 |
+
|
14 |
+
def __init__(
|
15 |
+
self,
|
16 |
+
space_name: Optional[str] = None,
|
17 |
+
username: Optional[str] = None,
|
18 |
+
height: int = 500,
|
19 |
+
width: str = "100%"
|
20 |
+
):
|
21 |
+
"""
|
22 |
+
Initialize the embed generator.
|
23 |
+
|
24 |
+
Args:
|
25 |
+
space_name: Hugging Face Space name
|
26 |
+
username: Hugging Face username
|
27 |
+
height: Default iframe height in pixels
|
28 |
+
width: Default iframe width (can be pixels or percentage)
|
29 |
+
"""
|
30 |
+
self.space_name = space_name or "norwegian-rag-chatbot"
|
31 |
+
self.username = username or "username"
|
32 |
+
self.height = height
|
33 |
+
self.width = width
|
34 |
+
|
35 |
+
def get_iframe_code(
|
36 |
+
self,
|
37 |
+
height: Optional[int] = None,
|
38 |
+
width: Optional[str] = None
|
39 |
+
) -> str:
|
40 |
+
"""
|
41 |
+
Generate iframe embed code.
|
42 |
+
|
43 |
+
Args:
|
44 |
+
height: Optional custom height
|
45 |
+
width: Optional custom width
|
46 |
+
|
47 |
+
Returns:
|
48 |
+
HTML iframe code
|
49 |
+
"""
|
50 |
+
h = height or self.height
|
51 |
+
w = width or self.width
|
52 |
+
|
53 |
+
return f'<iframe src="https://huggingface.co/spaces/{self.username}/{self.space_name}" width="{w}" height="{h}px" frameborder="0"></iframe>'
|
54 |
+
|
55 |
+
def get_javascript_widget_code(self) -> str:
|
56 |
+
"""
|
57 |
+
Generate JavaScript widget embed code.
|
58 |
+
|
59 |
+
Returns:
|
60 |
+
HTML script tag for widget
|
61 |
+
"""
|
62 |
+
return f'<script src="https://huggingface.co/spaces/{self.username}/{self.space_name}/widget.js"></script>'
|
63 |
+
|
64 |
+
def get_direct_url(self) -> str:
|
65 |
+
"""
|
66 |
+
Get direct URL to the Hugging Face Space.
|
67 |
+
|
68 |
+
Returns:
|
69 |
+
URL to the Hugging Face Space
|
70 |
+
"""
|
71 |
+
return f"https://huggingface.co/spaces/{self.username}/{self.space_name}"
|
72 |
+
|
73 |
+
def get_embed_options(self) -> Dict[str, str]:
|
74 |
+
"""
|
75 |
+
Get all embedding options.
|
76 |
+
|
77 |
+
Returns:
|
78 |
+
Dictionary of embedding options
|
79 |
+
"""
|
80 |
+
return {
|
81 |
+
"iframe": self.get_iframe_code(),
|
82 |
+
"javascript": self.get_javascript_widget_code(),
|
83 |
+
"url": self.get_direct_url()
|
84 |
+
}
|
85 |
+
|
86 |
+
def update_space_info(self, username: str, space_name: str) -> None:
|
87 |
+
"""
|
88 |
+
Update Hugging Face Space information.
|
89 |
+
|
90 |
+
Args:
|
91 |
+
username: Hugging Face username
|
92 |
+
space_name: Hugging Face Space name
|
93 |
+
"""
|
94 |
+
self.username = username
|
95 |
+
self.space_name = space_name
|
96 |
+
|
97 |
+
|
98 |
+
def create_embed_html_file(
|
99 |
+
embed_generator: EmbedGenerator,
|
100 |
+
output_path: str = "/home/ubuntu/chatbot_project/embed_example.html"
|
101 |
+
) -> str:
|
102 |
+
"""
|
103 |
+
Create an HTML file with embedding examples.
|
104 |
+
|
105 |
+
Args:
|
106 |
+
embed_generator: EmbedGenerator instance
|
107 |
+
output_path: Path to save the HTML file
|
108 |
+
|
109 |
+
Returns:
|
110 |
+
Path to the created HTML file
|
111 |
+
"""
|
112 |
+
html_content = f"""<!DOCTYPE html>
|
113 |
+
<html lang="no">
|
114 |
+
<head>
|
115 |
+
<meta charset="UTF-8">
|
116 |
+
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
117 |
+
<title>Norwegian RAG Chatbot - Embedding Examples</title>
|
118 |
+
<style>
|
119 |
+
body {{
|
120 |
+
font-family: Arial, sans-serif;
|
121 |
+
line-height: 1.6;
|
122 |
+
max-width: 800px;
|
123 |
+
margin: 0 auto;
|
124 |
+
padding: 20px;
|
125 |
+
}}
|
126 |
+
h1, h2, h3 {{
|
127 |
+
color: #2c3e50;
|
128 |
+
}}
|
129 |
+
.code-block {{
|
130 |
+
background-color: #f8f9fa;
|
131 |
+
border: 1px solid #ddd;
|
132 |
+
border-radius: 4px;
|
133 |
+
padding: 15px;
|
134 |
+
margin: 15px 0;
|
135 |
+
overflow-x: auto;
|
136 |
+
}}
|
137 |
+
.example {{
|
138 |
+
margin: 30px 0;
|
139 |
+
padding: 20px;
|
140 |
+
border: 1px solid #eee;
|
141 |
+
border-radius: 5px;
|
142 |
+
}}
|
143 |
+
</style>
|
144 |
+
</head>
|
145 |
+
<body>
|
146 |
+
<h1>Norwegian RAG Chatbot - Embedding Examples</h1>
|
147 |
+
|
148 |
+
<p>
|
149 |
+
This page demonstrates how to embed the Norwegian RAG Chatbot into your website.
|
150 |
+
There are multiple ways to integrate the chatbot, depending on your needs.
|
151 |
+
</p>
|
152 |
+
|
153 |
+
<h2>Option 1: iFrame Embedding</h2>
|
154 |
+
<p>
|
155 |
+
The simplest way to embed the chatbot is using an iFrame. Copy and paste the following code into your HTML:
|
156 |
+
</p>
|
157 |
+
<div class="code-block">
|
158 |
+
<pre>{embed_generator.get_iframe_code()}</pre>
|
159 |
+
</div>
|
160 |
+
|
161 |
+
<div class="example">
|
162 |
+
<h3>Example:</h3>
|
163 |
+
{embed_generator.get_iframe_code()}
|
164 |
+
</div>
|
165 |
+
|
166 |
+
<h2>Option 2: JavaScript Widget</h2>
|
167 |
+
<p>
|
168 |
+
For a more integrated experience, you can use the JavaScript widget. Copy and paste the following code into your HTML:
|
169 |
+
</p>
|
170 |
+
<div class="code-block">
|
171 |
+
<pre>{embed_generator.get_javascript_widget_code()}</pre>
|
172 |
+
</div>
|
173 |
+
|
174 |
+
<div class="example">
|
175 |
+
<h3>Example:</h3>
|
176 |
+
<p>The widget will appear below once the page is hosted on a web server:</p>
|
177 |
+
<!-- Widget will be inserted here when the script runs -->
|
178 |
+
</div>
|
179 |
+
|
180 |
+
<h2>Option 3: Direct Link</h2>
|
181 |
+
<p>
|
182 |
+
You can also provide a direct link to the chatbot:
|
183 |
+
</p>
|
184 |
+
<div class="code-block">
|
185 |
+
<pre>{embed_generator.get_direct_url()}</pre>
|
186 |
+
</div>
|
187 |
+
|
188 |
+
<h2>Customization</h2>
|
189 |
+
<p>
|
190 |
+
You can customize the appearance of the embedded chatbot by modifying the iFrame dimensions:
|
191 |
+
</p>
|
192 |
+
<div class="code-block">
|
193 |
+
<pre>{embed_generator.get_iframe_code(height=600, width="80%")}</pre>
|
194 |
+
</div>
|
195 |
+
|
196 |
+
<footer>
|
197 |
+
<p>
|
198 |
+
<small>
|
199 |
+
Created with <a href="https://huggingface.co/" target="_blank">Hugging Face</a> and
|
200 |
+
<a href="https://gradio.app/" target="_blank">Gradio</a>.
|
201 |
+
</small>
|
202 |
+
</p>
|
203 |
+
</footer>
|
204 |
+
</body>
|
205 |
+
</html>
|
206 |
+
"""
|
207 |
+
|
208 |
+
with open(output_path, 'w', encoding='utf-8') as f:
|
209 |
+
f.write(html_content)
|
210 |
+
|
211 |
+
return output_path
|
todo.md
ADDED
@@ -0,0 +1,26 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Norwegian RAG Chatbot Project Todo
|
2 |
+
|
3 |
+
## Research Phase
|
4 |
+
- [x] Research open-source LLMs with good Norwegian language support
|
5 |
+
- [x] Evaluate embedding models for Norwegian text
|
6 |
+
- [x] Research vector database options for RAG implementation
|
7 |
+
- [x] Document findings and select best options
|
8 |
+
|
9 |
+
## Design Phase
|
10 |
+
- [x] Design RAG architecture
|
11 |
+
- [x] Plan document processing pipeline
|
12 |
+
- [x] Design chat interface
|
13 |
+
- [x] Plan embedding functionality
|
14 |
+
|
15 |
+
## Implementation Phase
|
16 |
+
- [ ] Set up development environment
|
17 |
+
- [ ] Implement document processing and embedding
|
18 |
+
- [ ] Integrate LLM
|
19 |
+
- [ ] Create chat interface
|
20 |
+
- [ ] Develop embedding functionality
|
21 |
+
|
22 |
+
## Testing and Finalization
|
23 |
+
- [ ] Test with Norwegian content
|
24 |
+
- [ ] Optimize performance
|
25 |
+
- [ ] Document usage and integration
|
26 |
+
- [ ] Finalize solution
|