Fetching metadata from the HF Docker repository...
Upload 29 files
b34efa5
verified
Norwegian LLM and Embedding Models Research
Open-Source LLMs with Norwegian Language Support
1. NorMistral-7b-scratch
- Description: A large Norwegian language model pretrained from scratch on 260 billion subword tokens (using six repetitions of open Norwegian texts).
- Architecture: Based on Mistral architecture with 7 billion parameters
- Context Length: 2k tokens
- Performance:
- Perplexity on NCC validation set: 7.43
- Good performance on reading comprehension, sentiment analysis, and machine translation tasks
- License: Apache-2.0
- Hugging Face: https://huggingface.co/norallm/normistral-7b-scratch
- Notes: Part of the NORA.LLM family developed by the Language Technology Group at the University of Oslo
2. Viking 7B
- Description: The first multilingual large language model for all Nordic languages (including Norwegian)
- Architecture: Similar to Llama 2, with flash attention, rotary embeddings, grouped query attention
- Context Length: 4k tokens
- Performance: Best-in-class performance in all Nordic languages without compromising English performance
- License: Apache 2.0
- Notes:
- Developed by Silo AI and University of Turku's research group TurkuNLP
- Also available in larger sizes (13B and 33B parameters)
- Trained on 2 trillion tokens including Danish, English, Finnish, Icelandic, Norwegian, Swedish and programming languages
3. NorskGPT
- Description: A Norwegian large language model made for Norwegian society
- Versions:
- NorskGPT-Mistral: 7B dense transformer with 8K context window, based on Mistral 7B
- NorskGPT-LLAMA2: 7b and 13b parameter model with 4K context length, based on LLAMA2
- License: cc-by-nc-sa-4.0 (non-commercial)
- Website: https://www.norskgpt.com/norskgpt-llm
Embedding Models for Norwegian
1. NbAiLab/nb-sbert-base
- Description: A SentenceTransformers model trained on a machine translated version of the MNLI dataset
- Architecture: Based on nb-bert-base
- Vector Dimensions: 768
- Performance:
- Cosine Similarity: Pearson 0.8275, Spearman 0.8245
- License: apache-2.0
- Hugging Face: https://huggingface.co/NbAiLab/nb-sbert-base
- Use Cases:
- Sentence similarity
- Semantic search
- Few-shot classification (with SetFit)
- Keyword extraction (with KeyBERT)
- Topic modeling (with BERTopic)
- Notes: Works well with both Norwegian and English, making it ideal for bilingual applications
2. FFI/SimCSE-NB-BERT-large
Vector Database Options for Hugging Face RAG Integration
1. Milvus
2. MongoDB
3. MyScale
4. FAISS (Facebook AI Similarity Search)
- Integration: Lightweight vector database that works well with Hugging Face
- Notes: Can be used with
autofaiss
for quick experimentation
Hugging Face RAG Implementation Options
- Transformers Library: Provides access to pre-trained models
- Sentence Transformers: For text embeddings
- Datasets: For managing and processing data
- LangChain Integration: For advanced RAG pipelines
- Spaces: For deploying and sharing the application