Norwegian LLM and Embedding Models Research

Open-Source LLMs with Norwegian Language Support

Description: A large Norwegian language model pretrained from scratch on 260 billion subword tokens (using six repetitions of open Norwegian texts).
Architecture: Based on Mistral architecture with 7 billion parameters
Context Length: 2k tokens
Performance:
- Perplexity on NCC validation set: 7.43
- Good performance on reading comprehension, sentiment analysis, and machine translation tasks
License: Apache-2.0
Hugging Face: https://huggingface.co/norallm/normistral-7b-scratch
Notes: Part of the NORA.LLM family developed by the Language Technology Group at the University of Oslo

Description: The first multilingual large language model for all Nordic languages (including Norwegian)
Architecture: Similar to Llama 2, with flash attention, rotary embeddings, grouped query attention
Context Length: 4k tokens
Performance: Best-in-class performance in all Nordic languages without compromising English performance
License: Apache 2.0
Notes:
- Developed by Silo AI and University of Turku's research group TurkuNLP
- Also available in larger sizes (13B and 33B parameters)
- Trained on 2 trillion tokens including Danish, English, Finnish, Icelandic, Norwegian, Swedish and programming languages

Description: A Norwegian large language model made for Norwegian society
Versions:
- NorskGPT-Mistral: 7B dense transformer with 8K context window, based on Mistral 7B
- NorskGPT-LLAMA2: 7b and 13b parameter model with 4K context length, based on LLAMA2
License: cc-by-nc-sa-4.0 (non-commercial)
Website: https://www.norskgpt.com/norskgpt-llm

Description: A SentenceTransformers model trained on a machine translated version of the MNLI dataset
Architecture: Based on nb-bert-base
Vector Dimensions: 768
Performance:
- Cosine Similarity: Pearson 0.8275, Spearman 0.8245
License: apache-2.0
Hugging Face: https://huggingface.co/NbAiLab/nb-sbert-base
Use Cases:
- Sentence similarity
- Semantic search
- Few-shot classification (with SetFit)
- Keyword extraction (with KeyBERT)
- Topic modeling (with BERTopic)
Notes: Works well with both Norwegian and English, making it ideal for bilingual applications

Description: A Norwegian sentence embedding model trained using the SimCSE methodology
Hugging Face: https://huggingface.co/FFI/SimCSE-NB-BERT-large

Integration: Well-documented integration with Hugging Face for RAG pipelines
Reference: https://huggingface.co/learn/cookbook/en/rag_with_hf_and_milvus

Integration: Can be used with Hugging Face models for RAG systems
Reference: https://huggingface.co/learn/cookbook/en/rag_with_hugging_face_gemma_mongodb

Integration: Supports building RAG applications with Hugging Face embedding models
Reference: https://medium.com/@myscale/building-a-rag-application-in-10-min-with-claude-3-and-hugging-face-10caea4ea293