iver / research /norwegian_llm_research.md
hevold's picture
Upload 29 files
b34efa5 verified
|
raw
history blame
3.88 kB

Norwegian LLM and Embedding Models Research

Open-Source LLMs with Norwegian Language Support

1. NorMistral-7b-scratch

  • Description: A large Norwegian language model pretrained from scratch on 260 billion subword tokens (using six repetitions of open Norwegian texts).
  • Architecture: Based on Mistral architecture with 7 billion parameters
  • Context Length: 2k tokens
  • Performance:
    • Perplexity on NCC validation set: 7.43
    • Good performance on reading comprehension, sentiment analysis, and machine translation tasks
  • License: Apache-2.0
  • Hugging Face: https://huggingface.co/norallm/normistral-7b-scratch
  • Notes: Part of the NORA.LLM family developed by the Language Technology Group at the University of Oslo

2. Viking 7B

  • Description: The first multilingual large language model for all Nordic languages (including Norwegian)
  • Architecture: Similar to Llama 2, with flash attention, rotary embeddings, grouped query attention
  • Context Length: 4k tokens
  • Performance: Best-in-class performance in all Nordic languages without compromising English performance
  • License: Apache 2.0
  • Notes:
    • Developed by Silo AI and University of Turku's research group TurkuNLP
    • Also available in larger sizes (13B and 33B parameters)
    • Trained on 2 trillion tokens including Danish, English, Finnish, Icelandic, Norwegian, Swedish and programming languages

3. NorskGPT

  • Description: A Norwegian large language model made for Norwegian society
  • Versions:
    • NorskGPT-Mistral: 7B dense transformer with 8K context window, based on Mistral 7B
    • NorskGPT-LLAMA2: 7b and 13b parameter model with 4K context length, based on LLAMA2
  • License: cc-by-nc-sa-4.0 (non-commercial)
  • Website: https://www.norskgpt.com/norskgpt-llm

Embedding Models for Norwegian

1. NbAiLab/nb-sbert-base

  • Description: A SentenceTransformers model trained on a machine translated version of the MNLI dataset
  • Architecture: Based on nb-bert-base
  • Vector Dimensions: 768
  • Performance:
    • Cosine Similarity: Pearson 0.8275, Spearman 0.8245
  • License: apache-2.0
  • Hugging Face: https://huggingface.co/NbAiLab/nb-sbert-base
  • Use Cases:
    • Sentence similarity
    • Semantic search
    • Few-shot classification (with SetFit)
    • Keyword extraction (with KeyBERT)
    • Topic modeling (with BERTopic)
  • Notes: Works well with both Norwegian and English, making it ideal for bilingual applications

2. FFI/SimCSE-NB-BERT-large

Vector Database Options for Hugging Face RAG Integration

1. Milvus

2. MongoDB

3. MyScale

4. FAISS (Facebook AI Similarity Search)

  • Integration: Lightweight vector database that works well with Hugging Face
  • Notes: Can be used with autofaiss for quick experimentation

Hugging Face RAG Implementation Options

  1. Transformers Library: Provides access to pre-trained models
  2. Sentence Transformers: For text embeddings
  3. Datasets: For managing and processing data
  4. LangChain Integration: For advanced RAG pipelines
  5. Spaces: For deploying and sharing the application