license: apache-2.0 language: en library_name: pytorch tags: - text-generation - custom-model - educational - from-scratch - llama-inspired pipeline_tag: text-generation

LLAMA-3-From-Scartch (Custom ~221M Model)

Model Description

This repository contains a ~221 Million parameter decoder-only Transformer language model built "from scratch". The architecture is inspired by models like Llama but significantly scaled down and adapted during development to run within limited hardware constraints (specifically, training was demonstrated on a single ~4GB VRAM GPU).

This model is primarily an educational project demonstrating the implementation of core Transformer components (RMSNorm, RoPE, Attention, FFN with SwiGLU, Weight Tying) and a basic training pipeline (including checkpointing, LR scheduling, AMP, and validation).

Key Architectural Features:

  • Parameters: ~221 Million (Weight Tied)
  • Layers: 12 Transformer Blocks
  • Hidden Size: 768
  • Attention: 12 Heads (Multi-Head Attention, head_dim=64)
  • FFN Intermediate Size: 2048 (using SwiGLU activation)
  • Normalization: RMSNorm
  • Positional Embeddings: Rotary Positional Embeddings (RoPE)
  • Weight Tying: Input embeddings and output projection layer share weights.
  • Tokenizer: deepseek-ai/DeepSeek-R1 (Vocab Size: 128,000) - Note: Requires trust_remote_code=True
  • Context Length: Trained with max_position_embeddings=4096, demo training used sequence_length=256.

Training:

  • Dataset: Primarily trained on WikiText-2 (wikitext-2-raw-v1, ~2M tokens) for demonstration purposes. The tokenized versions (wikitext2_tokens_128k.pt, wikitext2_val_tokens_128k.pt) are included in the repository.
  • Procedure: Trained on a single GPU using PyTorch with AMP (float16), AdamW optimizer, and a Cosine learning rate schedule with warmup. The provided checkpoints (step_600.pt, step_800.pt, potentially others) represent states after very limited training.
  • Performance: Due to the extremely limited training data and duration, the model exhibits basic pattern learning but lacks coherence, factual accuracy, and instruction-following capabilities. The training and validation loss decreased but remained high. See loss_plot_*.png for visualization.

Intended Use:

  • Educational purposes: Studying Transformer architecture implementation and training basics.
  • Experimentation: Serving as a base for further training or architectural modifications.
  • Not suitable for production or reliable text generation.

How to Use

Important: This model requires the custom Python code (model_architecture.py) from this repository. It cannot be loaded directly using AutoModelForCausalLM.

  1. Clone the repository:
    git clone https://huggingface.co/DrNerd/LLAMA-3-From-Scartch
    cd LLAMA-3-From-Scartch
    # Ensure LFS files are downloaded (if needed)
    # git lfs pull
    
  2. Install dependencies:
    pip install torch transformers datasets matplotlib tqdm # Add others if needed
    
  3. Run Inference (using inference.py): The inference.py script loads a checkpoint (defaults to step_1200.pt or latest if not found, edit the script to specify step_800.pt or step_600.pt) and runs generation.
    # Make sure step_600.pt or step_800.pt exists in the directory
    # Edit inference.py to point to the desired checkpoint file
    python inference.py
    
    Alternatively, adapt the loading logic from inference.py into your own script.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train DrNerd/LLAMA-3-From-Scratch