license: apache-2.0 language: en library_name: pytorch tags: - text-generation - custom-model - educational - from-scratch - llama-inspired pipeline_tag: text-generation
LLAMA-3-From-Scartch (Custom ~221M Model)
Model Description
This repository contains a ~221 Million parameter decoder-only Transformer language model built "from scratch". The architecture is inspired by models like Llama but significantly scaled down and adapted during development to run within limited hardware constraints (specifically, training was demonstrated on a single ~4GB VRAM GPU).
This model is primarily an educational project demonstrating the implementation of core Transformer components (RMSNorm, RoPE, Attention, FFN with SwiGLU, Weight Tying) and a basic training pipeline (including checkpointing, LR scheduling, AMP, and validation).
Key Architectural Features:
- Parameters: ~221 Million (Weight Tied)
- Layers: 12 Transformer Blocks
- Hidden Size: 768
- Attention: 12 Heads (Multi-Head Attention,
head_dim=64
) - FFN Intermediate Size: 2048 (using SwiGLU activation)
- Normalization: RMSNorm
- Positional Embeddings: Rotary Positional Embeddings (RoPE)
- Weight Tying: Input embeddings and output projection layer share weights.
- Tokenizer:
deepseek-ai/DeepSeek-R1
(Vocab Size: 128,000) - Note: Requirestrust_remote_code=True
- Context Length: Trained with
max_position_embeddings=4096
, demo training usedsequence_length=256
.
Training:
- Dataset: Primarily trained on WikiText-2 (
wikitext-2-raw-v1
, ~2M tokens) for demonstration purposes. The tokenized versions (wikitext2_tokens_128k.pt
,wikitext2_val_tokens_128k.pt
) are included in the repository. - Procedure: Trained on a single GPU using PyTorch with AMP (float16), AdamW optimizer, and a Cosine learning rate schedule with warmup. The provided checkpoints (
step_600.pt
,step_800.pt
, potentially others) represent states after very limited training. - Performance: Due to the extremely limited training data and duration, the model exhibits basic pattern learning but lacks coherence, factual accuracy, and instruction-following capabilities. The training and validation loss decreased but remained high. See
loss_plot_*.png
for visualization.
Intended Use:
- Educational purposes: Studying Transformer architecture implementation and training basics.
- Experimentation: Serving as a base for further training or architectural modifications.
- Not suitable for production or reliable text generation.
How to Use
Important: This model requires the custom Python code (model_architecture.py
) from this repository. It cannot be loaded directly using AutoModelForCausalLM
.
- Clone the repository:
git clone https://huggingface.co/DrNerd/LLAMA-3-From-Scartch cd LLAMA-3-From-Scartch # Ensure LFS files are downloaded (if needed) # git lfs pull
- Install dependencies:
pip install torch transformers datasets matplotlib tqdm # Add others if needed
- Run Inference (using
inference.py
): Theinference.py
script loads a checkpoint (defaults tostep_1200.pt
or latest if not found, edit the script to specifystep_800.pt
orstep_600.pt
) and runs generation.
Alternatively, adapt the loading logic from# Make sure step_600.pt or step_800.pt exists in the directory # Edit inference.py to point to the desired checkpoint file python inference.py
inference.py
into your own script.