random-llama-small / README.md
saneowl's picture
correct mistake in readme
ec7d67c verified
metadata
license: mit
pipeline_tag: text-generation
library_name: transformers

Random-Llama-Small

Model Overview

Random-Llama-Small is a randomly initialized transformer-based language model with approximately 2 billion parameters, built using the LLaMA architecture. It is designed for research purposes, providing a starting point for pretraining or fine-tuning on custom datasets. The model uses the tokenizer from HuggingFaceTB/SmolLM2-1.7B-Instruct and is configured for causal language modeling. As a randomly initialized model, it produces incoherent outputs until trained, making it ideal for researchers studying transformer training dynamics or developing custom language models.


Key Details

  • Architecture: LLaMA (Causal Language Model)
  • Parameters: ~2B
  • Hidden Size: 2304
  • Layers: 22
  • Attention Heads: 36 (with 9 key-value heads for grouped-query attention)
  • Intermediate Size: 9216
  • Vocabulary Size: 128256
  • Tokenizer: Imported from HuggingFaceTB/SmolLM2-1.7B-Instruct
  • Precision: bfloat16
  • Max Context Length: 131,072 tokens (with RoPE scaling)
  • License: MIT

LLaMA Architecture

The LLaMA architecture, developed by Meta AI, is a family of efficient transformer-based models optimized for research. Random-Llama-Small follows this design, incorporating several key features:

Core Components

  • Decoder-Only Transformer: Predicts the next token in a sequence based on prior tokens, suitable for autoregressive tasks like text generation.
  • Grouped-Query Attention (GQA): 36 attention heads with only 9 key-value heads, improving efficiency and reducing memory/compute cost.
  • Rotary Position Embeddings (RoPE): Embeds positional information with scaling, enabling a context length of up to 131,072 tokens.
  • Swiglu Activation: Uses SiLU (Swish) activation in the FFN for improved expressiveness.
  • RMSNorm: Root Mean Square Layer Normalization replaces LayerNorm for stability and faster convergence.
  • Tied Embeddings: Input and output embeddings share weights (tie_word_embeddings=True), reducing parameter count by ~295M.

Benefits of LLaMA Architecture

  • Efficiency: High throughput, low memory use.
  • Scalability: Works well across model sizes.
  • Flexibility: Long-context support and task adaptability.
  • Research-Friendly: Great for exploring attention, positional encoding, and training dynamics.

Random-Llama-Small Specifics

This model uses random weights and:

  • Has ~2B parameters across 22 layers.
  • Uses a 2304 hidden size and 9216 FFN size.
  • Supports 128K+ vocab tokens and bfloat16 precision.
  • Supports extended context lengths of 131,072 tokens.

Intended Use

  • Research on transformer dynamics, optimization, or architectural changes.
  • Baseline for pretraining or task-specific fine-tuning.
  • Experimentation with scaling laws or custom architectures.

Out-of-Scope Use

  • Not for direct production deployment.
  • Not suitable for tasks needing coherence or accuracy without training.

Usage

Requirements

  • transformers >= 4.45.0
  • torch >= 2.0
  • GPU with ≥ 6GB VRAM (24GB+ for training)

Inference Example

# Use a pipeline as a high-level helper
from transformers import pipeline

messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe = pipeline("text-generation", model="reflex-ai/random-llama-small")
print(pipe(messages))

Note: Outputs will be random and incoherent due to the model’s untrained state.


Training Example

from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling, LlamaForCausalLM, AutoTokenizer

model = LlamaForCausalLM.from_pretrained("your_username/random-llama-small")
tokenizer = AutoTokenizer.from_pretrained("your_username/random-llama-small")

training_args = TrainingArguments(
    output_dir="./random_llama_small_finetuned",
    per_device_train_batch_size=4,
    num_train_epochs=3,
    fp16=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=your_dataset,
    data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False),
)

trainer.train()

Limitations

  • Random Initialization: Needs significant training to be useful.
  • Resource Intensive: High computational cost.
  • No Pretraining Data: Users must provide their own.
  • Tokenizer Constraint: May not suit all domains.

Benefits and Potential

  • Customizability: A blank slate for full control of objectives and data.
  • Research Insights: Ideal for understanding early-stage LLM behavior.
  • Scalable Baseline: Balances size and research feasibility.
  • Extended Context: Useful for long-form tasks post-training.

Model Configuration

{
  "architectures": ["LlamaForCausalLM"],
  "hidden_size": 2304,
  "num_hidden_layers": 22,
  "num_attention_heads": 36,
  "num_key_value_heads": 9,
  "intermediate_size": 9216,
  "vocab_size": 128256,
  "max_position_embeddings": 131072,
  "rope_scaling": {
    "factor": 32.0,
    "high_freq_factor": 4.0,
    "low_freq_factor": 1.0,
    "original_max_position_embeddings": 8192,
    "rope_type": "llama3"
  },
  "torch_dtype": "bfloat16",
  "tie_word_embeddings": true
}

Ethical Considerations

  • Untrained Safety: No immediate harmful outputs, but ethics matter during training.
  • Environmental Impact: Large-scale training consumes energy; optimize and use green compute.
  • Accessibility: Resource requirements may limit use by smaller research teams.

Contact

For questions or issues, please open an issue on the Hugging Face repository.

Model card created on April 20, 2025.