metadata

title: Phi-4 Unsloth Training
emoji: 🧠
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.17.0
app_file: app.py
pinned: false
license: mit

Phi-4 Unsloth Optimized Training

This space is dedicated to training Microsoft's Phi-4 model using Unsloth optimizations for enhanced performance and efficiency. The training process utilizes 4-bit quantization and advanced memory optimizations.

Installation

This Hugging Face Space automatically installs dependencies from requirements.txt. The following packages are included:

Installation Process

For clearer dependency management, the installation is split into multiple files:

Base Dependencies (requirements-base.txt):
- Core packages like torch, transformers, accelerate, etc.
- Install with: pip install -r requirements-base.txt
Standard Dependencies (requirements.txt):
- References base requirements and adds additional packages
- Install with: pip install -r requirements.txt
Flash Attention (requirements-flash.txt) (Optional):
- For faster attention computation
- Install with: pip install -r requirements-flash.txt --no-build-isolation

Using this staged approach helps prevent dependency conflicts and installation issues.

Essential Dependencies

unsloth (>=2024.3): Required for optimized 4-bit training
peft (>=0.9.0): Required for parameter-efficient fine-tuning
transformers (>=4.36.0): For model architecture and tokenization
einops: Required by Unsloth for tensor manipulation
sentencepiece: Required for tokenization

Optional Dependencies

flash-attn: Optional for faster attention computation (not included by default as it can cause build issues)

Features

4-bit quantization using Unsloth
Optimized training pipeline
Cognitive dataset integration
Advanced memory management
Gradient checkpointing
Sequential data processing

Configuration Files

transformers_config.json: Model and training parameters
hardware_config.json: Hardware-specific optimizations
dataset_config.json: Dataset processing settings
requirements.txt: Required dependencies

Training Process

The training utilizes the following optimizations:

Unsloth's 4-bit quantization
Custom chat templates for Phi-4
Paper-order preservation
Efficient memory usage
Gradient accumulation

Dataset

Training uses the cognitive dataset with:

Maintained paper order
Proper metadata handling
Optimized sequence length
Efficient batching

Hardware Requirements

GPU: A10G or better
VRAM: 24GB minimum
RAM: 32GB recommended

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

Phase 1: Domain Adaptation (Unsupervised)

This directory contains the code and configuration for domain adaptation of the phi-4-unsloth-bnb-4bit model to the cognitive science domain. This phase produces our domain-adapted model: George-API/phi-4-research-assistant.

Overview

Domain adaptation is the first phase of our training process, where we expose the model to a large corpus of cognitive science texts to help it learn domain-specific vocabulary, concepts, and patterns. This phase prepares the model for the more focused supervised fine-tuning in Phase 2.

Files

Core Training Files

run_transformers_training.py: Main script for domain adaptation
transformers_config.json: Model and training parameters
hardware_config.json: Hardware-specific optimizations
dataset_config.json: Dataset loading and processing settings
requirements.txt: Required Python packages

Analysis & Utilities

check_tokenization.py: Script to analyze token distributions
update_space.py: Hugging Face Space update utility
.env: Environment variables (API tokens, etc.)

Setup

Environment Setup:

python -m venv venv
source venv/bin/activate  # or `venv\Scripts\activate` on Windows
pip install -r requirements.txt

Environment Variables: Create .env file with:
```
HUGGINGFACE_TOKEN=your_token_here
```

Verify Setup:

python check_tokenization.py  # Ensures tokenizer works

How It Works

Data Loading: Loads pre-tokenized data from the Hugging Face dataset
Sequential Processing: Processes data in order, maintaining the integrity of research papers
Efficient Training: Uses pre-quantized Unsloth 4-bit model for memory-efficient and faster training
Checkpointing: Saves regular checkpoints and pushes to Hub
Monitoring: Logs detailed metrics and statistics during training
Model Publishing: Pushes the trained model to Hugging Face Hub

Key Features

Memory-Efficient Training

The training setup is optimized for A10G GPUs:

Uses pre-quantized 4-bit model (no additional quantization needed)
Gradient checkpointing for memory efficiency
Flash attention for faster training
bfloat16 mixed precision training
Optimized batch sizes for maximum throughput

Sequential Processing

The training script ensures that chunks from the same research paper are processed together by:

Sorting the dataset by ID
Using a SequentialSampler to maintain order
Processing chunks sequentially (average 1,673 tokens per chunk)

Data Collator

The SimpleDataCollator class:

Preserves pre-tokenized data format
Processes each entry independently
Provides detailed logging of processing statistics
Handles errors gracefully

Checkpointing

The training process saves checkpoints:

Every 200 steps
Pushes to Hub on every save
Maintains up to 5 recent checkpoints
Automatically resumes from the latest checkpoint if interrupted

Hardware Requirements

This training setup is optimized for:

2x NVIDIA A10G GPUs (24GB VRAM each)
92GB System RAM
CUDA 11.8 or higher

Memory breakdown per GPU:

Model (4-bit): ~3.5GB
Optimizer states: ~1GB
Batch memory: ~2GB
Peak usage: 18-20GB
Safe headroom: 4-6GB

Configuration

Key parameters in transformers_config.json:

model_name: unsloth/phi-4-unsloth-bnb-4bit
learning_rate: 2e-5
num_train_epochs: 3
per_device_train_batch_size: 16
gradient_accumulation_steps: 4
effective_batch_size: 128 (16 * 4 * 2 GPUs)
max_seq_length: 2048
lr_scheduler_type: "cosine"
warmup_ratio: 0.03
neftune_noise_alpha: 5

The configuration is optimized for:

Maximum memory efficiency with pre-quantized model
Stable training with cosine learning rate schedule
Effective gradient updates with accumulation
Regular checkpointing and Hub updates

Running Domain Adaptation

To start domain adaptation:

python run_transformers_training.py

The script will:

Load the pre-quantized model and dataset
Apply optimized training parameters
Process the data sequentially
Train the model for 3 epochs
Save and push checkpoints to Hub regularly

Using the Model

After training, you can use the domain-adapted model:

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the domain-adapted model
model_name = "George-API/phi-4-research-assistant"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, 
                                           device_map="auto",
                                           torch_dtype="bfloat16")

# Generate text
input_text = "The hippocampus is involved in"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Chat Format Example

Phi-4 works best with its native chat template:

from transformers import pipeline

pipeline = pipeline(
    "text-generation",
    model="George-API/phi-4-research-assistant",
    model_kwargs={"torch_dtype": "bfloat16"},
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are an expert in cognitive science."},
    {"role": "user", "content": "Explain the role of the hippocampus in memory formation."},
]

outputs = pipeline(messages, max_new_tokens=256)
print(outputs[0]["generated_text"])

Expected Outcomes

After domain adaptation, the model should:

Have a better understanding of cognitive science terminology
Show improved performance on domain-specific tasks
Be ready for supervised fine-tuning in Phase 2

Next Steps

After completing domain adaptation:

Evaluate the model's performance on cognitive science texts
Proceed to Phase 2 (Supervised Fine-Tuning)
Use TensorBoard to analyze training metrics