George-API's picture
Upload folder using huggingface_hub
22cec44 verified

A newer version of the Gradio SDK is available: 5.28.0

Upgrade
metadata
title: Phi-4 Unsloth Training
emoji: 🧠
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.17.0
app_file: app.py
pinned: false
license: mit

Phi-4 Unsloth Optimized Training

This space is dedicated to training Microsoft's Phi-4 model using Unsloth optimizations for enhanced performance and efficiency. The training process utilizes 4-bit quantization and advanced memory optimizations.

Installation

This Hugging Face Space automatically installs dependencies from requirements.txt. The following packages are included:

Installation Process

For clearer dependency management, the installation is split into multiple files:

  1. Base Dependencies (requirements-base.txt):

    • Core packages like torch, transformers, accelerate, etc.
    • Install with: pip install -r requirements-base.txt
  2. Standard Dependencies (requirements.txt):

    • References base requirements and adds additional packages
    • Install with: pip install -r requirements.txt
  3. Flash Attention (requirements-flash.txt) (Optional):

    • For faster attention computation
    • Install with: pip install -r requirements-flash.txt --no-build-isolation

Using this staged approach helps prevent dependency conflicts and installation issues.

Essential Dependencies

  • unsloth (>=2024.3): Required for optimized 4-bit training
  • peft (>=0.9.0): Required for parameter-efficient fine-tuning
  • transformers (>=4.36.0): For model architecture and tokenization
  • einops: Required by Unsloth for tensor manipulation
  • sentencepiece: Required for tokenization

Optional Dependencies

  • flash-attn: Optional for faster attention computation (not included by default as it can cause build issues)

Features

  • 4-bit quantization using Unsloth
  • Optimized training pipeline
  • Cognitive dataset integration
  • Advanced memory management
  • Gradient checkpointing
  • Sequential data processing

Configuration Files

  • transformers_config.json: Model and training parameters
  • hardware_config.json: Hardware-specific optimizations
  • dataset_config.json: Dataset processing settings
  • requirements.txt: Required dependencies

Training Process

The training utilizes the following optimizations:

  • Unsloth's 4-bit quantization
  • Custom chat templates for Phi-4
  • Paper-order preservation
  • Efficient memory usage
  • Gradient accumulation

Dataset

Training uses the cognitive dataset with:

  • Maintained paper order
  • Proper metadata handling
  • Optimized sequence length
  • Efficient batching

Hardware Requirements

  • GPU: A10G or better
  • VRAM: 24GB minimum
  • RAM: 32GB recommended

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

Phase 1: Domain Adaptation (Unsupervised)

This directory contains the code and configuration for domain adaptation of the phi-4-unsloth-bnb-4bit model to the cognitive science domain. This phase produces our domain-adapted model: George-API/phi-4-research-assistant.

Overview

Domain adaptation is the first phase of our training process, where we expose the model to a large corpus of cognitive science texts to help it learn domain-specific vocabulary, concepts, and patterns. This phase prepares the model for the more focused supervised fine-tuning in Phase 2.

Files

Core Training Files

  • run_transformers_training.py: Main script for domain adaptation
  • transformers_config.json: Model and training parameters
  • hardware_config.json: Hardware-specific optimizations
  • dataset_config.json: Dataset loading and processing settings
  • requirements.txt: Required Python packages

Analysis & Utilities

  • check_tokenization.py: Script to analyze token distributions
  • update_space.py: Hugging Face Space update utility
  • .env: Environment variables (API tokens, etc.)

Setup

  1. Environment Setup:

    python -m venv venv
    source venv/bin/activate  # or `venv\Scripts\activate` on Windows
    pip install -r requirements.txt
    
  2. Environment Variables: Create .env file with:

    HUGGINGFACE_TOKEN=your_token_here
    
  3. Verify Setup:

    python check_tokenization.py  # Ensures tokenizer works
    

How It Works

  1. Data Loading: Loads pre-tokenized data from the Hugging Face dataset
  2. Sequential Processing: Processes data in order, maintaining the integrity of research papers
  3. Efficient Training: Uses pre-quantized Unsloth 4-bit model for memory-efficient and faster training
  4. Checkpointing: Saves regular checkpoints and pushes to Hub
  5. Monitoring: Logs detailed metrics and statistics during training
  6. Model Publishing: Pushes the trained model to Hugging Face Hub

Key Features

Memory-Efficient Training

The training setup is optimized for A10G GPUs:

  • Uses pre-quantized 4-bit model (no additional quantization needed)
  • Gradient checkpointing for memory efficiency
  • Flash attention for faster training
  • bfloat16 mixed precision training
  • Optimized batch sizes for maximum throughput

Sequential Processing

The training script ensures that chunks from the same research paper are processed together by:

  • Sorting the dataset by ID
  • Using a SequentialSampler to maintain order
  • Processing chunks sequentially (average 1,673 tokens per chunk)

Data Collator

The SimpleDataCollator class:

  • Preserves pre-tokenized data format
  • Processes each entry independently
  • Provides detailed logging of processing statistics
  • Handles errors gracefully

Checkpointing

The training process saves checkpoints:

  • Every 200 steps
  • Pushes to Hub on every save
  • Maintains up to 5 recent checkpoints
  • Automatically resumes from the latest checkpoint if interrupted

Hardware Requirements

This training setup is optimized for:

  • 2x NVIDIA A10G GPUs (24GB VRAM each)
  • 92GB System RAM
  • CUDA 11.8 or higher

Memory breakdown per GPU:

  • Model (4-bit): ~3.5GB
  • Optimizer states: ~1GB
  • Batch memory: ~2GB
  • Peak usage: 18-20GB
  • Safe headroom: 4-6GB

Configuration

Key parameters in transformers_config.json:

  • model_name: unsloth/phi-4-unsloth-bnb-4bit
  • learning_rate: 2e-5
  • num_train_epochs: 3
  • per_device_train_batch_size: 16
  • gradient_accumulation_steps: 4
  • effective_batch_size: 128 (16 * 4 * 2 GPUs)
  • max_seq_length: 2048
  • lr_scheduler_type: "cosine"
  • warmup_ratio: 0.03
  • neftune_noise_alpha: 5

The configuration is optimized for:

  • Maximum memory efficiency with pre-quantized model
  • Stable training with cosine learning rate schedule
  • Effective gradient updates with accumulation
  • Regular checkpointing and Hub updates

Running Domain Adaptation

To start domain adaptation:

python run_transformers_training.py

The script will:

  1. Load the pre-quantized model and dataset
  2. Apply optimized training parameters
  3. Process the data sequentially
  4. Train the model for 3 epochs
  5. Save and push checkpoints to Hub regularly

Using the Model

After training, you can use the domain-adapted model:

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the domain-adapted model
model_name = "George-API/phi-4-research-assistant"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, 
                                           device_map="auto",
                                           torch_dtype="bfloat16")

# Generate text
input_text = "The hippocampus is involved in"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Chat Format Example

Phi-4 works best with its native chat template:

from transformers import pipeline

pipeline = pipeline(
    "text-generation",
    model="George-API/phi-4-research-assistant",
    model_kwargs={"torch_dtype": "bfloat16"},
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are an expert in cognitive science."},
    {"role": "user", "content": "Explain the role of the hippocampus in memory formation."},
]

outputs = pipeline(messages, max_new_tokens=256)
print(outputs[0]["generated_text"])

Expected Outcomes

After domain adaptation, the model should:

  • Have a better understanding of cognitive science terminology
  • Show improved performance on domain-specific tasks
  • Be ready for supervised fine-tuning in Phase 2

Next Steps

After completing domain adaptation:

  1. Evaluate the model's performance on cognitive science texts
  2. Proceed to Phase 2 (Supervised Fine-Tuning)
  3. Use TensorBoard to analyze training metrics