Spaces:
Build error
A newer version of the Gradio SDK is available:
5.28.0
title: Phi-4 Unsloth Training
emoji: 🧠
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.17.0
app_file: app.py
pinned: false
license: mit
Phi-4 Unsloth Optimized Training
This space is dedicated to training Microsoft's Phi-4 model using Unsloth optimizations for enhanced performance and efficiency. The training process utilizes 4-bit quantization and advanced memory optimizations.
Installation
This Hugging Face Space automatically installs dependencies from requirements.txt. The following packages are included:
Installation Process
For clearer dependency management, the installation is split into multiple files:
Base Dependencies (requirements-base.txt):
- Core packages like torch, transformers, accelerate, etc.
- Install with:
pip install -r requirements-base.txt
Standard Dependencies (requirements.txt):
- References base requirements and adds additional packages
- Install with:
pip install -r requirements.txt
Flash Attention (requirements-flash.txt) (Optional):
- For faster attention computation
- Install with:
pip install -r requirements-flash.txt --no-build-isolation
Using this staged approach helps prevent dependency conflicts and installation issues.
Essential Dependencies
- unsloth (>=2024.3): Required for optimized 4-bit training
- peft (>=0.9.0): Required for parameter-efficient fine-tuning
- transformers (>=4.36.0): For model architecture and tokenization
- einops: Required by Unsloth for tensor manipulation
- sentencepiece: Required for tokenization
Optional Dependencies
- flash-attn: Optional for faster attention computation (not included by default as it can cause build issues)
Features
- 4-bit quantization using Unsloth
- Optimized training pipeline
- Cognitive dataset integration
- Advanced memory management
- Gradient checkpointing
- Sequential data processing
Configuration Files
transformers_config.json
: Model and training parametershardware_config.json
: Hardware-specific optimizationsdataset_config.json
: Dataset processing settingsrequirements.txt
: Required dependencies
Training Process
The training utilizes the following optimizations:
- Unsloth's 4-bit quantization
- Custom chat templates for Phi-4
- Paper-order preservation
- Efficient memory usage
- Gradient accumulation
Dataset
Training uses the cognitive dataset with:
- Maintained paper order
- Proper metadata handling
- Optimized sequence length
- Efficient batching
Hardware Requirements
- GPU: A10G or better
- VRAM: 24GB minimum
- RAM: 32GB recommended
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
Phase 1: Domain Adaptation (Unsupervised)
This directory contains the code and configuration for domain adaptation of the phi-4-unsloth-bnb-4bit model to the cognitive science domain. This phase produces our domain-adapted model: George-API/phi-4-research-assistant.
Overview
Domain adaptation is the first phase of our training process, where we expose the model to a large corpus of cognitive science texts to help it learn domain-specific vocabulary, concepts, and patterns. This phase prepares the model for the more focused supervised fine-tuning in Phase 2.
Files
Core Training Files
run_transformers_training.py
: Main script for domain adaptationtransformers_config.json
: Model and training parametershardware_config.json
: Hardware-specific optimizationsdataset_config.json
: Dataset loading and processing settingsrequirements.txt
: Required Python packages
Analysis & Utilities
check_tokenization.py
: Script to analyze token distributionsupdate_space.py
: Hugging Face Space update utility.env
: Environment variables (API tokens, etc.)
Setup
Environment Setup:
python -m venv venv source venv/bin/activate # or `venv\Scripts\activate` on Windows pip install -r requirements.txt
Environment Variables: Create
.env
file with:HUGGINGFACE_TOKEN=your_token_here
Verify Setup:
python check_tokenization.py # Ensures tokenizer works
How It Works
- Data Loading: Loads pre-tokenized data from the Hugging Face dataset
- Sequential Processing: Processes data in order, maintaining the integrity of research papers
- Efficient Training: Uses pre-quantized Unsloth 4-bit model for memory-efficient and faster training
- Checkpointing: Saves regular checkpoints and pushes to Hub
- Monitoring: Logs detailed metrics and statistics during training
- Model Publishing: Pushes the trained model to Hugging Face Hub
Key Features
Memory-Efficient Training
The training setup is optimized for A10G GPUs:
- Uses pre-quantized 4-bit model (no additional quantization needed)
- Gradient checkpointing for memory efficiency
- Flash attention for faster training
- bfloat16 mixed precision training
- Optimized batch sizes for maximum throughput
Sequential Processing
The training script ensures that chunks from the same research paper are processed together by:
- Sorting the dataset by ID
- Using a SequentialSampler to maintain order
- Processing chunks sequentially (average 1,673 tokens per chunk)
Data Collator
The SimpleDataCollator
class:
- Preserves pre-tokenized data format
- Processes each entry independently
- Provides detailed logging of processing statistics
- Handles errors gracefully
Checkpointing
The training process saves checkpoints:
- Every 200 steps
- Pushes to Hub on every save
- Maintains up to 5 recent checkpoints
- Automatically resumes from the latest checkpoint if interrupted
Hardware Requirements
This training setup is optimized for:
- 2x NVIDIA A10G GPUs (24GB VRAM each)
- 92GB System RAM
- CUDA 11.8 or higher
Memory breakdown per GPU:
- Model (4-bit): ~3.5GB
- Optimizer states: ~1GB
- Batch memory: ~2GB
- Peak usage: 18-20GB
- Safe headroom: 4-6GB
Configuration
Key parameters in transformers_config.json
:
model_name
: unsloth/phi-4-unsloth-bnb-4bitlearning_rate
: 2e-5num_train_epochs
: 3per_device_train_batch_size
: 16gradient_accumulation_steps
: 4effective_batch_size
: 128 (16 * 4 * 2 GPUs)max_seq_length
: 2048lr_scheduler_type
: "cosine"warmup_ratio
: 0.03neftune_noise_alpha
: 5
The configuration is optimized for:
- Maximum memory efficiency with pre-quantized model
- Stable training with cosine learning rate schedule
- Effective gradient updates with accumulation
- Regular checkpointing and Hub updates
Running Domain Adaptation
To start domain adaptation:
python run_transformers_training.py
The script will:
- Load the pre-quantized model and dataset
- Apply optimized training parameters
- Process the data sequentially
- Train the model for 3 epochs
- Save and push checkpoints to Hub regularly
Using the Model
After training, you can use the domain-adapted model:
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load the domain-adapted model
model_name = "George-API/phi-4-research-assistant"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name,
device_map="auto",
torch_dtype="bfloat16")
# Generate text
input_text = "The hippocampus is involved in"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Chat Format Example
Phi-4 works best with its native chat template:
from transformers import pipeline
pipeline = pipeline(
"text-generation",
model="George-API/phi-4-research-assistant",
model_kwargs={"torch_dtype": "bfloat16"},
device_map="auto",
)
messages = [
{"role": "system", "content": "You are an expert in cognitive science."},
{"role": "user", "content": "Explain the role of the hippocampus in memory formation."},
]
outputs = pipeline(messages, max_new_tokens=256)
print(outputs[0]["generated_text"])
Expected Outcomes
After domain adaptation, the model should:
- Have a better understanding of cognitive science terminology
- Show improved performance on domain-specific tasks
- Be ready for supervised fine-tuning in Phase 2
Next Steps
After completing domain adaptation:
- Evaluate the model's performance on cognitive science texts
- Proceed to Phase 2 (Supervised Fine-Tuning)
- Use TensorBoard to analyze training metrics