Spaces:

George-API
/

Mindmodel-Phi4-Unsupervised

Build error

App Files Files Community

Mindmodel-Phi4-Unsupervised / README.md

George-API

Upload folder using huggingface_hub

22cec44 verified about 2 months ago

preview code

raw

history blame contribute delete

9.03 kB

	---
	title: Phi-4 Unsloth Training
	emoji: 🧠
	colorFrom: blue
	colorTo: purple
	sdk: gradio
	sdk_version: 5.17.0
	app_file: app.py
	pinned: false
	license: mit
	---

	# Phi-4 Unsloth Optimized Training

	This space is dedicated to training Microsoft's Phi-4 model using Unsloth optimizations for enhanced performance and efficiency. The training process utilizes 4-bit quantization and advanced memory optimizations.

	## Installation

	This Hugging Face Space automatically installs dependencies from requirements.txt. The following packages are included:

	### Installation Process

	For clearer dependency management, the installation is split into multiple files:

	1. Base Dependencies (requirements-base.txt):
	- Core packages like torch, transformers, accelerate, etc.
	- Install with: `pip install -r requirements-base.txt`

	2. Standard Dependencies (requirements.txt):
	- References base requirements and adds additional packages
	- Install with: `pip install -r requirements.txt`

	3. Flash Attention (requirements-flash.txt) (Optional):
	- For faster attention computation
	- Install with: `pip install -r requirements-flash.txt --no-build-isolation`

	Using this staged approach helps prevent dependency conflicts and installation issues.

	### Essential Dependencies

	- unsloth (>=2024.3): Required for optimized 4-bit training
	- peft (>=0.9.0): Required for parameter-efficient fine-tuning
	- transformers (>=4.36.0): For model architecture and tokenization
	- einops: Required by Unsloth for tensor manipulation
	- sentencepiece: Required for tokenization

	### Optional Dependencies

	- flash-attn: Optional for faster attention computation (not included by default as it can cause build issues)

	## Features

	- 4-bit quantization using Unsloth
	- Optimized training pipeline
	- Cognitive dataset integration
	- Advanced memory management
	- Gradient checkpointing
	- Sequential data processing

	## Configuration Files

	- `transformers_config.json`: Model and training parameters
	- `hardware_config.json`: Hardware-specific optimizations
	- `dataset_config.json`: Dataset processing settings
	- `requirements.txt`: Required dependencies

	## Training Process

	The training utilizes the following optimizations:
	- Unsloth's 4-bit quantization
	- Custom chat templates for Phi-4
	- Paper-order preservation
	- Efficient memory usage
	- Gradient accumulation

	## Dataset

	Training uses the cognitive dataset with:
	- Maintained paper order
	- Proper metadata handling
	- Optimized sequence length
	- Efficient batching

	## Hardware Requirements

	- GPU: A10G or better
	- VRAM: 24GB minimum
	- RAM: 32GB recommended

	Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

	# Phase 1: Domain Adaptation (Unsupervised)

	This directory contains the code and configuration for domain adaptation of the phi-4-unsloth-bnb-4bit model to the cognitive science domain. This phase produces our domain-adapted model: [George-API/phi-4-research-assistant](https://huggingface.co/George-API/phi-4-research-assistant).

	## Overview

	Domain adaptation is the first phase of our training process, where we expose the model to a large corpus of cognitive science texts to help it learn domain-specific vocabulary, concepts, and patterns. This phase prepares the model for the more focused supervised fine-tuning in Phase 2.

	## Files

	### Core Training Files
	- `run_transformers_training.py`: Main script for domain adaptation
	- `transformers_config.json`: Model and training parameters
	- `hardware_config.json`: Hardware-specific optimizations
	- `dataset_config.json`: Dataset loading and processing settings
	- `requirements.txt`: Required Python packages

	### Analysis & Utilities
	- `check_tokenization.py`: Script to analyze token distributions
	- `update_space.py`: Hugging Face Space update utility
	- `.env`: Environment variables (API tokens, etc.)

	## Setup

	1. Environment Setup:
	```bash
	python -m venv venv
	source venv/bin/activate # or `venv\Scripts\activate` on Windows
	pip install -r requirements.txt
	```

	2. Environment Variables:
	Create `.env` file with:
	```
	HUGGINGFACE_TOKEN=your_token_here
	```

	3. Verify Setup:
	```bash
	python check_tokenization.py # Ensures tokenizer works
	```

	## How It Works

	1. Data Loading: Loads pre-tokenized data from the Hugging Face dataset
	2. Sequential Processing: Processes data in order, maintaining the integrity of research papers
	3. Efficient Training: Uses pre-quantized Unsloth 4-bit model for memory-efficient and faster training
	4. Checkpointing: Saves regular checkpoints and pushes to Hub
	5. Monitoring: Logs detailed metrics and statistics during training
	6. Model Publishing: Pushes the trained model to Hugging Face Hub

	## Key Features

	### Memory-Efficient Training

	The training setup is optimized for A10G GPUs:
	- Uses pre-quantized 4-bit model (no additional quantization needed)
	- Gradient checkpointing for memory efficiency
	- Flash attention for faster training
	- bfloat16 mixed precision training
	- Optimized batch sizes for maximum throughput

	### Sequential Processing

	The training script ensures that chunks from the same research paper are processed together by:
	- Sorting the dataset by ID
	- Using a SequentialSampler to maintain order
	- Processing chunks sequentially (average 1,673 tokens per chunk)

	### Data Collator

	The `SimpleDataCollator` class:
	- Preserves pre-tokenized data format
	- Processes each entry independently
	- Provides detailed logging of processing statistics
	- Handles errors gracefully

	### Checkpointing

	The training process saves checkpoints:
	- Every 200 steps
	- Pushes to Hub on every save
	- Maintains up to 5 recent checkpoints
	- Automatically resumes from the latest checkpoint if interrupted

	## Hardware Requirements

	This training setup is optimized for:
	- 2x NVIDIA A10G GPUs (24GB VRAM each)
	- 92GB System RAM
	- CUDA 11.8 or higher

	Memory breakdown per GPU:
	- Model (4-bit): ~3.5GB
	- Optimizer states: ~1GB
	- Batch memory: ~2GB
	- Peak usage: 18-20GB
	- Safe headroom: 4-6GB

	## Configuration

	Key parameters in `transformers_config.json`:

	- `model_name`: unsloth/phi-4-unsloth-bnb-4bit
	- `learning_rate`: 2e-5
	- `num_train_epochs`: 3
	- `per_device_train_batch_size`: 16
	- `gradient_accumulation_steps`: 4
	- `effective_batch_size`: 128 (16 * 4 * 2 GPUs)
	- `max_seq_length`: 2048
	- `lr_scheduler_type`: "cosine"
	- `warmup_ratio`: 0.03
	- `neftune_noise_alpha`: 5

	The configuration is optimized for:
	- Maximum memory efficiency with pre-quantized model
	- Stable training with cosine learning rate schedule
	- Effective gradient updates with accumulation
	- Regular checkpointing and Hub updates

	## Running Domain Adaptation

	To start domain adaptation:

	```bash
	python run_transformers_training.py
	```

	The script will:
	1. Load the pre-quantized model and dataset
	2. Apply optimized training parameters
	3. Process the data sequentially
	4. Train the model for 3 epochs
	5. Save and push checkpoints to Hub regularly

	## Using the Model

	After training, you can use the domain-adapted model:

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	# Load the domain-adapted model
	model_name = "George-API/phi-4-research-assistant"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForCausalLM.from_pretrained(model_name,
	device_map="auto",
	torch_dtype="bfloat16")

	# Generate text
	input_text = "The hippocampus is involved in"
	inputs = tokenizer(input_text, return_tensors="pt")
	outputs = model.generate(**inputs, max_length=100)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	## Chat Format Example

	Phi-4 works best with its native chat template:

	```python
	from transformers import pipeline

	pipeline = pipeline(
	"text-generation",
	model="George-API/phi-4-research-assistant",
	model_kwargs={"torch_dtype": "bfloat16"},
	device_map="auto",
	)

	messages = [
	{"role": "system", "content": "You are an expert in cognitive science."},
	{"role": "user", "content": "Explain the role of the hippocampus in memory formation."},
	]

	outputs = pipeline(messages, max_new_tokens=256)
	print(outputs[0]["generated_text"])
	```

	## Expected Outcomes

	After domain adaptation, the model should:
	- Have a better understanding of cognitive science terminology
	- Show improved performance on domain-specific tasks
	- Be ready for supervised fine-tuning in Phase 2

	## Next Steps

	After completing domain adaptation:
	1. Evaluate the model's performance on cognitive science texts
	2. Proceed to Phase 2 (Supervised Fine-Tuning)
	3. Use TensorBoard to analyze training metrics