Spaces:
Sleeping
A newer version of the Gradio SDK is available:
5.28.0
title: R1-Distill-LLama-8b Training
emoji: 🧠
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.17.0
app_file: app.py
pinned: false
license: mit
DeepSeek R1-Distill-LLama-8b Training
This space is dedicated to training the DeepSeek R1-Distill-LLama-8b model for cognitive science research. The training process utilizes advanced optimizations and efficient data processing techniques.
Features
- Optimized training pipeline
- Cognitive dataset integration
- Advanced memory management
- Gradient checkpointing
- Sequential data processing
Configuration Files
transformers_config.json
: Model and training parametershardware_config.json
: Hardware-specific optimizationsdataset_config.json
: Dataset processing settingsrequirements.txt
: Required dependencies
Training Process
The training utilizes:
- Custom data processing pipeline
- Paper-order preservation
- Efficient memory usage
- Gradient accumulation
Dataset
Training uses the cognitive dataset with:
- Maintained paper order
- Proper metadata handling
- Optimized sequence length
- Efficient batching
Hardware Requirements
- GPU: L4 or better
- VRAM: 24GB minimum
- RAM: 32GB recommended
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
Phase 1: Domain Adaptation (Unsupervised)
This directory contains the code and configuration for domain adaptation of the DeepSeek-R1-Distill-Llama-8B model to the cognitive science domain. This phase produces our domain-adapted model: George-API/DeepSeek-Cognitive-Science.
Overview
Domain adaptation is the first phase of our training process, where we expose the model to a large corpus of cognitive science texts to help it learn domain-specific vocabulary, concepts, and patterns. This phase prepares the model for the more focused supervised fine-tuning in Phase 2.
Files
run_transformers_training.py
: Main script for domain adaptationtransformers_config.json
: Configuration parameters for training
How It Works
- Data Loading: Loads pre-tokenized data from the Hugging Face dataset
- Sequential Processing: Processes data in order, maintaining the integrity of research papers
- Efficient Training: Uses 4-bit quantization and LoRA for memory-efficient training
- Checkpointing: Saves regular checkpoints to resume training if interrupted
- Monitoring: Logs detailed metrics and statistics during training
- Model Publishing: Pushes the trained model to Hugging Face Hub as George-API/DeepSeek-Cognitive-Science
Key Features
Sequential Processing
The training script ensures that chunks from the same research paper are processed together by:
- Sorting the dataset by ID
- Using a SequentialSampler to maintain order
- Overriding the default DataLoader to disable shuffling
Data Collator
The SimpleDataCollator
class:
- Preserves pre-tokenized data format
- Processes each entry independently
- Provides detailed logging of processing statistics
- Handles errors gracefully
Checkpointing
The training process saves checkpoints:
- Every 100 steps (configurable)
- Automatically resumes from the latest checkpoint if interrupted
- Maintains up to 3 recent checkpoints
Configuration
Key parameters in transformers_config.json
:
model_name
: deepseek-ai/DeepSeek-R1-Distill-Llama-8Bdataset_name
: George-API/cognitive-datalearning_rate
: 3e-5num_train_epochs
: 5per_device_train_batch_size
: 4gradient_accumulation_steps
: 8max_seq_length
: 2048push_to_hub
: truehub_model_id
: "DeepSeek-Cognitive-Science"
Running Domain Adaptation
To start domain adaptation:
python run_transformers_training.py
The script will:
- Load the dataset and model
- Configure LoRA adapters
- Process the data sequentially
- Train the model for the specified number of epochs
- Save the resulting model and push it to Hugging Face Hub as George-API/DeepSeek-Cognitive-Science
Using the Model
After training, you can use the domain-adapted model:
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load the domain-adapted model
model_name = "George-API/DeepSeek-Cognitive-Science"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# Generate text
input_text = "The hippocampus is involved in"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Expected Outcomes
After domain adaptation, the model should:
- Have a better understanding of cognitive science terminology
- Show improved performance on cognitive science tasks
- Be ready for supervised fine-tuning in Phase 2
Next Steps
After completing domain adaptation:
- Evaluate the model's performance on cognitive science texts
- Proceed to Phase 2 (Supervised Fine-Tuning) using the George-API/DeepSeek-Cognitive-Science model
- Use TensorBoard to analyze training metrics