Spaces:
Sleeping
Sleeping
title: R1-Distill-LLama-8b Training | |
emoji: 🧠 | |
colorFrom: blue | |
colorTo: purple | |
sdk: gradio | |
sdk_version: "5.17.0" | |
app_file: app.py | |
pinned: false | |
license: mit | |
# DeepSeek R1-Distill-LLama-8b Training | |
This space is dedicated to training the DeepSeek R1-Distill-LLama-8b model for cognitive science research. The training process utilizes advanced optimizations and efficient data processing techniques. | |
## Features | |
- Optimized training pipeline | |
- Cognitive dataset integration | |
- Advanced memory management | |
- Gradient checkpointing | |
- Sequential data processing | |
## Configuration Files | |
- `transformers_config.json`: Model and training parameters | |
- `hardware_config.json`: Hardware-specific optimizations | |
- `dataset_config.json`: Dataset processing settings | |
- `requirements.txt`: Required dependencies | |
## Training Process | |
The training utilizes: | |
- Custom data processing pipeline | |
- Paper-order preservation | |
- Efficient memory usage | |
- Gradient accumulation | |
## Dataset | |
Training uses the cognitive dataset with: | |
- Maintained paper order | |
- Proper metadata handling | |
- Optimized sequence length | |
- Efficient batching | |
## Hardware Requirements | |
- GPU: L4 or better | |
- VRAM: 24GB minimum | |
- RAM: 32GB recommended | |
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference | |
# Phase 1: Domain Adaptation (Unsupervised) | |
This directory contains the code and configuration for domain adaptation of the DeepSeek-R1-Distill-Llama-8B model to the cognitive science domain. This phase produces our domain-adapted model: [George-API/DeepSeek-Cognitive-Science](https://huggingface.co/George-API/DeepSeek-Cognitive-Science). | |
## Overview | |
Domain adaptation is the first phase of our training process, where we expose the model to a large corpus of cognitive science texts to help it learn domain-specific vocabulary, concepts, and patterns. This phase prepares the model for the more focused supervised fine-tuning in Phase 2. | |
## Files | |
- `run_transformers_training.py`: Main script for domain adaptation | |
- `transformers_config.json`: Configuration parameters for training | |
## How It Works | |
1. **Data Loading**: Loads pre-tokenized data from the Hugging Face dataset | |
2. **Sequential Processing**: Processes data in order, maintaining the integrity of research papers | |
3. **Efficient Training**: Uses 4-bit quantization and LoRA for memory-efficient training | |
4. **Checkpointing**: Saves regular checkpoints to resume training if interrupted | |
5. **Monitoring**: Logs detailed metrics and statistics during training | |
6. **Model Publishing**: Pushes the trained model to Hugging Face Hub as [George-API/DeepSeek-Cognitive-Science](https://huggingface.co/George-API/DeepSeek-Cognitive-Science) | |
## Key Features | |
### Sequential Processing | |
The training script ensures that chunks from the same research paper are processed together by: | |
- Sorting the dataset by ID | |
- Using a SequentialSampler to maintain order | |
- Overriding the default DataLoader to disable shuffling | |
### Data Collator | |
The `SimpleDataCollator` class: | |
- Preserves pre-tokenized data format | |
- Processes each entry independently | |
- Provides detailed logging of processing statistics | |
- Handles errors gracefully | |
### Checkpointing | |
The training process saves checkpoints: | |
- Every 100 steps (configurable) | |
- Automatically resumes from the latest checkpoint if interrupted | |
- Maintains up to 3 recent checkpoints | |
## Configuration | |
Key parameters in `transformers_config.json`: | |
- `model_name`: deepseek-ai/DeepSeek-R1-Distill-Llama-8B | |
- `dataset_name`: George-API/cognitive-data | |
- `learning_rate`: 3e-5 | |
- `num_train_epochs`: 5 | |
- `per_device_train_batch_size`: 4 | |
- `gradient_accumulation_steps`: 8 | |
- `max_seq_length`: 2048 | |
- `push_to_hub`: true | |
- `hub_model_id`: "DeepSeek-Cognitive-Science" | |
## Running Domain Adaptation | |
To start domain adaptation: | |
```bash | |
python run_transformers_training.py | |
``` | |
The script will: | |
1. Load the dataset and model | |
2. Configure LoRA adapters | |
3. Process the data sequentially | |
4. Train the model for the specified number of epochs | |
5. Save the resulting model and push it to Hugging Face Hub as [George-API/DeepSeek-Cognitive-Science](https://huggingface.co/George-API/DeepSeek-Cognitive-Science) | |
## Using the Model | |
After training, you can use the domain-adapted model: | |
```python | |
from transformers import AutoModelForCausalLM, AutoTokenizer | |
# Load the domain-adapted model | |
model_name = "George-API/DeepSeek-Cognitive-Science" | |
tokenizer = AutoTokenizer.from_pretrained(model_name) | |
model = AutoModelForCausalLM.from_pretrained(model_name) | |
# Generate text | |
input_text = "The hippocampus is involved in" | |
inputs = tokenizer(input_text, return_tensors="pt") | |
outputs = model.generate(**inputs, max_length=100) | |
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) | |
``` | |
## Expected Outcomes | |
After domain adaptation, the model should: | |
- Have a better understanding of cognitive science terminology | |
- Show improved performance on cognitive science tasks | |
- Be ready for supervised fine-tuning in Phase 2 | |
## Next Steps | |
After completing domain adaptation: | |
1. Evaluate the model's performance on cognitive science texts | |
2. Proceed to Phase 2 (Supervised Fine-Tuning) using the [George-API/DeepSeek-Cognitive-Science](https://huggingface.co/George-API/DeepSeek-Cognitive-Science) model | |
3. Use TensorBoard to analyze training metrics |