metadata

title: R1-Distill-LLama-8b Training
emoji: 🧠
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.17.0
app_file: app.py
pinned: false
license: mit

DeepSeek R1-Distill-LLama-8b Training

This space is dedicated to training the DeepSeek R1-Distill-LLama-8b model for cognitive science research. The training process utilizes advanced optimizations and efficient data processing techniques.

Features

Optimized training pipeline
Cognitive dataset integration
Advanced memory management
Gradient checkpointing
Sequential data processing

Configuration Files

transformers_config.json: Model and training parameters
hardware_config.json: Hardware-specific optimizations
dataset_config.json: Dataset processing settings
requirements.txt: Required dependencies

Training Process

The training utilizes:

Custom data processing pipeline
Paper-order preservation
Efficient memory usage
Gradient accumulation

Dataset

Training uses the cognitive dataset with:

Maintained paper order
Proper metadata handling
Optimized sequence length
Efficient batching

Hardware Requirements

GPU: L4 or better
VRAM: 24GB minimum
RAM: 32GB recommended

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

Phase 1: Domain Adaptation (Unsupervised)

This directory contains the code and configuration for domain adaptation of the DeepSeek-R1-Distill-Llama-8B model to the cognitive science domain. This phase produces our domain-adapted model: George-API/DeepSeek-Cognitive-Science.

Overview

Domain adaptation is the first phase of our training process, where we expose the model to a large corpus of cognitive science texts to help it learn domain-specific vocabulary, concepts, and patterns. This phase prepares the model for the more focused supervised fine-tuning in Phase 2.

Files

run_transformers_training.py: Main script for domain adaptation
transformers_config.json: Configuration parameters for training

How It Works

Data Loading: Loads pre-tokenized data from the Hugging Face dataset
Sequential Processing: Processes data in order, maintaining the integrity of research papers
Efficient Training: Uses 4-bit quantization and LoRA for memory-efficient training
Checkpointing: Saves regular checkpoints to resume training if interrupted
Monitoring: Logs detailed metrics and statistics during training
Model Publishing: Pushes the trained model to Hugging Face Hub as George-API/DeepSeek-Cognitive-Science

Key Features

Sequential Processing

The training script ensures that chunks from the same research paper are processed together by:

Sorting the dataset by ID
Using a SequentialSampler to maintain order
Overriding the default DataLoader to disable shuffling

Data Collator

The SimpleDataCollator class:

Preserves pre-tokenized data format
Processes each entry independently
Provides detailed logging of processing statistics
Handles errors gracefully

Checkpointing

The training process saves checkpoints:

Every 100 steps (configurable)
Automatically resumes from the latest checkpoint if interrupted
Maintains up to 3 recent checkpoints

Configuration

Key parameters in transformers_config.json:

model_name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
dataset_name: George-API/cognitive-data
learning_rate: 3e-5
num_train_epochs: 5
per_device_train_batch_size: 4
gradient_accumulation_steps: 8
max_seq_length: 2048
push_to_hub: true
hub_model_id: "DeepSeek-Cognitive-Science"

Running Domain Adaptation

To start domain adaptation:

python run_transformers_training.py

The script will:

Load the dataset and model
Configure LoRA adapters
Process the data sequentially
Train the model for the specified number of epochs
Save the resulting model and push it to Hugging Face Hub as George-API/DeepSeek-Cognitive-Science

Using the Model

After training, you can use the domain-adapted model:

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the domain-adapted model
model_name = "George-API/DeepSeek-Cognitive-Science"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Generate text
input_text = "The hippocampus is involved in"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Expected Outcomes

After domain adaptation, the model should:

Have a better understanding of cognitive science terminology
Show improved performance on cognitive science tasks
Be ready for supervised fine-tuning in Phase 2

Next Steps

After completing domain adaptation:

Evaluate the model's performance on cognitive science texts
Proceed to Phase 2 (Supervised Fine-Tuning) using the George-API/DeepSeek-Cognitive-Science model
Use TensorBoard to analyze training metrics