File size: 5,504 Bytes
335441e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
---

title: R1-Distill-LLama-8b Training
emoji: 🧠
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: "5.17.0"
app_file: app.py
pinned: false
license: mit
---


# DeepSeek R1-Distill-LLama-8b Training

This space is dedicated to training the DeepSeek R1-Distill-LLama-8b model for cognitive science research. The training process utilizes advanced optimizations and efficient data processing techniques.

## Features

- Optimized training pipeline
- Cognitive dataset integration
- Advanced memory management
- Gradient checkpointing
- Sequential data processing

## Configuration Files

- `transformers_config.json`: Model and training parameters
- `hardware_config.json`: Hardware-specific optimizations
- `dataset_config.json`: Dataset processing settings
- `requirements.txt`: Required dependencies

## Training Process

The training utilizes:
- Custom data processing pipeline
- Paper-order preservation
- Efficient memory usage
- Gradient accumulation

## Dataset

Training uses the cognitive dataset with:
- Maintained paper order
- Proper metadata handling
- Optimized sequence length
- Efficient batching

## Hardware Requirements

- GPU: L4 or better
- VRAM: 24GB minimum
- RAM: 32GB recommended

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

# Phase 1: Domain Adaptation (Unsupervised)

This directory contains the code and configuration for domain adaptation of the DeepSeek-R1-Distill-Llama-8B model to the cognitive science domain. This phase produces our domain-adapted model: [George-API/DeepSeek-Cognitive-Science](https://huggingface.co/George-API/DeepSeek-Cognitive-Science).

## Overview

Domain adaptation is the first phase of our training process, where we expose the model to a large corpus of cognitive science texts to help it learn domain-specific vocabulary, concepts, and patterns. This phase prepares the model for the more focused supervised fine-tuning in Phase 2.

## Files

- `run_transformers_training.py`: Main script for domain adaptation
- `transformers_config.json`: Configuration parameters for training

## How It Works

1. **Data Loading**: Loads pre-tokenized data from the Hugging Face dataset
2. **Sequential Processing**: Processes data in order, maintaining the integrity of research papers
3. **Efficient Training**: Uses 4-bit quantization and LoRA for memory-efficient training
4. **Checkpointing**: Saves regular checkpoints to resume training if interrupted
5. **Monitoring**: Logs detailed metrics and statistics during training
6. **Model Publishing**: Pushes the trained model to Hugging Face Hub as [George-API/DeepSeek-Cognitive-Science](https://huggingface.co/George-API/DeepSeek-Cognitive-Science)

## Key Features

### Sequential Processing

The training script ensures that chunks from the same research paper are processed together by:
- Sorting the dataset by ID
- Using a SequentialSampler to maintain order
- Overriding the default DataLoader to disable shuffling

### Data Collator

The `SimpleDataCollator` class:
- Preserves pre-tokenized data format
- Processes each entry independently
- Provides detailed logging of processing statistics
- Handles errors gracefully

### Checkpointing

The training process saves checkpoints:
- Every 100 steps (configurable)
- Automatically resumes from the latest checkpoint if interrupted
- Maintains up to 3 recent checkpoints

## Configuration

Key parameters in `transformers_config.json`:

- `model_name`: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
- `dataset_name`: George-API/cognitive-data
- `learning_rate`: 3e-5
- `num_train_epochs`: 5
- `per_device_train_batch_size`: 4
- `gradient_accumulation_steps`: 8
- `max_seq_length`: 2048
- `push_to_hub`: true
- `hub_model_id`: "DeepSeek-Cognitive-Science"

## Running Domain Adaptation

To start domain adaptation:

```bash

python run_transformers_training.py

```

The script will:
1. Load the dataset and model
2. Configure LoRA adapters
3. Process the data sequentially
4. Train the model for the specified number of epochs
5. Save the resulting model and push it to Hugging Face Hub as [George-API/DeepSeek-Cognitive-Science](https://huggingface.co/George-API/DeepSeek-Cognitive-Science)

## Using the Model

After training, you can use the domain-adapted model:

```python

from transformers import AutoModelForCausalLM, AutoTokenizer



# Load the domain-adapted model

model_name = "George-API/DeepSeek-Cognitive-Science"

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForCausalLM.from_pretrained(model_name)



# Generate text

input_text = "The hippocampus is involved in"

inputs = tokenizer(input_text, return_tensors="pt")

outputs = model.generate(**inputs, max_length=100)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

```

## Expected Outcomes

After domain adaptation, the model should:
- Have a better understanding of cognitive science terminology
- Show improved performance on cognitive science tasks
- Be ready for supervised fine-tuning in Phase 2

## Next Steps

After completing domain adaptation:
1. Evaluate the model's performance on cognitive science texts
2. Proceed to Phase 2 (Supervised Fine-Tuning) using the [George-API/DeepSeek-Cognitive-Science](https://huggingface.co/George-API/DeepSeek-Cognitive-Science) model
3. Use TensorBoard to analyze training metrics