File size: 2,692 Bytes

2f9e6fc
c09f08d
 
 
 
2f9e6fc
90c77c8
 
2f9e6fc
 
c09f08d
2f9e6fc
c09f08d
2f9e6fc
c09f08d
2f9e6fc
90c77c8
 
2f9e6fc
c09f08d
2f9e6fc
c09f08d
2f9e6fc
c09f08d
 
 
2f9e6fc
c09f08d
2f9e6fc
c09f08d
 
 
2f9e6fc
c09f08d
2f9e6fc
c09f08d
2f9e6fc
c09f08d
 
 
2f9e6fc
c09f08d
2f9e6fc
c09f08d
2f9e6fc
90c77c8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c09f08d
2f9e6fc
c09f08d
 
 
 
 
 
 
90c77c8

---
datasets:
- togethercomputer/RedPajama-Data-1T
language:
- en
library_name: transformers
license: apache-2.0
pipeline_tag: text-generation
---

## PDS-1.7B

[paper](https://arxiv.org/abs/2410.07064) | [code](https://github.com/microsoft/LMOps/tree/main/data_selection)

**PDS-1.7B** is a 1.7B model with [Mistral](https://arxiv.org/abs/2310.06825) achitecture pre-trained from scratch on the data selected from the CC split of [Redpajama](https://github.com/togethercomputer/RedPajama-Data), using the PDS framework.

This work investigates the selection of high-quality pre-training data from massive corpora to enhance LMs' capabilities for downstream usage.
We formulate data selection as a generalized Optimal Control problem, which can be solved theoretically by Pontryagin's Maximum Principle (PMP), yielding a set of necessary conditions that characterize the relationship between optimal data selection and LM training dynamics. Based on these theoretical results, we introduce PMP-based Data Selection (PDS), a framework that approximates optimal data selection by solving the PMP conditions.

Please refer to our [paper](https://arxiv.org/abs/2410.07064) for more details.

### Overview of the theory:

<p align='left'>
    <img src="https://cdn-uploads.huggingface.co/production/uploads/624ac662102fcdff87be51b9/Hdw83Vsb305GRlsqB7c34.png" width="700">
</p>

### Overview of the PDS framework:

<p align='left'>
    <img src="https://cdn-uploads.huggingface.co/production/uploads/624ac662102fcdff87be51b9/YPwluLyZGK7DACH1WqDUN.png" width="700">
</p>

### Evaluation

PDS-selected data improves the performance of language models pre-trained from scratch and saves pre-training comptation. The improvement scales up to large model sizes.

<p align='left'>
    <img src="https://cdn-uploads.huggingface.co/production/uploads/624ac662102fcdff87be51b9/6undIr37d10qD73TDiPDK.png" width="600">
</p>

### Baseline

[Conventional Pre-training](https://huggingface.co/Data-Selection/BSL-1.7B)

### Sample Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "Data-Selection/PDS-1.7B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

inputs = tokenizer("Hello, my name is", return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

### Citation

```bibtex
@article{gu2024data,
  title={Data Selection via Optimal Control for Language Models},
  author={Gu, Yuxian and Dong, Li and Wang, Hongning and Hao, Yaru and Dong, Qingxiu and Wei, Furu and Huang, Minlie},
  journal={arXiv preprint arXiv:2410.07064},
  year={2024}
}
```