Retentive Network: The Successor to Transformer for Large Language Models

Code release: https://github.com/microsoft/torchscale
July 2023: release preprint Retentive Network: A Successor to Transformer for Large Language Models

Installation

To install:

pip install torchscale

Alternatively, you can develop it locally:

git clone https://github.com/microsoft/torchscale.git
cd torchscale
pip install -e .

Getting Started

It takes only several lines of code to create a RetNet model:

# Creating a RetNet model
>>> import torch
>>> from torchscale.architecture.config import RetNetConfig
>>> from torchscale.architecture.retnet import RetNetDecoder

>>> config = RetNetConfig(vocab_size=64000)
>>> retnet = RetNetDecoder(config)

>>> print(retnet)

Changelog

Nov 2023: improve stability via better initialization
Nov 2023: fix retention normalization in the commit
Oct 2023: improve stability as follows
- The RMSNorm is used in the commit, so that the effects of LN_eps can be eliminated
- The LN eps was modified from 1e-6 to 1e-5 as in the commit
- For the RetNet implementation, the initialization principle proposed in DeepNet has been integrated. So the arguments --subln or --deepnorm should not be added.
- Removing layer bias also improves training stability
Aug 4, 2023: fix a bug of the chunkwise recurrent representation (commit)
Aug 4, 2023: improve the numerical precision of the recurrent representation as suggested by https://github.com/microsoft/torchscale/issues/47 (commit)

Citations

If you find this repository useful, please consider citing our work:

@article{retnet,
  author={Yutao Sun and Li Dong and Shaohan Huang and Shuming Ma and Yuqing Xia and Jilong Xue and Jianyong Wang and Furu Wei},
  title     = {Retentive Network: A Successor to {Transformer} for Large Language Models},
  journal   = {ArXiv},
  volume    = {abs/2307.08621},
  year      = {2023}
}