Tzktz's picture
Upload 7664 files
6fc683c verified

A newer version of the Gradio SDK is available: 5.31.0

Upgrade

Retentive Network: The Successor to Transformer for Large Language Models

MIT License MIT License

Installation

To install:

pip install torchscale

Alternatively, you can develop it locally:

git clone https://github.com/microsoft/torchscale.git
cd torchscale
pip install -e .

Getting Started

It takes only several lines of code to create a RetNet model:

# Creating a RetNet model
>>> import torch
>>> from torchscale.architecture.config import RetNetConfig
>>> from torchscale.architecture.retnet import RetNetDecoder

>>> config = RetNetConfig(vocab_size=64000)
>>> retnet = RetNetDecoder(config)

>>> print(retnet)

Changelog

  • Nov 2023: improve stability via better initialization
  • Nov 2023: fix retention normalization in the commit
  • Oct 2023: improve stability as follows
    • The RMSNorm is used in the commit, so that the effects of LN_eps can be eliminated
    • The LN eps was modified from 1e-6 to 1e-5 as in the commit
    • For the RetNet implementation, the initialization principle proposed in DeepNet has been integrated. So the arguments --subln or --deepnorm should not be added.
    • Removing layer bias also improves training stability
  • Aug 4, 2023: fix a bug of the chunkwise recurrent representation (commit)
  • Aug 4, 2023: improve the numerical precision of the recurrent representation as suggested by https://github.com/microsoft/torchscale/issues/47 (commit)

Citations

If you find this repository useful, please consider citing our work:

@article{retnet,
  author={Yutao Sun and Li Dong and Shaohan Huang and Shuming Ma and Yuqing Xia and Jilong Xue and Jianyong Wang and Furu Wei},
  title     = {Retentive Network: A Successor to {Transformer} for Large Language Models},
  journal   = {ArXiv},
  volume    = {abs/2307.08621},
  year      = {2023}
}