Spaces:
Sleeping
Sleeping
A newer version of the Gradio SDK is available:
5.31.0
Retentive Network: The Successor to Transformer for Large Language Models
- Code release: https://github.com/microsoft/torchscale
- July 2023: release preprint Retentive Network: A Successor to Transformer for Large Language Models
Installation
To install:
pip install torchscale
Alternatively, you can develop it locally:
git clone https://github.com/microsoft/torchscale.git
cd torchscale
pip install -e .
Getting Started
It takes only several lines of code to create a RetNet model:
# Creating a RetNet model
>>> import torch
>>> from torchscale.architecture.config import RetNetConfig
>>> from torchscale.architecture.retnet import RetNetDecoder
>>> config = RetNetConfig(vocab_size=64000)
>>> retnet = RetNetDecoder(config)
>>> print(retnet)
Changelog
- Nov 2023: improve stability via better initialization
- Nov 2023: fix retention normalization in the commit
- Oct 2023: improve stability as follows
- The RMSNorm is used in the commit, so that the effects of LN_eps can be eliminated
- The LN eps was modified from 1e-6 to 1e-5 as in the commit
- For the RetNet implementation, the initialization principle proposed in DeepNet has been integrated. So the arguments
--subln or --deepnorm
should not be added. - Removing layer bias also improves training stability
- Aug 4, 2023: fix a bug of the chunkwise recurrent representation (commit)
- Aug 4, 2023: improve the numerical precision of the recurrent representation as suggested by https://github.com/microsoft/torchscale/issues/47 (commit)
Citations
If you find this repository useful, please consider citing our work:
@article{retnet,
author={Yutao Sun and Li Dong and Shaohan Huang and Shuming Ma and Yuqing Xia and Jilong Xue and Jianyong Wang and Furu Wei},
title = {Retentive Network: A Successor to {Transformer} for Large Language Models},
journal = {ArXiv},
volume = {abs/2307.08621},
year = {2023}
}