|
--- |
|
library_name: transformers |
|
tags: [] |
|
--- |
|
|
|
# tinyllamas_92M |
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
|
|
|
|
## Model Details |
|
```python |
|
max_seq_len = 256 |
|
vocab_size = 8192 |
|
dim = 768 |
|
n_layers = 12 |
|
n_heads = 12 |
|
n_kv_heads = 12 |
|
``` |
|
|
|
### Training Data |
|
- https://huggingface.co/datasets/roneneldan/TinyStories |
|
- Tokenized using: https://github.com/karpathy/llama2.c?tab=readme-ov-file#custom-tokenizers |
|
|
|
|
|
#### Training Hyperparameters |
|
|
|
```python |
|
batch_size = 64 # if gradient_accumulation_steps > 1, this is the micro-batch size |
|
dropout = 0.0 |
|
# adamw optimizer |
|
gradient_accumulation_steps = 8 # used to simulate larger batch sizes |
|
learning_rate = 1e-3 # max learning rate |
|
max_iters = 34000 # total number of training iterations |
|
weight_decay = 3e-4 |
|
beta1 = 0.9 |
|
beta2 = 0.95 |
|
grad_clip = 1.0 # clip gradients at this value, or disable if == 0.0 |
|
# learning rate decay settings |
|
decay_lr = True # whether to decay the learning rate |
|
warmup_iters = 1000 # how many steps to warm up for |
|
``` |
|
|
|
### Results |
|
```bash |
|
4xV100 GPUs used. |
|
Run summary: |
|
iter 34000 |
|
loss/train 0.8704 |
|
loss/val 0.9966 |
|
tokens 983040000 |
|
``` |