mtasic85 commited on
Commit
323d75d
·
1 Parent(s): 09143cd

prepare datasets

Browse files
Files changed (2) hide show
  1. README.md +5 -35
  2. scripts/pretrain-core-model-0.yaml +4 -3
README.md CHANGED
@@ -44,7 +44,7 @@ tags:
44
  - reason
45
  ---
46
 
47
- # tangled-alpha-0.2-core
48
 
49
  ![logo](./misc/logo.jpg)
50
 
@@ -53,44 +53,14 @@ time python -B prepare_core_datasets.py
53
  ```
54
 
55
  ```
56
- Progress: 100%|████████| 220/220 [23:15<00:00, 6.34s/it]
57
- Workers are finished.██| 220/220 [23:15<00:00, 6.34s/it]
58
- Finished data processing!
59
- i=0, block_size=8192, chunk_size=16384000, len(dataset)=893355, len(dataset) * block_size=7318364160
60
- Total number of tokens in the optimized dataset '../core-data-0-8192-2000' is 7318364160
61
  ```
62
 
63
  ```bash
64
- CUDA_VISIBLE_DEVICES=0 CUDA_LAUNCH_BLOCKING=0 PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True litgpt pretrain --config pretrain-core-model.yaml
65
  ```
66
 
67
  ```
68
- Seed set to 23
69
- Time to instantiate model: 0.32 seconds.
70
- Total parameters: 217,088,512
71
- Verifying settings ...
72
- Measured TFLOPs: 3548.40
73
-
74
- Epoch 1 | iter 256 step 1 | loss train: 11.716, val: n/a | iter time: 1735.26 ms (step) remaining time: 4 days, 11:06:29
75
- Epoch 1 | iter 512 step 2 | loss train: 11.534, val: n/a | iter time: 1102.77 ms (step) remaining time: 4 days, 2:31:30
76
- Epoch 1 | iter 768 step 3 | loss train: 11.356, val: n/a | iter time: 1095.87 ms (step) remaining time: 3 days, 23:44:12
77
- Epoch 1 | iter 1024 step 4 | loss train: 11.162, val: n/a | iter time: 1099.92 ms (step) remaining time: 3 days, 22:18:27
78
- Epoch 1 | iter 1280 step 5 | loss train: 11.018, val: n/a | iter time: 1096.45 ms (step) remaining time: 3 days, 21:24:35
79
- Epoch 1 | iter 1536 step 6 | loss train: 10.901, val: n/a | iter time: 1093.65 ms (step) remaining time: 3 days, 20:48:11
80
- Epoch 1 | iter 1792 step 7 | loss train: 10.850, val: n/a | iter time: 1100.16 ms (step) remaining time: 3 days, 20:22:00
81
- Epoch 1 | iter 2048 step 8 | loss train: 10.780, val: n/a | iter time: 1092.67 ms (step) remaining time: 3 days, 20:01:57
82
- Epoch 1 | iter 2304 step 9 | loss train: 10.692, val: n/a | iter time: 1095.77 ms (step) remaining time: 3 days, 19:45:57
83
- Epoch 1 | iter 2560 step 10 | loss train: 10.678, val: n/a | iter time: 1092.12 ms (step) remaining time: 3 days, 19:32:43
84
- Epoch 1 | iter 2816 step 11 | loss train: 10.619, val: n/a | iter time: 1094.44 ms (step) remaining time: 3 days, 19:21:32
85
- Epoch 1 | iter 3072 step 12 | loss train: 10.588, val: n/a | iter time: 1102.51 ms (step) remaining time: 3 days, 19:12:30
86
- Epoch 1 | iter 3328 step 13 | loss train: 10.514, val: n/a | iter time: 1095.57 ms (step) remaining time: 3 days, 19:04:07
87
- Epoch 1 | iter 3584 step 14 | loss train: 10.472, val: n/a | iter time: 1104.00 ms (step) remaining time: 3 days, 18:56:56
88
- Epoch 1 | iter 3840 step 15 | loss train: 10.431, val: n/a | iter time: 1096.00 ms (step) remaining time: 3 days, 18:50:21
89
- Epoch 1 | iter 4096 step 16 | loss train: 10.392, val: n/a | iter time: 1098.34 ms (step) remaining time: 3 days, 18:44:25
90
- Epoch 1 | iter 4352 step 17 | loss train: 10.360, val: n/a | iter time: 1106.53 ms (step) remaining time: 3 days, 18:38:58
91
- Epoch 1 | iter 4608 step 18 | loss train: 10.329, val: n/a | iter time: 1084.95 ms (step) remaining time: 3 days, 18:33:58
92
- Epoch 1 | iter 4864 step 19 | loss train: 10.296, val: n/a | iter time: 1096.22 ms (step) remaining time: 3 days, 18:29:12
93
- Epoch 1 | iter 5120 step 20 | loss train: 10.236, val: n/a | iter time: 1093.39 ms (step) remaining time: 3 days, 18:24:51
94
  # ...
95
  ```
96
 
@@ -103,11 +73,11 @@ mv wandb wandb-pretrain-core
103
  Chat with model:
104
 
105
  ```bash
106
- CUDA_VISIBLE_DEVICES=0 CUDA_LAUNCH_BLOCKING=0 PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True litgpt chat ../out/pretrain-core/final
107
  ```
108
 
109
  ```bash
110
- CUDA_VISIBLE_DEVICES=0 CUDA_LAUNCH_BLOCKING=0 PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True time litgpt evaluate --tasks 'leaderboard' --out_dir '../evaluate/pretrain-core/leaderboard/' --batch_size 1 --dtype 'bfloat16' '../out/pretrain-core/final'
111
  ```
112
 
113
  ```
 
44
  - reason
45
  ---
46
 
47
+ # tangled-alpha-0.3-core
48
 
49
  ![logo](./misc/logo.jpg)
50
 
 
53
  ```
54
 
55
  ```
56
+ # ...
 
 
 
 
57
  ```
58
 
59
  ```bash
60
+ CUDA_VISIBLE_DEVICES=0 CUDA_LAUNCH_BLOCKING=0 PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True litgpt pretrain --config pretrain-core-model-0.yaml
61
  ```
62
 
63
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
64
  # ...
65
  ```
66
 
 
73
  Chat with model:
74
 
75
  ```bash
76
+ CUDA_VISIBLE_DEVICES=0 CUDA_LAUNCH_BLOCKING=0 PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True litgpt chat ../out/pretrain-core-0/final
77
  ```
78
 
79
  ```bash
80
+ CUDA_VISIBLE_DEVICES=0 CUDA_LAUNCH_BLOCKING=0 PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True time litgpt evaluate --tasks 'leaderboard' --out_dir '../evaluate/pretrain-core-0/leaderboard/' --batch_size 1 --dtype 'bfloat16' '../out/pretrain-core-0/final'
81
  ```
82
 
83
  ```
scripts/pretrain-core-model-0.yaml CHANGED
@@ -25,7 +25,7 @@ model_config:
25
 
26
  # Directory in which to save checkpoints and logs. If running in a Lightning Studio Job, look for it in
27
  # /teamspace/jobs/<job-name>/share. (type: <class 'Path'>, default: out/pretrain)
28
- out_dir: "../out/pretrain-core/"
29
 
30
  # The precision to use for pretraining. Possible choices: "bf16-true", "bf16-mixed", "32-true". (type: Optional[str], default: null)
31
  # precision: bf16-mixed
@@ -60,6 +60,7 @@ train:
60
  # Number of samples between optimizer steps across data-parallel ranks (type: int, default: 512)
61
  global_batch_size: 512
62
  # global_batch_size: 256
 
63
 
64
  # Number of samples per data-parallel rank (type: int, default: 4)
65
  micro_batch_size: 4
@@ -67,7 +68,7 @@ train:
67
  # micro_batch_size: 1
68
 
69
  # Number of iterations with learning rate warmup active (type: int, default: 2000)
70
- lr_warmup_steps: 200
71
 
72
  # Number of epochs to train on (type: Optional[int], default: null)
73
  epochs:
@@ -93,7 +94,7 @@ train:
93
  # Evaluation-related arguments. See ``litgpt.args.EvalArgs`` for details
94
  eval:
95
  # Number of optimizer steps between evaluation calls (type: int, default: 1000)
96
- interval: 50
97
 
98
  # Number of tokens to generate (type: Optional[int], default: null)
99
  max_new_tokens:
 
25
 
26
  # Directory in which to save checkpoints and logs. If running in a Lightning Studio Job, look for it in
27
  # /teamspace/jobs/<job-name>/share. (type: <class 'Path'>, default: out/pretrain)
28
+ out_dir: "../out/pretrain-core-0/"
29
 
30
  # The precision to use for pretraining. Possible choices: "bf16-true", "bf16-mixed", "32-true". (type: Optional[str], default: null)
31
  # precision: bf16-mixed
 
60
  # Number of samples between optimizer steps across data-parallel ranks (type: int, default: 512)
61
  global_batch_size: 512
62
  # global_batch_size: 256
63
+ # global_batch_size: 128
64
 
65
  # Number of samples per data-parallel rank (type: int, default: 4)
66
  micro_batch_size: 4
 
68
  # micro_batch_size: 1
69
 
70
  # Number of iterations with learning rate warmup active (type: int, default: 2000)
71
+ lr_warmup_steps: 500
72
 
73
  # Number of epochs to train on (type: Optional[int], default: null)
74
  epochs:
 
94
  # Evaluation-related arguments. See ``litgpt.args.EvalArgs`` for details
95
  eval:
96
  # Number of optimizer steps between evaluation calls (type: int, default: 1000)
97
+ interval: 100
98
 
99
  # Number of tokens to generate (type: Optional[int], default: null)
100
  max_new_tokens: