File size: 10,071 Bytes
c19ca42
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
# CUDA

Install latest version of **CUDA** that matches major version of your **PyTorch**  
For example, CUDA 11.8 can be used with PyTorch compiled for CUDA 11.7, but CUDA 12.0 *cannot*

- <https://developer.nvidia.com/cuda-downloads>

Install latest version of **cuDNN** compatible with chosen CUDA version

- <https://developer.nvidia.com/rdp/cudnn-download>

Currently best options are **CUDA 11.8** with **cuDNN 8.7**  
Note that **CUDA 12** is not yet supported by PyTorch  

## PyTorch

*Note*: Uninstall `torch` and `triton` before attempting any new installs

> pip uninstall torch torchvision torchaudio triton -y

### Stable

**PyTorch 2.0.0** compiled with **CUDA 11.8**:

> pip install torch torchaudio torchvision triton --force --extra-index-url https://download.pytorch.org/whl/cu118  
> pip show torch  
> 2.0.0

### Nightly

**PyTorch 2.1-nightly** compiled with **CUDA 12.1**:

> pip install --pre torch triton torchvision torchaudio --force --extra-index-url https://download.pytorch.org/whl/nightly/cu121  
> pip show torch  
> 2.1.0.dev20230305+cu118

### From source

Read <https://github.com/pytorch/pytorch#from-source>  
Note: **PyTorch** heavily relies on **Anaconda** for its build process

### Monkey-patching

Torch comes with its own version of `cuDNN` which is great for simplicity,  
but not so great if your performance is 50% of what's expected  

First make sure that your `cuDNN` is installed correctly and in `ldconfig` can find it  
Then, remove `cuDNN` from `torch` package:

> rm ~/.local/lib/python3.10/site-packages/torch/lib/libcudnn*

Now check if correct `cuDNN` libraries are found
> sudo ldconfig
> ldconfig -p | grep cudnn

And if not, modify `LD_LIBRARY_PATH` to include `cuDNN` libraries and repeat `ldconfig` command

> export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64

## SDP cross-attention optimization  

Recommended if you are using **PyTorch 2.0**  

## Xformers cross-attention optimization

`xformers` is a library of optimized attention kernels for PyTorch  
Highly recommended for significant performance boost when using `Pytorch` **1.x**  
Not required when using `Pytorch` **2.0**  

### xFormers Stable

When using release version of **PyTorch 1.13.1**, simply install `xformers` from `PyPI`:

> pip install -U xformers

### xFormers From Source

Otherwise, build process takes a bit longer...

Set your environment so `xformers` can be optimized for *your* GPU

> python -c 'import torch; print(torch.cuda.get_device_capability())'
> (8, 6)
> export TORCH_CUDA_ARCH_LIST="8.6"

Rebuild `xformers`

> sudo apt install pybind11-dev
> pip install ninja setuptools pybind11 
> pip install -v -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers  

This will compile `xformers` for your system which is preferred over using pre-built wheel

Check functionality using:

> python -m xformers.info

Make sure that all fields marked with `memory_efficient` are set to `available`  

## Triton

### Triton Stable

There are separate `torchtriton` and `triton` packages as well as different sources for `triton`
To avoid confusion, uninstall any existing `triton` packages before installing `torch` and install `triton` in the same install command as `torch`

### Triton From Source

Default version of `triton` package is good-enough for a fully functional system  
unless you want to further experiment with torch `dynamo` just-in-time compiler,  
in which case you may need to build & install <https://github.com/openai/triton> package from source

## Accelerate

Recommended to run in **FP16** mode with **Dynamo** accelerators  
But...**Dynamo** is only supported with **Torch 2.0**!
Otherwise, run without **Dynamo**

> pip install accelerate
> accelerate config

    In which compute environment are you running? This machine
    Which type of machine are you using? No distributed training
    Do you want to run your training on CPU only (even if a GPU is available)? [yes/NO]: no
    Do you wish to optimize your script with torch dynamo?[yes/NO]: yes
    Which dynamo backend would you like to use? inductor <- only if using torch 2.0+, otherwise no
    Do you want to use DeepSpeed? [yes/NO]: no
    What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]: all
    Do you wish to use FP16 or BF16 (mixed precision)? fp16

> accelerate test

## Python

PyTorch is **NOT** compatible with Python 3.11, use 3.10 instead

Just install as usual, but also possible to build from sources

### Build

You can install `python` itself from sources

Download from <https://www.python.org/downloads/source/>

Configure:
> export CFLAGS="-march=native -O3 -pipe -Wno-unused-value -Wno-empty-body -DNDEBUG"  
> ./configure --prefix /usr --enable-optimizations --with-lto --enable-loadable-sqlite-extensions  
> time make -j32  

Check: 
> ./python --version  
> ./python -c 'import sysconfig; print(sysconfig.get_config_var("PY_CFLAGS"))'  

Do side-by-side install:
> sudo make altinstall  
> sudo update-alternatives --install /bin/python3 python3 /bin/python3.11 100  
> sudo update-alternatives --list python3  

Switch to new `python`:

> sudo update-alternatives --config python3  
> python -m pip install --upgrade pip  
> python -m pip uninstall torch torchaudio triton pytorch_triton -y
> python -m pip install --pre torch triton torchaudio torchvision --extra-index-url https://download.pytorch.org/whl/nightly/cu118 --force  
> python -c 'import torch; print(torch.__path__, torch.__version__)'  

## nVidia CUDA

### Windows WSL2

Requirements:
- Latest versions of Windows: not included in RTM  
  Note: Insider builds are no longer required as CUDA support is present in Beta builds  
- Updated WSL kernel: `wsl --update`, minimum **4.19.121** recommended **5.15.74**  
- Updated nVidia drivers: minimum **460** recommended **510**  

Links:
- [nVidia install docs](https://docs.nvidia.com/cuda/wsl-user-guide/index.html)
- [Ubuntu install docs](https://ubuntu.com/blog/getting-started-with-cuda-on-ubuntu-on-wsl-2)
- [CUDA download](https://developer.nvidia.com/cuda-downloads)

### Install

Install both `CUDA` and `cuDNN`  
- Note: Do not install drivers if running in VM, let host drivers be as-is  

Driver can be higher than runtime, but not opposite  
- Example: driver 510 supports Cuda 12 and is compatible with Cuda 11.6)

Install using either:
- Add nVidia repository and install using `apt`
- Download installer and install manually  

### Check

Is CUDA detected and versions:

> apt list cuda*

List is long, but minimum packages are:

    cuda/now 11.6.1-1
    cuda-11-6/now 11.6.1-1
    cuda-cccl-11-6/now 11.6.55-1
    cuda-command-line-tools-11-6/now 11.6.1-1
    cuda-compiler-11-6/now 11.6.1-1
    cuda-cudart-11-6/now 11.6.55-1
    cuda-cupti-11-6/now 11.6.112-1
    cuda-libraries-11-6/now 11.6.1-1
    cuda-nvcc-11-6/now 11.6.112-1
    cuda-runtime-11-6/now 11.6.1-1
    cuda-toolkit-11-6/now 11.6.1-1
    cuda-tools-11-6/now 11.6.1-1

> apt list libcudnn*

    libcudnn8/now 8.3.2.44-1+cuda11.5

> nvidia-smi  

    NVIDIA-SMI 510.85.02 Driver Version: 526.98 CUDA Version: 12.0

> head /usr/local/cuda/version.json  

    "cuda" : {
      "name" : "CUDA SDK",
      "version" : "11.6.1"
    },

### NVCC

Test:

> git clone https://github.com/NVIDIA/cuda-samples

Edit `Makefile` as needed to specify compute level and run `make`

> Samples/1_Utilities/deviceQuery

    Device 0: "NVIDIA GeForce RTX 3060"
      CUDA Driver Version / Runtime Version          12.0 / 11.6
      CUDA Capability Major/Minor version number:    8.6
      Total amount of global memory:                 12288 MBytes (12884377600 bytes)
      (028) Multiprocessors, (128) CUDA Cores/MP:    3584 CUDA Cores
      GPU Max Clock rate:                            1777 MHz (1.78 GHz)
      Memory Clock rate:                             7501 Mhz
      Memory Bus Width:                              192-bit
      ...

## Stable Diffusion

Stable-Diffusion requires `CUDA` level **SM86** so version older than 11 are insufficient

## TensorFlow

Install:

> pip3 install tensorflow  

Tensorflow dynamically links to CUDA libraries, so as long as major version matches, it should work (e.g. Tensorflow 2.10 uses CUDA 11.x).
But mixing different major versions between Tensorflow and CUDA does not work

Check:  

> wget https://raw.githubusercontent.com/vladmandic/tfjs-utils/main/src/tfinfo.py  
> python src/tfinfo.py

    sysconfig: [
      ('cpu_compiler', '/dt9/usr/bin/gcc'),
      ('cuda_compute_capabilities', ['sm_35', 'sm_50', 'sm_60', 'sm_70', 'sm_75', 'compute_80']),
      ('cuda_version', '11.2'),
      ('cudnn_version', '8'),
      ('is_cuda_build', True),
      ('is_rocm_build', False),
      ('is_tensorrt_build', True)
    ]
    gpu device: PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU') {
      'compute_capability': (8, 6),
      'device_name': 'NVIDIA GeForce RTX 3060'
    }
    logical device: LogicalDevice(name='/device:GPU:0', device_type='GPU')

## PyTorch

Install **PyTorch** linked to *exact* major/minor version of **CUDA**:

> pip3 uninstall torch torchvision torchaudio  
> pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116  

Note that `cu116` at the end refers to `CUDA` **11.6** which should match `CUDA` installation on your system  

Check:  

> wget https://raw.githubusercontent.com/vladmandic/tfjs-utils/main/src/torchinfo.py
> python torchinfo.py  

    torch version: 1.12.1+cu116
    cuda available: True
    cuda version: 11.6
    cuda arch list: ['sm_37', 'sm_50', 'sm_60', 'sm_70', 'sm_75', 'sm_80', 'sm_86']
    device: NVIDIA GeForce RTX 3060

## XFormers

Download

> git clone https://github.com/facebookresearch/xformers.git
> cd xformers
> git submodule update --init --recursive

Compile

> export FORCE_CUDA="1"
> export TORCH_CUDA_ARCH_LIST=8.6
> pip install ninja pyre-extensions einops
> python setup.py build develop
> python setup.py bdist_wheel

Install

> pip install dist/*
> python -m xformers.info