{
"cells": [
{
"cell_type": "markdown",
"id": "e69e9896-4be5-4706-9b49-cb772d02e8d4",
"metadata": {},
"source": [
"# Swin Transformers as a special sparsity pattern\n",
"\n",
"In this notebook, we will show how the recently-introduced [Swin Transformers](https://arxiv.org/abs/2103.14030) can be cast\n",
"as a sparse transformer with a particular sparsity pattern.\n",
"\n",
"\n",
"Swin Transformers is a hierarchical Transformer whose representation is computed with shifted windows.\n",
"The shifted windowing scheme brings efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection\n",
"\n",
"\n",
"\n",
"\n",
"In this notebook, we will cover:\n",
"- what type of self-attention is needed to replicate a Swin Transformer\n",
"- we will show how one can modify their pre-trained Swin Transformer to use the sparse kernels from xformers instead of hand writing the Swin Transformer self-attention by hand.\n",
"\n",
"Let's start with a few imports. In this notebook, the vanilla Swin Transformer will be taken from `timm`."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "1beec17c-cdec-4c54-afca-61423c1aab58",
"metadata": {},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n",
"import copy\n",
"import torch\n",
"from torch import nn\n",
"from torch.utils import benchmark\n",
"\n",
"import xformers.components.attention.attention_patterns as AP\n",
"from xformers.components.attention.core import scaled_dot_product_attention\n",
"from xformers.components.attention._sputnik_sparse import SparseCS\n",
"\n",
"import timm\n",
"\n",
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"id": "9072a48e-ba89-4093-ae7e-22706602f11e",
"metadata": {},
"source": [
"## What sparsity pattern does Swin Transformer correspond to?\n",
"\n",
"In xformers, we provide for reference a default implementation of the attention pattern that corresponds to the Swin Transformer architecture.\n",
"\n",
"It can be found together with the other attention patterns in `xformers.components.attention.attention_patterns`.\n",
"\n",
"Let's try it out on the example case from above, on an image of size 8x8, and windows of size 4:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "57323b78-3b3b-457a-95d8-1f55fdcdddd6",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAfAAAAD6CAYAAABeQBU0AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjQuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/Z1A+gAAAACXBIWXMAAAsTAAALEwEAmpwYAAAHZUlEQVR4nO3dsW4VRxiG4T0hSJELChQKIlmhoiQpkLkACt8sN3AkuACQkCyXR2lo0iQQiShWkmZSpALZu94Zjdef53lK2+vfxTm8GuTfsyulTABAlm+2/gEAgPUEHAACCTgABBJwAAgk4AAQSMABINC3a774+4f3ypPj+1WDDudHVc89fXZR9ZyZjObv6a/p3/LPbu1zc+/rXq+tude71zN86c/pj99LKY++/viqgD85vj+92x9X/QCnP/xc9dx+f1b1nJmM5m15U/Xc3Pu612tr7vXu9Qxfel1efbjs4/4LHQACCTgABBJwAAgk4AAQaNUvsQF3z+H86MpfHNv/elb9fed+GW3uc71mwl3jBA4AgQQcAAIJOAAEEnAACCTgABBIwAEgkIADQKBVe+Bz+6JLanc7W/Y6zew3E4BtOYEDQCABB4BAAg4AgQQcAAIJOAAEEnAACOQ6UeBKvdYba68a7TUTEjmBA0AgAQeAQAIOAIEEHAACCTgABBJwAAi0ao3s6bOLab8/qxp007eYmQnXM/e+bnltzT3ba93Le4GROIEDQCABB4BAAg4AgQQcAAIJOAAEEnAACOQ2Mhjc4fzoyvWrXuuNtStmLTPhrnECB4BAAg4AgQQcAAIJOAAEEnAACCTgABBIwAEg0Ko98Ll90SW1u50te51m9psJwLacwAEgkIADQCABB4BAAg4AgQQcAAIJOAAEcp0ocKVe6421V432mgmJnMABIJCAA0AgAQeAQAIOAIEEHAACCTgABFq1Rvb02cW0359VDbrpW8zMhG1t8bqcm7n0Huvx844yc2muf6P6cAIHgEACDgCBBBwAAgk4AAQScAAIJOAAEEjAASCQ60SBISztItf+LYaWq1Hvysylz/eaOToncAAIJOAAEEjAASCQgANAIAEHgEACDgCBVq2RHc6Pbvy6zJY1AjP7zQRgW07gABBIwAEgkIADQCABB4BAAg4AgQQcAAK5jQxgql+rnFvjbLnBK2nm0vfdYuYInMABIJCAA0AgAQeAQAIOAIEEHAACCTgABNqVUq79xc9/+q682x9XDbrpW8zMZDRvy5vpc/m0W/vcg93D8mL3ssePdKtssZI0ysyluf6NavO6vHpfSnn+9cedwAEgkIADQCABB4BAAg4AgQQcAAIJOAAEEnAACOQ6UWAIS7vItX+LoeUazbsyc+nzvWaOzgkcAAIJOAAEEnAACCTgABBIwAEgkIADQKBVa2SH86Mbvy6zZY3AzH4zAdiWEzgABBJwAAgk4AAQSMABIJCAA0AgAQeAQG4jA5jq1yrn1jhbbvBKmrn0fbeYOQIncAAIJOAAEEjAASCQgANAIAEHgEACDgCBdqWUa3/xg93D8mL3suOPw02qvcVsmvJuTxth5tvyZvpcPu3WPjfK+3qLlaRRZi7NHX3dq9Xr8up9KeX51x93AgeAQAIOAIEEHAACCTgABBJwAAgk4AAQSMABIJDrRIEhLO0i99jbH2Xm0ue3+DsMI3ACB4BAAg4AgQQcAAIJOAAEEnAACCTgABDIGtnAtrgqs2WumfNOTi+qngMyOYEDQCABB4BAAg4AgQQcAAIJOAAEEnAACGSNDGDqs/bXcoNX0syl77vFzBE4gQNAIAEHgEACDgCBBBwAAgk4AAQScAAIJOAAEMge+MC2uCqzZa6Z8w7lY9Vzo+i1Uzz33Cgzl+b2mjk6J3AACCTgABBIwAEgkIADQCABB4BAAg4AgayRAUNoWYOq/b6jzFz6/BZrnCNwAgeAQAIOAIEEHAACCTgABBJwAAgk4AAQyBrZwLa4aatlrpnzTk4vqp4DMjmBA0AgAQeAQAIOAIEEHAACCTgABBJwAAhkjQxg6rP213KDV9LMpe+7xcwROIEDQCABB4BAAg4AgQQcAAIJOAAEEnAACCTgABDIHvjAtrgqs2WumfMO5WPVc7TZYhfZzP+17MPX2mLmVZzAASCQgANAIAEHgEACDgCBBBwAAgk4AASyRgZApLm1rV5rslvMvIoTOAAEEnAACCTgABBIwAEgkIADQCABB4BA1sgGtsVNWy1zzZx3cnpR9RyQyQkcAAIJOAAEEnAACCTgABBIwAEgkIADQCABB4BA9sABuHN6/Z2L2qtGW2bee3z5x53AASCQgANAIAEHgEACDgCBBBwAAgk4AASyRjawLa7KbJlr5rxD+Vj1HKSqXfdaMvfsFjOn6ZdLP+oEDgCBBBwAAgk4AAQScAAIJOAAEEjAASCQNTIAItWue7V83y1mXsUJHAACCTgABBJwAAgk4AAQSMABIJCAA0Aga2QD2+KmrZa5Zs47Ob2oeg7I5AQOAIEEHAACCTgABBJwAAgk4AAQSMABIJCAA0Age+AA3Dm9/s5F7VWjLTPvPb78407gABBIwAEgkIADQCABB4BAAg4AgQQcAALtSinX/+Ld7rdpmj70+3GABj+WUh6tfcj7Gm69S9/bqwIOANwO/gsdAAIJOAAEEnAACCTgABBIwAEgkIADQCABB4BAAg4AgQQcAAL9B2NyMknQIrs/AAAAAElFTkSuQmCC\n",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"H, W = 8, 8\n",
"window_size = 4\n",
"\n",
"mask = AP.swin_attention_pattern(H, W, window_size, shift_size=0)\n",
"mask_shifted = AP.swin_attention_pattern(H, W, window_size, shift_size=2)\n",
"\n",
"fig = plt.figure(figsize=(7, 14))\n",
"ax = fig.add_subplot(1, 2, 1)\n",
"ax.imshow(mask)\n",
"plt.xticks([])\n",
"plt.yticks([])\n",
"ax = fig.add_subplot(1, 2, 2)\n",
"ax.imshow(mask_shifted)\n",
"plt.xticks([])\n",
"plt.yticks([])\n",
"fig.tight_layout()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "26ca8db6-9897-4239-89fb-f48b1f0d75b8",
"metadata": {},
"source": [
"Now let's visualize the self-attention for every pixel in the image. Every sub-image corresponds to the self-attention for one pixel"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "10fe9b1e-a443-4c54-a141-d3d08969efe6",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"fig = plt.figure(figsize=(12, 12))\n",
"for i in range(H * W):\n",
" ax = fig.add_subplot(H, W, i + 1)\n",
" ax.imshow(mask[i].reshape(H, W))\n",
" ax.grid(color='k', linestyle='-', linewidth=1)\n",
" ax.set_xticks(torch.arange(0.5, W))\n",
" ax.set_yticks(torch.arange(0.5, H))\n",
" ax.set_xticklabels([])\n",
" ax.set_yticklabels([])\n",
"fig.tight_layout()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "747678ef-18d0-4e13-969b-da389f7e3479",
"metadata": {},
"source": [
"And for the shifted case"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "01274104-c52a-4740-85bf-67363300322b",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"fig = plt.figure(figsize=(12, 12))\n",
"for i in range(H * W):\n",
" ax = fig.add_subplot(H, W, i + 1)\n",
" ax.imshow(mask_shifted[i].reshape(H, W))\n",
" ax.grid(color='k', linestyle='-', linewidth=1)\n",
" ax.set_xticks(torch.arange(0.5, W))\n",
" ax.set_yticks(torch.arange(0.5, H))\n",
" ax.set_xticklabels([])\n",
" ax.set_yticklabels([])\n",
"fig.tight_layout()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "dd27ef92-1ba7-405a-9545-251f94f29461",
"metadata": {},
"source": [
"We can see that the self-attention maps correspond to the images in the paper (shown in the top of the notebook), illustrating that indeed a custom sparsity pattern is enough to reproduce Swin Transformer without having to ressort to implementing custom code.\n",
"\n",
"Plus, it is trivial to extend Swin Transformer with other attention patterns (such as local 2d, axial and more, see [the 2d attetnion patterns notebook](https://github.com/fairinternal/xformers/blob/main/docs/source/2d_attention_patterns.ipynb) for more examples."
]
},
{
"cell_type": "markdown",
"id": "699fd7bc-377b-4786-8dc7-732619ddb89e",
"metadata": {},
"source": [
"## Using Swin Transformers as a sparse Transformer in your model\n",
"\n",
"Now that we know that we can represent a Swin Transformer as a particular instantiation of a sparse Transformer, let's use xformers efficient sparse kernels to see\n",
"what type of speed / memory trade-offs we get by casting a Swin Transformer as a sparse Transformer.\n",
"\n",
"To facilitate benchmarking and memory profiling, let's define a function that takes a generic callable and executes it, measuring the execution time and the GPU memory"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "b072ae40-9bf3-4ab8-9717-acf2e8ffe981",
"metadata": {},
"outputs": [],
"source": [
"def profile_model(fn, min_run_time=2):\n",
" torch.cuda.reset_peak_memory_stats()\n",
" torch.cuda.synchronize()\n",
" res = benchmark.Timer(\n",
" stmt='fn()',\n",
" globals={\"fn\": fn},\n",
" label=\"profile\",\n",
" sub_label=\"\",\n",
" description=\"\"\n",
" ).blocked_autorange(min_run_time=min_run_time)\n",
" torch.cuda.synchronize()\n",
" memory = torch.cuda.max_memory_allocated() / 2 ** 20\n",
" memory = f\"Memory used: {memory} MB\"\n",
" print(res)\n",
" print(memory)"
]
},
{
"cell_type": "markdown",
"id": "1edb40c1-98ce-4ae1-a486-42022f3b6b1b",
"metadata": {},
"source": [
"Now it comes the core of it. We will implement an `Attention` module following the same API and modules as timm's, but using our `scaled_dot_product_attention` function.\n",
"\n",
"Note the similarities between this implementation and the one from the [vision transformers notebook](https://github.com/fairinternal/xformers/blob/main/docs/source/vision_transformers.ipynb).\n",
"\n",
"Note that we are not implementing relative positional embedding for simplicity"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "8def591e-be74-489a-af6b-45f90e13aadc",
"metadata": {},
"outputs": [],
"source": [
"from timm.models.layers import Mlp, DropPath\n",
"\n",
"\n",
"# exact the same one as from https://github.com/fairinternal/xformers/blob/main/docs/source/vision_transformers.ipynb\n",
"class Attention(torch.nn.Module):\n",
" def __init__(\n",
" self,\n",
" dim,\n",
" num_heads=8,\n",
" qkv_bias=False,\n",
" attn_drop=0.0,\n",
" proj_drop=0.0,\n",
" attn_mask=None,\n",
" **kwargs\n",
" ):\n",
" super().__init__()\n",
" self.num_heads = num_heads\n",
"\n",
" self.qkv = torch.nn.Linear(dim, dim * 3, bias=qkv_bias)\n",
" self.attn_drop = torch.nn.Dropout(attn_drop)\n",
" self.proj = torch.nn.Linear(dim, dim)\n",
" self.proj_drop = torch.nn.Dropout(proj_drop)\n",
" self.attn_mask = attn_mask\n",
"\n",
" def forward(self, x):\n",
" B, N, C = x.shape\n",
" qkv = (\n",
" self.qkv(x)\n",
" .reshape(B, N, 3, self.num_heads, C // self.num_heads)\n",
" .permute(2, 0, 3, 1, 4)\n",
" )\n",
"\n",
" qkv = qkv.flatten(1, 2)\n",
"\n",
" q, k, v = qkv.unbind()\n",
" \n",
" x = scaled_dot_product_attention(q, k, v, self.attn_mask, dropout=self.attn_drop)\n",
" \n",
" x = x.reshape(B, self.num_heads, N, C // self.num_heads)\n",
"\n",
" x = x.transpose(1, 2).reshape(B, N, C)\n",
" x = self.proj(x)\n",
" x = self.proj_drop(x)\n",
" return x\n",
" \n",
"\n",
"# almost copy and paste from timm's implementation, but removing the unneeded elements\n",
"# as we don't need to perform the image partitioning anymore\n",
"# Note that we call our swin_attention_pattern in the constructor\n",
"# to generate the custom sparsity pattern\n",
"class SwinTransformerBlock(nn.Module):\n",
" r\"\"\" Swin Transformer Block.\n",
" Args:\n",
" dim (int): Number of input channels.\n",
" input_resolution (tuple[int]): Input resulotion.\n",
" num_heads (int): Number of attention heads.\n",
" window_size (int): Window size.\n",
" shift_size (int): Shift size for SW-MSA.\n",
" mlp_ratio (float): Ratio of mlp hidden dim to embedding dim.\n",
" qkv_bias (bool, optional): If True, add a learnable bias to query, key, value. Default: True\n",
" qk_scale (float | None, optional): Override default qk scale of head_dim ** -0.5 if set.\n",
" drop (float, optional): Dropout rate. Default: 0.0\n",
" attn_drop (float, optional): Attention dropout rate. Default: 0.0\n",
" drop_path (float, optional): Stochastic depth rate. Default: 0.0\n",
" act_layer (nn.Module, optional): Activation layer. Default: nn.GELU\n",
" norm_layer (nn.Module, optional): Normalization layer. Default: nn.LayerNorm\n",
" \"\"\"\n",
"\n",
" def __init__(self, dim, input_resolution, num_heads, window_size=7, shift_size=0,\n",
" mlp_ratio=4., qkv_bias=True, qk_scale=None, drop=0., attn_drop=0., drop_path=0.,\n",
" act_layer=nn.GELU, norm_layer=nn.LayerNorm):\n",
" super().__init__()\n",
" self.dim = dim\n",
" self.input_resolution = input_resolution\n",
" self.num_heads = num_heads\n",
" self.window_size = window_size\n",
" self.shift_size = shift_size\n",
" self.mlp_ratio = mlp_ratio\n",
" if min(self.input_resolution) <= self.window_size:\n",
" # if window size is larger than input resolution, we don't partition windows\n",
" self.shift_size = 0\n",
" self.window_size = min(self.input_resolution)\n",
" assert 0 <= self.shift_size < self.window_size, \"shift_size must in 0-window_size\"\n",
" \n",
" # create swin_attention_pattern sparsity pattern\n",
" attn_mask = AP.swin_attention_pattern(input_resolution[0], input_resolution[1], window_size, shift_size=shift_size)\n",
" attn_mask = SparseCS(attn_mask, torch.device(\"cuda\"))\n",
"\n",
" self.norm1 = norm_layer(dim)\n",
" self.attn = Attention(\n",
" dim, window_size=(self.window_size, self.window_size), num_heads=num_heads,\n",
" qkv_bias=qkv_bias, qk_scale=qk_scale, attn_drop=attn_drop, proj_drop=drop,\n",
" attn_mask=attn_mask\n",
" )\n",
"\n",
" self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()\n",
" self.norm2 = norm_layer(dim)\n",
" mlp_hidden_dim = int(dim * mlp_ratio)\n",
" self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim, act_layer=act_layer, drop=drop)\n",
"\n",
" def forward(self, x):\n",
" H, W = self.input_resolution\n",
" B, L, C = x.shape\n",
" assert L == H * W, \"input feature has wrong size\"\n",
"\n",
" shortcut = x\n",
" x = self.norm1(x)\n",
"\n",
" # W-MSA/SW-MSA\n",
" x = self.attn(x) # nW*B, window_size*window_size, C\n",
"\n",
" # FFN\n",
" x = shortcut + self.drop_path(x)\n",
" x = x + self.drop_path(self.mlp(self.norm2(x)))\n",
"\n",
" return x"
]
},
{
"cell_type": "markdown",
"id": "e55578e0-1f4f-4a51-ab6c-eaa526e28b79",
"metadata": {},
"source": [
"Let's write a function that given a model, will replace all instances of timm.models.swin_transformer.SwinTransformerBlock with our own implementation, which leverages `scaled_dot_product_attention` and `swin_attention_pattern` from xformers\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "81e9c20f-69fb-48cc-b4e6-debb404e2240",
"metadata": {},
"outputs": [],
"source": [
"def replace_attn_with_xformers_one(module):\n",
" module_output = module\n",
" if isinstance(module, timm.models.swin_transformer.SwinTransformerBlock):\n",
" \n",
" module_output = SwinTransformerBlock(module.dim, module.input_resolution, module.num_heads, module.window_size, module.shift_size, module.mlp_ratio)\n",
" module_output.drop_path = module.drop_path\n",
" module_output.norm1 = module.norm1\n",
" module_output.norm2 = module.norm2\n",
" module_output.mlp = module.mlp\n",
" \n",
" module_output.attn.qkv = module.attn.qkv\n",
" module_output.attn.attn_drop = module.attn.attn_drop\n",
" module_output.attn.proj = module.attn.proj\n",
" module_output.attn.proj_drop = module.attn.proj_drop\n",
" \n",
" module_output.train(module.training)\n",
" \n",
" else:\n",
"\n",
" for name, child in module.named_children():\n",
" module_output.add_module(name, replace_attn_with_xformers_one(child))\n",
" del module\n",
" return module_output"
]
},
{
"cell_type": "markdown",
"id": "12eaa142-ae39-4a8e-bad9-20b3738e38d2",
"metadata": {},
"source": [
"Now it's time to create our Swin Transformer. Nothing unusual here. Note that we will be keeping a copy of the model, which will be the model to use sparse self-attention"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "d04d2cab-f1ec-4151-b41b-0c3406f0d9f7",
"metadata": {},
"outputs": [],
"source": [
"model = timm.models.create_model(\"swin_base_patch4_window7_224\").cuda().eval()\n",
"\n",
"# zero relative positional embedding in original model as we don't implement it here\n",
"for n, p in model.named_parameters():\n",
" if \"relative_position_bias_table\" in n:\n",
" torch.nn.init.zeros_(p)\n",
"\n",
"model_sparse = copy.deepcopy(model)\n",
"model_sparse = replace_attn_with_xformers_one(model_sparse)"
]
},
{
"cell_type": "markdown",
"id": "42d3792a-c219-445e-bc24-e32d3d729a02",
"metadata": {},
"source": [
"Let's new create an input tensor verify if both the sparse and the baseline versions produce the same results"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "52c4fce7-2e5a-4214-8944-cd54832f90f8",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Median absolute difference: 3.70e-05\n",
"Max absolute difference: 2.51e-04\n"
]
}
],
"source": [
"i = torch.rand(32, 3, 224, 224).cuda()\n",
"\n",
"with torch.no_grad():\n",
" r0 = model(i)\n",
" r1 = model_sparse(i)\n",
"\n",
"diff = (r0 - r1).abs()\n",
" \n",
"print(f\"Median absolute difference: {diff.median().item():.2e}\")\n",
"print(f\"Max absolute difference: {diff.max().item():.2e}\")"
]
},
{
"cell_type": "markdown",
"id": "6f7d83f9-b21b-48de-89dd-c255fef30f98",
"metadata": {},
"source": [
"The results are almost the same. The reason why they are not equivalent up to float precision is because we currently assume that the number of non-zero elements in the sparse matrix is a multiple of 4, so up to 3 elements in the self-attention might be dropped in order to satisfy this constraint.\n",
"This constraint will be lifted in the future.\n",
"\n",
"Let's new benchmark both the sparse and the baseline versions"
]
},
{
"cell_type": "markdown",
"id": "8ba0279a-ad14-4295-bb37-5778f60650dc",
"metadata": {},
"source": [
"### Profiling the baseline (dense) model"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "909b23f2-2ac7-488e-9447-089496d37346",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Forward only\n",
"\n",
"profile\n",
" Median: 212.33 ms\n",
" IQR: 9.49 ms (210.94 to 220.43)\n",
" 10 measurements, 1 runs per measurement, 1 thread\n",
"Memory used: 1448.72509765625 MB\n",
"\n",
"Forward + backward\n",
"\n",
"profile\n",
" Median: 626.96 ms\n",
" IQR: 12.91 ms (623.15 to 636.06)\n",
" 4 measurements, 1 runs per measurement, 1 thread\n",
"Memory used: 8615.0703125 MB\n"
]
}
],
"source": [
"print(\"Forward only\")\n",
"with torch.no_grad():\n",
" profile_model(lambda : model(i))\n",
"print(\"\")\n",
"print(\"Forward + backward\")\n",
"profile_model(lambda : model(i).sum().backward())"
]
},
{
"cell_type": "markdown",
"id": "97bcc2c6-a2e7-403c-8b43-33c5f818a7c1",
"metadata": {},
"source": [
"### Profiling the sparse model"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "4e17e1aa-6970-4b6f-9adb-4e51b55cb4ac",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Forward only\n",
"\n",
"profile\n",
" Median: 208.51 ms\n",
" IQR: 1.29 ms (208.06 to 209.34)\n",
" 10 measurements, 1 runs per measurement, 1 thread\n",
"Memory used: 1636.5673828125 MB\n",
"\n",
"Forward + backward\n",
"\n",
"profile\n",
" Median: 607.60 ms\n",
" IQR: 9.11 ms (605.09 to 614.20)\n",
" 4 measurements, 1 runs per measurement, 1 thread\n",
"Memory used: 8770.02001953125 MB\n"
]
}
],
"source": [
"print(\"Forward only\")\n",
"with torch.no_grad():\n",
" profile_model(lambda : model_sparse(i))\n",
"print(\"\")\n",
"print(\"Forward + backward\")\n",
"profile_model(lambda : model_sparse(i).sum().backward())"
]
},
{
"cell_type": "markdown",
"id": "6cfa7340-8d7c-4be5-861b-2354324f50ca",
"metadata": {},
"source": [
"Those results indicate that the sparse model achieves the same speed as the manually-implemented dense version.\n",
"This is very encouraging, as with a generic sparse implementation we are able to achieve comparable speed versus the optimized dense implementation, while being substantially simpler to implement (specially on the windows shift optimizations, see [\\[1\\]](https://github.com/microsoft/Swin-Transformer/issues/52) and [\\[2\\]](https://github.com/microsoft/Swin-Transformer/issues/38) for examples).\n",
"\n",
"From the memory perspective, the sparse model uses slightly more memory, as it needs to keep the indices of the non-zero elements in memory, while in the baseline dense model the structure is encoded directly in the code. Note that we can further reduce the memory needs by re-using the same sparse pattern over multiple layers."
]
},
{
"cell_type": "markdown",
"id": "95caaf5c-53b4-493e-8a66-1ad09d1917f6",
"metadata": {},
"source": [
"# Wrapping up\n",
"\n",
"In this notebook, we've shown that Swin Transformers can be casted as a sparse transformer, and we've shown that a generic implementation based on the sparse kernels from `xformers` is able to match performances compared to the hand-crafted implementation.\n",
"\n",
"We hope that this will further illustrate the power of custom sparsity patterns, and we hope xformers will enable new research directions on large sequences.\n",
"\n",
"Do not hesitate to reach out if you have questions."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.8"
}
},
"nbformat": 4,
"nbformat_minor": 5
}