File size: 7,413 Bytes
2c1ba20 2fa8b28 2c1ba20 2fa8b28 2c1ba20 2fa8b28 a8e1b78 c02e37b a8e1b78 c02e37b a8e1b78 c02e37b a8e1b78 c02e37b a8e1b78 c02e37b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 |
---
library_name: timm
license: mit
tags:
- image-classification
- timm
pipeline_tag: image-classification
---
# LSNet: See Large, Focus Small
Paper: https://arxiv.org/abs/2503.23135
Code: https://github.com/jameslahm/lsnet
## Usage
```python
import timm
import torch
from PIL import Image
import requests
from timm.data import resolve_data_config, create_transform
# Load the model
model = timm.create_model(
'hf_hub:jameslahm/lsnet_b',
pretrained=True
)
model.eval()
# Load and transform image
# Example using a URL:
url = 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
img = Image.open(requests.get(url, stream=True).raw)
config = resolve_data_config({}, model=model)
transform = create_transform(**config)
input_tensor = transform(img).unsqueeze(0) # transform and add batch dimension
# Make prediction
with torch.no_grad():
output = model(input_tensor)
probabilities = torch.nn.functional.softmax(output[0], dim=0)
# Get top 5 predictions
top5_prob, top5_catid = torch.topk(probabilities, 5)
# Assuming you have imagenet labels list 'imagenet_labels'
# for i in range(top5_prob.size(0)):
# print(imagenet_labels[top5_catid[i]], top5_prob[i].item())
```
## Citation
If our code or models help your work, please cite our paper:
```bibtex
@misc{wang2025lsnetlargefocussmall,
title={LSNet: See Large, Focus Small},
author={Ao Wang and Hui Chen and Zijia Lin and Jungong Han and Guiguang Ding},
year={2025},
eprint={2503.23135},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.23135},
}
```
# [LSNet: See Large, Focus Small](https://arxiv.org/abs/2503.23135)
Official PyTorch implementation of **LSNet**. CVPR 2025.
<p align="center">
<img src="https://raw.githubusercontent.com/THU-MIG/lsnet/refs/heads/master/figures/throughput.svg" width=60%> <br>
Models are trained on ImageNet-1K and the throughput
is tested on a Nvidia RTX3090.
</p>
[LSNet: See Large, Focus Small](https://arxiv.org/abs/2503.23135).\
Ao Wang, Hui Chen, Zijia Lin, Jungong Han, and Guiguang Ding\
[](https://arxiv.org/abs/2503.23135) [](https://huggingface.co/jameslahm/lsnet/tree/main) [](https://huggingface.co/collections/jameslahm/lsnet-67ebec0ab4e220e7918d9565)
We introduce LSNet, a new family of lightweight vision models inspired by dynamic heteroscale capability of the human visual system, i.e., "See Large, Focus Small". LSNet achieves state-of-the-art performance and efficiency trade-offs across various vision tasks.
<details>
<summary>
<font size="+1">Abstract</font>
</summary>
Vision network designs, including Convolutional Neural Networks and Vision Transformers, have significantly advanced the field of computer vision. Yet, their complex computations pose challenges for practical deployments, particularly in real-time applications. To tackle this issue, researchers have explored various lightweight and efficient network designs. However, existing lightweight models predominantly leverage self-attention mechanisms and convolutions for token mixing. This dependence brings limitations in effectiveness and efficiency in the perception and aggregation processes of lightweight networks, hindering the balance between performance and efficiency under limited computational budgets. In this paper, we draw inspiration from the dynamic heteroscale vision ability inherent in the efficient human vision system and propose a "See Large, Focus Small" strategy for lightweight vision network design. We introduce LS (<b>L</b>arge-<b>S</b>mall) convolution, which combines large-kernel perception and small-kernel aggregation. It can efficiently capture a wide range of perceptual information and achieve precise feature aggregation for dynamic and complex visual representations, thus enabling proficient processing of visual information. Based on LS convolution, we present LSNet, a new family of lightweight models. Extensive experiments demonstrate that LSNet achieves superior performance and efficiency over existing lightweight networks in various vision tasks.
</details>
## Classification on ImageNet-1K
### Models
- \* denotes the results with distillation.
- The throughput is tested on a Nvidia RTX3090 using [speed.py](./speed.py).
| Model | Top-1 | Params | FLOPs | Throughput | Ckpt | Log |
|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
| LSNet-T | 74.9 / 76.1* | 11.4M | 0.3G | 14708 | [T](https://huggingface.co/jameslahm/lsnet/blob/main/lsnet_t.pth) / [T*](https://huggingface.co/jameslahm/lsnet/blob/main/lsnet_t_distill.pth) | [T](logs/lsnet_t.log) / [T*](logs/lsnet_t_distill.log) |
| LSNet-S | 77.8 / 79.0* | 16.1M | 0.5G | 9023 | [S](https://huggingface.co/jameslahm/lsnet/blob/main/lsnet_s.pth) / [S*](https://huggingface.co/jameslahm/lsnet/blob/main/lsnet_s_distill.pth) | [S](logs/lsnet_s.log) / [S*](logs/lsnet_s_distill.log) |
| LSNet-B | 80.3 / 81.6* | 23.2M | 1.3G | 3996 | [B](https://huggingface.co/jameslahm/lsnet/blob/main/lsnet_b.pth) / [B*](https://huggingface.co/jameslahm/lsnet/blob/main/lsnet_b_distill.pth) | [B](logs/lsnet_b.log) / [B*](logs/lsnet_b_distill.log) |
## ImageNet
### Prerequisites
`conda` virtual environment is recommended.
```bash
conda create -n lsnet python=3.8
pip install -r requirements.txt
```
### Data preparation
Download and extract ImageNet train and val images from http://image-net.org/. The training and validation data are expected to be in the `train` folder and `val` folder respectively:
```
|-- /path/to/imagenet/
|-- train
|-- val
```
### Training
To train LSNet-T on an 8-GPU machine:
```bash
python -m torch.distributed.launch --nproc_per_node=8 --master_port 12345 --use_env main.py --model lsnet_t --data-path ~/imagenet --dist-eval
# For training with distillation, please add `--distillation-type hard`
# For LSNet-B, please add `--weight-decay 0.05`
```
### Testing
```bash
python main.py --eval --model lsnet_t --resume ./pretrain/lsnet_t.pth --data-path ~/imagenet
```
Models can also be automatically downloaded from 🤗 like below.
```python
import timm
model = timm.create_model(
f'hf_hub:jameslahm/lsnet_{t/t_distill/s/s_distill/b/b_distill}',
pretrained=True
)
```
## Downstream Tasks
[Object Detection and Instance Segmentation](https://github.com/THU-MIG/lsnet/blob/master/detection/README.md)<br>
[Semantic Segmentation](https://github.com/THU-MIG/lsnet/blob/master/segmentation/README.md)<br>
[Robustness Evaluation](https://github.com/THU-MIG/lsnet/blob/master/README_robustness.md)
## Acknowledgement
Classification (ImageNet) code base is partly built with [EfficientViT](https://github.com/microsoft/Cream/tree/main/EfficientViT), [LeViT](https://github.com/facebookresearch/LeViT), [PoolFormer](https://github.com/sail-sg/poolformer) and [EfficientFormer](https://github.com/snap-research/EfficientFormer).
The detection and segmentation pipeline is from [MMCV](https://github.com/open-mmlab/mmcv) ([MMDetection](https://github.com/open-mmlab/mmdetection) and [MMSegmentation](https://github.com/open-mmlab/mmsegmentation)).
Thanks for the great implementations! |