Spaces:
Running
on
Zero
Running
on
Zero
File size: 8,467 Bytes
3aba902 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 |
<div align="center">
# MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision
<a href="https://arxiv.org/abs/2410.19115"><img src='https://img.shields.io/badge/arXiv-Paper-red?logo=arxiv&logoColor=white' alt='arXiv'></a>
<a href='https://wangrc.site/MoGePage/'><img src='https://img.shields.io/badge/Project_Page-Website-green?logo=googlechrome&logoColor=white' alt='Project Page'></a>
<a href='https://huggingface.co/spaces/Ruicheng/MoGe'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Live_Demo-blue'></a>
</div>
<img src="./assets/overview_simplified.png" width="100%" alt="Method overview" align="center">
MoGe is a powerful model for recovering 3D geometry from monocular open-domain images. The model consists of a ViT encoder and a convolutional decoder. It directly predicts an affine-invariant point map as well as a mask that excludes regions with undefined geometry (e.g., sky), from which the camera shift, camera focal length and depth map can be further derived.
***Check our [website](https://wangrc.site/MoGePage) for videos and interactive results!***
## Features
* **Accurately** estimate 3D geometry in point map or mesh format from a **single** image.
* Support various image resolutions and aspect ratios, ranging from **2:1** to **1:2**.
* Capable of producing an extensive depth range, with distances from nearest to farthest reaching up to **1000x**.
* **Fast** inference, typically **0.2s** for a single image on an A100 or RTX 3090 GPU.
## TODO List
- [x] Release inference code & ViT-Large model.
- [ ] Release ViT-Base and ViT-Giant models.
- [ ] Release evaluation and training code.
🌟*Updated on 2024/11/28* - [CHANGELOG](CHANGELOG.md):
* Supported user-provided camera FOV.
* Added the script for panorama images [scripts/infer_panorama.py](scripts/infer_panorama.py).
## Usage
### Prerequisite
- Clone this repository.
```bash
git clone https://github.com/microsoft/MoGe.git
cd MoGe
```
- Python (>= 3.10) environment:
- torch (>= 2.0) and torchvision (compatible with the torch version).
- other requirements
```bash
pip install -r requirements.txt
```
MoGe should be compatible with most requirements versions. Please check the `requirements.txt` for more details if you have concerns.
### Pretrained model
The ViT-Large model has been uploaded to Hugging Face hub at [Ruicheng/moge-vitl](https://huggingface.co/Ruicheng/moge-vitl).
You may load the model via `MoGeModel.from_pretrained("Ruicheng/moge-vitl")` without manually downloading.
If loading the model from a local file is preferred, you may manually download the model from the huggingface hub and load it via `MoGeModel.from_pretrained("PATH_TO_LOCAL_MODEL.pt")`.
### Minimal example
Here is a minimal example for loading the model and inferring on a single image.
```python
import cv2
import torch
from moge.model import MoGeModel
device = torch.device("cuda")
# Load the model from huggingface hub (or load from local).
model = MoGeModel.from_pretrained("Ruicheng/moge-vitl").to(device)
# Read the input image and convert to tensor (3, H, W) and normalize to [0, 1]
input_image = cv2.cvtColor(cv2.imread("PATH_TO_IMAGE.jpg"), cv2.COLOR_BGR2RGB)
input_image = torch.tensor(input_image / 255, dtype=torch.float32, device=device).permute(2, 0, 1)
# Infer
output = model.infer(input_image)
# `output` has keys "points", "depth", "mask" and "intrinsics",
# The maps are in the same size as the input image.
# {
# "points": (H, W, 3), # scale-invariant point map in OpenCV camera coordinate system (x right, y down, z forward)
# "depth": (H, W), # scale-invariant depth map
# "mask": (H, W), # a binary mask for valid pixels.
# "intrinsics": (3, 3), # normalized camera intrinsics
# }
# For more usage details, see the `MoGeModel.infer` docstring.
```
### Using [scripts/app.py](scripts/app.py) for a web demo
Make sure that `gradio` is installed and then run the following command to start the web demo:
```bash
python scripts/app.py # --share for Gradio public sharing
```
The web demo is also available at our [Hugging Face space](https://huggingface.co/spaces/Ruicheng/MoGe).
### Using [scripts/infer.py](scripts/infer.py)
Run the script `scripts/infer.py` via the following command:
```bash
# Save the output [maps], [glb] and [ply] files
python scripts/infer.py --input IMAGES_FOLDER_OR_IMAGE_PATH --output OUTPUT_FOLDER --maps --glb --ply
# Show the result in a window (requires pyglet < 2.0, e.g. pip install pyglet==1.5.29)
python scripts/infer.py --input IMAGES_FOLDER_OR_IMAGE_PATH --output OUTPUT_FOLDER --show
```
For detailed options, run `python scripts/infer.py --help`:
```
Usage: infer.py [OPTIONS]
Inference script for the MoGe model.
Options:
--input PATH Input image or folder path. "jpg" and "png" are
supported.
--fov_x FLOAT If camera parameters are known, set the
horizontal field of view in degrees. Otherwise,
MoGe will estimate it.
--output PATH Output folder path
--pretrained TEXT Pretrained model name or path. Default is
"Ruicheng/moge-vitl"
--device TEXT Device name (e.g. "cuda", "cuda:0", "cpu").
Default is "cuda"
--resize INTEGER Resize the image(s) & output maps to a specific
size. Default is None (no resizing).
--resolution_level INTEGER An integer [0-9] for the resolution level of
inference. The higher, the better but slower.
Default is 9. Note that it is irrelevant to the
output resolution.
--threshold FLOAT Threshold for removing edges. Default is 0.03.
Smaller value removes more edges. "inf" means no
thresholding.
--maps Whether to save the output maps and fov(image,
depth, mask, points, fov).
--glb Whether to save the output as a.glb file. The
color will be saved as a texture.
--ply Whether to save the output as a.ply file. The
color will be saved as vertex colors.
--show Whether show the output in a window. Note that
this requires pyglet<2 installed as required by
trimesh.
--help Show this message and exit.
```
### Using [scripts/infer_panorama.py](scripts/infer_panorama.py) for 360° panorama images
> *NOTE: This is an experimental extension of MoGe.*
The script will split the 360-degree panorama image into multiple perspective views and infer on each view separately.
The output maps will be combined to produce a panorama depth map and point map.
Note that the panorama image must have spherical parameterization (e.g., environment maps or equirectangular images). Other formats must be converted to spherical format before using this script. Run `python scripts/infer_panorama.py --help` for detailed options.
<div align="center">
<img src="./assets/panorama_pipeline.png" width="80%">
The photo is from [this URL](https://commons.wikimedia.org/wiki/Category:360%C2%B0_panoramas_with_equirectangular_projection#/media/File:Braunschweig_Sankt-%C3%84gidien_Panorama_02.jpg)
</div>
## License
MoGe code is released under the MIT license, except for DINOv2 code in `moge/model/dinov2` which is released by Meta AI under the Apache 2.0 license.
See [LICENSE](LICENSE) for more details.
## Citation
If you find our work useful in your research, we gratefully request that you consider citing our paper:
```
@misc{wang2024moge,
title={MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision},
author={Wang, Ruicheng and Xu, Sicheng and Dai, Cassie and Xiang, Jianfeng and Deng, Yu and Tong, Xin and Yang, Jiaolong},
year={2024},
eprint={2410.19115},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2410.19115},
}
```
|