Spaces:

tayhan
/

minigpt-final

Configuration error

App Files Files Community

tayhan commited on Aug 15, 2023

Commit

4c609ee

1 Parent(s): 127ee62

final

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitattributes +4 -0
Dockerfile +6 -0
LICENSE.md +14 -0
LICENSE_Lavis.md +14 -0
MiniGPT_4.pdf +3 -0
PrepareVicuna.md +35 -0
README.md +170 -10
__pycache__/demo.cpython-39.pyc +0 -0
api.py +107 -0
dataset/README_1_STAGE.md +96 -0
dataset/README_2_STAGE.md +19 -0
dataset/convert_cc_sbu.py +20 -0
dataset/convert_laion.py +20 -0
dataset/download_cc_sbu.sh +6 -0
dataset/download_laion.sh +6 -0
demo.py +154 -0
environment.yml +63 -0
eval_configs/minigpt4_eval.yaml +25 -0
examples/ad_1.png +3 -0
examples/ad_2.png +3 -0
examples/cook_1.png +3 -0
examples/cook_2.png +3 -0
examples/describe_1.png +3 -0
examples/describe_2.png +3 -0
examples/fact_1.png +3 -0
examples/fact_2.png +3 -0
examples/fix_1.png +3 -0
examples/fix_2.png +3 -0
examples/fun_1.png +3 -0
examples/fun_2.png +3 -0
examples/logo_1.png +3 -0
examples/op_1.png +3 -0
examples/op_2.png +3 -0
examples/people_1.png +3 -0
examples/people_2.png +3 -0
examples/rhyme_1.png +3 -0
examples/rhyme_2.png +3 -0
examples/story_1.png +3 -0
examples/story_2.png +3 -0
examples/web_1.png +3 -0
examples/wop_1.png +3 -0
examples/wop_2.png +3 -0
figs/examples/ad_1.png +3 -0
figs/examples/ad_2.png +3 -0
figs/examples/cook_1.png +3 -0
figs/examples/cook_2.png +3 -0
figs/examples/describe_1.png +3 -0
figs/examples/describe_2.png +3 -0
figs/examples/fact_1.png +3 -0
figs/examples/fact_2.png +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,7 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+. filter=lfs diff=lfs merge=lfs -text
+*.pdf filter=lfs diff=lfs merge=lfs -text
+*.png filter=lfs diff=lfs merge=lfs -text
+figs/*.png filter=lfs diff=lfs merge=lfs -text

Dockerfile ADDED Viewed

	@@ -0,0 +1,6 @@

+FROM conda/miniconda3
+RUN apt-get install -y git
+RUN git clone https://github.com/Vision-CAIR/MiniGPT-4.git
+WORKDIR /MiniGPT-4
+RUN conda env create -f environment.yml
+RUN conda activate minigpt4

LICENSE.md ADDED Viewed

	@@ -0,0 +1,14 @@

+BSD 3-Clause License
+Copyright 2023 Deyao Zhu
+All rights reserved.
+Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
+1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
+2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
+3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

LICENSE_Lavis.md ADDED Viewed

	@@ -0,0 +1,14 @@

+BSD 3-Clause License
+Copyright (c) 2022 Salesforce, Inc.
+All rights reserved.
+Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
+1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
+2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
+3. Neither the name of Salesforce.com nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

MiniGPT_4.pdf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e6d3843b238d5cceb7fd1f6d07582196e18fafa3ed02b65d9fbf089532819d1c
+size 6616060

PrepareVicuna.md ADDED Viewed

	@@ -0,0 +1,35 @@

+## How to Prepare Vicuna Weight
+Vicuna is an open-source LLAMA-based LLM that has a performance close to ChatGPT.
+We currently use the v0 version of Vicuna-13B.
+To prepare Vicuna’s weight, first download Vicuna’s **delta** weight from [https://huggingface.co/lmsys/vicuna-13b-delta-v0](https://huggingface.co/lmsys/vicuna-13b-delta-v0).
+In case you have git-lfs installed (https://git-lfs.com), this can be done by
+```
+git lfs install
+git clone https://huggingface.co/lmsys/vicuna-13b-delta-v0  # more powerful, need at least 24G gpu memory
+# or
+git clone https://huggingface.co/lmsys/vicuna-7b-delta-v0  # smaller, need 12G gpu memory
+```
+Note that this is not directly the working weight, but the difference between the working weight and the original weight of LLAMA-13B. (Due to LLAMA’s rules, we cannot distribute the weight of LLAMA.)
+Then, you need to obtain the original LLAMA-7B or LLAMA-13B weights in the HuggingFace format
+either following the instruction provided by HuggingFace
+[here](https://huggingface.co/docs/transformers/main/model_doc/llama) or from the Internet.
+When these two weights are ready, we can use tools from Vicuna’s team to create the real working weight.
+First, Install their library that is compatible with v0 Vicuna by
+```
+pip install git+https://github.com/lm-sys/[email protected]
+```
+Then, run the following command to create the final working weight
+```
+python -m fastchat.model.apply_delta --base /path/to/llama-13bOR7b-hf/  --target /path/to/save/working/vicuna/weight/  --delta /path/to/vicuna-13bOR7b-delta-v0/
+```
+Now you are good to go!

README.md CHANGED Viewed

@@ -1,10 +1,170 @@
----
-title: Minigpt Final
-emoji: 📊
-colorFrom: indigo
-colorTo: blue
-sdk: static
-pinned: false
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models
+[Deyao Zhu](https://tsutikgiau.github.io/)* (On Job Market!), [Jun Chen](https://junchen14.github.io/)* (On Job Market!), [Xiaoqian Shen](https://xiaoqian-shen.github.io), [Xiang Li](https://xiangli.ac.cn), and [Mohamed Elhoseiny](https://www.mohamed-elhoseiny.com/). *Equal Contribution
+**King Abdullah University of Science and Technology**
+<a href='https://minigpt-4.github.io'><img src='https://img.shields.io/badge/Project-Page-Green'></a>  <a href='https://arxiv.org/abs/2304.10592'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a> <a href='https://huggingface.co/spaces/Vision-CAIR/minigpt4'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue'></a> <a href='https://huggingface.co/Vision-CAIR/MiniGPT-4'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue'></a> [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1OK4kYsZphwt5DXchKkzMBjYF6jnkqh4R?usp=sharing) [![YouTube](https://badges.aleen42.com/src/youtube.svg)](https://www.youtube.com/watch?v=__tftoxpBAw&feature=youtu.be)
+## News
+We now provide a pretrained MiniGPT-4 aligned with Vicuna-7B! The demo GPU memory consumption now can be as low as 12GB.
+## Online Demo
+Click the image to chat with MiniGPT-4 around your images
+[![demo](figs/online_demo.png)](https://minigpt-4.github.io)
+## Examples
+  |   |   |
+:-------------------------:|:-------------------------:
+![find wild](figs/examples/wop_2.png) |  ![write story](figs/examples/ad_2.png)
+![solve problem](figs/examples/fix_1.png)  |  ![write Poem](figs/examples/rhyme_1.png)
+More examples can be found in the [project page](https://minigpt-4.github.io).
+## Introduction
+- MiniGPT-4 aligns a frozen visual encoder from BLIP-2 with a frozen LLM, Vicuna, using just one projection layer.
+- We train MiniGPT-4 with two stages. The first traditional pretraining stage is trained using roughly 5 million aligned image-text pairs in 10 hours using 4 A100s. After the first stage, Vicuna is able to understand the image. But the generation ability of Vicuna is heavilly impacted.
+- To address this issue and improve usability, we propose a novel way to create high-quality image-text pairs by the model itself and ChatGPT together. Based on this, we then create a small (3500 pairs in total) yet high-quality dataset.
+- The second finetuning stage is trained on this dataset in a conversation template to significantly improve its generation reliability and overall usability. To our surprise, this stage is computationally efficient and takes only around 7 minutes with a single A100.
+- MiniGPT-4 yields many emerging vision-language capabilities similar to those demonstrated in GPT-4.
+![overview](figs/overview.png)
+## Getting Started
+### Installation
+**1. Prepare the code and the environment**
+Git clone our repository, creating a python environment and ativate it via the following command
+```bash
+git clone https://github.com/Vision-CAIR/MiniGPT-4.git
+cd MiniGPT-4
+conda env create -f environment.yml
+conda activate minigpt4
+```
+**2. Prepare the pretrained Vicuna weights**
+The current version of MiniGPT-4 is built on the v0 versoin of Vicuna-13B.
+Please refer to our instruction [here](PrepareVicuna.md)
+to prepare the Vicuna weights.
+The final weights would be in a single folder in a structure similar to the following:
+```
+vicuna_weights
+├── config.json
+├── generation_config.json
+├── pytorch_model.bin.index.json
+├── pytorch_model-00001-of-00003.bin
+...
+```
+Then, set the path to the vicuna weight in the model config file
+[here](minigpt4/configs/models/minigpt4.yaml#L16) at Line 16.
+**3. Prepare the pretrained MiniGPT-4 checkpoint**
+Download the pretrained checkpoints according to the Vicuna model you prepare.
+|                                Checkpoint Aligned with Vicuna 13B                                |                               Checkpoint Aligned with Vicuna 7B                                |
+:------------------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------:
+ [Downlad](https://drive.google.com/file/d/1a4zLvaiDBr-36pasffmgpvH5P7CKmpze/view?usp=share_link) | [Download](https://drive.google.com/file/d/1RY9jV0dyqLX-o38LrumkKRh6Jtaop58R/view?usp=sharing)
+Then, set the path to the pretrained checkpoint in the evaluation config file
+in [eval_configs/minigpt4_eval.yaml](eval_configs/minigpt4_eval.yaml#L10) at Line 11.
+### Launching Demo Locally
+Try out our demo [demo.py](demo.py) on your local machine by running
+```
+python demo.py --cfg-path eval_configs/minigpt4_eval.yaml  --gpu-id 0
+```
+To save GPU memory, Vicuna loads as 8 bit by default, with a beam search width of 1.
+This configuration requires about 23G GPU memory for Vicuna 13B and 11.5G GPU memory for Vicuna 7B.
+For more powerful GPUs, you can run the model
+in 16 bit by setting low_resource to False in the config file
+[minigpt4_eval.yaml](eval_configs/minigpt4_eval.yaml) and use a larger beam search width.
+Thanks [@WangRongsheng](https://github.com/WangRongsheng), you can also run our code on [Colab](https://colab.research.google.com/drive/1OK4kYsZphwt5DXchKkzMBjYF6jnkqh4R?usp=sharing)
+### Training
+The training of MiniGPT-4 contains two alignment stages.
+**1. First pretraining stage**
+In the first pretrained stage, the model is trained using image-text pairs from Laion and CC datasets
+to align the vision and language model. To download and prepare the datasets, please check
+our [first stage dataset preparation instruction](dataset/README_1_STAGE.md).
+After the first stage, the visual features are mapped and can be understood by the language
+model.
+To launch the first stage training, run the following command. In our experiments, we use 4 A100.
+You can change the save path in the config file
+[train_configs/minigpt4_stage1_pretrain.yaml](train_configs/minigpt4_stage1_pretrain.yaml)
+```bash
+torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/minigpt4_stage1_pretrain.yaml
+```
+A MiniGPT-4 checkpoint with only stage one training can be downloaded
+[here (13B)](https://drive.google.com/file/d/1u9FRRBB3VovP1HxCAlpD9Lw4t4P6-Yq8/view?usp=share_link) or [here (7B)](https://drive.google.com/file/d/1HihQtCEXUyBM1i9DQbaK934wW3TZi-h5/view?usp=share_link).
+Compared to the model after stage two, this checkpoint generate incomplete and repeated sentences frequently.
+**2. Second finetuning stage**
+In the second stage, we use a small high quality image-text pair dataset created by ourselves
+and convert it to a conversation format to further align MiniGPT-4.
+To download and prepare our second stage dataset, please check our
+[second stage dataset preparation instruction](dataset/README_2_STAGE.md).
+To launch the second stage alignment,
+first specify the path to the checkpoint file trained in stage 1 in
+[train_configs/minigpt4_stage1_pretrain.yaml](train_configs/minigpt4_stage2_finetune.yaml).
+You can also specify the output path there.
+Then, run the following command. In our experiments, we use 1 A100.
+```bash
+torchrun --nproc-per-node NUM_GPU train.py --cfg-path train_configs/minigpt4_stage2_finetune.yaml
+```
+After the second stage alignment, MiniGPT-4 is able to talk about the image coherently and user-friendly.
+## Acknowledgement
++ [BLIP2](https://huggingface.co/docs/transformers/main/model_doc/blip-2) The model architecture of MiniGPT-4 follows BLIP-2. Don't forget to check this great open-source work if you don't know it before!
++ [Lavis](https://github.com/salesforce/LAVIS) This repository is built upon Lavis!
++ [Vicuna](https://github.com/lm-sys/FastChat) The fantastic language ability of Vicuna with only 13B parameters is just amazing. And it is open-source!
+If you're using MiniGPT-4 in your research or applications, please cite using this BibTeX:
+```bibtex
+@article{zhu2023minigpt,
+  title={MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models},
+  author={Zhu, Deyao and Chen, Jun and Shen, Xiaoqian and Li, Xiang and Elhoseiny, Mohamed},
+  journal={arXiv preprint arXiv:2304.10592},
+  year={2023}
+}
+```
+## License
+This repository is under [BSD 3-Clause License](LICENSE.md).
+Many codes are based on [Lavis](https://github.com/salesforce/LAVIS) with
+BSD 3-Clause License [here](LICENSE_Lavis.md).

__pycache__/demo.cpython-39.pyc ADDED Viewed

Binary file (114 Bytes). View file

api.py ADDED Viewed

	@@ -0,0 +1,107 @@

+import argparse
+import os
+import random
+from flask import Flask, redirect, url_for, request
+import numpy as np
+import torch
+import torch.backends.cudnn as cudnn
+import gradio as gr
+from minigpt4.common.config import Config
+from minigpt4.common.dist_utils import get_rank
+from minigpt4.common.registry import registry
+from minigpt4.conversation.conversation import Chat, CONV_VISION
+# imports modules for registration
+from minigpt4.datasets.builders import *
+from minigpt4.models import *
+from minigpt4.processors import *
+from minigpt4.runners import *
+from minigpt4.tasks import *
+from PIL import Image
+import requests
+from huggingface_hub import login
+login("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")
+def parse_args():
+    parser = argparse.ArgumentParser(description="Demo")
+    parser.add_argument("--cfg-path", required=True, help="path to configuration file.")
+    parser.add_argument("--gpu-id", type=int, default=0, help="specify the gpu to load the model.")
+    parser.add_argument(
+        "--options",
+        nargs="+",
+        help="override some settings in the used config, the key-value pair "
+        "in xxx=yyy format will be merged into config file (deprecate), "
+        "change to --cfg-options instead.",
+    )
+    args = parser.parse_args()
+    return args
+def setup_seeds(config):
+    seed = config.run_cfg.seed + get_rank()
+    random.seed(seed)
+    np.random.seed(seed)
+    torch.manual_seed(seed)
+    cudnn.benchmark = False
+    cudnn.deterministic = True
+# ========================================
+#             Model Initialization
+# ========================================
+print('Initializing Chat')
+args = parse_args()
+cfg = Config(args)
+model_config = cfg.model_cfg
+model_config.device_8bit = args.gpu_id
+model_cls = registry.get_model_class(model_config.arch)
+model = model_cls.from_config(model_config).to('cuda:{}'.format(args.gpu_id))
+vis_processor_cfg = cfg.datasets_cfg.cc_sbu_align.vis_processor.train
+vis_processor = registry.get_processor_class(vis_processor_cfg.name).from_config(vis_processor_cfg)
+chat = Chat(model, vis_processor, device='cuda:{}'.format(args.gpu_id))
+print('Initialization Finished')
+#
+# curl -X POST -H "Content-Type: application/x-www-form-urlencoded"  -d "user_message=Response in json format with keys image_description, name, objects, object_name, object_color. " http://127.0.0.1:5000
+# curl -X POST -H "Content-Type: application/x-www-form-urlencoded"  -d "user_message=describe the image" http://127.0.0.1:5000
+#curl -X POST -H "Content-Type: application/x-www-form-urlencoded"  -d "user_message=Response in json format with keys image_description, name, objects, object_name, object_color. " http://127.0.0.1:5000
+app = Flask(__name__)
+app.config["DEBUG"] = False
+@app.route('/', methods = ['POST', 'GET'])
+def home():
+    user_message = request.form['user_message']
+    image = Image.open(requests.get(request.form['image'], stream=True).raw)
+    print(user_message)
+    chat_state = CONV_VISION.copy()
+    chat_state.messages = []
+    img_list = []
+    llm_message = chat.upload_img(image, chat_state, img_list)
+    chat.ask(user_message, chat_state)
+    llm_message = chat.answer(conv=chat_state,
+                                  img_list=img_list,
+                                  num_beams=5,
+                                  temperature=1,
+                                  max_new_tokens=600,
+                                  max_length=2000)[0]
+    return llm_message
+app.run(host='0.0.0.0')

dataset/README_1_STAGE.md ADDED Viewed

	@@ -0,0 +1,96 @@

+## Download the filtered Conceptual Captions, SBU, LAION datasets
+### Pre-training datasets download:
+We use the filtered synthetic captions prepared by BLIP. For more details about the dataset, please refer to [BLIP](https://github.com/salesforce/BLIP).
+It requires ~2.3T to store LAION and CC3M+CC12M+SBU datasets
+Image source | Filtered synthetic caption by ViT-L
+--- | :---:
+CC3M+CC12M+SBU | <a href="https://storage.googleapis.com/sfr-vision-language-research/BLIP/datasets/ccs_synthetic_filtered_large.json">Download</a>
+LAION115M |  <a href="https://storage.googleapis.com/sfr-vision-language-research/BLIP/datasets/laion_synthetic_filtered_large.json">Download</a>
+This will download two json files
+```
+ccs_synthetic_filtered_large.json
+laion_synthetic_filtered_large.json
+```
+## prepare the data step-by-step
+### setup the dataset folder and move the annotation file to the data storage folder
+```
+export MINIGPT4_DATASET=/YOUR/PATH/FOR/LARGE/DATASET/
+mkdir ${MINIGPT4_DATASET}/cc_sbu
+mkdir ${MINIGPT4_DATASET}/laion
+mv ccs_synthetic_filtered_large.json ${MINIGPT4_DATASET}/cc_sbu
+mv laion_synthetic_filtered_large.json ${MINIGPT4_DATASET}/laion
+```
+### Convert the scripts to data storate folder
+```
+cp convert_cc_sbu.py ${MINIGPT4_DATASET}/cc_sbu
+cp download_cc_sbu.sh ${MINIGPT4_DATASET}/cc_sbu
+cp convert_laion.py ${MINIGPT4_DATASET}/laion
+cp download_laion.sh ${MINIGPT4_DATASET}/laion
+```
+### Convert the laion and cc_sbu annotation file format to be img2dataset format
+```
+cd ${MINIGPT4_DATASET}/cc_sbu
+python convert_cc_sbu.py
+cd ${MINIGPT4_DATASET}/laion
+python convert_laion.py
+```
+### Download the datasets with img2dataset
+```
+cd ${MINIGPT4_DATASET}/cc_sbu
+sh download_cc_sbu.sh
+cd ${MINIGPT4_DATASET}/laion
+sh download_laion.sh
+```
+The final dataset structure
+```
+.
+├── ${MINIGPT4_DATASET}
+│   ├── cc_sbu
+│       ├── convert_cc_sbu.py
+│       ├── download_cc_sbu.sh
+│       ├── ccs_synthetic_filtered_large.json
+│       ├── ccs_synthetic_filtered_large.tsv
+│       └── cc_sbu_dataset
+│           ├── 00000.tar
+│           ├── 00000.parquet
+│           ...
+│   ├── laion
+│       ├── convert_laion.py
+│       ├── download_laion.sh
+│       ├── laion_synthetic_filtered_large.json
+│       ├── laion_synthetic_filtered_large.tsv
+│       └── laion_dataset
+│           ├── 00000.tar
+│           ├── 00000.parquet
+│           ...
+...
+```
+## Set up the dataset configuration files
+Then, set up the LAION dataset loading path in
+[here](../minigpt4/configs/datasets/laion/defaults.yaml#L5) at Line 5 as
+${MINIGPT4_DATASET}/laion/laion_dataset/{00000..10488}.tar
+and the Conceptual Captoin and SBU datasets loading path in
+[here](../minigpt4/configs/datasets/cc_sbu/defaults.yaml#L5) at Line 5 as
+${MINIGPT4_DATASET}/cc_sbu/cc_sbu_dataset/{00000..01255}.tar

dataset/README_2_STAGE.md ADDED Viewed

	@@ -0,0 +1,19 @@

+## Second Stage Data Preparation
+Our second stage dataset can be downloaded from
+[here](https://drive.google.com/file/d/1nJXhoEcy3KTExr17I7BXqY5Y9Lx_-n-9/view?usp=share_link)
+After extraction, you will get a data follder with the following structure:
+```
+cc_sbu_align
+├── filter_cap.json
+└── image
+    ├── 2.jpg
+    ├── 3.jpg
+    ...
+```
+Put the folder to any path you want.
+Then, set up the dataset path in the dataset config file
+[here](../minigpt4/configs/datasets/cc_sbu/align.yaml#L5) at Line 5.

dataset/convert_cc_sbu.py ADDED Viewed

	@@ -0,0 +1,20 @@

+import json
+import csv
+# specify input and output file paths
+input_file = 'ccs_synthetic_filtered_large.json'
+output_file = 'ccs_synthetic_filtered_large.tsv'
+# load JSON data from input file
+with open(input_file, 'r') as f:
+    data = json.load(f)
+# extract header and data from JSON
+header = data[0].keys()
+rows = [x.values() for x in data]
+# write data to TSV file
+with open(output_file, 'w') as f:
+    writer = csv.writer(f, delimiter='\t')
+    writer.writerow(header)
+    writer.writerows(rows)

dataset/convert_laion.py ADDED Viewed

	@@ -0,0 +1,20 @@

+import json
+import csv
+# specify input and output file paths
+input_file = 'laion_synthetic_filtered_large.json'
+output_file = 'laion_synthetic_filtered_large.tsv'
+# load JSON data from input file
+with open(input_file, 'r') as f:
+    data = json.load(f)
+# extract header and data from JSON
+header = data[0].keys()
+rows = [x.values() for x in data]
+# write data to TSV file
+with open(output_file, 'w') as f:
+    writer = csv.writer(f, delimiter='\t')
+    writer.writerow(header)
+    writer.writerows(rows)

dataset/download_cc_sbu.sh ADDED Viewed

	@@ -0,0 +1,6 @@

+#!/bin/bash
+img2dataset --url_list ccs_synthetic_filtered_large.tsv --input_format "tsv"\
+         --url_col "url" --caption_col "caption" --output_format webdataset\
+           --output_folder cc_sbu_dataset --processes_count 16 --thread_count 128 --image_size 256 \
+             --enable_wandb True

dataset/download_laion.sh ADDED Viewed

	@@ -0,0 +1,6 @@

+#!/bin/bash
+img2dataset --url_list laion_synthetic_filtered_large.tsv --input_format "tsv"\
+         --url_col "url" --caption_col "caption" --output_format webdataset\
+           --output_folder laion_dataset --processes_count 16 --thread_count 128 --image_size 256 \
+             --enable_wandb True

demo.py ADDED Viewed

	@@ -0,0 +1,154 @@

+import argparse
+import os
+import random
+import flask
+import numpy as np
+import torch
+import torch.backends.cudnn as cudnn
+import gradio as gr
+from minigpt4.common.config import Config
+from minigpt4.common.dist_utils import get_rank
+from minigpt4.common.registry import registry
+from minigpt4.conversation.conversation import Chat, CONV_VISION
+# imports modules for registration
+from minigpt4.datasets.builders import *
+from minigpt4.models import *
+from minigpt4.processors import *
+from minigpt4.runners import *
+from minigpt4.tasks import *
+def parse_args():
+    parser = argparse.ArgumentParser(description="Demo")
+    parser.add_argument("--cfg-path", required=True, help="path to configuration file.")
+    parser.add_argument("--gpu-id", type=int, default=0, help="specify the gpu to load the model.")
+    parser.add_argument(
+        "--options",
+        nargs="+",
+        help="override some settings in the used config, the key-value pair "
+        "in xxx=yyy format will be merged into config file (deprecate), "
+        "change to --cfg-options instead.",
+    )
+    args = parser.parse_args()
+    return args
+def setup_seeds(config):
+    seed = config.run_cfg.seed + get_rank()
+    random.seed(seed)
+    np.random.seed(seed)
+    torch.manual_seed(seed)
+    cudnn.benchmark = False
+    cudnn.deterministic = True
+# ========================================
+#             Model Initialization
+# ========================================
+print('Initializing Chat')
+args = parse_args()
+cfg = Config(args)
+model_config = cfg.model_cfg
+model_config.device_8bit = args.gpu_id
+model_cls = registry.get_model_class(model_config.arch)
+model = model_cls.from_config(model_config).to('cuda:{}'.format(args.gpu_id))
+vis_processor_cfg = cfg.datasets_cfg.cc_sbu_align.vis_processor.train
+vis_processor = registry.get_processor_class(vis_processor_cfg.name).from_config(vis_processor_cfg)
+chat = Chat(model, vis_processor, device='cuda:{}'.format(args.gpu_id))
+print('Initialization Finished')
+# ========================================
+#             Gradio Setting
+# ========================================
+def gradio_reset(chat_state, img_list):
+    if chat_state is not None:
+        chat_state.messages = []
+    if img_list is not None:
+        img_list = []
+    return None, gr.update(value=None, interactive=True), gr.update(placeholder='Please upload your image first', interactive=False),gr.update(value="Upload & Start Chat", interactive=True), chat_state, img_list
+def upload_img(gr_img, text_input, chat_state):
+    if gr_img is None:
+        return None, None, gr.update(interactive=True), chat_state, None
+    chat_state = CONV_VISION.copy()
+    img_list = []
+    llm_message = chat.upload_img(gr_img, chat_state, img_list)
+    return gr.update(interactive=False), gr.update(interactive=True, placeholder='Type and press Enter'), gr.update(value="Start Chatting", interactive=False), chat_state, img_list
+def gradio_ask(user_message, chatbot, chat_state):
+    if len(user_message) == 0:
+        return gr.update(interactive=True, placeholder='Input should not be empty!'), chatbot, chat_state
+    chat.ask(user_message, chat_state)
+    chatbot = chatbot + [[user_message, None]]
+    return '', chatbot, chat_state
+def gradio_answer(chatbot, chat_state, img_list, num_beams, temperature):
+    llm_message = chat.answer(conv=chat_state,
+                              img_list=img_list,
+                              num_beams=num_beams,
+                              temperature=temperature,
+                              max_new_tokens=300,
+                              max_length=2000)[0]
+    chatbot[-1][1] = llm_message
+    return chatbot, chat_state, img_list
+title = """<h1 align="center">Demo of MiniGPT-4</h1>"""
+description = """<h3>This is the demo of MiniGPT-4. Upload your images and start chatting!</h3>"""
+article = """<p><a href='https://minigpt-4.github.io'><img src='https://img.shields.io/badge/Project-Page-Green'></a></p><p><a href='https://github.com/Vision-CAIR/MiniGPT-4'><img src='https://img.shields.io/badge/Github-Code-blue'></a></p><p><a href='https://raw.githubusercontent.com/Vision-CAIR/MiniGPT-4/main/MiniGPT_4.pdf'><img src='https://img.shields.io/badge/Paper-PDF-red'></a></p>
+"""
+#TODO show examples below
+with gr.Blocks() as demo:
+    gr.Markdown(title)
+    gr.Markdown(description)
+    gr.Markdown(article)
+    with gr.Row():
+        with gr.Column(scale=0.5):
+            image = gr.Image(type="pil")
+            upload_button = gr.Button(value="Upload & Start Chat", interactive=True, variant="primary")
+            clear = gr.Button("Restart")
+            num_beams = gr.Slider(
+                minimum=1,
+                maximum=10,
+                value=1,
+                step=1,
+                interactive=True,
+                label="beam search numbers)",
+            )
+            temperature = gr.Slider(
+                minimum=0.1,
+                maximum=2.0,
+                value=1.0,
+                step=0.1,
+                interactive=True,
+                label="Temperature",
+            )
+        with gr.Column():
+            chat_state = gr.State()
+            img_list = gr.State()
+            chatbot = gr.Chatbot(label='MiniGPT-4')
+            text_input = gr.Textbox(label='User', placeholder='Please upload your image first', interactive=False)
+    upload_button.click(upload_img, [image, text_input, chat_state], [image, text_input, upload_button, chat_state, img_list])
+    text_input.submit(gradio_ask, [text_input, chatbot, chat_state], [text_input, chatbot, chat_state]).then(
+        gradio_answer, [chatbot, chat_state, img_list, num_beams, temperature], [chatbot, chat_state, img_list]
+    )
+    clear.click(gradio_reset, [chat_state, img_list], [chatbot, image, text_input, upload_button, chat_state, img_list], queue=False)
+demo.launch(share=True, enable_queue=True)

environment.yml ADDED Viewed

	@@ -0,0 +1,63 @@

+name: minigpt4
+channels:
+  - pytorch
+  - defaults
+  - anaconda
+dependencies:
+  - python=3.9
+  - cudatoolkit
+  - pip
+  - pytorch=1.12.1
+  - pytorch-mutex=1.0=cuda
+  - torchaudio=0.12.1
+  - torchvision=0.13.1
+  - pip:
+    - accelerate==0.16.0
+    - aiohttp==3.8.4
+    - aiosignal==1.3.1
+    - async-timeout==4.0.2
+    - attrs==22.2.0
+    - bitsandbytes==0.37.0
+    - cchardet==2.1.7
+    - chardet==5.1.0
+    - contourpy==1.0.7
+    - cycler==0.11.0
+    - filelock==3.9.0
+    - fonttools==4.38.0
+    - frozenlist==1.3.3
+    - huggingface-hub==0.13.4
+    - importlib-resources==5.12.0
+    - kiwisolver==1.4.4
+    - matplotlib==3.7.0
+    - multidict==6.0.4
+    - openai==0.27.0
+    - packaging==23.0
+    - psutil==5.9.4
+    - pycocotools==2.0.6
+    - pyparsing==3.0.9
+    - python-dateutil==2.8.2
+    - pyyaml==6.0
+    - regex==2022.10.31
+    - tokenizers==0.13.2
+    - tqdm==4.64.1
+    - transformers==4.28.0
+    - timm==0.6.13
+    - spacy==3.5.1
+    - webdataset==0.2.48
+    - scikit-learn==1.2.2
+    - scipy==1.10.1
+    - yarl==1.8.2
+    - zipp==3.14.0
+    - omegaconf==2.3.0
+    - opencv-python==4.7.0.72
+    - iopath==0.1.10
+    - decord==0.6.0
+    - tenacity==8.2.2
+    - peft
+    - pycocoevalcap
+    - sentence-transformers
+    - umap-learn
+    - notebook
+    - gradio==3.24.1
+    - gradio-client==0.0.8
+    - wandb

eval_configs/minigpt4_eval.yaml ADDED Viewed

	@@ -0,0 +1,25 @@

+model:
+  arch: mini_gpt4
+  model_type: pretrain_vicuna
+  freeze_vit: True
+  freeze_qformer: True
+  max_txt_len: 160
+  end_sym: "###"
+  low_resource: False
+  prompt_path: "prompts/alignment.txt"
+  prompt_template: '###Human: {} ###Assistant: '
+  ckpt: '/app/MiniGPT-4/pretrained_minigpt4.pth'
+datasets:
+  cc_sbu_align:
+    vis_processor:
+      train:
+        name: "blip2_image_eval"
+        image_size: 224
+    text_processor:
+      train:
+        name: "blip_caption"
+run:
+  task: image_text_pretrain