|
--- |
|
title: Image Description with Qwen-VL |
|
emoji: 🖼️ |
|
colorFrom: indigo |
|
colorTo: purple |
|
sdk: docker |
|
sdk_version: 3.0.0 |
|
app_file: app.py |
|
pinned: false |
|
--- |
|
|
|
# Image Description Application with Qwen-VL |
|
|
|
This application uses the advanced Qwen-VL-Chat vision language model to generate detailed descriptions for images. It's specifically set up to describe the image in the `data_temp` folder, but can also analyze any uploaded image. |
|
|
|
## Features |
|
|
|
- Loads an image from the data_temp folder or via upload |
|
- Generates multiple types of descriptions using state-of-the-art AI: |
|
- Basic description (brief overview) |
|
- Detailed analysis (comprehensive description) |
|
- Technical analysis (assessment of technical aspects) |
|
- Displays the image (optional) |
|
- Uses 8-bit quantization for efficient model loading |
|
- Provides a user-friendly Gradio UI |
|
|
|
## Requirements |
|
|
|
- Python 3.8 or higher |
|
- PyTorch |
|
- Transformers (version 4.35.2+) |
|
- Pillow |
|
- Matplotlib |
|
- Accelerate |
|
- Bitsandbytes |
|
- Safetensors |
|
- Gradio for the web interface |
|
|
|
## Hardware Requirements |
|
|
|
This application uses a vision-language model which requires: |
|
- A CUDA-capable GPU with at least 8GB VRAM |
|
- 8GB+ system RAM |
|
|
|
## Deployment Options |
|
|
|
### 1. Hugging Face Spaces (Recommended) |
|
|
|
This repository is ready to be deployed on Hugging Face Spaces. |
|
|
|
**Steps:** |
|
1. Create a new Space on [Hugging Face Spaces](https://huggingface.co/spaces) |
|
2. Select "Docker" as the Space SDK |
|
3. Link this GitHub repository |
|
4. Select a GPU (T4 or better is recommended) |
|
5. Create the Space |
|
|
|
The application will automatically deploy with the Gradio UI frontend. |
|
|
|
### 2. AWS SageMaker |
|
|
|
For production deployment on AWS SageMaker: |
|
|
|
1. Package the application using the provided Dockerfile |
|
2. Upload the Docker image to Amazon ECR |
|
3. Create a SageMaker Model using the ECR image |
|
4. Deploy an endpoint with an instance type like ml.g4dn.xlarge |
|
5. Set up API Gateway for HTTP access (optional) |
|
|
|
Detailed AWS instructions can be found in the `docs/aws_deployment.md` file. |
|
|
|
### 3. Azure Machine Learning |
|
|
|
For Azure deployment: |
|
|
|
1. Create an Azure ML workspace |
|
2. Register the model on Azure ML |
|
3. Create an inference configuration |
|
4. Deploy to AKS or ACI with a GPU-enabled instance |
|
|
|
Detailed Azure instructions can be found in the `docs/azure_deployment.md` file. |
|
|
|
## How It Works |
|
|
|
The application uses the Qwen-VL-Chat model, a state-of-the-art multimodal AI model that can understand and describe images with impressive detail. |
|
|
|
The script: |
|
1. Processes the image with three different prompts: |
|
- "Describe this image briefly in a single paragraph." |
|
- "Analyze this image in detail. Describe the main elements, any text visible, the colors, and the overall composition." |
|
- "What can you tell me about the technical aspects of this image?" |
|
2. Uses 8-bit quantization to reduce memory requirements |
|
3. Formats and displays the results |
|
|
|
## Repository Structure |
|
|
|
- `app.py` - Gradio UI for web interface |
|
- `Dockerfile` - For containerized deployment |
|
- `requirements.txt` - Python dependencies |
|
- `data_temp/` - Sample images for testing |
|
|
|
## Local Development |
|
|
|
1. Install the required packages: |
|
``` |
|
pip install -r requirements.txt |
|
``` |
|
|
|
2. Run the Gradio UI: |
|
``` |
|
python app.py |
|
``` |
|
|
|
3. Visit `http://localhost:7860` in your browser |
|
|
|
## Example Output |
|
|
|
``` |
|
Processing image: data_temp/page_2.png |
|
Loading model... |
|
Generating descriptions... |
|
|
|
==== Image Description Results (Qwen-VL) ==== |
|
|
|
Basic Description: |
|
The image shows a webpage or document with text content organized in multiple columns. |
|
|
|
Detailed Description: |
|
The image displays a structured document or webpage with multiple sections of text organized in a grid layout. The content appears to be technical or educational in nature, with what looks like headings and paragraphs of text. The color scheme is primarily black text on a white background, creating a clean, professional appearance. There appear to be multiple columns of information, possibly representing different topics or categories. The layout suggests this might be documentation, a reference guide, or an educational resource related to technical content. |
|
|
|
Technical Analysis: |
|
This appears to be a screenshot of a digital document or webpage. The image quality is good with clear text rendering, suggesting it was captured at an appropriate resolution. The image uses a standard document layout with what appears to be a grid or multi-column structure. The screenshot has been taken of what seems to be a text-heavy interface with minimal graphics, consistent with technical documentation or reference materials. |
|
``` |
|
|
|
Note: Actual descriptions will vary based on the specific image content and may be more detailed than this example. |