Rishi Desai commited on
Commit
aa6982c
·
1 Parent(s): 0f8d917

readme + req

Browse files
Files changed (2) hide show
  1. README.md +107 -1
  2. requirements.txt +3 -5
README.md CHANGED
@@ -1 +1,107 @@
1
- # AutoCaptioner
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # AutoCaptioner
2
+ A tool to automatically
3
+ * generate detailed image captions to train higher-quality LoRA and
4
+ * optimize your prompts during inference.
5
+
6
+ <div style="text-align: center;">
7
+ <img src="examples/caption_example.gif" alt="Captioning Example" width="600"/>
8
+ </div>
9
+
10
+ ## What is AutoCaptioner?
11
+
12
+ AutoCaptioner creates detailed, principled image captions for your LoRA dataset. These captions can be used to:
13
+ - Train more expressive LoRAs on Flux or SDXL
14
+ - Make inference easy via prompt optimization
15
+ - Save time compared to manual captioning or ignoring captioning
16
+
17
+ ## Installation
18
+
19
+ ### Prerequisites
20
+ - Python 3.11 or higher
21
+ - [Together API](https://together.ai/) account and API key
22
+
23
+
24
+ ### Setup
25
+
26
+ 1. Create the virtual environment:
27
+ ```bash
28
+ python -m venv venv
29
+ source venv/bin/activate
30
+ python -m pip install -r requirements.txt
31
+ ```
32
+
33
+ 2. Set your Together API key: `TOGETHER_API_KEY`
34
+
35
+ 3. Run inference on one set of images:
36
+
37
+ ```bash
38
+ python main.py --input examples/ --output output/
39
+ ```
40
+
41
+ <details>
42
+ <summary>Arguments</summary>
43
+
44
+ - `--input` (str): Directory containing images to caption.
45
+ - `--output` (str): Directory to save images and captions (defaults to input directory).
46
+ - `--fix_outfit` (flag): Indicate if character has one outfit (for consistent descriptions).
47
+ - `--batch_images` (flag): Process images in batches by category.
48
+ </details>
49
+
50
+
51
+ ## Gradio Web Interface
52
+
53
+ Launch a user-friendly web interface for captioning and prompt optimization:
54
+ ```bash
55
+ python demo.py
56
+ ```
57
+
58
+ ### Features
59
+
60
+ - High-accuracy image captioning with detailed contextual descriptions
61
+ - Consistent character descriptions when using the outfit flag
62
+ - Batch processing for large image collections
63
+ - Optimized for AI model training datasets
64
+ - Web interface for easy use
65
+
66
+ ## How It Works
67
+
68
+ AutoCaptioner leverages the Llama-4-Maverick model through the Together AI platform to:
69
+ 1. Analyze the visual content of your images
70
+ 2. Generate detailed, structured captions
71
+ 3. Save the captions as text files alongside your images
72
+
73
+ ## Notes
74
+ - Images are processed individually in standard mode
75
+ - For large collections, batch processing by category is recommended
76
+ - Each caption is saved as a .txt file with the same name as the image
77
+
78
+ ### Troubleshooting
79
+
80
+ - **API errors**: Ensure your Together API key is set correctly
81
+ - **Unsupported formats**: Only .png, .jpg, .jpeg, and .webp files are supported
82
+ - **Memory issues**: For very large images, try processing in smaller batches
83
+
84
+ ### Examples
85
+
86
+ <table>
87
+ <tr>
88
+ <th>Original Image</th>
89
+ <th>Generated Caption</th>
90
+ </tr>
91
+ <tr>
92
+ <td><img src="examples/landscape.jpg" alt="Landscape" width="300"/></td>
93
+ <td>"tr1gger photorealistic, mountain landscape, hiking gear and outdoor clothing, standing at viewpoint, awestruck, snow-capped peaks and valley, golden sunset light, wide-angle view"</td>
94
+ </tr>
95
+ <tr>
96
+ <td><img src="examples/city.jpg" alt="City" width="300"/></td>
97
+ <td>"tr1gger photorealistic, urban architecture, business suit with briefcase, walking on sidewalk, focused expression, downtown skyscrapers, evening city lights, street-level view"</td>
98
+ </tr>
99
+ <tr>
100
+ <td><img src="examples/food.jpg" alt="Food" width="300"/></td>
101
+ <td>"tr1gger photorealistic, culinary presentation, chef's uniform and hat, plating dish, concentrated, modern restaurant kitchen, soft studio lighting, overhead camera angle"</td>
102
+ </tr>
103
+ </table>
104
+
105
+ ## License
106
+
107
+ [MIT License](LICENSE)
requirements.txt CHANGED
@@ -1,5 +1,3 @@
1
- gradio==4.44.1
2
- Pillow==10.0.0
3
- pydantic>=2.0.0
4
- together
5
- fastapi>=0.100.0
 
1
+ gradio
2
+ pillow
3
+ together