File size: 11,520 Bytes
ce97608
b3a5734
dda982a
 
 
ce97608
dda982a
ce97608
dda982a
 
ce97608
a82ada4
ce97608
 
dbdd7c8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dda982a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5910e0d
dda982a
 
 
5910e0d
dda982a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5910e0d
 
 
 
 
 
 
dda982a
 
5910e0d
dda982a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e5140fc
dda982a
 
 
 
5910e0d
 
dda982a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5910e0d
 
 
 
 
dda982a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5910e0d
dda982a
 
 
 
 
bdc060b
dda982a
 
5910e0d
 
 
 
 
 
 
 
 
 
b3c2847
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
---
title: Markit GOT OCR
emoji: πŸ“„
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 5.14.0
app_file: app.py
build_script: build.sh
startup_script: setup.sh
pinned: false
hf_oauth: true
---

# Document to Markdown Converter

A Hugging Face Space that converts various document formats to Markdown, now with MarkItDown integration!

## Features

- Convert PDFs, Office documents, images, and more to Markdown
- Multiple parser options:
  - MarkItDown: For comprehensive document conversion
  - GOT-OCR: For image-based OCR with LaTeX support
  - Gemini Flash: For AI-powered text extraction
- Download converted documents as Markdown files
- Clean, responsive UI

## Using MarkItDown

This app integrates [Microsoft's MarkItDown](https://github.com/microsoft/markitdown) library, which supports a wide range of file formats:

- PDF
- PowerPoint (PPTX)
- Word (DOCX)
- Excel (XLSX)
- Images (JPG, PNG)
- Audio files (with transcription)
- HTML
- Text-based formats (CSV, JSON, XML)
- ZIP files
- YouTube URLs
- EPubs
- And more!

## Environment Variables

You can enhance the functionality by setting these environment variables:

- `OPENAI_API_KEY`: Enables AI-based image descriptions in MarkItDown
- `GOOGLE_API_KEY`: Used for Gemini Flash parser and LaTeX to Markdown conversion

## Usage

1. Select a file to upload
2. Choose "MarkItDown" as the parser
3. Select "Standard Conversion"
4. Click "Convert"
5. View the Markdown output and download the converted file

## Local Development

1. Clone the repository
2. Create a `.env` file based on `.env.example`
3. Install dependencies:
   ```
   pip install -r requirements.txt
   ```
4. Run the application:
   ```
   python app.py
   ```

## Credits

- [MarkItDown](https://github.com/microsoft/markitdown) by Microsoft
- [GOT-OCR](https://github.com/stepfun-ai/GOT-OCR-2.0) for image-based OCR
- [Gradio](https://gradio.app/) for the UI framework

# Markit: Document to Markdown Converter

[![Hugging Face Space](https://img.shields.io/badge/πŸ€—%20Hugging%20Face-Space-blue)](https://huggingface.co/spaces/Ansemin101/Markit)

**Author: Anse Min** | [GitHub](https://github.com/ansemin) | [LinkedIn](https://www.linkedin.com/in/ansemin/)

## Project Links
- **GitHub Repository**: [github.com/ansemin/Markit_HF](https://github.com/ansemin/Markit_HF)
- **Hugging Face Space**: [huggingface.co/spaces/Ansemin101/Markit](https://huggingface.co/spaces/Ansemin101/Markit)

## Overview
Markit is a powerful tool that converts various document formats (PDF, DOCX, images, etc.) to Markdown format. It uses different parsing engines and OCR methods to extract text from documents and convert them to clean, readable Markdown formats.

## Key Features
- **Multiple Document Formats**: Convert PDFs, Word documents, images, and other document formats
- **Versatile Output Formats**: Export to Markdown, JSON, plain text, or document tags format
- **Advanced Parsing Engines**:
  - **PyPdfium**: Fast PDF parsing using the PDFium engine
  - **Docling**: Advanced document structure analysis
  - **Gemini Flash**: AI-powered conversion using Google's Gemini API
  - **GOT-OCR**: State-of-the-art OCR model for images (JPG/PNG only) with plain text and formatted text options
- **OCR Integration**: Extract text from images and scanned documents using Tesseract OCR
- **Interactive UI**: User-friendly Gradio interface with page navigation for large documents
- **AI-Powered Chat**: Interact with your documents using AI to ask questions about content
- **ZeroGPU Support**: Optimized for Hugging Face Spaces with Stateless GPU environments

## System Architecture
The application is built with a modular architecture:
- **Core Engine**: Handles document conversion and processing workflows
- **Parser Registry**: Central registry for all document parsers
- **UI Layer**: Gradio-based web interface
- **Service Layer**: Handles AI chat functionality and external services integration

## Installation

### For Local Development
1. Clone the repository
2. Install dependencies:
   ```bash
   pip install -r requirements.txt
   ```
3. Install Tesseract OCR (required for OCR functionality):
   - Windows: Download and install from [GitHub](https://github.com/UB-Mannheim/tesseract/wiki)
   - Linux: `sudo apt-get install tesseract-ocr libtesseract-dev`
   - macOS: `brew install tesseract`

4. Run the application:
   ```bash
   python app.py
   ```

### API Keys Setup

#### Gemini Flash Parser
To use the Gemini Flash parser, you need to:
1. Install the Google Generative AI client: `pip install google-genai`
2. Set the API key environment variable:
   ```bash
   # On Windows
   set GOOGLE_API_KEY=your_api_key_here
   
   # On Linux/Mac
   export GOOGLE_API_KEY=your_api_key_here
   ```
3. Alternatively, create a `.env` file in the project root with:
   ```
   GOOGLE_API_KEY=your_api_key_here
   ```
4. Get your Gemini API key from [Google AI Studio](https://aistudio.google.com/app/apikey)

#### GOT-OCR Parser
The GOT-OCR parser requires:
1. CUDA-capable GPU with sufficient memory
2. The following dependencies will be installed automatically:
   ```bash
   torch
   torchvision
   git+https://github.com/huggingface/transformers.git@main  # Latest transformers from GitHub
   accelerate
   verovio
   numpy==1.26.3  # Specific version required
   opencv-python
   ```
3. Note that GOT-OCR only supports JPG and PNG image formats
4. In HF Spaces, the integration with ZeroGPU is automatic and optimized for Stateless GPU environments

## Deploying to Hugging Face Spaces

### Environment Configuration
1. Go to your Space settings: `https://huggingface.co/spaces/YOUR_USERNAME/YOUR_SPACE_NAME/settings`
2. Add the following repository secrets:
   - Name: `GOOGLE_API_KEY`
   - Value: Your Gemini API key

### Space Configuration
Ensure your Hugging Face Space configuration includes:
```yaml
build:
  dockerfile: Dockerfile
  python_version: "3.10" 
  system_packages:
    - "tesseract-ocr"
    - "libtesseract-dev"
```

## How to Use

### Document Conversion
1. Upload your document using the file uploader
2. Select a parser provider:
   - **PyPdfium**: Best for standard PDFs with selectable text
   - **Docling**: Best for complex document layouts
   - **Gemini Flash**: Best for AI-powered conversions (requires API key)
   - **GOT-OCR**: Best for high-quality OCR on images (JPG/PNG only)
3. Choose an OCR option based on your selected parser:
   - **None**: No OCR processing (for documents with selectable text)
   - **Tesseract**: Basic OCR using Tesseract
   - **Advanced**: Enhanced OCR with layout preservation (available with specific parsers)
   - **Plain Text**: For GOT-OCR, extracts raw text without formatting
   - **Formatted Text**: For GOT-OCR, preserves formatting and converts to Markdown
4. Select your desired output format:
   - **Markdown**: Clean, readable markdown format
   - **JSON**: Structured data representation
   - **Text**: Plain text extraction
   - **Document Tags**: XML-like structure tags
5. Click "Convert" to process your document
6. Navigate through pages using the navigation buttons for multi-page documents
7. Download the converted content in your selected format

## Troubleshooting

### OCR Issues
- Ensure Tesseract is properly installed and in your system PATH
- Check the TESSDATA_PREFIX environment variable is set correctly
- Verify language files are available in the tessdata directory

### Gemini Flash Parser Issues
- Confirm your API key is set correctly as an environment variable
- Check for API usage limits or restrictions
- Verify the document format is supported by the Gemini API

### GOT-OCR Parser Issues
- Ensure you have a CUDA-capable GPU with sufficient memory
- Verify that all required dependencies are installed correctly
- Remember that GOT-OCR only supports JPG and PNG image formats
- If you encounter CUDA out-of-memory errors, try using a smaller image
- In Hugging Face Spaces with Stateless GPU, ensure the `spaces` module is imported before any CUDA initialization
- If you see errors about "CUDA must not be initialized in the main process", verify the import order in your app.py
- If you encounter "cannot pickle '_thread.lock' object" errors, this indicates thread locks are being passed to the GPU function
- The GOT-OCR parser has been optimized for ZeroGPU in Stateless GPU environments with proper serialization handling
- For local development, the parser will fall back to CPU processing if GPU is not available

### General Issues
- Check the console logs for error messages
- Ensure all dependencies are installed correctly
- For large documents, try processing fewer pages at a time

## Development Guide

### Project Structure

```
markit/
β”œβ”€β”€ app.py                  # Main application entry point
β”œβ”€β”€ setup.sh                # Setup script
β”œβ”€β”€ build.sh                # Build script
β”œβ”€β”€ requirements.txt        # Python dependencies
β”œβ”€β”€ README.md               # Project documentation
β”œβ”€β”€ .env                    # Environment variables
β”œβ”€β”€ .gitignore              # Git ignore file
β”œβ”€β”€ .gitattributes          # Git attributes file
β”œβ”€β”€ src/                    # Source code
β”‚   β”œβ”€β”€ __init__.py         # Package initialization
β”‚   β”œβ”€β”€ main.py             # Main module
β”‚   β”œβ”€β”€ core/               # Core functionality
β”‚   β”‚   β”œβ”€β”€ __init__.py     # Package initialization
β”‚   β”‚   β”œβ”€β”€ converter.py    # Document conversion logic
β”‚   β”‚   └── parser_factory.py # Parser factory
β”‚   β”œβ”€β”€ parsers/            # Parser implementations
β”‚   β”‚   β”œβ”€β”€ __init__.py     # Package initialization
β”‚   β”‚   β”œβ”€β”€ parser_interface.py # Parser interface
β”‚   β”‚   β”œβ”€β”€ parser_registry.py # Parser registry
β”‚   β”‚   β”œβ”€β”€ docling_parser.py # Docling parser
β”‚   β”‚   β”œβ”€β”€ got_ocr_parser.py # GOT-OCR parser for images
β”‚   β”‚   └── pypdfium_parser.py # PyPDFium parser
β”‚   β”œβ”€β”€ ui/                 # User interface
β”‚   β”‚   β”œβ”€β”€ __init__.py     # Package initialization
β”‚   β”‚   └── ui.py           # Gradio UI implementation
β”‚   └── services/           # External services
β”‚       └── __init__.py     # Package initialization
└── tests/                  # Tests
    └── __init__.py         # Package initialization
```

### ZeroGPU Integration Notes

When developing for Hugging Face Spaces with Stateless GPU:

1. Always import the `spaces` module before any CUDA initialization
2. Place all CUDA operations inside functions decorated with `@spaces.GPU()`
3. Ensure only picklable objects are passed to GPU-decorated functions
4. Use wrapper functions to filter out unpicklable objects like thread locks
5. For advanced use cases, consider implementing fallback mechanisms for serialization errors
6. **Add `hf_oauth: true` to your Space's README.md metadata** to mitigate GPU quota limitations
7. Sign in with your Hugging Face account when using the app to utilize your personal GPU quota
8. For extensive GPU usage without quota limitations, a Hugging Face Pro subscription is required

> **Note**: If you're implementing a Space with ZeroGPU on your own, you may encounter quota limitations ("GPU task aborted" errors). These can be mitigated by:
> - Adding `hf_oauth: true` to your Space's metadata (as shown in this Space)
> - Having users sign in with their Hugging Face accounts
> - Upgrading to a Hugging Face Pro subscription for dedicated GPU resources