AnseMin commited on
Commit
5910e0d
Β·
1 Parent(s): 4cac30a

Working version 1: GOT OCR works with latex output

Browse files
Files changed (1) hide show
  1. README.md +29 -10
README.md CHANGED
@@ -31,10 +31,11 @@ Markit is a powerful tool that converts various document formats (PDF, DOCX, ima
31
  - **PyPdfium**: Fast PDF parsing using the PDFium engine
32
  - **Docling**: Advanced document structure analysis
33
  - **Gemini Flash**: AI-powered conversion using Google's Gemini API
34
- - **GOT-OCR**: State-of-the-art OCR model for images (JPG/PNG only)
35
  - **OCR Integration**: Extract text from images and scanned documents using Tesseract OCR
36
  - **Interactive UI**: User-friendly Gradio interface with page navigation for large documents
37
  - **AI-Powered Chat**: Interact with your documents using AI to ask questions about content
 
38
 
39
  ## System Architecture
40
  The application is built with a modular architecture:
@@ -85,14 +86,16 @@ The GOT-OCR parser requires:
85
  1. CUDA-capable GPU with sufficient memory
86
  2. The following dependencies will be installed automatically:
87
  ```bash
88
- torch>=2.0.1
89
- torchvision>=0.15.2
90
- transformers>=4.37.2,<4.48.0 # Specific version range required
91
- tiktoken>=0.6.0
92
- verovio>=4.3.1
93
- accelerate>=0.28.0
 
94
  ```
95
  3. Note that GOT-OCR only supports JPG and PNG image formats
 
96
 
97
  ## Deploying to Hugging Face Spaces
98
 
@@ -126,6 +129,8 @@ build:
126
  - **None**: No OCR processing (for documents with selectable text)
127
  - **Tesseract**: Basic OCR using Tesseract
128
  - **Advanced**: Enhanced OCR with layout preservation (available with specific parsers)
 
 
129
  4. Select your desired output format:
130
  - **Markdown**: Clean, readable markdown format
131
  - **JSON**: Structured data representation
@@ -152,8 +157,11 @@ build:
152
  - Verify that all required dependencies are installed correctly
153
  - Remember that GOT-OCR only supports JPG and PNG image formats
154
  - If you encounter CUDA out-of-memory errors, try using a smaller image
155
- - GOT-OCR requires transformers version <4.48.0 due to API changes in newer versions
156
- - If you see errors about 'get_max_length', downgrade transformers to version 4.47.0
 
 
 
157
 
158
  ### General Issues
159
  - Check the console logs for error messages
@@ -186,6 +194,7 @@ markit/
186
  β”‚ β”‚ β”œβ”€β”€ parser_interface.py # Parser interface
187
  β”‚ β”‚ β”œβ”€β”€ parser_registry.py # Parser registry
188
  β”‚ β”‚ β”œβ”€β”€ docling_parser.py # Docling parser
 
189
  β”‚ β”‚ └── pypdfium_parser.py # PyPDFium parser
190
  β”‚ β”œβ”€β”€ ui/ # User interface
191
  β”‚ β”‚ β”œβ”€β”€ __init__.py # Package initialization
@@ -194,4 +203,14 @@ markit/
194
  β”‚ └── __init__.py # Package initialization
195
  └── tests/ # Tests
196
  └── __init__.py # Package initialization
197
- ```
 
 
 
 
 
 
 
 
 
 
 
31
  - **PyPdfium**: Fast PDF parsing using the PDFium engine
32
  - **Docling**: Advanced document structure analysis
33
  - **Gemini Flash**: AI-powered conversion using Google's Gemini API
34
+ - **GOT-OCR**: State-of-the-art OCR model for images (JPG/PNG only) with plain text and formatted text options
35
  - **OCR Integration**: Extract text from images and scanned documents using Tesseract OCR
36
  - **Interactive UI**: User-friendly Gradio interface with page navigation for large documents
37
  - **AI-Powered Chat**: Interact with your documents using AI to ask questions about content
38
+ - **ZeroGPU Support**: Optimized for Hugging Face Spaces with Stateless GPU environments
39
 
40
  ## System Architecture
41
  The application is built with a modular architecture:
 
86
  1. CUDA-capable GPU with sufficient memory
87
  2. The following dependencies will be installed automatically:
88
  ```bash
89
+ torch
90
+ torchvision
91
+ git+https://github.com/huggingface/transformers.git@main # Latest transformers from GitHub
92
+ accelerate
93
+ verovio
94
+ numpy==1.26.3 # Specific version required
95
+ opencv-python
96
  ```
97
  3. Note that GOT-OCR only supports JPG and PNG image formats
98
+ 4. In HF Spaces, the integration with ZeroGPU is automatic and optimized for Stateless GPU environments
99
 
100
  ## Deploying to Hugging Face Spaces
101
 
 
129
  - **None**: No OCR processing (for documents with selectable text)
130
  - **Tesseract**: Basic OCR using Tesseract
131
  - **Advanced**: Enhanced OCR with layout preservation (available with specific parsers)
132
+ - **Plain Text**: For GOT-OCR, extracts raw text without formatting
133
+ - **Formatted Text**: For GOT-OCR, preserves formatting and converts to Markdown
134
  4. Select your desired output format:
135
  - **Markdown**: Clean, readable markdown format
136
  - **JSON**: Structured data representation
 
157
  - Verify that all required dependencies are installed correctly
158
  - Remember that GOT-OCR only supports JPG and PNG image formats
159
  - If you encounter CUDA out-of-memory errors, try using a smaller image
160
+ - In Hugging Face Spaces with Stateless GPU, ensure the `spaces` module is imported before any CUDA initialization
161
+ - If you see errors about "CUDA must not be initialized in the main process", verify the import order in your app.py
162
+ - If you encounter "cannot pickle '_thread.lock' object" errors, this indicates thread locks are being passed to the GPU function
163
+ - The GOT-OCR parser has been optimized for ZeroGPU in Stateless GPU environments with proper serialization handling
164
+ - For local development, the parser will fall back to CPU processing if GPU is not available
165
 
166
  ### General Issues
167
  - Check the console logs for error messages
 
194
  β”‚ β”‚ β”œβ”€β”€ parser_interface.py # Parser interface
195
  β”‚ β”‚ β”œβ”€β”€ parser_registry.py # Parser registry
196
  β”‚ β”‚ β”œβ”€β”€ docling_parser.py # Docling parser
197
+ β”‚ β”‚ β”œβ”€β”€ got_ocr_parser.py # GOT-OCR parser for images
198
  β”‚ β”‚ └── pypdfium_parser.py # PyPDFium parser
199
  β”‚ β”œβ”€β”€ ui/ # User interface
200
  β”‚ β”‚ β”œβ”€β”€ __init__.py # Package initialization
 
203
  β”‚ └── __init__.py # Package initialization
204
  └── tests/ # Tests
205
  └── __init__.py # Package initialization
206
+ ```
207
+
208
+ ### ZeroGPU Integration Notes
209
+
210
+ When developing for Hugging Face Spaces with Stateless GPU:
211
+
212
+ 1. Always import the `spaces` module before any CUDA initialization
213
+ 2. Place all CUDA operations inside functions decorated with `@spaces.GPU()`
214
+ 3. Ensure only picklable objects are passed to GPU-decorated functions
215
+ 4. Use wrapper functions to filter out unpicklable objects like thread locks
216
+ 5. For advanced use cases, consider implementing fallback mechanisms for serialization errors