historical-ocr / improvements.md
milwright's picture
fix cline
2d01495
# Historical OCR Application Improvements
Based on a thorough code review of the Historical OCR application, I've identified several areas for improvement to reduce technical debt and enhance the application's functionality, maintainability, and performance.
## 1. Code Organization and Structure
### 1.1 Modularize Large Functions
Several functions in the codebase are excessively long and handle multiple responsibilities:
- **Issue**: `process_file()` in ocr_processing.py is over 400 lines and handles file validation, preprocessing, OCR processing, and result formatting.
- **Solution**: Break down into smaller, focused functions:
```python
def process_file(uploaded_file, options):
# Validate and prepare file
file_info = validate_and_prepare_file(uploaded_file)
# Apply preprocessing based on document type
preprocessed_file = preprocess_document(file_info, options)
# Perform OCR processing
ocr_result = perform_ocr(preprocessed_file, options)
# Format and enhance results
return format_and_enhance_results(ocr_result, file_info)
```
### 1.2 Consistent Error Handling
Error handling approaches vary across modules:
- **Issue**: Some functions use try/except blocks with detailed logging, while others return error dictionaries or raise exceptions.
- **Solution**: Implement a consistent error handling strategy:
```python
class OCRError(Exception):
def __init__(self, message, error_code=None, details=None):
self.message = message
self.error_code = error_code
self.details = details
super().__init__(self.message)
def handle_error(func):
@functools.wraps(func)
def wrapper(*args, **kwargs):
try:
return func(*args, **kwargs)
except OCRError as e:
logger.error(f"OCR Error: {e.message} (Code: {e.error_code})")
return {"error": e.message, "error_code": e.error_code, "details": e.details}
except Exception as e:
logger.error(f"Unexpected error: {str(e)}")
return {"error": "An unexpected error occurred", "details": str(e)}
return wrapper
```
## 2. API Integration and Performance
### 2.1 API Client Optimization
The Mistral API client initialization and usage can be improved:
- **Issue**: The client is initialized for each request and error handling is duplicated.
- **Solution**: Create a singleton API client with centralized error handling:
```python
class MistralClient:
_instance = None
@classmethod
def get_instance(cls, api_key=None):
if cls._instance is None:
cls._instance = cls(api_key)
return cls._instance
def __init__(self, api_key=None):
self.api_key = api_key or os.environ.get("MISTRAL_API_KEY", "")
self.client = Mistral(api_key=self.api_key)
def process_ocr(self, document, **kwargs):
try:
return self.client.ocr.process(document=document, **kwargs)
except Exception as e:
# Centralized error handling
return self._handle_api_error(e)
```
### 2.2 Caching Strategy
The current caching approach can be improved:
- **Issue**: Cache keys don't always account for all relevant parameters, and TTL is fixed at 24 hours.
- **Solution**: Implement a more sophisticated caching strategy:
```python
def generate_cache_key(file_content, options):
# Create a comprehensive hash of all relevant parameters
options_str = json.dumps(options, sort_keys=True)
content_hash = hashlib.md5(file_content).hexdigest()
return f"{content_hash}_{hashlib.md5(options_str.encode()).hexdigest()}"
# Adaptive TTL based on document type
def get_cache_ttl(document_type):
ttl_map = {
"handwritten": 48 * 3600, # 48 hours for handwritten docs
"newspaper": 24 * 3600, # 24 hours for newspapers
"standard": 12 * 3600 # 12 hours for standard docs
}
return ttl_map.get(document_type, 24 * 3600)
```
## 3. State Management
### 3.1 Streamlit Session State
The application uses a complex state management approach:
- **Issue**: Many session state variables with unclear relationships and reset logic.
- **Solution**: Implement a more structured state management approach:
```python
class DocumentState:
def __init__(self):
self.document = None
self.original_bytes = None
self.name = None
self.mime_type = None
self.is_sample = False
self.processed = False
self.temp_files = []
def reset(self):
# Clean up temp files
for temp_file in self.temp_files:
if os.path.exists(temp_file):
os.unlink(temp_file)
# Reset state
self.__init__()
# Initialize in session state
if 'document_state' not in st.session_state:
st.session_state.document_state = DocumentState()
```
### 3.2 Result History Management
The current approach to managing result history can be improved:
- **Issue**: Results are stored directly in session state with limited management.
- **Solution**: Create a dedicated class for result history:
```python
class ResultHistory:
def __init__(self, max_results=20):
self.results = []
self.max_results = max_results
def add_result(self, result):
# Add timestamp and ensure result is serializable
result = self._prepare_result(result)
self.results.insert(0, result)
# Trim to max size
if len(self.results) > self.max_results:
self.results = self.results[:self.max_results]
def _prepare_result(self, result):
# Add timestamp and ensure result is serializable
result = result.copy()
result['timestamp'] = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
# Ensure result is serializable
return json.loads(json.dumps(result, default=str))
```
## 4. Image Processing Pipeline
### 4.1 Preprocessing Configuration
The preprocessing configuration can be improved:
- **Issue**: Preprocessing options are scattered across different parts of the code.
- **Solution**: Create a centralized preprocessing configuration:
```python
PREPROCESSING_CONFIGS = {
"standard": {
"grayscale": True,
"denoise": True,
"contrast": 5,
"deskew": True
},
"handwritten": {
"grayscale": True,
"denoise": True,
"contrast": 10,
"deskew": True,
"adaptive_threshold": {
"block_size": 21,
"constant": 5
}
},
"newspaper": {
"grayscale": True,
"denoise": True,
"contrast": 5,
"deskew": True,
"column_detection": True
}
}
```
### 4.2 Image Segmentation
The image segmentation approach can be improved:
- **Issue**: Segmentation is optional and not well-integrated with the preprocessing pipeline.
- **Solution**: Make segmentation a standard part of the preprocessing pipeline for certain document types:
```python
def preprocess_document(file_info, options):
# Apply basic preprocessing
preprocessed_file = apply_basic_preprocessing(file_info, options)
# Apply segmentation for specific document types
if options["document_type"] in ["newspaper", "book", "multi_column"]:
return apply_segmentation(preprocessed_file, options)
return preprocessed_file
```
## 5. User Experience Enhancements
### 5.1 Progressive Loading
Improve the user experience during processing:
- **Issue**: The UI can appear frozen during long-running operations.
- **Solution**: Implement progressive loading and feedback:
```python
def process_with_feedback(file, options, progress_callback):
# Update progress at each step
progress_callback(10, "Validating document...")
file_info = validate_and_prepare_file(file)
progress_callback(30, "Preprocessing document...")
preprocessed_file = preprocess_document(file_info, options)
progress_callback(50, "Performing OCR...")
ocr_result = perform_ocr(preprocessed_file, options)
progress_callback(80, "Enhancing results...")
final_result = format_and_enhance_results(ocr_result, file_info)
progress_callback(100, "Complete!")
return final_result
```
### 5.2 Result Visualization
Enhance the visualization of OCR results:
- **Issue**: Results are displayed in a basic format with limited visualization.
- **Solution**: Implement enhanced visualization options:
```python
def display_enhanced_results(result):
# Create tabs for different views
tabs = st.tabs(["Text", "Annotated", "Side-by-Side", "JSON"])
with tabs[0]:
# Display formatted text
st.markdown(format_ocr_text(result["ocr_contents"]["raw_text"]))
with tabs[1]:
# Display annotated image with bounding boxes
display_annotated_image(result)
with tabs[2]:
# Display side-by-side comparison
col1, col2 = st.columns(2)
with col1:
st.image(result["original_image"])
with col2:
st.markdown(format_ocr_text(result["ocr_contents"]["raw_text"]))
with tabs[3]:
# Display raw JSON
st.json(result)
```
## 6. Testing and Reliability
### 6.1 Automated Testing
Implement comprehensive testing:
- **Issue**: Limited or no automated testing.
- **Solution**: Implement unit and integration tests:
```python
# Unit test for preprocessing
def test_preprocess_image():
# Test with various document types
for doc_type in ["standard", "handwritten", "newspaper"]:
# Load test image
with open(f"test_data/{doc_type}_sample.jpg", "rb") as f:
image_bytes = f.read()
# Apply preprocessing
options = {"document_type": doc_type, "grayscale": True, "denoise": True}
result = preprocess_image(image_bytes, options)
# Assert result is not None and different from original
assert result is not None
assert result != image_bytes
```
### 6.2 Error Recovery
Implement better error recovery mechanisms:
- **Issue**: Errors in one part of the pipeline can cause the entire process to fail.
- **Solution**: Implement graceful degradation:
```python
def process_with_fallbacks(file, options):
try:
# Try full processing pipeline
return full_processing_pipeline(file, options)
except OCRError as e:
logger.warning(f"Full pipeline failed: {e.message}. Trying simplified pipeline.")
try:
# Try simplified pipeline
return simplified_processing_pipeline(file, options)
except Exception as e2:
logger.error(f"Simplified pipeline failed: {str(e2)}. Falling back to basic OCR.")
# Fall back to basic OCR
return basic_ocr_only(file)
```
## 7. Documentation and Maintainability
### 7.1 Code Documentation
Improve code documentation:
- **Issue**: Inconsistent documentation across modules.
- **Solution**: Implement consistent docstring format and add module-level documentation:
```python
"""
OCR Processing Module
This module handles the core OCR processing functionality, including:
- File validation and preparation
- Image preprocessing
- OCR processing with Mistral AI
- Result formatting and enhancement
The main entry point is the `process_file` function.
"""
def process_file(file, options):
"""
Process a file with OCR.
Args:
file: The file to process (UploadedFile or bytes)
options: Dictionary of processing options
- document_type: Type of document (standard, handwritten, etc.)
- preprocessing: Dictionary of preprocessing options
- use_vision: Whether to use vision model
Returns:
Dictionary containing OCR results and metadata
Raises:
OCRError: If OCR processing fails
"""
# Implementation
```
### 7.2 Configuration Management
Improve configuration management:
- **Issue**: Configuration is scattered across multiple files.
- **Solution**: Implement a centralized configuration system:
```python
"""
Configuration Module
This module provides a centralized configuration system for the application.
"""
import os
import yaml
from pathlib import Path
class Config:
_instance = None
@classmethod
def get_instance(cls):
if cls._instance is None:
cls._instance = cls()
return cls._instance
def __init__(self):
self.config = {}
self.load_config()
def load_config(self):
# Load from config file
config_path = Path(__file__).parent / "config.yaml"
if config_path.exists():
with open(config_path, "r") as f:
self.config = yaml.safe_load(f)
# Override with environment variables
for key, value in os.environ.items():
if key.startswith("OCR_"):
config_key = key[4:].lower()
self.config[config_key] = value
def get(self, key, default=None):
return self.config.get(key, default)
```
## 8. Security Enhancements
### 8.1 API Key Management
Improve API key management:
- **Issue**: API keys are stored in environment variables with limited validation.
- **Solution**: Implement secure API key management:
```python
def get_api_key():
# Try to get from secure storage first
api_key = get_from_secure_storage("mistral_api_key")
# Fall back to environment variable
if not api_key:
api_key = os.environ.get("MISTRAL_API_KEY", "")
# Validate key format
if api_key and not re.match(r'^[A-Za-z0-9_-]{30,}$', api_key):
logger.warning("API key format appears invalid")
return api_key
```
### 8.2 Input Validation
Improve input validation:
- **Issue**: Limited validation of user inputs.
- **Solution**: Implement comprehensive input validation:
```python
def validate_file(file):
# Check file size
if len(file.getvalue()) > MAX_FILE_SIZE:
raise OCRError("File too large", "FILE_TOO_LARGE")
# Check file type
file_type = get_file_type(file)
if file_type not in ALLOWED_FILE_TYPES:
raise OCRError(f"Unsupported file type: {file_type}", "UNSUPPORTED_FILE_TYPE")
# Check for malicious content
if is_potentially_malicious(file):
raise OCRError("File appears to be malicious", "SECURITY_RISK")
return file_type
```
## 9. Performance Optimizations
### 9.1 Parallel Processing
Implement parallel processing for multi-page documents:
- **Issue**: Pages are processed sequentially, which can be slow for large documents.
- **Solution**: Implement parallel processing:
```python
def process_pdf_pages(pdf_path, options):
# Extract pages
pages = extract_pdf_pages(pdf_path)
# Process pages in parallel
with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
future_to_page = {executor.submit(process_page, page, options): i
for i, page in enumerate(pages)}
results = []
for future in concurrent.futures.as_completed(future_to_page):
page_idx = future_to_page[future]
try:
result = future.result()
results.append((page_idx, result))
except Exception as e:
logger.error(f"Error processing page {page_idx}: {str(e)}")
# Sort results by page index
results.sort(key=lambda x: x[0])
# Combine results
return combine_page_results([r[1] for r in results])
```
### 9.2 Resource Management
Improve resource management:
- **Issue**: Temporary files are not always cleaned up properly.
- **Solution**: Implement better resource management:
```python
class TempFileManager:
def __init__(self):
self.temp_files = []
def create_temp_file(self, content, suffix=".tmp"):
with tempfile.NamedTemporaryFile(delete=False, suffix=suffix) as tmp:
tmp.write(content)
self.temp_files.append(tmp.name)
return tmp.name
def cleanup(self):
for temp_file in self.temp_files:
try:
if os.path.exists(temp_file):
os.unlink(temp_file)
except Exception as e:
logger.warning(f"Failed to remove temp file {temp_file}: {str(e)}")
self.temp_files = []
def __enter__(self):
return self
def __exit__(self, exc_type, exc_val, exc_tb):
self.cleanup()
```
## 10. Extensibility
### 10.1 Plugin System
Implement a plugin system for extensibility:
- **Issue**: Adding new document types or processing methods requires code changes.
- **Solution**: Implement a plugin system:
```python
class OCRPlugin:
def __init__(self, name, description):
self.name = name
self.description = description
def can_handle(self, file_info):
"""Return True if this plugin can handle the file"""
raise NotImplementedError
def process(self, file_info, options):
"""Process the file and return results"""
raise NotImplementedError
# Example plugin
class HandwrittenDocumentPlugin(OCRPlugin):
def __init__(self):
super().__init__("handwritten", "Handwritten document processor")
def can_handle(self, file_info):
# Check if this is a handwritten document
return file_info.get("document_type") == "handwritten"
def process(self, file_info, options):
# Specialized processing for handwritten documents
# ...
```
### 10.2 API Abstraction
Create an abstraction layer for the OCR API:
- **Issue**: The application is tightly coupled to the Mistral AI API.
- **Solution**: Implement an abstraction layer:
```python
class OCRProvider:
def process_image(self, image_path, options):
"""Process an image and return OCR results"""
raise NotImplementedError
def process_pdf(self, pdf_path, options):
"""Process a PDF and return OCR results"""
raise NotImplementedError
class MistralOCRProvider(OCRProvider):
def __init__(self, api_key=None):
self.client = MistralClient.get_instance(api_key)
def process_image(self, image_path, options):
# Implementation using Mistral API
def process_pdf(self, pdf_path, options):
# Implementation using Mistral API
# Factory function to get the appropriate provider
def get_ocr_provider(provider_name="mistral"):
if provider_name == "mistral":
return MistralOCRProvider()
# Add more providers as needed
raise ValueError(f"Unknown OCR provider: {provider_name}")
```
## Implementation Priority