Spaces:

milwright
/

historical-ocr

Running

File size: 19,850 Bytes

2d01495

# Historical OCR Application Improvements

Based on a thorough code review of the Historical OCR application, I've identified several areas for improvement to reduce technical debt and enhance the application's functionality, maintainability, and performance.

## 1. Code Organization and Structure

### 1.1 Modularize Large Functions
Several functions in the codebase are excessively long and handle multiple responsibilities:

- **Issue**: `process_file()` in ocr_processing.py is over 400 lines and handles file validation, preprocessing, OCR processing, and result formatting.
- **Solution**: Break down into smaller, focused functions:
  ```python
  def process_file(uploaded_file, options):
      # Validate and prepare file
      file_info = validate_and_prepare_file(uploaded_file)
      
      # Apply preprocessing based on document type
      preprocessed_file = preprocess_document(file_info, options)
      
      # Perform OCR processing
      ocr_result = perform_ocr(preprocessed_file, options)
      
      # Format and enhance results
      return format_and_enhance_results(ocr_result, file_info)
  ```

### 1.2 Consistent Error Handling
Error handling approaches vary across modules:

- **Issue**: Some functions use try/except blocks with detailed logging, while others return error dictionaries or raise exceptions.
- **Solution**: Implement a consistent error handling strategy:
  ```python
  class OCRError(Exception):
      def __init__(self, message, error_code=None, details=None):
          self.message = message
          self.error_code = error_code
          self.details = details
          super().__init__(self.message)
          
  def handle_error(func):
      @functools.wraps(func)
      def wrapper(*args, **kwargs):
          try:
              return func(*args, **kwargs)
          except OCRError as e:
              logger.error(f"OCR Error: {e.message} (Code: {e.error_code})")
              return {"error": e.message, "error_code": e.error_code, "details": e.details}
          except Exception as e:
              logger.error(f"Unexpected error: {str(e)}")
              return {"error": "An unexpected error occurred", "details": str(e)}
      return wrapper
  ```

## 2. API Integration and Performance

### 2.1 API Client Optimization
The Mistral API client initialization and usage can be improved:

- **Issue**: The client is initialized for each request and error handling is duplicated.
- **Solution**: Create a singleton API client with centralized error handling:
  ```python
  class MistralClient:
      _instance = None
      
      @classmethod
      def get_instance(cls, api_key=None):
          if cls._instance is None:
              cls._instance = cls(api_key)
          return cls._instance
          
      def __init__(self, api_key=None):
          self.api_key = api_key or os.environ.get("MISTRAL_API_KEY", "")
          self.client = Mistral(api_key=self.api_key)
          
      def process_ocr(self, document, **kwargs):
          try:
              return self.client.ocr.process(document=document, **kwargs)
          except Exception as e:
              # Centralized error handling
              return self._handle_api_error(e)
  ```

### 2.2 Caching Strategy
The current caching approach can be improved:

- **Issue**: Cache keys don't always account for all relevant parameters, and TTL is fixed at 24 hours.
- **Solution**: Implement a more sophisticated caching strategy:
  ```python
  def generate_cache_key(file_content, options):
      # Create a comprehensive hash of all relevant parameters
      options_str = json.dumps(options, sort_keys=True)
      content_hash = hashlib.md5(file_content).hexdigest()
      return f"{content_hash}_{hashlib.md5(options_str.encode()).hexdigest()}"
      
  # Adaptive TTL based on document type
  def get_cache_ttl(document_type):
      ttl_map = {
          "handwritten": 48 * 3600,  # 48 hours for handwritten docs
          "newspaper": 24 * 3600,    # 24 hours for newspapers
          "standard": 12 * 3600      # 12 hours for standard docs
      }
      return ttl_map.get(document_type, 24 * 3600)
  ```

## 3. State Management

### 3.1 Streamlit Session State
The application uses a complex state management approach:

- **Issue**: Many session state variables with unclear relationships and reset logic.
- **Solution**: Implement a more structured state management approach:
  ```python
  class DocumentState:
      def __init__(self):
          self.document = None
          self.original_bytes = None
          self.name = None
          self.mime_type = None
          self.is_sample = False
          self.processed = False
          self.temp_files = []
          
      def reset(self):
          # Clean up temp files
          for temp_file in self.temp_files:
              if os.path.exists(temp_file):
                  os.unlink(temp_file)
          
          # Reset state
          self.__init__()
          
  # Initialize in session state
  if 'document_state' not in st.session_state:
      st.session_state.document_state = DocumentState()
  ```

### 3.2 Result History Management
The current approach to managing result history can be improved:

- **Issue**: Results are stored directly in session state with limited management.
- **Solution**: Create a dedicated class for result history:
  ```python
  class ResultHistory:
      def __init__(self, max_results=20):
          self.results = []
          self.max_results = max_results
          
      def add_result(self, result):
          # Add timestamp and ensure result is serializable
          result = self._prepare_result(result)
          self.results.insert(0, result)
          
          # Trim to max size
          if len(self.results) > self.max_results:
              self.results = self.results[:self.max_results]
              
      def _prepare_result(self, result):
          # Add timestamp and ensure result is serializable
          result = result.copy()
          result['timestamp'] = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
          
          # Ensure result is serializable
          return json.loads(json.dumps(result, default=str))
  ```

## 4. Image Processing Pipeline

### 4.1 Preprocessing Configuration
The preprocessing configuration can be improved:

- **Issue**: Preprocessing options are scattered across different parts of the code.
- **Solution**: Create a centralized preprocessing configuration:
  ```python
  PREPROCESSING_CONFIGS = {
      "standard": {
          "grayscale": True,
          "denoise": True,
          "contrast": 5,
          "deskew": True
      },
      "handwritten": {
          "grayscale": True,
          "denoise": True,
          "contrast": 10,
          "deskew": True,
          "adaptive_threshold": {
              "block_size": 21,
              "constant": 5
          }
      },
      "newspaper": {
          "grayscale": True,
          "denoise": True,
          "contrast": 5,
          "deskew": True,
          "column_detection": True
      }
  }
  ```

### 4.2 Image Segmentation
The image segmentation approach can be improved:

- **Issue**: Segmentation is optional and not well-integrated with the preprocessing pipeline.
- **Solution**: Make segmentation a standard part of the preprocessing pipeline for certain document types:
  ```python
  def preprocess_document(file_info, options):
      # Apply basic preprocessing
      preprocessed_file = apply_basic_preprocessing(file_info, options)
      
      # Apply segmentation for specific document types
      if options["document_type"] in ["newspaper", "book", "multi_column"]:
          return apply_segmentation(preprocessed_file, options)
      
      return preprocessed_file
  ```

## 5. User Experience Enhancements

### 5.1 Progressive Loading
Improve the user experience during processing:

- **Issue**: The UI can appear frozen during long-running operations.
- **Solution**: Implement progressive loading and feedback:
  ```python
  def process_with_feedback(file, options, progress_callback):
      # Update progress at each step
      progress_callback(10, "Validating document...")
      file_info = validate_and_prepare_file(file)
      
      progress_callback(30, "Preprocessing document...")
      preprocessed_file = preprocess_document(file_info, options)
      
      progress_callback(50, "Performing OCR...")
      ocr_result = perform_ocr(preprocessed_file, options)
      
      progress_callback(80, "Enhancing results...")
      final_result = format_and_enhance_results(ocr_result, file_info)
      
      progress_callback(100, "Complete!")
      return final_result
  ```

### 5.2 Result Visualization
Enhance the visualization of OCR results:

- **Issue**: Results are displayed in a basic format with limited visualization.
- **Solution**: Implement enhanced visualization options:
  ```python
  def display_enhanced_results(result):
      # Create tabs for different views
      tabs = st.tabs(["Text", "Annotated", "Side-by-Side", "JSON"])
      
      with tabs[0]:
          # Display formatted text
          st.markdown(format_ocr_text(result["ocr_contents"]["raw_text"]))
          
      with tabs[1]:
          # Display annotated image with bounding boxes
          display_annotated_image(result)
          
      with tabs[2]:
          # Display side-by-side comparison
          col1, col2 = st.columns(2)
          with col1:
              st.image(result["original_image"])
          with col2:
              st.markdown(format_ocr_text(result["ocr_contents"]["raw_text"]))
              
      with tabs[3]:
          # Display raw JSON
          st.json(result)
  ```

## 6. Testing and Reliability

### 6.1 Automated Testing
Implement comprehensive testing:

- **Issue**: Limited or no automated testing.
- **Solution**: Implement unit and integration tests:
  ```python
  # Unit test for preprocessing
  def test_preprocess_image():
      # Test with various document types
      for doc_type in ["standard", "handwritten", "newspaper"]:
          # Load test image
          with open(f"test_data/{doc_type}_sample.jpg", "rb") as f:
              image_bytes = f.read()
              
          # Apply preprocessing
          options = {"document_type": doc_type, "grayscale": True, "denoise": True}
          result = preprocess_image(image_bytes, options)
          
          # Assert result is not None and different from original
          assert result is not None
          assert result != image_bytes
  ```

### 6.2 Error Recovery
Implement better error recovery mechanisms:

- **Issue**: Errors in one part of the pipeline can cause the entire process to fail.
- **Solution**: Implement graceful degradation:
  ```python
  def process_with_fallbacks(file, options):
      try:
          # Try full processing pipeline
          return full_processing_pipeline(file, options)
      except OCRError as e:
          logger.warning(f"Full pipeline failed: {e.message}. Trying simplified pipeline.")
          try:
              # Try simplified pipeline
              return simplified_processing_pipeline(file, options)
          except Exception as e2:
              logger.error(f"Simplified pipeline failed: {str(e2)}. Falling back to basic OCR.")
              # Fall back to basic OCR
              return basic_ocr_only(file)
  ```

## 7. Documentation and Maintainability

### 7.1 Code Documentation
Improve code documentation:

- **Issue**: Inconsistent documentation across modules.
- **Solution**: Implement consistent docstring format and add module-level documentation:
  ```python
  """
  OCR Processing Module
  
  This module handles the core OCR processing functionality, including:
  - File validation and preparation
  - Image preprocessing
  - OCR processing with Mistral AI
  - Result formatting and enhancement
  
  The main entry point is the `process_file` function.
  """
  
  def process_file(file, options):
      """
      Process a file with OCR.
      
      Args:
          file: The file to process (UploadedFile or bytes)
          options: Dictionary of processing options
              - document_type: Type of document (standard, handwritten, etc.)
              - preprocessing: Dictionary of preprocessing options
              - use_vision: Whether to use vision model
              
      Returns:
          Dictionary containing OCR results and metadata
          
      Raises:
          OCRError: If OCR processing fails
      """
      # Implementation
  ```

### 7.2 Configuration Management
Improve configuration management:

- **Issue**: Configuration is scattered across multiple files.
- **Solution**: Implement a centralized configuration system:
  ```python
  """
  Configuration Module
  
  This module provides a centralized configuration system for the application.
  """
  
  import os
  import yaml
  from pathlib import Path
  
  class Config:
      _instance = None
      
      @classmethod
      def get_instance(cls):
          if cls._instance is None:
              cls._instance = cls()
          return cls._instance
          
      def __init__(self):
          self.config = {}
          self.load_config()
          
      def load_config(self):
          # Load from config file
          config_path = Path(__file__).parent / "config.yaml"
          if config_path.exists():
              with open(config_path, "r") as f:
                  self.config = yaml.safe_load(f)
                  
          # Override with environment variables
          for key, value in os.environ.items():
              if key.startswith("OCR_"):
                  config_key = key[4:].lower()
                  self.config[config_key] = value
                  
      def get(self, key, default=None):
          return self.config.get(key, default)
  ```

## 8. Security Enhancements

### 8.1 API Key Management
Improve API key management:

- **Issue**: API keys are stored in environment variables with limited validation.
- **Solution**: Implement secure API key management:
  ```python
  def get_api_key():
      # Try to get from secure storage first
      api_key = get_from_secure_storage("mistral_api_key")
      
      # Fall back to environment variable
      if not api_key:
          api_key = os.environ.get("MISTRAL_API_KEY", "")
          
      # Validate key format
      if api_key and not re.match(r'^[A-Za-z0-9_-]{30,}$', api_key):
          logger.warning("API key format appears invalid")
          
      return api_key
  ```

### 8.2 Input Validation
Improve input validation:

- **Issue**: Limited validation of user inputs.
- **Solution**: Implement comprehensive input validation:
  ```python
  def validate_file(file):
      # Check file size
      if len(file.getvalue()) > MAX_FILE_SIZE:
          raise OCRError("File too large", "FILE_TOO_LARGE")
          
      # Check file type
      file_type = get_file_type(file)
      if file_type not in ALLOWED_FILE_TYPES:
          raise OCRError(f"Unsupported file type: {file_type}", "UNSUPPORTED_FILE_TYPE")
          
      # Check for malicious content
      if is_potentially_malicious(file):
          raise OCRError("File appears to be malicious", "SECURITY_RISK")
          
      return file_type
  ```

## 9. Performance Optimizations

### 9.1 Parallel Processing
Implement parallel processing for multi-page documents:

- **Issue**: Pages are processed sequentially, which can be slow for large documents.
- **Solution**: Implement parallel processing:
  ```python
  def process_pdf_pages(pdf_path, options):
      # Extract pages
      pages = extract_pdf_pages(pdf_path)
      
      # Process pages in parallel
      with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
          future_to_page = {executor.submit(process_page, page, options): i 
                           for i, page in enumerate(pages)}
          
          results = []
          for future in concurrent.futures.as_completed(future_to_page):
              page_idx = future_to_page[future]
              try:
                  result = future.result()
                  results.append((page_idx, result))
              except Exception as e:
                  logger.error(f"Error processing page {page_idx}: {str(e)}")
                  
      # Sort results by page index
      results.sort(key=lambda x: x[0])
      
      # Combine results
      return combine_page_results([r[1] for r in results])
  ```

### 9.2 Resource Management
Improve resource management:

- **Issue**: Temporary files are not always cleaned up properly.
- **Solution**: Implement better resource management:
  ```python
  class TempFileManager:
      def __init__(self):
          self.temp_files = []
          
      def create_temp_file(self, content, suffix=".tmp"):
          with tempfile.NamedTemporaryFile(delete=False, suffix=suffix) as tmp:
              tmp.write(content)
              self.temp_files.append(tmp.name)
              return tmp.name
              
      def cleanup(self):
          for temp_file in self.temp_files:
              try:
                  if os.path.exists(temp_file):
                      os.unlink(temp_file)
              except Exception as e:
                  logger.warning(f"Failed to remove temp file {temp_file}: {str(e)}")
          self.temp_files = []
          
      def __enter__(self):
          return self
          
      def __exit__(self, exc_type, exc_val, exc_tb):
          self.cleanup()
  ```

## 10. Extensibility

### 10.1 Plugin System
Implement a plugin system for extensibility:

- **Issue**: Adding new document types or processing methods requires code changes.
- **Solution**: Implement a plugin system:
  ```python
  class OCRPlugin:
      def __init__(self, name, description):
          self.name = name
          self.description = description
          
      def can_handle(self, file_info):
          """Return True if this plugin can handle the file"""
          raise NotImplementedError
          
      def process(self, file_info, options):
          """Process the file and return results"""
          raise NotImplementedError
          
  # Example plugin
  class HandwrittenDocumentPlugin(OCRPlugin):
      def __init__(self):
          super().__init__("handwritten", "Handwritten document processor")
          
      def can_handle(self, file_info):
          # Check if this is a handwritten document
          return file_info.get("document_type") == "handwritten"
          
      def process(self, file_info, options):
          # Specialized processing for handwritten documents
          # ...
  ```

### 10.2 API Abstraction
Create an abstraction layer for the OCR API:

- **Issue**: The application is tightly coupled to the Mistral AI API.
- **Solution**: Implement an abstraction layer:
  ```python
  class OCRProvider:
      def process_image(self, image_path, options):
          """Process an image and return OCR results"""
          raise NotImplementedError
          
      def process_pdf(self, pdf_path, options):
          """Process a PDF and return OCR results"""
          raise NotImplementedError
          
  class MistralOCRProvider(OCRProvider):
      def __init__(self, api_key=None):
          self.client = MistralClient.get_instance(api_key)
          
      def process_image(self, image_path, options):
          # Implementation using Mistral API
          
      def process_pdf(self, pdf_path, options):
          # Implementation using Mistral API
          
  # Factory function to get the appropriate provider
  def get_ocr_provider(provider_name="mistral"):
      if provider_name == "mistral":
          return MistralOCRProvider()
      # Add more providers as needed
      raise ValueError(f"Unknown OCR provider: {provider_name}")
  ```

## Implementation Priority