Spaces:

krishnavadithya
/

expiryprocess

Sleeping

App Files Files Community

krishnavadithya commited on Mar 7

Commit

aacdfd5

verified ·

1 Parent(s): a37298c

Upload folder using huggingface_hub

Browse files

Files changed (14) hide show

.DS_Store +0 -0
.gitignore +16 -0
README.md +81 -7
gradio_app.py +221 -0
install_requirement.sh +5 -0
process/__init__.py +2 -0
process/process_excel.py +215 -0
process/process_pdf_with_headers.py +251 -0
process_invoice.py +219 -0
requirements.txt +30 -0
src/__init__.py +4 -0
src/docx_to_pdf.py +34 -0
src/excel_to_pdf.py +246 -0
src/txt_to_pdf.py +43 -0

.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

.gitignore ADDED Viewed

	@@ -0,0 +1,16 @@

+.env
+result/
+*.pdf
+*.xlsx
+*.xls
+*.doc
+*.docx
+expiry_invoice/
+ignore_code/
+test.ipynb
+__pycache__/
+content/
+.gradio/
+*.json
+invoiceprocessing/
+invoiceprocessing/*

README.md CHANGED Viewed

@@ -1,12 +1,86 @@
 ---
-title: Expiryprocess
-emoji: 🦀
-colorFrom: red
-colorTo: red
 sdk: gradio
 sdk_version: 5.20.1
-app_file: app.py
-pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: expiryprocess
+app_file: gradio_app.py
 sdk: gradio
 sdk_version: 5.20.1
 ---
+# Invoice Processing System with Gradio UI
+This system processes invoice files (PDF, Excel, Word, Text) and extracts structured data using a combination of OCR, regex patterns, and LLM-based extraction. The extracted data can be downloaded as CSV.
+## Features
+- **Multiple File Formats**: Supports PDF, Excel (.xlsx, .xls), Word (.doc, .docx), and Text (.txt) files
+- **Document Conversion**: Automatically converts Word and Text files to PDF for processing
+- **LLM-Enhanced Extraction**: Uses Google's Generative AI for improved extraction accuracy (optional)
+- **Web Interface**: Easy-to-use Gradio UI for uploading files and downloading results
+- **CSV Export**: Download extracted data as CSV for further analysis
+## Installation
+1. Clone this repository:
+   ```bash
+   git clone <repository-url>
+   cd invoice-processing-system
+   ```
+2. Install dependencies:
+   ```bash
+   pip install -r requirements.txt
+   ```
+3. Set up environment variables:
+   - Create a `.env` file in the project root
+   - Add your Google API key for LLM processing:
+     ```
+     GOOGLE_API_KEY=your_api_key_here
+     ```
+## Usage
+### Web Interface (Gradio UI)
+1. Start the Gradio web interface:
+   ```bash
+   python gradio_app.py
+   ```
+2. Open your browser and navigate to the URL shown in the terminal (typically http://127.0.0.1:7860)
+3. Upload an invoice file using the file upload button
+4. Click "Process Invoice" to extract data from the file
+5. View the extracted data in the table and download as CSV using the download button
+### Command Line Interface
+You can also use the command line interface:
+```bash
+# Process a file with default settings (using LLM if available)
+python process_invoice.py path/to/invoice.pdf
+# Process without using LLM
+python process_invoice.py path/to/invoice.xlsx --no-llm
+# Process without saving JSON output
+python process_invoice.py path/to/invoice.docx --no-json
+```
+## Requirements
+- Python 3.8+
+- Google API key (for LLM-enhanced extraction)
+- LibreOffice (for converting .doc/.docx files to PDF)
+- Tesseract OCR (for PDF processing)
+## Troubleshooting
+- **LLM Processing Not Available**: Ensure your Google API key is correctly set in the `.env` file
+- **PDF Conversion Issues**: Make sure LibreOffice is installed and accessible in your PATH
+- **OCR Quality Issues**: Ensure Tesseract OCR is properly installed and configured
+## License
+[MIT License](LICENSE)

gradio_app.py ADDED Viewed

	@@ -0,0 +1,221 @@

+#!/usr/bin/env python3
+"""
+Gradio web interface for invoice processing system.
+This UI allows users to upload invoice files (PDF, DOCX, TXT, etc.) and download the results as CSV.
+"""
+import os
+import sys
+import csv
+import tempfile
+import logging
+import pandas as pd
+from pathlib import Path
+from typing import Dict, List, Optional, Tuple, Union
+import gradio as gr
+from gradio_pdf import PDF  # Import the enhanced PDF component
+from dotenv import load_dotenv
+# Import the invoice processing functionality
+from process_invoice import process_file, setup_google_client
+# Load environment variables
+load_dotenv()
+# Configure logging
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
+)
+logger = logging.getLogger(__name__)
+# Check if Google API is available
+GOOGLE_API_AVAILABLE = setup_google_client() is not None
+def convert_to_csv(invoice_data: Dict) -> str:
+    """
+    Convert invoice data to CSV format.
+    Args:
+        invoice_data: Dictionary containing invoice data
+    Returns:
+        Path to the generated CSV file
+    """
+    # Create a temporary file for the CSV
+    fd, temp_csv_path = tempfile.mkstemp(suffix='.csv')
+    os.close(fd)
+    # Extract items from invoice data
+    items = invoice_data.get('items', [])
+    if not items:
+        logger.warning("No items found in invoice data")
+        return temp_csv_path
+    # Get all unique keys from all items to use as headers
+    all_keys = set()
+    for item in items:
+        all_keys.update(item.keys())
+    # Write to CSV
+    with open(temp_csv_path, 'w', newline='', encoding='utf-8') as csvfile:
+        writer = csv.DictWriter(csvfile, fieldnames=sorted(all_keys))
+        writer.writeheader()
+        writer.writerows(items)
+    logger.info(f"CSV file created at {temp_csv_path}")
+    return temp_csv_path
+def process_invoice_file(
+    file_obj: tempfile._TemporaryFileWrapper,
+    use_llm: bool = True
+) -> Tuple[Dict, str, str, Optional[str], Optional[str]]:
+    """
+    Process an uploaded invoice file and return the results.
+    Args:
+        file_obj: The uploaded file object
+        use_llm: Whether to use LLM for processing
+    Returns:
+        Tuple containing:
+        - Dictionary of extracted data
+        - HTML table for display
+        - Status message
+        - Path to CSV file (or None if processing failed)
+        - Path to PDF file for display (or None if not a PDF)
+    """
+    if not file_obj:
+        return {}, "", "No file uploaded", None, None
+    # Get the file extension
+    file_path = file_obj.name
+    file_ext = os.path.splitext(file_path)[1].lower()
+    # Check if file format is supported
+    supported_formats = ['.pdf', '.xlsx', '.xls', '.doc', '.docx', '.txt']
+    if file_ext not in supported_formats:
+        return {}, "", f"Unsupported file format: {file_ext}. Supported formats: {', '.join(supported_formats)}", None, None
+    # Process the file
+    logger.info(f"Processing file: {file_path}")
+    # Create a temporary directory for JSON output
+    result_dir = Path("result")
+    result_dir.mkdir(exist_ok=True)
+    # For PDF display
+    pdf_path = file_path
+    # If the file is not a PDF, convert it to PDF for display
+    if file_ext != '.pdf':
+        temp_pdf = None
+        try:
+            if file_ext in ['.xlsx', '.xls']:
+                from src.excel_to_pdf import excel_to_pdf, convert_xls_to_xlsx
+                if file_ext == '.xls':
+                    xlsx_path = convert_xls_to_xlsx(file_path, tempfile.NamedTemporaryFile(delete=False, suffix='.xlsx').name)
+                    temp_pdf = tempfile.NamedTemporaryFile(delete=False, suffix='.pdf').name
+                    pdf_path = excel_to_pdf(xlsx_path, pdf_path=temp_pdf)
+                else:
+                    temp_pdf = tempfile.NamedTemporaryFile(delete=False, suffix='.pdf').name
+                    pdf_path = excel_to_pdf(file_path, pdf_path=temp_pdf)
+            elif file_ext in ['.doc', '.docx']:
+                from src.docx_to_pdf import docx_to_pdf
+                temp_pdf = tempfile.NamedTemporaryFile(delete=False, suffix='.pdf').name
+                pdf_path = docx_to_pdf(file_path, temp_pdf)
+            elif file_ext == '.txt':
+                from src.txt_to_pdf import txt_to_pdf
+                temp_pdf = tempfile.NamedTemporaryFile(delete=False, suffix='.pdf').name
+                pdf_path = txt_to_pdf(file_path, temp_pdf)
+            logger.info(f"Converted {file_ext} file to PDF: {pdf_path}")
+        except Exception as e:
+            logger.error(f"Error converting file to PDF: {str(e)}")
+            pdf_path = None
+    json_path = process_file(file_path)
+    # Try to read the JSON file that was created
+    if os.path.exists(json_path):
+        import json
+        with open(json_path, 'r', encoding='utf-8') as f:
+            invoice_data = json.load(f)
+    else:
+        return {}, "", "Failed to process file. No output data found.", None, pdf_path
+    # Create a DataFrame for display
+    items = invoice_data.get('items', [])
+    if 'error' in invoice_data and invoice_data['error']:
+        html_table = f"<p class='error' style='color: red; font-weight: bold;'>{invoice_data['error']}</p>"
+        status = f"Error: {invoice_data['error']}"
+        # Still create CSV with any available items
+        csv_path = convert_to_csv(invoice_data)
+        return invoice_data, html_table, status, csv_path, pdf_path
+    elif items:
+        df = pd.DataFrame(items)
+        html_table = df.to_html(classes='table table-striped')
+        status = f"Successfully processed {len(items)} items from {os.path.basename(file_path)}"
+        # Convert to CSV
+        csv_path = convert_to_csv(invoice_data)
+    else:
+        html_table = "<p>No items found in the invoice</p>"
+        status = "No items extracted from the file"
+        # Create empty CSV
+        csv_path = convert_to_csv({"items": []})
+    return invoice_data, html_table, status, csv_path, pdf_path
+def create_ui() -> gr.Blocks:
+    """Create and return the Gradio UI."""
+    with gr.Blocks(title="Invoice Processing System") as app:
+        gr.Markdown("# Invoice Processing System")
+        gr.Markdown("Upload an invoice file (PDF, Excel, Word, or Text) to extract and download the data as CSV.")
+        with gr.Row():
+            with gr.Column(scale=1):
+                file_input = gr.File(label="Upload Invoice File")
+                process_button = gr.Button("Process Invoice", variant="primary")
+                status_output = gr.Textbox(label="Status", interactive=False)
+                csv_output = gr.File(label="Download CSV", interactive=False)
+            with gr.Column(scale=2):
+                with gr.Tabs():
+                    with gr.TabItem("Extracted Data"):
+                        results_html = gr.HTML(label="Extracted Data")
+                    with gr.TabItem("PDF View"):
+                        # Use the enhanced PDF component from gradio_pdf
+                        pdf_viewer = PDF(label="Invoice PDF", interactive=False)
+        # Define the process flow
+        process_button.click(
+            fn=process_invoice_file,
+            inputs=[file_input],
+            outputs=[gr.State(), results_html, status_output, csv_output, pdf_viewer]
+        )
+        # Add examples if available
+        example_dir = Path("examples")
+        if example_dir.exists():
+            example_files = list(example_dir.glob("*.pdf")) + list(example_dir.glob("*.xlsx"))
+            if example_files:
+                gr.Examples(
+                    examples=[[str(f)] for f in example_files],
+                    inputs=[file_input]
+                )
+    return app
+def main():
+    """Main function to launch the Gradio app."""
+    app = create_ui()
+    app.launch(
+        server_name="0.0.0.0",  # Make accessible from other computers
+        share=True,             # Create a public link
+        inbrowser=True          # Open in browser
+    )
+if __name__ == "__main__":
+    main()

install_requirement.sh ADDED Viewed

	@@ -0,0 +1,5 @@

+sudo apt-get update
+sudo apt-get install -y poppler-utils
+sudo apt-get install -y libreoffice
+sudo apt-get install -y python3-pip
+pip install -r requirements.txt

process/__init__.py ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ from .process_pdf_with_headers import InvoiceItem, InvoiceData
2	+ from .process_excel import process_excel_file # Import the function for processing Excel files

process/process_excel.py ADDED Viewed

	@@ -0,0 +1,215 @@

+import pandas as pd
+import os
+import json
+import re
+import concurrent.futures
+from dotenv import load_dotenv
+from google import genai
+from typing import List, Dict, Any, Optional, Tuple
+import logging
+from pathlib import Path
+# Configure logging
+logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
+logger = logging.getLogger(__name__)
+def setup_environment() -> None:
+    """
+    Load environment variables and configure the Gemini API client.
+    Returns:
+        None
+    """
+    load_dotenv()
+def get_gemini_client() -> genai.Client:
+    """
+    Initialize and return a Gemini API client.
+    Returns:
+        genai.Client: Configured Gemini client
+    """
+    api_key = os.getenv("GEMINI_API_KEY")
+    if not api_key:
+        raise ValueError("GEMINI_API_KEY environment variable not set")
+    return genai.Client(api_key=api_key)
+def process_chunk(chunk_info: Tuple[int, pd.DataFrame, int, int], client: genai.Client) -> List[Dict[str, Any]]:
+    """
+    Process a single chunk of data using Gemini API.
+    Args:
+        chunk_info: Tuple containing (chunk_index, dataframe_chunk, start_index, end_index)
+        client: Gemini API client
+    Returns:
+        List of extracted items from the chunk
+    """
+    i, chunk_df, start_idx, end_idx = chunk_info
+    # Create a structured extraction prompt for the specific chunk
+    extraction_prompt = f"""
+    Extract product information from rows {start_idx} to {end_idx-1} in this Excel data.
+    For each product row, extract:
+    1. Product name
+    2. Batch number
+    3. Expiry date (MM/YY format)
+    4. MRP (Maximum Retail Price)
+    5. Quantity (as integer)
+    Return ONLY a JSON array of objects, one for each product, with these properties:
+    [
+      {{
+        "product_name": "...",
+        "batch_number": "...",
+        "expiry_date": "...",
+        "mrp": "...",
+        "quantity": ...
+      }},
+      ...
+    ]
+    Use null for any value you cannot extract. Return ONLY the JSON array.
+    """
+    chunk_items = []
+    # Process chunk
+    try:
+        chunk_response = client.models.generate_content(
+            model="gemini-2.0-flash",
+            contents=[extraction_prompt, chunk_df.to_string()],
+            config={
+                'response_mime_type': 'application/json',
+                'temperature': 0.1,
+                'max_output_tokens': 8192,
+            }
+        )
+        # Extract items
+        chunk_text = chunk_response.text
+        # Fix common JSON issues
+        chunk_text = re.sub(r'[\n\r\t]', '', chunk_text)
+        chunk_text = re.sub(r',\s*]', ']', chunk_text)
+        # Extract JSON array
+        match = re.search(r'\[(.*)\]', chunk_text, re.DOTALL)
+        if match:
+            try:
+                chunk_items = json.loads('[' + match.group(1) + ']')
+                logger.info(f"Successfully processed chunk {i+1} with {len(chunk_items)} items")
+            except json.JSONDecodeError:
+                logger.error(f"Error parsing JSON in chunk {i+1}")
+    except Exception as e:
+        logger.error(f"Error processing chunk {i+1}: {str(e)}")
+    return chunk_items
+def prepare_chunks(df: pd.DataFrame, chunk_size: int) -> List[Tuple[int, pd.DataFrame, int, int]]:
+    """
+    Prepare dataframe chunks for processing.
+    Args:
+        df: Input dataframe
+        chunk_size: Size of each chunk
+    Returns:
+        List of chunk information tuples
+    """
+    num_chunks = (len(df) + chunk_size - 1) // chunk_size
+    chunks_to_process = []
+    for i in range(num_chunks):
+        start_idx = i * chunk_size
+        end_idx = min((i + 1) * chunk_size, len(df))
+        chunk_df = df.iloc[start_idx:end_idx]
+        chunks_to_process.append((i, chunk_df, start_idx, end_idx))
+    return chunks_to_process
+def process_excel_file(file_path: str, output_path: str, chunk_size: int = 20, max_workers: int = 2) -> Dict[str, Any]:
+    """
+    Process an Excel file to extract product information using Gemini API.
+    Args:
+        file_path: Path to the Excel file
+        output_path: Path to save the extracted data
+        chunk_size: Size of each chunk for processing
+        max_workers: Maximum number of parallel workers
+    Returns:
+        Dict containing the extraction results
+    """
+    # Setup environment
+    setup_environment()
+    client = get_gemini_client()
+    # Read Excel file
+    logger.info(f"Reading Excel file: {file_path}")
+    df = pd.read_excel(file_path)
+    # Prepare chunks for processing
+    chunks_to_process = prepare_chunks(df, chunk_size)
+    num_chunks = len(chunks_to_process)
+    # Process chunks in parallel
+    logger.info(f"Processing {num_chunks} chunks with {max_workers} workers")
+    all_items = []
+    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
+        # Pass client to each process_chunk call
+        results = list(executor.map(
+            lambda chunk: process_chunk(chunk, client),
+            chunks_to_process
+        ))
+    # Combine results
+    for chunk_items in results:
+        all_items.extend(chunk_items)
+    # Create final result
+    final_result = {
+        "items": all_items,
+        "extraction_status": "COMPLETE" if all_items else "INCOMPLETE",
+        "total_items": len(all_items)
+    }
+    # Save the final result
+    with open(output_path, "w") as f:
+        json.dump(final_result, f, indent=2)
+    logger.info(f"Extraction complete. Total items extracted: {len(all_items)}")
+    return final_result
+def main() -> None:
+    """
+    Main function to run the Excel processing script.
+    """
+    input_file = 'expiry_invoice/SAC01000975.xls'
+    output_file = "extracted_invoice_data.json"
+    # Ensure the output directory exists
+    output_path = Path(output_file)
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    # Process the Excel file
+    result = process_excel_file(
+        file_path=input_file,
+        output_path=output_file,
+        chunk_size=20,
+        max_workers=2
+    )
+    print(f"Extraction complete. Total items extracted: {result['total_items']}")
+if __name__ == "__main__":
+    main()

process/process_pdf_with_headers.py ADDED Viewed

	@@ -0,0 +1,251 @@

+from google import genai
+from pydantic import BaseModel, Field
+from typing import List, Optional, Dict, Tuple
+import pdf2image
+import os
+from pathlib import Path
+import concurrent.futures
+from dataclasses import dataclass
+from functools import partial
+import logging
+from PIL import Image
+from dotenv import load_dotenv
+load_dotenv()
+class InvoiceItem(BaseModel):
+    """Represents a single item in an invoice."""
+    product_name: str = Field(description="The name of the product")
+    batch_number: str = Field(description="The batch number of the product")
+    expiry_date: str = Field(description="The expiry date (format: MM/YY)")
+    mrp: str = Field(description="Maximum Retail Price")
+    quantity: int = Field(description="Product quantity")
+class InvoiceData(BaseModel):
+    """Represents the complete invoice data including headers."""
+    headers: List[str] = Field(
+        description="Column headers from the invoice table",
+        default_factory=list
+    )
+    items: List[InvoiceItem] = Field(
+        description="List of extracted invoice items",
+        default_factory=list
+    )
+class HeaderExtraction(BaseModel):
+    """Model for extracting headers separately."""
+    headers: List[str] = Field(
+        description="The column headers found in the invoice table"
+    )
+@dataclass
+class PageData:
+    """Container for page processing data."""
+    idx: int
+    image_path: str
+    headers: List[str]
+    items: List[InvoiceItem]
+def extract_headers(client: genai.Client, image_path: str, model_id: str) -> List[str]:
+    """
+    Extract column headers from the first page of the invoice.
+    Args:
+        client: The Gemini API client
+        image_path: Path to the image file
+        model_id: The model ID to use for extraction
+    Returns:
+        List of column headers
+    """
+    header_prompt = """
+    Extract only the column headers from this invoice table.
+    Return them exactly as they appear, maintaining their order from left to right.
+    Only extract the headers, not any data from the rows.
+    """
+    image_file = client.files.upload(
+        file=image_path,
+        config={'display_name': 'invoice_header_page'}
+    )
+    response = client.models.generate_content(
+        model=model_id,
+        contents=[header_prompt, image_file],
+        config={
+            'response_mime_type': 'application/json',
+            'response_schema': HeaderExtraction
+        }
+    )
+    return response.parsed.headers if response.parsed else []
+def setup_client() -> genai.Client:
+    """Create and return a Gemini API client."""
+    return genai.Client(api_key=os.getenv("GEMINI_API_KEY"))
+def save_image(image: Image, temp_dir: Path, idx: int) -> str:
+    """
+    Save a single page image to disk.
+    Args:
+        image: The PDF page image (PIL Image)
+        temp_dir: Directory to save the image
+        idx: Page index
+    Returns:
+        Path to the saved image
+    """
+    image_path = str(temp_dir / f"page_{idx+1}.jpg")
+    image.save(image_path, "JPEG")
+    return image_path
+def process_single_page(
+    page_data: Tuple[int, Image.Image, Path, List[str], genai.Client, str]
+) -> PageData:
+    """
+    Process a single page of the PDF.
+    Args:
+        page_data: Tuple containing (page_index, page_image, temp_dir, headers, client, model_id)
+    Returns:
+        PageData object containing extracted information
+    """
+    idx, image, temp_dir, headers, client, model_id = page_data
+    # Save image
+    image_path = save_image(image, temp_dir, idx)
+    # First page: extract headers
+    if idx == 0:
+        headers = extract_headers(client, image_path, model_id)
+        prompt = """
+        Extract product details from this invoice table.
+        Use the exact column headers you see in the table.
+        """
+    else:
+        headers_str = ", ".join(headers)
+        prompt = f"""
+        Extract product details from this invoice table.
+        This is page {idx + 1} of the same invoice.
+        Use these column headers: {headers_str}
+        Ensure the extracted data aligns with these columns in order.
+        """
+    # Process image
+    image_file = client.files.upload(
+        file=image_path,
+        config={'display_name': f'invoice_page_{idx+1}'}
+    )
+    response = client.models.generate_content(
+        model=model_id,
+        contents=[prompt, image_file],
+        config={
+            'response_mime_type': 'application/json',
+            'response_schema': InvoiceData
+        }
+    )
+    items = response.parsed.items if response.parsed and response.parsed.items else []
+    return PageData(idx=idx, image_path=image_path, headers=headers, items=items)
+def process_pdf_with_headers(pdf_path: str, max_workers: int = 3) -> InvoiceData:
+    """
+    Process a PDF invoice while preserving column header context using parallel processing.
+    Args:
+        pdf_path: Path to the PDF file
+        max_workers: Maximum number of concurrent workers
+    Returns:
+        InvoiceData object containing headers and extracted items
+    """
+    # Convert PDF pages to images
+    images = pdf2image.convert_from_path(pdf_path)
+    # Create temp directory
+    temp_dir = Path("content/temp")
+    temp_dir.mkdir(parents=True, exist_ok=True)
+    # Initialize shared resources
+    client = setup_client()
+    model_id = "gemini-2.0-flash"
+    headers: List[str] = []
+    # Prepare data for parallel processing
+    page_data = []
+    try:
+        # Process first page separately to get headers
+        first_page = process_single_page((0, images[0], temp_dir, headers, client, model_id))
+        headers = first_page.headers
+        all_items = first_page.items
+        # Prepare remaining pages for parallel processing
+        remaining_pages = [
+            (i, img, temp_dir, headers, client, model_id)
+            for i, img in enumerate(images[1:], start=1)
+        ]
+        # Process remaining pages in parallel
+        with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
+            future_to_page = {
+                executor.submit(process_single_page, page): page[0]
+                for page in remaining_pages
+            }
+            # Collect results as they complete
+            for future in concurrent.futures.as_completed(future_to_page):
+                page_idx = future_to_page[future]
+                try:
+                    page_result = future.result()
+                    all_items.extend(page_result.items)
+                except Exception as e:
+                    logging.error(f"Error processing page {page_idx}: {str(e)}")
+    finally:
+        # Cleanup temporary files
+        for file in temp_dir.glob("*.jpg"):
+            try:
+                file.unlink()
+            except Exception as e:
+                logging.warning(f"Failed to delete temporary file {file}: {str(e)}")
+    return InvoiceData(headers=headers, items=all_items)
+def main():
+    """Main function to demonstrate usage."""
+    # Configure logging
+    logging.basicConfig(
+        level=logging.INFO,
+        format='%(asctime)s - %(levelname)s - %(message)s'
+    )
+    try:
+        invoice_data = process_pdf_with_headers(
+            "/Users/krishnaadithya/Desktop/dev/invoice_processing_2.0/pdf_only/expiry_invoice/DR REDDYS PE 1194.pdf",
+            max_workers=3  # Adjust based on your system and API limits
+        )
+        # Print headers
+        print("Column Headers:", ", ".join(invoice_data.headers))
+        print("\nExtracted Items:")
+        # Print results
+        for item in invoice_data.items:
+            print(f"Product: {item.product_name}")
+            print(f"Batch: {item.batch_number}")
+            print(f"Expiry: {item.expiry_date}")
+            print(f"MRP: {item.mrp}")
+            print(f"Quantity: {item.quantity}")
+            print("-" * 50)
+    except Exception as e:
+        logging.error(f"Error processing invoice: {str(e)}")
+if __name__ == "__main__":
+    main()

process_invoice.py ADDED Viewed

	@@ -0,0 +1,219 @@

+#!/usr/bin/env python3
+"""
+Unified invoice processing script that handles both PDF and Excel files.
+"""
+import os
+import sys
+# Add the project root directory to the Python path
+sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+import json
+import logging
+from typing import Optional
+from pathlib import Path
+import argparse
+import tempfile
+from dotenv import load_dotenv
+# Import document processing functions
+from process.process_pdf_with_headers import process_pdf_with_headers
+from process.process_excel import process_excel_file
+from src.excel_to_pdf import excel_to_pdf, convert_xls_to_xlsx
+from src.docx_to_pdf import docx_to_pdf
+from src.txt_to_pdf import txt_to_pdf
+# Load environment variables from .env file if it exists
+load_dotenv()
+# Configure logging
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
+)
+logger = logging.getLogger(__name__)
+def setup_google_client():
+    """Set up and return the Google Generative AI client."""
+    try:
+        from google import genai
+        api_key = os.environ.get("GOOGLE_API_KEY")
+        if not api_key:
+            logger.warning("GOOGLE_API_KEY environment variable not set. PDF processing with LLM will not be available.")
+            return None
+        return genai.Client(api_key=api_key)
+    except ImportError:
+        logger.warning("google-generativeai package not installed. PDF processing with LLM will not be available.")
+        return None
+    except Exception as e:
+        logger.error(f"Error setting up Google client: {str(e)}")
+        return None
+def save_to_json(invoice_data, input_file_path: str) -> str:
+    """
+    Save the invoice data to a JSON file in the 'result' directory.
+    Args:
+        invoice_data: The invoice data to save (can be a dictionary or an object)
+        input_file_path: The path to the input file
+    Returns:
+        The path to the saved JSON file
+    """
+    # Create result directory if it doesn't exist
+    result_dir = "result"
+    os.makedirs(result_dir, exist_ok=True)
+    # Get the base filename without extension
+    base_filename = os.path.splitext(os.path.basename(input_file_path))[0]
+    # Create the output JSON file path
+    output_file_path = os.path.join(result_dir, f"{base_filename}.json")
+    # Convert invoice data to JSON-serializable format
+    # Check if invoice_data is a dictionary or an object
+    if isinstance(invoice_data, dict):
+        # It's already a dictionary, just ensure items are serializable
+        json_data = invoice_data
+    else:
+        # It's an object, convert to dictionary
+        json_data = {
+            "headers": invoice_data.headers if hasattr(invoice_data, 'headers') else [],
+            "items": [item.model_dump() if hasattr(item, 'model_dump') else item.dict()
+                     for item in invoice_data.items]
+        }
+    # Write to JSON file
+    with open(output_file_path, 'w', encoding='utf-8') as f:
+        json.dump(json_data, f, indent=2, ensure_ascii=False)
+    logger.info(f"Saved invoice data to {output_file_path}")
+    return output_file_path
+def process_file(file_path: str) -> None:
+    """
+    Process an invoice file (PDF, Excel, or Document) and print the extracted data.
+    Args:
+        file_path: Path to the invoice file
+    """
+    file_path = os.path.abspath(file_path)
+    if not os.path.exists(file_path):
+        logger.error(f"File not found: {file_path}")
+        return
+    file_ext = os.path.splitext(file_path)[1].lower()
+    llm_client = setup_google_client()
+    temp_pdf_path = tempfile.NamedTemporaryFile(delete=False, suffix='.pdf').name
+    if file_ext in ['.xlsx', '.xls']:
+        # Process Excel file
+        # For .xls files, convert to .xlsx format first
+        if file_ext == '.xls':
+            xlsx_path = convert_xls_to_xlsx(file_path)
+            file_path = xlsx_path
+        # Create output JSON path
+        output_json_path = os.path.join("result", f"{os.path.splitext(os.path.basename(file_path))[0]}.json")
+        result = process_excel_file(
+            file_path=file_path,
+            output_path=output_json_path,
+            chunk_size=20,
+            max_workers=2
+        )
+        # Create the expected invoice_data format
+        invoice_data = {
+            "headers": ["Product Name", "Batch Number", "Expiry Date", "MRP", "Quantity"],
+            "items": result["items"]
+        }
+    elif file_ext == '.pdf':
+        try:
+            logger.info(f"Processing PDF file with header context: {file_path}")
+            # Process the PDF using process_pdf_with_headers
+            invoice_data_obj = process_pdf_with_headers(file_path)
+            # Convert the InvoiceData object to the format expected by the rest of the code
+            invoice_data = {
+                "headers": invoice_data_obj.headers,
+                "items": [item.model_dump() if hasattr(item, 'model_dump') else item.dict() for item in invoice_data_obj.items]
+            }
+        except Exception as e:
+            logger.error(f"Error processing PDF with headers: {str(e)}")
+    elif file_ext in ['.doc', '.docx', '.txt']:
+        # Process Document file by first converting to PDF
+        # Ensure the required modules are imported
+        if file_ext == '.txt':
+            temp_pdf_path = txt_to_pdf(file_path, temp_pdf_path)
+            logger.info(f"Converted text file to PDF: {temp_pdf_path}")
+        elif file_ext in ['.doc', '.docx']:
+            temp_pdf_path = docx_to_pdf(file_path, temp_pdf_path)
+            logger.info(f"Converted document file to PDF: {temp_pdf_path}")
+        invoice_data_obj = process_pdf_with_headers(temp_pdf_path)
+        # Convert the InvoiceData object to the format expected by the rest of the code
+        invoice_data = {
+            "headers": invoice_data_obj.headers,
+            "items": [item.model_dump() if hasattr(item, 'model_dump') else item.dict() for item in invoice_data_obj.items]
+        }
+    else:
+        logger.error(f"Unsupported file format: {file_ext}")
+        logger.error("Supported formats: .pdf, .xlsx, .xls, .doc, .docx, .txt")
+        return
+    json_path = save_to_json(invoice_data, file_path)
+    print(f"Results saved to: {json_path}")
+    # Print results
+    if isinstance(invoice_data, dict):
+        # It's a dictionary
+        items_count = len(invoice_data.get('items', []))
+        items = invoice_data.get('items', [])
+        print(f"\nExtracted {items_count} items from {file_path}:")
+        for i, item in enumerate(items, 1):
+            print(f"\nItem {i}:")
+            print(f"  Product: {item.get('product_name', 'N/A')}")
+            print(f"  Batch Number: {item.get('batch_number', 'N/A')}")
+            print(f"  Expiry: {item.get('expiry_date', 'N/A')}")
+            print(f"  MRP: {item.get('mrp', 'N/A')}")
+            print(f"  Quantity: {item.get('quantity', 'N/A')}")
+    else:
+        # It's an object (likely a Pydantic model)
+        items_count = len(invoice_data.items) if hasattr(invoice_data, 'items') else 0
+        print(f"\nExtracted {items_count} items from {file_path}:")
+        for i, item in enumerate(invoice_data.items if hasattr(invoice_data, 'items') else [], 1):
+            print(f"\nItem {i}:")
+            print(f"  Product: {getattr(item, 'product_name', 'N/A')}")
+            print(f"  Batch Number: {getattr(item, 'batch_number', 'N/A')}")
+            print(f"  Expiry: {getattr(item, 'expiry_date', 'N/A')}")
+            print(f"  MRP: {getattr(item, 'mrp', 'N/A')}")
+            print(f"  Quantity: {getattr(item, 'quantity', 'N/A')}")
+    return json_path
+def main():
+    """Main function to parse arguments and process files."""
+    parser = argparse.ArgumentParser(description="Process invoice files (PDF, Excel, XLS)")
+    parser.add_argument("--file_path", help="Path to the invoice file")
+    args = parser.parse_args()
+    try:
+        process_file(args.file_path)
+    except Exception as e:
+        logger.error(f"Error processing file: {str(e)}")
+        sys.exit(1)
+if __name__ == "__main__":
+    main()

requirements.txt ADDED Viewed

	@@ -0,0 +1,30 @@

+# Core dependencies
+python-dotenv
+pandas
+numpy
+pydantic
+# PDF processing
+pdf2image
+PyMuPDF
+pytesseract
+Pillow
+# Document processing
+python-docx
+reportlab
+aspose-words
+# Excel processing
+openpyxl
+xlrd
+pyexcel
+pyexcel-xls
+pyexcel-xlsx
+# LLM integration
+google-genai
+# Web UI
+gradio
+gradio_pdf

src/__init__.py ADDED Viewed

	@@ -0,0 +1,4 @@

+# Import functions to make them available at the package level
+from .excel_to_pdf import excel_to_pdf, convert_xls_to_xlsx
+from .docx_to_pdf import docx_to_pdf
+from .txt_to_pdf import txt_to_pdf

src/docx_to_pdf.py ADDED Viewed

	@@ -0,0 +1,34 @@

+import os
+import subprocess
+import aspose.words as aw
+def docx_to_pdf(input_file, output_file=None):
+    #convert doc,docx into pdf
+    # Ensure LibreOffice is installed and get absolute path
+    input_path = os.path.abspath(input_file)
+    output_dir = os.path.dirname(input_path)  # Save in the same directory
+    # Run LibreOffice command
+    command = ["libreoffice", "--headless", "--convert-to", "pdf", input_path, "--outdir", output_dir]
+    subprocess.run(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
+    # Return output file path
+    if output_file is None:
+      output_file = os.path.join(output_dir, os.path.splitext(os.path.basename(input_file))[0] + ".pdf")
+    return output_file
+def docx_to_pdf_(input_file, output_file=None):
+    input_path = os.path.abspath(input_file)
+    output_dir = os.path.dirname(input_path)  # Save in the same directory
+    # Load .doc file
+    doc = aw.Document(input_path)
+    if output_file is None:
+      output_file = os.path.join(output_dir, os.path.splitext(os.path.basename(input_file))[0] + ".pdf")
+    doc.save(output_file)
+    return output_file

src/excel_to_pdf.py ADDED Viewed

	@@ -0,0 +1,246 @@

+import os
+import math
+from openpyxl import load_workbook
+from reportlab.lib import colors
+from reportlab.lib.pagesizes import letter, A4, A3, landscape, portrait
+from reportlab.platypus import SimpleDocTemplate, Table, TableStyle, Paragraph, Spacer, PageBreak
+from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
+from reportlab.lib.enums import TA_LEFT, TA_CENTER
+from reportlab.lib.units import inch
+import pyexcel as p
+def convert_xls_to_xlsx(xls_path, xlsx_path=None):
+    """Convert the old .xls file to .xlsx format"""
+    if xlsx_path is None:
+        xlsx_path = os.path.splitext(xls_path)[0] + '.xlsx'
+    p.save_book_as(file_name=xls_path, dest_file_name=xlsx_path)
+    return xlsx_path
+def determine_page_format(num_columns, max_column_width=None):
+    """
+    Determine the optimal page size and orientation based on table dimensions.
+    Args:
+        num_columns (int): Number of columns in the table.
+        max_column_width (float, optional): Maximum column width if available.
+    Returns:
+        tuple: (pagesize, orientation function)
+    """
+    # Define thresholds for decision making
+    if num_columns <= 5:
+        # Few columns, likely to fit on portrait A4
+        return A4, portrait
+    elif num_columns <= 8:
+        # Medium number of columns, use landscape A4
+        return A4, landscape
+    elif num_columns <= 12:
+        # Many columns, use portrait A3
+        return A3, portrait
+    else:
+        # Lots of columns, use landscape A3
+        return A3, landscape
+def is_effectively_empty(value):
+    """
+    Return True if the cell value is considered empty.
+    Empty means:
+      - The value is None.
+      - The value is a float and math.isnan(value) is True.
+      - The value is a string that is empty (after stripping whitespace).
+    """
+    if value is None:
+        return True
+    if isinstance(value, float) and math.isnan(value):
+        return True
+    if isinstance(value, str) and not value.strip():
+        return True
+    return False
+def excel_to_pdf(excel_path, pdf_path=None, sheet_name=None, max_rows_per_table=50):
+    """
+    Convert Excel file to PDF with adaptive page size based on content,
+    removing columns that contain only NaN (or empty) values.
+    Args:
+        excel_path (str): Path to the Excel file.
+        pdf_path (str, optional): Path for the output PDF file.
+        sheet_name (str, optional): Name of the sheet to convert.
+        max_rows_per_table (int): Maximum rows per table before splitting.
+    Returns:
+        str: Path to the created PDF file.
+    """
+    if excel_path.endswith('.xls'):
+        excel_path = convert_xls_to_xlsx(excel_path)
+    if pdf_path is None:
+        pdf_path = os.path.splitext(excel_path)[0] + '.pdf'
+    # Load Excel file
+    wb = load_workbook(excel_path)
+    sheets = [sheet_name] if sheet_name else wb.sheetnames
+    # Create paragraph styles for cell content
+    styles = getSampleStyleSheet()
+    header_style = ParagraphStyle(
+        name='HeaderStyle',
+        parent=styles['Normal'],
+        fontName='Helvetica-Bold',
+        fontSize=9,
+        alignment=TA_CENTER,
+        textColor=colors.white,
+        leading=12
+    )
+    cell_style = ParagraphStyle(
+        name='CellStyle',
+        parent=styles['Normal'],
+        fontName='Helvetica',
+        fontSize=8,
+        alignment=TA_LEFT,
+        leading=10  # Line spacing
+    )
+    elements = []
+    # Determine the effective maximum number of columns among all sheets (after filtering out empty ones)
+    global_effective_max_columns = 0
+    for sh in sheets:
+        sheet = wb[sh]
+        effective_cols = 0
+        for col in range(1, sheet.max_column + 1):
+            # Check if any cell in the column is non-empty
+            for row in range(1, sheet.max_row + 1):
+                if not is_effectively_empty(sheet.cell(row=row, column=col).value):
+                    effective_cols += 1
+                    break
+        global_effective_max_columns = max(global_effective_max_columns, effective_cols)
+    # Determine optimal page format based on effective column count
+    pagesize, orientation_func = determine_page_format(global_effective_max_columns)
+    # Create the document with determined format
+    doc = SimpleDocTemplate(
+        pdf_path,
+        pagesize=orientation_func(pagesize),
+        leftMargin=10,
+        rightMargin=10,
+        topMargin=15,
+        bottomMargin=15
+    )
+    # Process each sheet
+    for sheet_idx, current_sheet in enumerate(sheets):
+        sheet = wb[current_sheet]
+        # Determine which columns to keep (those with at least one non-empty cell)
+        columns_to_keep = []
+        for col in range(1, sheet.max_column + 1):
+            for row in range(1, sheet.max_row + 1):
+                if not is_effectively_empty(sheet.cell(row=row, column=col).value):
+                    columns_to_keep.append(col)
+                    break
+        # If no columns have valid data, skip this sheet.
+        if not columns_to_keep:
+            continue
+        # Calculate appropriate column widths (only for kept columns)
+        max_col_width = 130  # Maximum column width in points
+        min_col_width = 40   # Minimum column width in points
+        if pagesize == A3:
+            max_col_width = 150  # Allow wider columns on A3
+        col_widths = []
+        for col in columns_to_keep:
+            max_length = 0
+            # Sample first 100 rows for efficiency
+            for row in range(1, min(100, sheet.max_row) + 1):
+                cell = sheet.cell(row=row, column=col)
+                if cell.value:
+                    content_length = len(str(cell.value))
+                    # Cap the length for width calculation at 30 characters
+                    max_length = max(max_length, min(content_length, 30))
+            # Adjust multiplier based on page format (narrower columns for A4, wider for A3)
+            multiplier = 5.5 if pagesize == A4 else 6.0
+            width = min(max(min_col_width, max_length * multiplier), max_col_width)
+            col_widths.append(width)
+        # Build the header row from the kept columns
+        header_row = []
+        # Using row 1 as header (or adjust if your header is in another row)
+        for col in columns_to_keep:
+            cell_value = sheet.cell(row=1, column=col).value
+            header_row.append(Paragraph(str(cell_value or ""), header_style))
+        # Process data rows in chunks to avoid huge tables that might get chopped
+        row_count = sheet.max_row
+        # Start after header row
+        start_row = 2
+        while start_row <= row_count:
+            end_row = min(start_row + max_rows_per_table - 1, row_count)
+            # Create data for this chunk, starting with the header row
+            chunk_data = [header_row]
+            for row_idx in range(start_row, end_row + 1):
+                data_row = []
+                for col in columns_to_keep:
+                    cell = sheet.cell(row=row_idx, column=col)
+                    cell_value = cell.value or ""
+                    data_row.append(Paragraph(str(cell_value), cell_style))
+                chunk_data.append(data_row)
+            # Create table for this chunk
+            table = Table(chunk_data, colWidths=col_widths, repeatRows=1)
+            # Style the table
+            table_style = TableStyle([
+                # Header styling
+                ('BACKGROUND', (0, 0), (-1, 0), colors.darkblue),
+                ('TEXTCOLOR', (0, 0), (-1, 0), colors.white),
+                ('ALIGN', (0, 0), (-1, 0), 'CENTER'),
+                # Grid
+                ('GRID', (0, 0), (-1, -1), 0.5, colors.grey),
+                ('VALIGN', (0, 0), (-1, -1), 'TOP'),
+                # Row background colors
+                ('ROWBACKGROUNDS', (0, 1), (-1, -1), [colors.white, colors.lightgrey]),
+                # Cell padding
+                ('LEFTPADDING', (0, 0), (-1, -1), 3),
+                ('RIGHTPADDING', (0, 0), (-1, -1), 3),
+                ('TOPPADDING', (0, 0), (-1, -1), 3),
+                ('BOTTOMPADDING', (0, 0), (-1, -1), 3)
+            ])
+            table.setStyle(table_style)
+            table.hAlign = 'LEFT'
+            table.spaceBefore = 5
+            table.spaceAfter = 15
+            elements.append(table)
+            # Uncomment below if you wish to add a continuation note when splitting tables
+            # if end_row < row_count:
+            #     continuation = Paragraph(f"Table continues... (Rows {start_row}-{end_row} of {row_count})", styles['Italic'])
+            #     elements.append(continuation)
+            #     elements.append(Spacer(1, 0.2 * inch))
+            start_row = end_row + 1
+        # Add page break between sheets (except for the last sheet)
+        if sheet_idx < len(sheets) - 1:
+            elements.append(PageBreak())
+    # Build PDF
+    doc.build(elements)
+    return pdf_path

src/txt_to_pdf.py ADDED Viewed

	@@ -0,0 +1,43 @@

+import os
+from reportlab.lib.pagesizes import letter
+from reportlab.pdfgen import canvas
+def txt_to_pdf(input_txt, output_pdf=None):
+    if output_pdf is None:
+        output_pdf = os.path.splitext(input_txt)[0] + '.pdf'
+    # Read the text file without modifying spacing
+    with open(input_txt, "r", encoding="utf-8") as file:
+        lines = file.readlines()
+    c = canvas.Canvas(output_pdf, pagesize=letter)
+    width, height = letter
+    left_margin = 10
+    top_margin = 10
+    bottom_margin = 10
+    line_height = 10  # Adjust based on desired spacing
+    # Use a text object for more control
+    text_object = c.beginText(left_margin, height - top_margin)
+    text_object.setFont("Courier", 8)  # Use a monospaced font to keep spacing intact
+    for line in lines:
+        # Remove the newline, preserving other whitespace
+        # And skip the line if it's empty (after stripping all whitespace)
+        if not line.strip():
+            continue
+        line = line.rstrip("\n")
+        text_object.textLine(line)
+        # Check if we have reached the bottom margin
+        if text_object.getY() < bottom_margin:
+            c.drawText(text_object)
+            c.showPage()
+            text_object = c.beginText(left_margin, height - top_margin)
+            text_object.setFont("Courier", 8)
+    # Draw any remaining text
+    c.drawText(text_object)
+    c.save()
+    return output_pdf