expiryprocess / README.md
krishnavadithya's picture
Upload folder using huggingface_hub
aacdfd5 verified

A newer version of the Gradio SDK is available: 5.27.1

Upgrade
metadata
title: expiryprocess
app_file: gradio_app.py
sdk: gradio
sdk_version: 5.20.1

Invoice Processing System with Gradio UI

This system processes invoice files (PDF, Excel, Word, Text) and extracts structured data using a combination of OCR, regex patterns, and LLM-based extraction. The extracted data can be downloaded as CSV.

Features

  • Multiple File Formats: Supports PDF, Excel (.xlsx, .xls), Word (.doc, .docx), and Text (.txt) files
  • Document Conversion: Automatically converts Word and Text files to PDF for processing
  • LLM-Enhanced Extraction: Uses Google's Generative AI for improved extraction accuracy (optional)
  • Web Interface: Easy-to-use Gradio UI for uploading files and downloading results
  • CSV Export: Download extracted data as CSV for further analysis

Installation

  1. Clone this repository:

    git clone <repository-url>
    cd invoice-processing-system
    
  2. Install dependencies:

    pip install -r requirements.txt
    
  3. Set up environment variables:

    • Create a .env file in the project root
    • Add your Google API key for LLM processing:
      GOOGLE_API_KEY=your_api_key_here
      

Usage

Web Interface (Gradio UI)

  1. Start the Gradio web interface:

    python gradio_app.py
    
  2. Open your browser and navigate to the URL shown in the terminal (typically http://127.0.0.1:7860)

  3. Upload an invoice file using the file upload button

  4. Click "Process Invoice" to extract data from the file

  5. View the extracted data in the table and download as CSV using the download button

Command Line Interface

You can also use the command line interface:

# Process a file with default settings (using LLM if available)
python process_invoice.py path/to/invoice.pdf

# Process without using LLM
python process_invoice.py path/to/invoice.xlsx --no-llm

# Process without saving JSON output
python process_invoice.py path/to/invoice.docx --no-json

Requirements

  • Python 3.8+
  • Google API key (for LLM-enhanced extraction)
  • LibreOffice (for converting .doc/.docx files to PDF)
  • Tesseract OCR (for PDF processing)

Troubleshooting

  • LLM Processing Not Available: Ensure your Google API key is correctly set in the .env file
  • PDF Conversion Issues: Make sure LibreOffice is installed and accessible in your PATH
  • OCR Quality Issues: Ensure Tesseract OCR is properly installed and configured

License

MIT License