File size: 2,496 Bytes
a37298c
aacdfd5
 
a37298c
 
 
aacdfd5
a37298c
aacdfd5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
---
title: expiryprocess
app_file: gradio_app.py
sdk: gradio
sdk_version: 5.20.1
---
# Invoice Processing System with Gradio UI

This system processes invoice files (PDF, Excel, Word, Text) and extracts structured data using a combination of OCR, regex patterns, and LLM-based extraction. The extracted data can be downloaded as CSV.

## Features

- **Multiple File Formats**: Supports PDF, Excel (.xlsx, .xls), Word (.doc, .docx), and Text (.txt) files
- **Document Conversion**: Automatically converts Word and Text files to PDF for processing
- **LLM-Enhanced Extraction**: Uses Google's Generative AI for improved extraction accuracy (optional)
- **Web Interface**: Easy-to-use Gradio UI for uploading files and downloading results
- **CSV Export**: Download extracted data as CSV for further analysis

## Installation

1. Clone this repository:
   ```bash
   git clone <repository-url>
   cd invoice-processing-system
   ```

2. Install dependencies:
   ```bash
   pip install -r requirements.txt
   ```

3. Set up environment variables:
   - Create a `.env` file in the project root
   - Add your Google API key for LLM processing:
     ```
     GOOGLE_API_KEY=your_api_key_here
     ```

## Usage

### Web Interface (Gradio UI)

1. Start the Gradio web interface:
   ```bash
   python gradio_app.py
   ```

2. Open your browser and navigate to the URL shown in the terminal (typically http://127.0.0.1:7860)

3. Upload an invoice file using the file upload button

4. Click "Process Invoice" to extract data from the file

5. View the extracted data in the table and download as CSV using the download button

### Command Line Interface

You can also use the command line interface:

```bash
# Process a file with default settings (using LLM if available)
python process_invoice.py path/to/invoice.pdf

# Process without using LLM
python process_invoice.py path/to/invoice.xlsx --no-llm

# Process without saving JSON output
python process_invoice.py path/to/invoice.docx --no-json
```

## Requirements

- Python 3.8+
- Google API key (for LLM-enhanced extraction)
- LibreOffice (for converting .doc/.docx files to PDF)
- Tesseract OCR (for PDF processing)

## Troubleshooting

- **LLM Processing Not Available**: Ensure your Google API key is correctly set in the `.env` file
- **PDF Conversion Issues**: Make sure LibreOffice is installed and accessible in your PATH
- **OCR Quality Issues**: Ensure Tesseract OCR is properly installed and configured

## License

[MIT License](LICENSE)