Spaces:

winamnd
/

ocr-llm-test

Running

File size: 4,978 Bytes

---
title: OCR + LLM
emoji: 🔎
colorFrom: pink
colorTo: gray
sdk: gradio
sdk_version: 5.16.0
app_file: app.py
pinned: false
short_description: Technical Assessment
---

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

# OCR LLM Classifier

This project provides a simple interface for Optical Character Recognition (OCR) and spam classification using deep learning models. It supports three OCR methods (PaddleOCR, EasyOCR, and KerasOCR) and uses a DistilBERT model for classifying the extracted text as "Spam" or "Not Spam."

## Features
- Extract text from images using OCR.
- Classify extracted text as either "Spam" or "Not Spam."

## How It Works
1. **OCR**: The app uses one of the three OCR methods to extract text from the uploaded image:
   - **PaddleOCR**
   - **EasyOCR**
   - **KerasOCR**
   
2. **Classification**: The extracted text is passed to a pre-trained DistilBERT model that classifies the text as either "Spam" or "Not Spam."


## Installation

To get started with this project, follow these steps:

### 1. Clone the Repository
```bash
git clone https://github.com/yourusername/ocr-llm-test.git
cd ocr-llm-test
```

### 2. Install Dependencies
You can install the required dependencies using pip:

```bash
pip install -r requirements.txt
```

### 3. Run the App
To run the Gradio interface locally, execute:

```bash
python app.py
```

Once the app is running, it will be accessible through your web browser at [http://localhost:7860](http://localhost:7860).

## API Documentation

### 1. API Endpoint

The main endpoint for this API is `/predict`.

### 2. API Call Example

#### Install the Python Client
If you don't already have it installed, run the following command:

```bash
pip install gradio_client
```

#### Make an API Call

```python
from gradio_client import Client, handle_file

client = Client("winamnd/ocr-llm-test")
result = client.predict(
    method="PaddleOCR",
    img=handle_file('https://huggingface.co/spaces/winamnd/ocr-llm-test/blob/main/sample_images/sample2.png'),
    api_name="/predict"
)
print(result)
```

### 3. Parameters

| Parameter | Type | Description |
|-----------|------|-------------|
| `method` | `Literal['PaddleOCR', 'EasyOCR', 'KerasOCR', 'TesseractOCR']` | Choose the OCR method to be used for text extraction. Default is "PaddleOCR." |
| `img` | `dict` | The image input, which can be provided as a URL, path, or base64 encoded image. |

#### Image Input Details
- **path**: Path to a local file.
- **url**: Publicly available URL for the image.
- **size**: The size of the image (in bytes).
- **orig_name**: Original filename.
- **mime_type**: MIME type of the image.
- **is_stream**: Always set to False.
- **meta**: Metadata.

### 4. Returns
The API returns a tuple with two elements:

- **Extracted Text (`str`)**: The text extracted from the image.
- **Spam Classification (`str`)**: The classification result ("Spam" or "Not Spam").
- 

---

# Chosen LLM and Justification

I have chosen **DistilBERT** as the foundational LLM for text classification due to its efficiency, lightweight architecture, and high performance in natural language processing (NLP) tasks. DistilBERT is a distilled version of BERT that retains 97% of BERT’s performance while being 60% faster and requiring significantly fewer computational resources. This makes it ideal for classifying extracted text as spam or not spam in real-time OCR applications.
[reference](https://arxiv.org/pdf/1910.01108)


## Steps for Fine-Tuning or Prompt Engineering

### Data Preparation:
- Gather a dataset of spam and non-spam text samples.
- Preprocess the text (cleaning, tokenization, and padding).
- Split data into training and validation sets.

### Fine-Tuning DistilBERT:
1. Load the pre-trained DistilBERT model.
2. Apply transfer learning by training the model on the spam dataset.
3. Use a classification head (fully connected layer) on top of DistilBERT for binary classification.
4. Implement cross-entropy loss and optimize with AdamW.
5. Evaluate performance using precision, recall, and F1-score.


## Integration with OCR Output

- Once text is extracted using OCR (PaddleOCR, EasyOCR, or KerasOCR), it is passed to the DistilBERT model for classification.
- The classification result is appended to the OCR output and stored in `ocr_results.json` and `ocr_results.csv`.
- The system updates the UI in real-time via **Gradio** to display extracted text along with the classification label.


## Security and Evaluation Strategies

### Security Measures:
- Sanitize input data to prevent injection attacks.
- Implement rate limiting to prevent abuse of the API.
- Store results securely, ensuring sensitive data is not exposed.

### Evaluation Strategies:
- Perform cross-validation to assess model robustness.
- Continuously monitor classification accuracy on new incoming data.
- Implement feedback mechanisms for users to report misclassifications and improve the model.