metadata

title: OCR + LLM
emoji: 🔎
colorFrom: pink
colorTo: gray
sdk: gradio
sdk_version: 5.16.0
app_file: app.py
pinned: false
short_description: Technical Assessment

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

OCR LLM Classifier

This project provides a simple interface for Optical Character Recognition (OCR) and spam classification using deep learning models. It supports three OCR methods (PaddleOCR, EasyOCR, and KerasOCR) and uses a DistilBERT model for classifying the extracted text as "Spam" or "Not Spam."

Features

Extract text from images using OCR.
Classify extracted text as either "Spam" or "Not Spam."

How It Works

OCR: The app uses one of the three OCR methods to extract text from the uploaded image:
- PaddleOCR
- EasyOCR
- KerasOCR
Classification: The extracted text is passed to a pre-trained DistilBERT model that classifies the text as either "Spam" or "Not Spam."

Installation

To get started with this project, follow these steps:

1. Clone the Repository

git clone https://github.com/yourusername/ocr-llm-test.git
cd ocr-llm-test

2. Install Dependencies

You can install the required dependencies using pip:

pip install -r requirements.txt

3. Run the App

To run the Gradio interface locally, execute:

python app.py

Once the app is running, it will be accessible through your web browser at http://localhost:7860.

API Documentation

1. API Endpoint

The main endpoint for this API is /predict.

2. API Call Example

Install the Python Client

If you don't already have it installed, run the following command:

pip install gradio_client

Make an API Call

from gradio_client import Client, handle_file

client = Client("winamnd/ocr-llm-test")
result = client.predict(
    method="PaddleOCR",
    img=handle_file('https://huggingface.co/spaces/winamnd/ocr-llm-test/blob/main/sample_images/sample2.png'),
    api_name="/predict"
)
print(result)

3. Parameters

Parameter	Type	Description
`method`	`Literal['PaddleOCR', 'EasyOCR', 'KerasOCR', 'TesseractOCR']`	Choose the OCR method to be used for text extraction. Default is "PaddleOCR."
`img`	`dict`	The image input, which can be provided as a URL, path, or base64 encoded image.

Image Input Details

path: Path to a local file.
url: Publicly available URL for the image.
size: The size of the image (in bytes).
orig_name: Original filename.
mime_type: MIME type of the image.
is_stream: Always set to False.
meta: Metadata.

4. Returns

The API returns a tuple with two elements:

Extracted Text (str): The text extracted from the image.
Spam Classification (str): The classification result ("Spam" or "Not Spam").

Chosen LLM and Justification

I have chosen DistilBERT as the foundational LLM for text classification due to its efficiency, lightweight architecture, and high performance in natural language processing (NLP) tasks. DistilBERT is a distilled version of BERT that retains 97% of BERT’s performance while being 60% faster and requiring significantly fewer computational resources. This makes it ideal for classifying extracted text as spam or not spam in real-time OCR applications. reference

Steps for Fine-Tuning or Prompt Engineering

Data Preparation:

Gather a dataset of spam and non-spam text samples.
Preprocess the text (cleaning, tokenization, and padding).
Split data into training and validation sets.

Fine-Tuning DistilBERT:

Load the pre-trained DistilBERT model.
Apply transfer learning by training the model on the spam dataset.
Use a classification head (fully connected layer) on top of DistilBERT for binary classification.
Implement cross-entropy loss and optimize with AdamW.
Evaluate performance using precision, recall, and F1-score.

Integration with OCR Output

Once text is extracted using OCR (PaddleOCR, EasyOCR, or KerasOCR), it is passed to the DistilBERT model for classification.
The classification result is appended to the OCR output and stored in ocr_results.json and ocr_results.csv.
The system updates the UI in real-time via Gradio to display extracted text along with the classification label.

Security and Evaluation Strategies

Security Measures:

Sanitize input data to prevent injection attacks.
Implement rate limiting to prevent abuse of the API.
Store results securely, ensuring sensitive data is not exposed.

Evaluation Strategies:

Perform cross-validation to assess model robustness.
Continuously monitor classification accuracy on new incoming data.
Implement feedback mechanisms for users to report misclassifications and improve the model.