ocr-llm-test / README.md
winamnd's picture
Update README.md
6e7971d verified

A newer version of the Gradio SDK is available: 5.27.1

Upgrade
metadata
title: OCR + LLM
emoji: 🔎
colorFrom: pink
colorTo: gray
sdk: gradio
sdk_version: 5.16.0
app_file: app.py
pinned: false
short_description: Technical Assessment

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

OCR LLM Classifier

This project provides a simple interface for Optical Character Recognition (OCR) and spam classification using deep learning models. It supports three OCR methods (PaddleOCR, EasyOCR, and KerasOCR) and uses a DistilBERT model for classifying the extracted text as "Spam" or "Not Spam."

Features

  • Extract text from images using OCR.
  • Classify extracted text as either "Spam" or "Not Spam."

How It Works

  1. OCR: The app uses one of the three OCR methods to extract text from the uploaded image:

    • PaddleOCR
    • EasyOCR
    • KerasOCR
  2. Classification: The extracted text is passed to a pre-trained DistilBERT model that classifies the text as either "Spam" or "Not Spam."

Installation

To get started with this project, follow these steps:

1. Clone the Repository

git clone https://github.com/yourusername/ocr-llm-test.git
cd ocr-llm-test

2. Install Dependencies

You can install the required dependencies using pip:

pip install -r requirements.txt

3. Run the App

To run the Gradio interface locally, execute:

python app.py

Once the app is running, it will be accessible through your web browser at http://localhost:7860.

API Documentation

1. API Endpoint

The main endpoint for this API is /predict.

2. API Call Example

Install the Python Client

If you don't already have it installed, run the following command:

pip install gradio_client

Make an API Call

from gradio_client import Client, handle_file

client = Client("winamnd/ocr-llm-test")
result = client.predict(
    method="PaddleOCR",
    img=handle_file('https://huggingface.co/spaces/winamnd/ocr-llm-test/blob/main/sample_images/sample2.png'),
    api_name="/predict"
)
print(result)

3. Parameters

Parameter Type Description
method Literal['PaddleOCR', 'EasyOCR', 'KerasOCR', 'TesseractOCR'] Choose the OCR method to be used for text extraction. Default is "PaddleOCR."
img dict The image input, which can be provided as a URL, path, or base64 encoded image.

Image Input Details

  • path: Path to a local file.
  • url: Publicly available URL for the image.
  • size: The size of the image (in bytes).
  • orig_name: Original filename.
  • mime_type: MIME type of the image.
  • is_stream: Always set to False.
  • meta: Metadata.

4. Returns

The API returns a tuple with two elements:

  • Extracted Text (str): The text extracted from the image.
  • Spam Classification (str): The classification result ("Spam" or "Not Spam").

Chosen LLM and Justification

I have chosen DistilBERT as the foundational LLM for text classification due to its efficiency, lightweight architecture, and high performance in natural language processing (NLP) tasks. DistilBERT is a distilled version of BERT that retains 97% of BERT’s performance while being 60% faster and requiring significantly fewer computational resources. This makes it ideal for classifying extracted text as spam or not spam in real-time OCR applications. reference

Steps for Fine-Tuning or Prompt Engineering

Data Preparation:

  • Gather a dataset of spam and non-spam text samples.
  • Preprocess the text (cleaning, tokenization, and padding).
  • Split data into training and validation sets.

Fine-Tuning DistilBERT:

  1. Load the pre-trained DistilBERT model.
  2. Apply transfer learning by training the model on the spam dataset.
  3. Use a classification head (fully connected layer) on top of DistilBERT for binary classification.
  4. Implement cross-entropy loss and optimize with AdamW.
  5. Evaluate performance using precision, recall, and F1-score.

Integration with OCR Output

  • Once text is extracted using OCR (PaddleOCR, EasyOCR, or KerasOCR), it is passed to the DistilBERT model for classification.
  • The classification result is appended to the OCR output and stored in ocr_results.json and ocr_results.csv.
  • The system updates the UI in real-time via Gradio to display extracted text along with the classification label.

Security and Evaluation Strategies

Security Measures:

  • Sanitize input data to prevent injection attacks.
  • Implement rate limiting to prevent abuse of the API.
  • Store results securely, ensuring sensitive data is not exposed.

Evaluation Strategies:

  • Perform cross-validation to assess model robustness.
  • Continuously monitor classification accuracy on new incoming data.
  • Implement feedback mechanisms for users to report misclassifications and improve the model.