Spaces:
Running
Running
title: OCR + LLM | |
emoji: 🔎 | |
colorFrom: pink | |
colorTo: gray | |
sdk: gradio | |
sdk_version: 5.16.0 | |
app_file: app.py | |
pinned: false | |
short_description: Technical Assessment | |
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference | |
# OCR LLM Classifier | |
This project provides a simple interface for Optical Character Recognition (OCR) and spam classification using deep learning models. It supports three OCR methods (PaddleOCR, EasyOCR, and KerasOCR) and uses a DistilBERT model for classifying the extracted text as "Spam" or "Not Spam." | |
## Features | |
- Extract text from images using OCR. | |
- Classify extracted text as either "Spam" or "Not Spam." | |
## How It Works | |
1. **OCR**: The app uses one of the three OCR methods to extract text from the uploaded image: | |
- **PaddleOCR** | |
- **EasyOCR** | |
- **KerasOCR** | |
2. **Classification**: The extracted text is passed to a pre-trained DistilBERT model that classifies the text as either "Spam" or "Not Spam." | |
## Installation | |
To get started with this project, follow these steps: | |
### 1. Clone the Repository | |
```bash | |
git clone https://github.com/yourusername/ocr-llm-test.git | |
cd ocr-llm-test | |
``` | |
### 2. Install Dependencies | |
You can install the required dependencies using pip: | |
```bash | |
pip install -r requirements.txt | |
``` | |
### 3. Run the App | |
To run the Gradio interface locally, execute: | |
```bash | |
python app.py | |
``` | |
Once the app is running, it will be accessible through your web browser at [http://localhost:7860](http://localhost:7860). | |
## API Documentation | |
### 1. API Endpoint | |
The main endpoint for this API is `/predict`. | |
### 2. API Call Example | |
#### Install the Python Client | |
If you don't already have it installed, run the following command: | |
```bash | |
pip install gradio_client | |
``` | |
#### Make an API Call | |
```python | |
from gradio_client import Client, handle_file | |
client = Client("winamnd/ocr-llm-test") | |
result = client.predict( | |
method="PaddleOCR", | |
img=handle_file('https://huggingface.co/spaces/winamnd/ocr-llm-test/blob/main/sample_images/sample2.png'), | |
api_name="/predict" | |
) | |
print(result) | |
``` | |
### 3. Parameters | |
| Parameter | Type | Description | | |
|-----------|------|-------------| | |
| `method` | `Literal['PaddleOCR', 'EasyOCR', 'KerasOCR', 'TesseractOCR']` | Choose the OCR method to be used for text extraction. Default is "PaddleOCR." | | |
| `img` | `dict` | The image input, which can be provided as a URL, path, or base64 encoded image. | | |
#### Image Input Details | |
- **path**: Path to a local file. | |
- **url**: Publicly available URL for the image. | |
- **size**: The size of the image (in bytes). | |
- **orig_name**: Original filename. | |
- **mime_type**: MIME type of the image. | |
- **is_stream**: Always set to False. | |
- **meta**: Metadata. | |
### 4. Returns | |
The API returns a tuple with two elements: | |
- **Extracted Text (`str`)**: The text extracted from the image. | |
- **Spam Classification (`str`)**: The classification result ("Spam" or "Not Spam"). | |
- | |
--- | |
# Chosen LLM and Justification | |
I have chosen **DistilBERT** as the foundational LLM for text classification due to its efficiency, lightweight architecture, and high performance in natural language processing (NLP) tasks. DistilBERT is a distilled version of BERT that retains 97% of BERT’s performance while being 60% faster and requiring significantly fewer computational resources. This makes it ideal for classifying extracted text as spam or not spam in real-time OCR applications. | |
[reference](https://arxiv.org/pdf/1910.01108) | |
## Steps for Fine-Tuning or Prompt Engineering | |
### Data Preparation: | |
- Gather a dataset of spam and non-spam text samples. | |
- Preprocess the text (cleaning, tokenization, and padding). | |
- Split data into training and validation sets. | |
### Fine-Tuning DistilBERT: | |
1. Load the pre-trained DistilBERT model. | |
2. Apply transfer learning by training the model on the spam dataset. | |
3. Use a classification head (fully connected layer) on top of DistilBERT for binary classification. | |
4. Implement cross-entropy loss and optimize with AdamW. | |
5. Evaluate performance using precision, recall, and F1-score. | |
## Integration with OCR Output | |
- Once text is extracted using OCR (PaddleOCR, EasyOCR, or KerasOCR), it is passed to the DistilBERT model for classification. | |
- The classification result is appended to the OCR output and stored in `ocr_results.json` and `ocr_results.csv`. | |
- The system updates the UI in real-time via **Gradio** to display extracted text along with the classification label. | |
## Security and Evaluation Strategies | |
### Security Measures: | |
- Sanitize input data to prevent injection attacks. | |
- Implement rate limiting to prevent abuse of the API. | |
- Store results securely, ensuring sensitive data is not exposed. | |
### Evaluation Strategies: | |
- Perform cross-validation to assess model robustness. | |
- Continuously monitor classification accuracy on new incoming data. | |
- Implement feedback mechanisms for users to report misclassifications and improve the model. |