File size: 4,978 Bytes
4516f86
09a5cc0
 
4516f86
 
 
 
 
 
 
 
 
 
37d5823
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37a4df1
37d5823
 
 
 
 
 
 
 
 
6e7971d
37d5823
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7eacd2f
 
 
 
 
 
b90ac8f
 
7eacd2f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b90ac8f
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
---
title: OCR + LLM
emoji: 🔎
colorFrom: pink
colorTo: gray
sdk: gradio
sdk_version: 5.16.0
app_file: app.py
pinned: false
short_description: Technical Assessment
---

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

# OCR LLM Classifier

This project provides a simple interface for Optical Character Recognition (OCR) and spam classification using deep learning models. It supports three OCR methods (PaddleOCR, EasyOCR, and KerasOCR) and uses a DistilBERT model for classifying the extracted text as "Spam" or "Not Spam."

## Features
- Extract text from images using OCR.
- Classify extracted text as either "Spam" or "Not Spam."

## How It Works
1. **OCR**: The app uses one of the three OCR methods to extract text from the uploaded image:
   - **PaddleOCR**
   - **EasyOCR**
   - **KerasOCR**
   
2. **Classification**: The extracted text is passed to a pre-trained DistilBERT model that classifies the text as either "Spam" or "Not Spam."


## Installation

To get started with this project, follow these steps:

### 1. Clone the Repository
```bash
git clone https://github.com/yourusername/ocr-llm-test.git
cd ocr-llm-test
```

### 2. Install Dependencies
You can install the required dependencies using pip:

```bash
pip install -r requirements.txt
```

### 3. Run the App
To run the Gradio interface locally, execute:

```bash
python app.py
```

Once the app is running, it will be accessible through your web browser at [http://localhost:7860](http://localhost:7860).

## API Documentation

### 1. API Endpoint

The main endpoint for this API is `/predict`.

### 2. API Call Example

#### Install the Python Client
If you don't already have it installed, run the following command:

```bash
pip install gradio_client
```

#### Make an API Call

```python
from gradio_client import Client, handle_file

client = Client("winamnd/ocr-llm-test")
result = client.predict(
    method="PaddleOCR",
    img=handle_file('https://huggingface.co/spaces/winamnd/ocr-llm-test/blob/main/sample_images/sample2.png'),
    api_name="/predict"
)
print(result)
```

### 3. Parameters

| Parameter | Type | Description |
|-----------|------|-------------|
| `method` | `Literal['PaddleOCR', 'EasyOCR', 'KerasOCR', 'TesseractOCR']` | Choose the OCR method to be used for text extraction. Default is "PaddleOCR." |
| `img` | `dict` | The image input, which can be provided as a URL, path, or base64 encoded image. |

#### Image Input Details
- **path**: Path to a local file.
- **url**: Publicly available URL for the image.
- **size**: The size of the image (in bytes).
- **orig_name**: Original filename.
- **mime_type**: MIME type of the image.
- **is_stream**: Always set to False.
- **meta**: Metadata.

### 4. Returns
The API returns a tuple with two elements:

- **Extracted Text (`str`)**: The text extracted from the image.
- **Spam Classification (`str`)**: The classification result ("Spam" or "Not Spam").
- 

---

# Chosen LLM and Justification

I have chosen **DistilBERT** as the foundational LLM for text classification due to its efficiency, lightweight architecture, and high performance in natural language processing (NLP) tasks. DistilBERT is a distilled version of BERT that retains 97% of BERT’s performance while being 60% faster and requiring significantly fewer computational resources. This makes it ideal for classifying extracted text as spam or not spam in real-time OCR applications.
[reference](https://arxiv.org/pdf/1910.01108)


## Steps for Fine-Tuning or Prompt Engineering

### Data Preparation:
- Gather a dataset of spam and non-spam text samples.
- Preprocess the text (cleaning, tokenization, and padding).
- Split data into training and validation sets.

### Fine-Tuning DistilBERT:
1. Load the pre-trained DistilBERT model.
2. Apply transfer learning by training the model on the spam dataset.
3. Use a classification head (fully connected layer) on top of DistilBERT for binary classification.
4. Implement cross-entropy loss and optimize with AdamW.
5. Evaluate performance using precision, recall, and F1-score.


## Integration with OCR Output

- Once text is extracted using OCR (PaddleOCR, EasyOCR, or KerasOCR), it is passed to the DistilBERT model for classification.
- The classification result is appended to the OCR output and stored in `ocr_results.json` and `ocr_results.csv`.
- The system updates the UI in real-time via **Gradio** to display extracted text along with the classification label.


## Security and Evaluation Strategies

### Security Measures:
- Sanitize input data to prevent injection attacks.
- Implement rate limiting to prevent abuse of the API.
- Store results securely, ensuring sensitive data is not exposed.

### Evaluation Strategies:
- Perform cross-validation to assess model robustness.
- Continuously monitor classification accuracy on new incoming data.
- Implement feedback mechanisms for users to report misclassifications and improve the model.