Spaces:

winamnd
/

ocr-llm-test

Running

App Files Files Community

ocr-llm-test / README.md

winamnd

Update README.md

6e7971d verified 2 months ago

preview code

raw

history blame contribute delete

4.98 kB

	---
	title: OCR + LLM
	emoji: 🔎
	colorFrom: pink
	colorTo: gray
	sdk: gradio
	sdk_version: 5.16.0
	app_file: app.py
	pinned: false
	short_description: Technical Assessment
	---

	Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

	# OCR LLM Classifier

	This project provides a simple interface for Optical Character Recognition (OCR) and spam classification using deep learning models. It supports three OCR methods (PaddleOCR, EasyOCR, and KerasOCR) and uses a DistilBERT model for classifying the extracted text as "Spam" or "Not Spam."

	## Features
	- Extract text from images using OCR.
	- Classify extracted text as either "Spam" or "Not Spam."

	## How It Works
	1. OCR: The app uses one of the three OCR methods to extract text from the uploaded image:
	- PaddleOCR
	- EasyOCR
	- KerasOCR

	2. Classification: The extracted text is passed to a pre-trained DistilBERT model that classifies the text as either "Spam" or "Not Spam."


	## Installation

	To get started with this project, follow these steps:

	### 1. Clone the Repository
	```bash
	git clone https://github.com/yourusername/ocr-llm-test.git
	cd ocr-llm-test
	```

	### 2. Install Dependencies
	You can install the required dependencies using pip:

	```bash
	pip install -r requirements.txt
	```

	### 3. Run the App
	To run the Gradio interface locally, execute:

	```bash
	python app.py
	```

	Once the app is running, it will be accessible through your web browser at [http://localhost:7860](http://localhost:7860).

	## API Documentation

	### 1. API Endpoint

	The main endpoint for this API is `/predict`.

	### 2. API Call Example

	#### Install the Python Client
	If you don't already have it installed, run the following command:

	```bash
	pip install gradio_client
	```

	#### Make an API Call

	```python
	from gradio_client import Client, handle_file

	client = Client("winamnd/ocr-llm-test")
	result = client.predict(
	method="PaddleOCR",
	img=handle_file('https://huggingface.co/spaces/winamnd/ocr-llm-test/blob/main/sample_images/sample2.png'),
	api_name="/predict"
	)
	print(result)
	```

	### 3. Parameters

	\| Parameter \| Type \| Description \|
	\|-----------\|------\|-------------\|
	\| `method` \| `Literal['PaddleOCR', 'EasyOCR', 'KerasOCR', 'TesseractOCR']` \| Choose the OCR method to be used for text extraction. Default is "PaddleOCR." \|
	\| `img` \| `dict` \| The image input, which can be provided as a URL, path, or base64 encoded image. \|

	#### Image Input Details
	- path: Path to a local file.
	- url: Publicly available URL for the image.
	- size: The size of the image (in bytes).
	- orig_name: Original filename.
	- mime_type: MIME type of the image.
	- is_stream: Always set to False.
	- meta: Metadata.

	### 4. Returns
	The API returns a tuple with two elements:

	- Extracted Text (`str`): The text extracted from the image.
	- Spam Classification (`str`): The classification result ("Spam" or "Not Spam").
	-

	---

	# Chosen LLM and Justification

	I have chosen DistilBERT as the foundational LLM for text classification due to its efficiency, lightweight architecture, and high performance in natural language processing (NLP) tasks. DistilBERT is a distilled version of BERT that retains 97% of BERT’s performance while being 60% faster and requiring significantly fewer computational resources. This makes it ideal for classifying extracted text as spam or not spam in real-time OCR applications.
	[reference](https://arxiv.org/pdf/1910.01108)


	## Steps for Fine-Tuning or Prompt Engineering

	### Data Preparation:
	- Gather a dataset of spam and non-spam text samples.
	- Preprocess the text (cleaning, tokenization, and padding).
	- Split data into training and validation sets.

	### Fine-Tuning DistilBERT:
	1. Load the pre-trained DistilBERT model.
	2. Apply transfer learning by training the model on the spam dataset.
	3. Use a classification head (fully connected layer) on top of DistilBERT for binary classification.
	4. Implement cross-entropy loss and optimize with AdamW.
	5. Evaluate performance using precision, recall, and F1-score.


	## Integration with OCR Output

	- Once text is extracted using OCR (PaddleOCR, EasyOCR, or KerasOCR), it is passed to the DistilBERT model for classification.
	- The classification result is appended to the OCR output and stored in `ocr_results.json` and `ocr_results.csv`.
	- The system updates the UI in real-time via Gradio to display extracted text along with the classification label.


	## Security and Evaluation Strategies

	### Security Measures:
	- Sanitize input data to prevent injection attacks.
	- Implement rate limiting to prevent abuse of the API.
	- Store results securely, ensuring sensitive data is not exposed.

	### Evaluation Strategies:
	- Perform cross-validation to assess model robustness.
	- Continuously monitor classification accuracy on new incoming data.
	- Implement feedback mechanisms for users to report misclassifications and improve the model.