Spaces:

ignaciaginting
/

extract_from_doc

Build error

App Files Files Community

extract_from_doc / PDF-Extract-Kit /docs /en /algorithm /ocr.rst

ignaciaginting

Upload 396 files

230c9a6 verified 8 days ago

raw

history blame contribute delete

2.53 kB

	.. _algorithm_ocr:
	==========================
	OCR (Optical Character Recognition) Algorithm
	==========================

	Introduction
	====================

	OCR(Optical Character Recognition) involves identifying the positions ajnd contents of all text blocks in pictures.


	Model Usage
	====================

	With the environment properly set up, simply run the ocr algorithm script by executing ``scripts/ocr.py`` .

	.. code:: shell

	$ python scripts/ocr.py --config configs/ocr.yaml


	Model Configuration
	--------------------

	.. code:: yaml

	inputs: assets/demo/ocr
	outputs: outputs/ocr
	visualize: True
	tasks:
	ocr:
	model: ocr_ppocr
	model_config:
	lang: ch
	show_log: True
	det_model_dir: models/OCR/PaddleOCR/det/ch_PP-OCRv4_det
	rec_model_dir: models/OCR/PaddleOCR/rec/ch_PP-OCRv4_rec
	det_db_box_thresh: 0.3

	- inputs/outputs: Define the input path and the output path, respectively.
	- visualize: Whether to visualize the model results. Visualized results will be saved in the outputs directory.
	- tasks: Define the task type, currently only a OCR task is included.
	- model: Define the specific model type, currently, only the PaddleOCR model is available.
	- model_config: Define the model configuration.
	- lang: Define the language, default language ch supports both english and chinese.
	- show_log: Whether to print running logs.
	- det_model_dir: Define the path of PaddleOCR' detection model, If the specified path does not exist, the model weight will be automatically downloaded to the path.
	- rec_model_dir: Define the path of PaddleOCR' recognize model, If the specified path does not exist, the model weight will be automatically downloaded to the path.
	- det_db_box_thresh: Confidence filter threshold, bounding boxes whose confidence is lower than the threshold are discarded.


	Diverse Input Support
	--------------------

	The OCR script in PDF-Extract-Kit supports various input formats such as ``a single image/PDF``, ``a directory of image/PDF files``.


	Viewing Visualization Results
	--------------------

	When the ``visualize`` option in the config file is set to ``True``, visualization results will be saved in the ``outputs`` directory.

	.. note::

	Visualization facilitates the analysis of model results. However, for large-scale tasks, it is recommended to disable visualization (set ``visualize`` to ``False`` ) to reduce memory and disk usage.