Spaces:

ignaciaginting
/

extract_from_doc

Build error

File size: 2,530 Bytes

230c9a6

..  _algorithm_ocr:

==========================
OCR (Optical Character Recognition) Algorithm

==========================

Introduction

====================

OCR(Optical Character Recognition) involves identifying the positions ajnd contents of all text blocks in pictures.


Model Usage

====================

With the environment properly set up, simply run the ocr algorithm script by executing ``scripts/ocr.py`` .

.. code:: shell

   $ python scripts/ocr.py --config configs/ocr.yaml


Model Configuration

--------------------

.. code:: yaml

   inputs: assets/demo/ocr
   outputs: outputs/ocr
   visualize: True
   tasks:
      ocr:
         model: ocr_ppocr
         model_config:
            lang: ch
            show_log: True
            det_model_dir: models/OCR/PaddleOCR/det/ch_PP-OCRv4_det
            rec_model_dir: models/OCR/PaddleOCR/rec/ch_PP-OCRv4_rec
            det_db_box_thresh: 0.3

- inputs/outputs: Define the input path and the output path, respectively.
- visualize: Whether to visualize the model results. Visualized results will be saved in the outputs directory.
- tasks: Define the task type, currently only a OCR task is included.
- model: Define the specific model type, currently, only the PaddleOCR model is available.
- model_config: Define the model configuration.

- lang: Define the language, default language ch supports both english and chinese.

- show_log: Whether to print running logs.

- det_model_dir: Define the path of PaddleOCR' detection model, If the specified path does not exist, the model weight will be automatically downloaded to the path.

- rec_model_dir: Define the path of PaddleOCR' recognize model, If the specified path does not exist, the model weight will be automatically downloaded to the path.

- det_db_box_thresh: Confidence filter threshold, bounding boxes whose confidence is lower than the threshold are discarded.


Diverse Input Support

--------------------

The OCR script in PDF-Extract-Kit supports various input formats such as ``a single image/PDF``, ``a directory of image/PDF files``.


Viewing Visualization Results

--------------------

When the ``visualize`` option in the config file is set to ``True``, visualization results will be saved in the ``outputs`` directory.

.. note::

   Visualization facilitates the analysis of model results. However, for large-scale tasks, it is recommended to disable visualization (set ``visualize`` to ``False`` ) to reduce memory and disk usage.