.. _algorithm_layout_detection:
=================
Layout Detection Algorithm
=================
Introduction
=================
Layout detection is a fundamental task in document content extraction, aiming to locate different types of regions on a page, such as images, tables, text, and headings, to facilitate high-quality content extraction. For text and heading regions, OCR models can be used for text recognition, while table regions can be converted using table recognition models.
Model Usage
=================
Layout detection supports following models:
.. raw:: html
Model |
Description |
Characteristics |
Model weight |
Config file |
DocLayout-YOLO |
Improved based on YOLO-v10: 1. Generate diverse pre-training data,enhance generalization ability across multiple document types 2. Model architecture improvement, improve perception ability on scale-varing instances Details in DocLayout-YOLO |
Speed:Fast, Accuracy:High |
doclayout_yolo_ft.pt |
layout_detection.yaml |
YOLO-v10 |
Base YOLO-v10 model |
Speed:Fast, Accuracy:Moderate |
yolov10l_ft.pt |
layout_detection_yolo.yaml |
LayoutLMv3 |
Base LayoutLMv3 model |
Speed:Slow, Accuracy:High |
layoutlmv3_ft |
layout_detection_layoutlmv3.yaml |
Once enciroment is setup, you can perform layout detection by executing ``scripts/layout_detection.py`` directly.
**Run demo**
.. code:: shell
$ python scripts/layout_detection.py --config configs/layout_detection.yaml
Model Configuration
-----------------
**1. DocLayout-YOLO / YOLO-v10**
.. code:: yaml
inputs: assets/demo/layout_detection
outputs: outputs/layout_detection
tasks:
layout_detection:
model: layout_detection_yolo
model_config:
img_size: 1024
conf_thres: 0.25
iou_thres: 0.45
model_path: path/to/doclayout_yolo_model
visualize: True
- inputs/outputs: Define the input file path and the directory for visualization output.
- tasks: Define the task type, currently only a layout detection task is included.
- model: Specify the specific model type, e.g., layout_detection_yolo.
- model_config: Define the model configuration.
- img_size: Define the image long edge size; the short edge will be scaled proportionally based on the long edge, with the default long edge being 1024.
- conf_thres: Define the confidence threshold, detecting only targets above this threshold.
- iou_thres: Define the IoU threshold, removing targets with an overlap greater than this threshold.
- model_path: Path to the model weights.
- visualize: Whether to visualize the model results; visualized results will be saved in the outputs directory.
**2. layoutlmv3**
.. note::
LayoutLMv3 cannot run directly by default. Please follow the steps below to modify the configuration:
1. **Detectron2 Environment Setup**
.. code-block:: bash
# For Linux
pip install https://wheels-1251341229.cos.ap-shanghai.myqcloud.com/assets/whl/detectron2/detectron2-0.6-cp310-cp310-linux_x86_64.whl
# For macOS
pip install https://wheels-1251341229.cos.ap-shanghai.myqcloud.com/assets/whl/detectron2/detectron2-0.6-cp310-cp310-macosx_10_9_universal2.whl
# For Windows
pip install https://wheels-1251341229.cos.ap-shanghai.myqcloud.com/assets/whl/detectron2/detectron2-0.6-cp310-cp310-win_amd64.whl
2. **Enable LayoutLMv3 Registration Code**
Uncomment the lines at the following links:
- `line 2 `_
- `line 8 `_
.. code-block:: python
from pdf_extract_kit.tasks.layout_detection.models.yolo import LayoutDetectionYOLO
from pdf_extract_kit.tasks.layout_detection.models.layoutlmv3 import LayoutDetectionLayoutlmv3
from pdf_extract_kit.registry.registry import MODEL_REGISTRY
__all__ = [
"LayoutDetectionYOLO",
"LayoutDetectionLayoutlmv3",
]
.. code:: yaml
inputs: assets/demo/layout_detection
outputs: outputs/layout_detection
tasks:
layout_detection:
model: layout_detection_layoutlmv3
model_config:
model_path: path/to/layoutlmv3_model
- inputs/outputs: Define the input file path and the directory for visualization output.
- tasks: Define the task type, currently only a layout detection task is included.
- model: Specify the specific model type, e.g., layout_detection_layoutlmv3.
- model_config: Define the model configuration.
- model_path: Path to the model weights.
Diverse Input Support
-----------------
The layout detection script in PDF-Extract-Kit supports input formats such as a ``single image``, a ``directory containing only image files``, a ``single PDF file``, and a ``directory containing only PDF files``.
.. note::
Modify the path to inputs in configs/layout_detection.yaml according to your actual data format:
- Single image: path/to/image
- Image directory: path/to/images
- Single PDF file: path/to/pdf
- PDF directory: path/to/pdfs
.. note::
When using PDF as input, you need to change ``predict_images`` to ``predict_pdfs`` in ``layout_detection.py``.
.. code:: python
# for image detection
detection_results = model_layout_detection.predict_images(input_data, result_path)
Change to:
.. code:: python
# for pdf detection
detection_results = model_layout_detection.predict_pdfs(input_data, result_path)
Viewing Visualization Results
-----------------
When ``visualize`` is set to ``True`` in the config file, the visualization results will be saved in the ``outputs`` directory.
.. note::
Visualization is helpful for analyzing model results, but for large-scale tasks, it is recommended to turn off visualization (set ``visualize`` to ``False`` ) to reduce memory and disk usage.