Spaces:

ignaciaginting
/

extract_from_doc

Build error

File size: 10,637 Bytes

230c9a6

==================================
Code Implementation
==================================

The core code of the PDF-Extract-Kit project is implemented in the `pdf_extract_kit` directory, which contains the following modules:

- configs: Configuration files for specific modules, such as `pdf_extract_kit/configs/unimernet.yaml`. If the configuration is simple, it is recommended to define it in the `yaml` file's `model_config` in `repo_root/configs` for easier user modification.

- dataset: A custom `ImageDataset` class used for loading and preprocessing image data. It supports various input types and can perform unified preprocessing operations on images (such as resizing, converting to tensors, etc.) to accelerate subsequent model inference.

- evaluation: A module for evaluating model results, supporting evaluations for various task types such as `layout detection`, `formula detection`, `formula recognition`, etc., allowing users to fairly compare different tasks and models.

- registry: The `Registry` class is a generic registry class that provides functions for registering, retrieving, and listing registered items. Users can use this class to create different types of registries, such as task registries, model registries, etc.

- tasks: The core task module contains many different types of tasks, such as `layout detection`, `formula detection`, `formula recognition`, etc. Users typically only need to add code here to add new tasks and models.

.. note::
    Based on the above modular design, users generally only need to implement their new task class and corresponding model in `tasks` to extend new modules (in most cases, only the corresponding model needs to be implemented, as the task is already defined), and then register it in `registry`.

Below we take adding a YOLO-based `layout detection` model as an example to introduce how to add new tasks and models.

Task Definition and Registration
==============



First, we add a `layout_detection` directory under `tasks`, and then add a `task.py` file in that directory to define the layout detection task class, as follows:

.. code-block:: python

    from pdf_extract_kit.registry.registry import TASK_REGISTRY
    from pdf_extract_kit.tasks.base_task import BaseTask

    @TASK_REGISTRY.register("layout_detection")
    class LayoutDetectionTask(BaseTask):
        def __init__(self, model):
            super().__init__(model)

        def predict_images(self, input_data, result_path):
            """

            Predict layouts in images.



            Args:

                input_data (str): Path to a single image file or a directory containing image files.

                result_path (str): Path to save the prediction results.



            Returns:

                list: List of prediction results.

            """
            images = self.load_images(input_data)
            # Perform detection
            return self.model.predict(images, result_path)

        def predict_pdfs(self, input_data, result_path):
            """

            Predict layouts in PDF files.



            Args:

                input_data (str): Path to a single PDF file or a directory containing PDF files.

                result_path (str): Path to save the prediction results.



            Returns:

                list: List of prediction results.

            """
            pdf_images = self.load_pdf_images(input_data)
            # Perform detection
            return self.model.predict(list(pdf_images.values()), result_path, list(pdf_images.keys()))

As you can see, the task definition includes the following key points:

* Use the `@TASK_REGISTRY.register("layout_detection")` syntax to directly register the layout task class under `TASK_REGISTRY`.
* The `__init__` initialization function takes `model` as an argument, specifically referring to the `BaseTask` class.
* Implement inference functions. Considering that layout detection usually processes images and PDF files, two functions `predict_images` and `predict_pdfs` are provided for users to choose flexibly.

Model Definition and Registration
==============



Next, we implement the specific model by creating a `models` directory under `task` and adding `yolo.py` for YOLO model definition, as follows:

.. code-block:: python

    import os
    import cv2
    import torch
    from torch.utils.data import DataLoader, Dataset
    from ultralytics import YOLO
    from pdf_extract_kit.registry import MODEL_REGISTRY
    from pdf_extract_kit.utils.visualization import  visualize_bbox
    from pdf_extract_kit.dataset.dataset import ImageDataset
    import torchvision.transforms as transforms

    @MODEL_REGISTRY.register('layout_detection_yolo')
    class LayoutDetectionYOLO:
        def __init__(self, config):
            """

            Initialize the LayoutDetectionYOLO class.



            Args:

                config (dict): Configuration dictionary containing model parameters.

            """
            # Mapping from class IDs to class names
            self.id_to_names = {
                0: 'title', 
                1: 'plain text',
                2: 'abandon', 
                3: 'figure', 
                4: 'figure_caption', 
                5: 'table', 
                6: 'table_caption', 
                7: 'table_footnote', 
                8: 'isolate_formula', 
                9: 'formula_caption'
            }

            # Load the YOLO model from the specified path
            self.model = YOLO(config['model_path'])

            # Set model parameters
            self.img_size = config.get('img_size', 1280)
            self.pdf_dpi = config.get('pdf_dpi', 200)
            self.conf_thres = config.get('conf_thres', 0.25)
            self.iou_thres = config.get('iou_thres', 0.45)
            self.visualize = config.get('visualize', False)
            self.device = config.get('device', 'cuda' if torch.cuda.is_available() else 'cpu')
            self.batch_size = config.get('batch_size', 1)

        def predict(self, images, result_path, image_ids=None):
            """

            Predict layouts in images.



            Args:

                images (list): List of images to be predicted.

                result_path (str): Path to save the prediction results.

                image_ids (list, optional): List of image IDs corresponding to the images.



            Returns:

                list: List of prediction results.

            """
            results = []
            for idx, image in enumerate(images):
                result = self.model.predict(image, imgsz=self.img_size, conf=self.conf_thres, iou=self.iou_thres, verbose=False)[0]
                if self.visualize:
                    if not os.path.exists(result_path):
                        os.makedirs(result_path)
                    boxes = result.__dict__['boxes'].xyxy
                    classes = result.__dict__['boxes'].cls
                    vis_result = visualize_bbox(image, boxes, classes, self.id_to_names)

                    # Determine the base name of the image
                    if image_ids:
                        base_name = image_ids[idx]
                    else:
                        base_name = os.path.basename(image)
                    
                    result_name = f"{base_name}_MFD.png"
                    
                    # Save the visualized result                
                    cv2.imwrite(os.path.join(result_path, result_name), vis_result)
                results.append(result)
            return results

As you can see, the model definition includes the following key points:

* Use the `@MODEL_REGISTRY.register('layout_detection_yolo')` syntax to directly register the YOLO layout model under `MODEL_REGISTRY`.
* The initialization function needs to implement:
    + The `id_to_names` category mapping for visualization.
    + Model parameter configuration.
    + Model initialization.
* The model inference function needs to implement various types of model inference: it supports image lists and `PIL.Image` class, allowing users to perform inference directly based on image paths or image streams.

After implementing the above class definition, add `LayoutDetectionYOLO` to the `__all__` in `__init__.py` under the `layout_detection` task.

.. code-block:: python

    from pdf_extract_kit.tasks.layout_detection.models.yolo import LayoutDetectionYOLO
    from pdf_extract_kit.registry.registry import MODEL_REGISTRY

    __all__ = [
        "LayoutDetectionYOLO",
    ]

.. note:: 
    For the same task, we support multiple models. Users can choose which one to use based on evaluation results, considering model `accuracy`, `speed`, and `scenario adaptability`.

After implementing the tasks and models, you can add a script program `layout_detection.py` under `repo_root/scripts`.

Example Script
==============

.. code-block:: python

    import os
    import sys
    import os.path as osp
    import argparse

    sys.path.append(osp.join(os.path.dirname(os.path.abspath(__file__)), '..'))
    from pdf_extract_kit.utils.config_loader import load_config, initialize_tasks_and_models
    import pdf_extract_kit.tasks  # Ensure all task modules are imported

    TASK_NAME = 'layout_detection'

    def parse_args():
        parser = argparse.ArgumentParser(description="Run a task with a given configuration file.")
        parser.add_argument('--config', type=str, required=True, help='Path to the configuration file.')
        return parser.parse_args()

    def main(config_path):
        config = load_config(config_path)
        task_instances = initialize_tasks_and_models(config)

        # get input and output path from config
        input_data = config.get('inputs', None)
        result_path = config.get('outputs', 'outputs'+'/'+TASK_NAME)

        # layout_detection_task
        model_layout_detection = task_instances[TASK_NAME]

        # for image detection
        detection_results = model_layout_detection.predict_images(input_data, result_path)

        # for pdf detection
        # detection_results = model_layout_detection.predict_pdfs(input_data, result_path)

        # print(detection_results)
        print(f'The predicted results can be found at {result_path}')

    if __name__ == "__main__":
        args = parse_args()
        main(args.config)

Support Type Extension
==============

Batch Processing Extension
==============