Spaces:
Build error
Build error
================================== | |
Code Implementation | |
================================== | |
The core code of the PDF-Extract-Kit project is implemented in the `pdf_extract_kit` directory, which contains the following modules: | |
- configs: Configuration files for specific modules, such as `pdf_extract_kit/configs/unimernet.yaml`. If the configuration is simple, it is recommended to define it in the `yaml` file's `model_config` in `repo_root/configs` for easier user modification. | |
- dataset: A custom `ImageDataset` class used for loading and preprocessing image data. It supports various input types and can perform unified preprocessing operations on images (such as resizing, converting to tensors, etc.) to accelerate subsequent model inference. | |
- evaluation: A module for evaluating model results, supporting evaluations for various task types such as `layout detection`, `formula detection`, `formula recognition`, etc., allowing users to fairly compare different tasks and models. | |
- registry: The `Registry` class is a generic registry class that provides functions for registering, retrieving, and listing registered items. Users can use this class to create different types of registries, such as task registries, model registries, etc. | |
- tasks: The core task module contains many different types of tasks, such as `layout detection`, `formula detection`, `formula recognition`, etc. Users typically only need to add code here to add new tasks and models. | |
.. note:: | |
Based on the above modular design, users generally only need to implement their new task class and corresponding model in `tasks` to extend new modules (in most cases, only the corresponding model needs to be implemented, as the task is already defined), and then register it in `registry`. | |
Below we take adding a YOLO-based `layout detection` model as an example to introduce how to add new tasks and models. | |
Task Definition and Registration | |
============== | |
First, we add a `layout_detection` directory under `tasks`, and then add a `task.py` file in that directory to define the layout detection task class, as follows: | |
.. code-block:: python | |
from pdf_extract_kit.registry.registry import TASK_REGISTRY | |
from pdf_extract_kit.tasks.base_task import BaseTask | |
@TASK_REGISTRY.register("layout_detection") | |
class LayoutDetectionTask(BaseTask): | |
def __init__(self, model): | |
super().__init__(model) | |
def predict_images(self, input_data, result_path): | |
""" | |
Predict layouts in images. | |
Args: | |
input_data (str): Path to a single image file or a directory containing image files. | |
result_path (str): Path to save the prediction results. | |
Returns: | |
list: List of prediction results. | |
""" | |
images = self.load_images(input_data) | |
# Perform detection | |
return self.model.predict(images, result_path) | |
def predict_pdfs(self, input_data, result_path): | |
""" | |
Predict layouts in PDF files. | |
Args: | |
input_data (str): Path to a single PDF file or a directory containing PDF files. | |
result_path (str): Path to save the prediction results. | |
Returns: | |
list: List of prediction results. | |
""" | |
pdf_images = self.load_pdf_images(input_data) | |
# Perform detection | |
return self.model.predict(list(pdf_images.values()), result_path, list(pdf_images.keys())) | |
As you can see, the task definition includes the following key points: | |
* Use the `@TASK_REGISTRY.register("layout_detection")` syntax to directly register the layout task class under `TASK_REGISTRY`. | |
* The `__init__` initialization function takes `model` as an argument, specifically referring to the `BaseTask` class. | |
* Implement inference functions. Considering that layout detection usually processes images and PDF files, two functions `predict_images` and `predict_pdfs` are provided for users to choose flexibly. | |
Model Definition and Registration | |
============== | |
Next, we implement the specific model by creating a `models` directory under `task` and adding `yolo.py` for YOLO model definition, as follows: | |
.. code-block:: python | |
import os | |
import cv2 | |
import torch | |
from torch.utils.data import DataLoader, Dataset | |
from ultralytics import YOLO | |
from pdf_extract_kit.registry import MODEL_REGISTRY | |
from pdf_extract_kit.utils.visualization import visualize_bbox | |
from pdf_extract_kit.dataset.dataset import ImageDataset | |
import torchvision.transforms as transforms | |
@MODEL_REGISTRY.register('layout_detection_yolo') | |
class LayoutDetectionYOLO: | |
def __init__(self, config): | |
""" | |
Initialize the LayoutDetectionYOLO class. | |
Args: | |
config (dict): Configuration dictionary containing model parameters. | |
""" | |
# Mapping from class IDs to class names | |
self.id_to_names = { | |
0: 'title', | |
1: 'plain text', | |
2: 'abandon', | |
3: 'figure', | |
4: 'figure_caption', | |
5: 'table', | |
6: 'table_caption', | |
7: 'table_footnote', | |
8: 'isolate_formula', | |
9: 'formula_caption' | |
} | |
# Load the YOLO model from the specified path | |
self.model = YOLO(config['model_path']) | |
# Set model parameters | |
self.img_size = config.get('img_size', 1280) | |
self.pdf_dpi = config.get('pdf_dpi', 200) | |
self.conf_thres = config.get('conf_thres', 0.25) | |
self.iou_thres = config.get('iou_thres', 0.45) | |
self.visualize = config.get('visualize', False) | |
self.device = config.get('device', 'cuda' if torch.cuda.is_available() else 'cpu') | |
self.batch_size = config.get('batch_size', 1) | |
def predict(self, images, result_path, image_ids=None): | |
""" | |
Predict layouts in images. | |
Args: | |
images (list): List of images to be predicted. | |
result_path (str): Path to save the prediction results. | |
image_ids (list, optional): List of image IDs corresponding to the images. | |
Returns: | |
list: List of prediction results. | |
""" | |
results = [] | |
for idx, image in enumerate(images): | |
result = self.model.predict(image, imgsz=self.img_size, conf=self.conf_thres, iou=self.iou_thres, verbose=False)[0] | |
if self.visualize: | |
if not os.path.exists(result_path): | |
os.makedirs(result_path) | |
boxes = result.__dict__['boxes'].xyxy | |
classes = result.__dict__['boxes'].cls | |
vis_result = visualize_bbox(image, boxes, classes, self.id_to_names) | |
# Determine the base name of the image | |
if image_ids: | |
base_name = image_ids[idx] | |
else: | |
base_name = os.path.basename(image) | |
result_name = f"{base_name}_MFD.png" | |
# Save the visualized result | |
cv2.imwrite(os.path.join(result_path, result_name), vis_result) | |
results.append(result) | |
return results | |
As you can see, the model definition includes the following key points: | |
* Use the `@MODEL_REGISTRY.register('layout_detection_yolo')` syntax to directly register the YOLO layout model under `MODEL_REGISTRY`. | |
* The initialization function needs to implement: | |
+ The `id_to_names` category mapping for visualization. | |
+ Model parameter configuration. | |
+ Model initialization. | |
* The model inference function needs to implement various types of model inference: it supports image lists and `PIL.Image` class, allowing users to perform inference directly based on image paths or image streams. | |
After implementing the above class definition, add `LayoutDetectionYOLO` to the `__all__` in `__init__.py` under the `layout_detection` task. | |
.. code-block:: python | |
from pdf_extract_kit.tasks.layout_detection.models.yolo import LayoutDetectionYOLO | |
from pdf_extract_kit.registry.registry import MODEL_REGISTRY | |
__all__ = [ | |
"LayoutDetectionYOLO", | |
] | |
.. note:: | |
For the same task, we support multiple models. Users can choose which one to use based on evaluation results, considering model `accuracy`, `speed`, and `scenario adaptability`. | |
After implementing the tasks and models, you can add a script program `layout_detection.py` under `repo_root/scripts`. | |
Example Script | |
============== | |
.. code-block:: python | |
import os | |
import sys | |
import os.path as osp | |
import argparse | |
sys.path.append(osp.join(os.path.dirname(os.path.abspath(__file__)), '..')) | |
from pdf_extract_kit.utils.config_loader import load_config, initialize_tasks_and_models | |
import pdf_extract_kit.tasks # Ensure all task modules are imported | |
TASK_NAME = 'layout_detection' | |
def parse_args(): | |
parser = argparse.ArgumentParser(description="Run a task with a given configuration file.") | |
parser.add_argument('--config', type=str, required=True, help='Path to the configuration file.') | |
return parser.parse_args() | |
def main(config_path): | |
config = load_config(config_path) | |
task_instances = initialize_tasks_and_models(config) | |
# get input and output path from config | |
input_data = config.get('inputs', None) | |
result_path = config.get('outputs', 'outputs'+'/'+TASK_NAME) | |
# layout_detection_task | |
model_layout_detection = task_instances[TASK_NAME] | |
# for image detection | |
detection_results = model_layout_detection.predict_images(input_data, result_path) | |
# for pdf detection | |
# detection_results = model_layout_detection.predict_pdfs(input_data, result_path) | |
# print(detection_results) | |
print(f'The predicted results can be found at {result_path}') | |
if __name__ == "__main__": | |
args = parse_args() | |
main(args.config) | |
Support Type Extension | |
============== | |
Batch Processing Extension | |
============== |