Commit
·
93b4c8a
1
Parent(s):
8953ca0
Improved efficiency of review page navigation, especially for large documents. Updated user guide
Browse files- .dockerignore +3 -1
- .gitignore +3 -1
- README.md +224 -56
- app.py +6 -6
- tools/config.py +2 -2
- tools/file_conversion.py +605 -285
- tools/file_redaction.py +5 -30
- tools/redaction_review.py +518 -249
.dockerignore
CHANGED
@@ -17,4 +17,6 @@ dist/*
|
|
17 |
build_deps/*
|
18 |
logs/*
|
19 |
config/*
|
20 |
-
user_guide/*
|
|
|
|
|
|
17 |
build_deps/*
|
18 |
logs/*
|
19 |
config/*
|
20 |
+
user_guide/*
|
21 |
+
cdk/*
|
22 |
+
web/*
|
.gitignore
CHANGED
@@ -18,4 +18,6 @@ build_deps/*
|
|
18 |
logs/*
|
19 |
config/*
|
20 |
doc_redaction_amplify_app/*
|
21 |
-
user_guide/*
|
|
|
|
|
|
18 |
logs/*
|
19 |
config/*
|
20 |
doc_redaction_amplify_app/*
|
21 |
+
user_guide/*
|
22 |
+
cdk/*
|
23 |
+
web/*
|
README.md
CHANGED
@@ -20,6 +20,12 @@ NOTE: The app is not 100% accurate, and it will miss some personal information.
|
|
20 |
|
21 |
# USER GUIDE
|
22 |
|
|
|
|
|
|
|
|
|
|
|
|
|
23 |
## Table of contents
|
24 |
|
25 |
- [Example data files](#example-data-files)
|
@@ -35,57 +41,97 @@ NOTE: The app is not 100% accurate, and it will miss some personal information.
|
|
35 |
- [Reviewing and modifying suggested redactions](#reviewing-and-modifying-suggested-redactions)
|
36 |
|
37 |
See the [advanced user guide here](#advanced-user-guide):
|
38 |
-
- [
|
39 |
-
- [Modifying existing redaction review files](#modifying-existing-redaction-review-files)
|
40 |
-
- [Merging existing redaction review files](#merging-existing-redaction-review-files)
|
41 |
- [Identifying and redacting duplicate pages](#identifying-and-redacting-duplicate-pages)
|
42 |
- [Fuzzy search and redaction](#fuzzy-search-and-redaction)
|
43 |
- [Export redactions to and import from Adobe Acrobat](#export-to-and-import-from-adobe)
|
44 |
- [Exporting to Adobe Acrobat](#exporting-to-adobe-acrobat)
|
45 |
- [Importing from Adobe Acrobat](#importing-from-adobe-acrobat)
|
|
|
46 |
- [Using AWS Textract and Comprehend when not running in an AWS environment](#using-aws-textract-and-comprehend-when-not-running-in-an-aws-environment)
|
|
|
47 |
|
48 |
## Example data files
|
49 |
|
50 |
-
Please
|
51 |
- [Example of files sent to a professor before applying](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/example_of_emails_sent_to_a_professor_before_applying.pdf)
|
52 |
- [Example complaint letter (jpg)](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/example_complaint_letter.jpg)
|
53 |
- [Partnership Agreement Toolkit (for signatures and more advanced usage)](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/Partnership-Agreement-Toolkit_0_0.pdf)
|
|
|
54 |
|
55 |
## Basic redaction
|
56 |
|
57 |
-
The document redaction app can detect personally-identifiable information (PII) in documents. Documents can be redacted directly, or suggested redactions can be reviewed and modified using a grapical user interface.
|
58 |
|
59 |
Download the example PDFs above to your computer. Open up the redaction app with the link provided by email.
|
60 |
|
61 |

|
62 |
|
63 |
-
|
|
|
|
|
|
|
|
|
64 |
|
65 |
-
First, select one of the three text extraction options
|
66 |
-
- 'Local model - selectable text' - This will read text directly from PDFs that have selectable text to redact (using PikePDF). This is fine for most PDFs, but will find nothing if the PDF does not have selectable text, and it is not good for handwriting or signatures. If it encounters an image file, it will send it onto the second option below.
|
67 |
-
- 'Local OCR model - PDFs without selectable text' - This option will use a simple Optical Character Recognition (OCR) model (Tesseract) to pull out text from a PDF/image that it 'sees'. This can handle most typed text in PDFs/images without selectable text, but struggles with handwriting/signatures. If you are interested in the latter, then you should use the third option if available.
|
68 |
-
- 'AWS Textract service - all PDF types' - Only available for instances of the app running on AWS. AWS Textract is a service that performs OCR on documents within their secure service. This is a more advanced version of OCR compared to the local option, and carries a (relatively small) cost. Textract excels in complex documents based on images, or documents that contain a lot of handwriting and signatures.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
69 |
|
70 |
If you are running with the AWS service enabled, here you will also have a choice for PII redaction method:
|
71 |
-
- '
|
72 |
-
- '
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
73 |
|
74 |
-
|
|
|
|
|
75 |
|
76 |

|
77 |
|
78 |
-
- '...redacted.pdf' files contain the original pdf with suggested redacted text deleted and replaced by a black box on top of the document.
|
79 |
-
- '...ocr_results.csv' files contain the line-by-line text outputs from the entire document. This file can be useful for later searching through for any terms of interest in the document (e.g. using Excel or a similar program).
|
80 |
-
- '...review_file.csv' files are the review files that contain details and locations of all of the suggested redactions in the document. This file is key to the [review process](#reviewing-and-modifying-suggested-redactions), and should be downloaded to use later for this.
|
|
|
|
|
|
|
|
|
|
|
|
|
81 |
|
82 |
-
|
83 |
|
84 |
-
|
85 |
|
86 |
-
|
87 |
-
|
88 |
-
|
89 |
|
90 |
We have covered redacting documents with the default redaction options. The '...redacted.pdf' file output may be enough for your purposes. But it is very likely that you will need to customise your redaction options, which we will cover below.
|
91 |
|
@@ -126,6 +172,16 @@ There may be full pages in a document that you want to redact. The app also prov
|
|
126 |
|
127 |
Using the above approaches to allow, deny, and full page redaction lists will give you an output [like this](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/allow_list/Partnership-Agreement-Toolkit_0_0_redacted.pdf).
|
128 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
129 |
### Redacting additional types of personal information
|
130 |
|
131 |
You may want to redact additional types of information beyond the defaults, or you may not be interested in default suggested entity types. There are dates in the example complaint letter. Say we wanted to redact those dates also?
|
@@ -146,7 +202,9 @@ Say also we are only interested in redacting page 1 of the loaded documents. On
|
|
146 |
|
147 |
## Handwriting and signature redaction
|
148 |
|
149 |
-
The file [Partnership Agreement Toolkit (for signatures and more advanced usage)](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/Partnership-Agreement-Toolkit_0_0.pdf) is provided as an example document to test AWS Textract + redaction with a document that has signatures in. If you have access to AWS Textract in the app, try removing all entity types from redaction on the Redaction settings and clicking the big X to the right of 'Entities to redact'.
|
|
|
|
|
150 |
|
151 |

|
152 |
|
@@ -156,72 +214,135 @@ The outputs should show handwriting/signatures redacted (see pages 5 - 7), which
|
|
156 |
|
157 |
## Reviewing and modifying suggested redactions
|
158 |
|
159 |
-
|
|
|
|
|
160 |
|
161 |
-
On
|
162 |
|
163 |

|
164 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
165 |
You can change the page viewed either by clicking 'Previous page' or 'Next page', or by typing a specific page number in the 'Current page' box and pressing Enter on your keyboard. Each time you switch page, it will save redactions you have made on the page you are moving from, so you will not lose changes you have made.
|
166 |
|
167 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
168 |
|
169 |
-
|
170 |
|
171 |

|
172 |
|
173 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
174 |
|
175 |
-
|
176 |
|
177 |
-
|
178 |
|
179 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
180 |
|
181 |

|
182 |
|
183 |
-
|
184 |
|
185 |
-
|
186 |
|
187 |
-
|
188 |
|
189 |
-
|
190 |
|
191 |
-
|
192 |
-
- [Modifying existing redaction review files](#modifying-existing-redaction-review-files)
|
193 |
-
- [Merging existing redaction review files](#merging-existing-redaction-review-files)
|
194 |
-
- [Identifying and redacting duplicate pages](#identifying-and-redacting-duplicate-pages)
|
195 |
-
- [Fuzzy search and redaction](#fuzzy-search-and-redaction)
|
196 |
-
- [Export redactions to and import from Adobe Acrobat](#export-to-and-import-from-adobe)
|
197 |
-
- [Exporting to Adobe Acrobat](#exporting-to-adobe-acrobat)
|
198 |
-
- [Importing from Adobe Acrobat](#importing-from-adobe-acrobat)
|
199 |
|
|
|
200 |
|
201 |
-
|
|
|
|
|
202 |
|
203 |
-
|
204 |
|
205 |
-
|
|
|
206 |
|
207 |
-
|
208 |
-
If you open up a 'review_file' csv output using a spreadsheet software program such as Microsoft Excel you can easily modify redaction properties. Open the file '[Partnership-Agreement-Toolkit_0_0_redacted.pdf_review_file_local.csv](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/merge_review_files/Partnership-Agreement-Toolkit_0_0.pdf_review_file_local.csv)', and you should see a spreadshet with just four suggested redactions (see below). The following instructions are for using Excel.
|
209 |
|
210 |
-
|
211 |
|
212 |
-
|
213 |
|
214 |
-
|
215 |
|
216 |
-
|
217 |
|
218 |
-
|
219 |
|
220 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
221 |
|
222 |
-
|
223 |
|
224 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
225 |
|
226 |
Say you have run multiple redaction tasks on the same document, and you want to merge all of these redactions together. You could do this in your spreadsheet editor, but this could be fiddly especially if dealing with multiple review files or large numbers of redactions. The app has a feature to combine multiple review files together to create a 'merged' review file.
|
227 |
|
@@ -303,6 +424,30 @@ When you click the 'convert .xfdf comment file to review_file.csv' button, the a
|
|
303 |
|
304 |

|
305 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
306 |
## Using AWS Textract and Comprehend when not running in an AWS environment
|
307 |
|
308 |
AWS Textract and Comprehend give much better results for text extraction and document redaction than the local model options in the app. The most secure way to access them in the Redaction app is to run the app in a secure AWS environment with relevant permissions. Alternatively, you could run the app on your own system while logged in to AWS SSO with relevant permissions.
|
@@ -322,4 +467,27 @@ AWS_SECRET_KEY= your-secret-key
|
|
322 |
|
323 |
The app should then pick up these keys when trying to access the AWS Textract and Comprehend services during redaction.
|
324 |
|
325 |
-
Again, a lot can potentially go wrong with AWS solutions that are insecure, so before trying the above please consult with your AWS and data security teams.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
20 |
|
21 |
# USER GUIDE
|
22 |
|
23 |
+
## Experiment with the test (public) version of the app
|
24 |
+
You can test out many of the features described in this user guide at the [public test version of the app](https://huggingface.co/spaces/seanpedrickcase/document_redaction), which is free. AWS functions (e.g. Textract, Comprehend) are not enabled (unless you have valid API keys).
|
25 |
+
|
26 |
+
## Chat over this user guide
|
27 |
+
You can now [speak with a chat bot about this user guide](https://huggingface.co/spaces/seanpedrickcase/Light-PDF-Web-QA-Chatbot) (beta!)
|
28 |
+
|
29 |
## Table of contents
|
30 |
|
31 |
- [Example data files](#example-data-files)
|
|
|
41 |
- [Reviewing and modifying suggested redactions](#reviewing-and-modifying-suggested-redactions)
|
42 |
|
43 |
See the [advanced user guide here](#advanced-user-guide):
|
44 |
+
- [Merging redaction review files](#merging-redaction-review-files)
|
|
|
|
|
45 |
- [Identifying and redacting duplicate pages](#identifying-and-redacting-duplicate-pages)
|
46 |
- [Fuzzy search and redaction](#fuzzy-search-and-redaction)
|
47 |
- [Export redactions to and import from Adobe Acrobat](#export-to-and-import-from-adobe)
|
48 |
- [Exporting to Adobe Acrobat](#exporting-to-adobe-acrobat)
|
49 |
- [Importing from Adobe Acrobat](#importing-from-adobe-acrobat)
|
50 |
+
- [Using the AWS Textract document API](#using-the-aws-textract-document-api)
|
51 |
- [Using AWS Textract and Comprehend when not running in an AWS environment](#using-aws-textract-and-comprehend-when-not-running-in-an-aws-environment)
|
52 |
+
- [Modifying existing redaction review files](#modifying-existing-redaction-review-files)
|
53 |
|
54 |
## Example data files
|
55 |
|
56 |
+
Please try these example files to follow along with this guide:
|
57 |
- [Example of files sent to a professor before applying](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/example_of_emails_sent_to_a_professor_before_applying.pdf)
|
58 |
- [Example complaint letter (jpg)](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/example_complaint_letter.jpg)
|
59 |
- [Partnership Agreement Toolkit (for signatures and more advanced usage)](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/Partnership-Agreement-Toolkit_0_0.pdf)
|
60 |
+
- [Dummy case note data](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/combined_case_notes.csv)
|
61 |
|
62 |
## Basic redaction
|
63 |
|
64 |
+
The document redaction app can detect personally-identifiable information (PII) in documents. Documents can be redacted directly, or suggested redactions can be reviewed and modified using a grapical user interface. Basic document redaction can be performed quickly using the default options.
|
65 |
|
66 |
Download the example PDFs above to your computer. Open up the redaction app with the link provided by email.
|
67 |
|
68 |

|
69 |
|
70 |
+
### Upload files to the app
|
71 |
+
|
72 |
+
The 'Redact PDFs/images tab' currently accepts PDFs and image files (JPG, PNG) for redaction. Click on the 'Drop files here or Click to Upload' area of the screen, and select one of the three different [example files](#example-data-files) (they should all be stored in the same folder if you want them to be redacted at the same time).
|
73 |
+
|
74 |
+
### Text extraction
|
75 |
|
76 |
+
First, select one of the three text extraction options:
|
77 |
+
- **'Local model - selectable text'** - This will read text directly from PDFs that have selectable text to redact (using PikePDF). This is fine for most PDFs, but will find nothing if the PDF does not have selectable text, and it is not good for handwriting or signatures. If it encounters an image file, it will send it onto the second option below.
|
78 |
+
- **'Local OCR model - PDFs without selectable text'** - This option will use a simple Optical Character Recognition (OCR) model (Tesseract) to pull out text from a PDF/image that it 'sees'. This can handle most typed text in PDFs/images without selectable text, but struggles with handwriting/signatures. If you are interested in the latter, then you should use the third option if available.
|
79 |
+
- **'AWS Textract service - all PDF types'** - Only available for instances of the app running on AWS. AWS Textract is a service that performs OCR on documents within their secure service. This is a more advanced version of OCR compared to the local option, and carries a (relatively small) cost. Textract excels in complex documents based on images, or documents that contain a lot of handwriting and signatures.
|
80 |
+
|
81 |
+
### Optional - select signature extraction
|
82 |
+
If you chose the AWS Textract service above, you can choose if you want handwriting and/or signatures redacted by default. Choosing signatures here will have a cost implication, as identifying signatures will cost ~£2.66 ($3.50) per 1,000 pages vs ~£1.14 ($1.50) per 1,000 pages without signature detection.
|
83 |
+
|
84 |
+

|
85 |
+
|
86 |
+
### PII redaction method
|
87 |
|
88 |
If you are running with the AWS service enabled, here you will also have a choice for PII redaction method:
|
89 |
+
- **'Only extract text - (no redaction)'** - If you are only interested in getting the text out of the document for further processing (e.g. to find duplicate pages, or to review text on the Review redactions page)
|
90 |
+
- **'Local'** - This uses the spacy package to rapidly detect PII in extracted text. This method is often sufficient if you are just interested in redacting specific terms defined in a custom list.
|
91 |
+
- **'AWS Comprehend'** - This method calls an AWS service to provide more accurate identification of PII in extracted text.
|
92 |
+
|
93 |
+
### Optional - costs and time estimation
|
94 |
+
If the option is enabled (by your system admin, in the config file), you will see a cost and time estimate for the redaction process. 'Existing Textract output file found' will be checked automatically if previous Textract text extraction files exist in the output folder, or have been [previously uploaded by the user](#aws-textract-outputs) (saving time and money for redaction).
|
95 |
+
|
96 |
+

|
97 |
+
|
98 |
+
### Optional - cost code selection
|
99 |
+
If the option is enabled (by your system admin, in the config file), you may be prompted to select a cost code before continuing with the redaction task.
|
100 |
+
|
101 |
+

|
102 |
+
|
103 |
+
The relevant cost code can be found either by: 1. Using the search bar above the data table to find relevant cost codes, then clicking on the relevant row, or 2. typing it directly into the dropdown to the right, where it should filter as you type.
|
104 |
+
|
105 |
+
### Optional - Submit whole documents to Textract API
|
106 |
+
If this option is enabled (by your system admin, in the config file), you will have the option to submit whole documents in quick succession to the AWS Textract service to get extracted text outputs quickly (faster than using the 'Redact document' process described here). This feature is described in more detail in the [advanced user guide](#using-the-aws-textract-document-api).
|
107 |
+
|
108 |
+

|
109 |
+
|
110 |
+
### Redact the document
|
111 |
|
112 |
+
Click 'Redact document'. After loading in the document, the app should be able to process about 30 pages per minute (depending on redaction methods chose above). When ready, you should see a message saying that processing is complete, with output files appearing in the bottom right.
|
113 |
+
|
114 |
+
### Redaction outputs
|
115 |
|
116 |

|
117 |
|
118 |
+
- **'...redacted.pdf'** files contain the original pdf with suggested redacted text deleted and replaced by a black box on top of the document.
|
119 |
+
- **'...ocr_results.csv'** files contain the line-by-line text outputs from the entire document. This file can be useful for later searching through for any terms of interest in the document (e.g. using Excel or a similar program).
|
120 |
+
- **'...review_file.csv'** files are the review files that contain details and locations of all of the suggested redactions in the document. This file is key to the [review process](#reviewing-and-modifying-suggested-redactions), and should be downloaded to use later for this.
|
121 |
+
|
122 |
+
### Additional AWS Textract outputs
|
123 |
+
|
124 |
+
If you have used the AWS Textract option for extracting text, you may also see a '..._textract.json' file. This file contains all the relevant extracted text information that comes from the AWS Textract service. You can keep this file and upload it at a later date alongside your input document, which will enable you to skip calling AWS Textract every single time you want to do a redaction task, as follows:
|
125 |
+
|
126 |
+

|
127 |
|
128 |
+
### Downloading output files from previous redaction tasks
|
129 |
|
130 |
+
If you are logged in via AWS Cognito and you lose your app page for some reason (e.g. from a crash, reloading), it is possible recover your previous output files, provided the server has not been shut down since you redacted the document. Go to 'Redaction settings', then scroll to the bottom to see 'View all output files from this session'.
|
131 |
|
132 |
+

|
133 |
+
|
134 |
+
### Basic redaction summary
|
135 |
|
136 |
We have covered redacting documents with the default redaction options. The '...redacted.pdf' file output may be enough for your purposes. But it is very likely that you will need to customise your redaction options, which we will cover below.
|
137 |
|
|
|
172 |
|
173 |
Using the above approaches to allow, deny, and full page redaction lists will give you an output [like this](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/allow_list/Partnership-Agreement-Toolkit_0_0_redacted.pdf).
|
174 |
|
175 |
+
#### Adding to the loaded allow, deny, and whole page lists in-app
|
176 |
+
|
177 |
+
If you open the accordion below the allow list options called 'Manually modify custom allow...', you should be able to see a few tables with options to add new rows:
|
178 |
+
|
179 |
+

|
180 |
+
|
181 |
+
If the table is empty, you can add a new entry, you can add a new row by clicking on the '+' item below each table header. If there is existing data, you may need to click on the three dots to the right and select 'Add row below'. Type the item you wish to keep/remove in the cell, and then (important) press enter to add this new item to the allow/deny/whole page list. Your output tables should look something like below.
|
182 |
+
|
183 |
+

|
184 |
+
|
185 |
### Redacting additional types of personal information
|
186 |
|
187 |
You may want to redact additional types of information beyond the defaults, or you may not be interested in default suggested entity types. There are dates in the example complaint letter. Say we wanted to redact those dates also?
|
|
|
202 |
|
203 |
## Handwriting and signature redaction
|
204 |
|
205 |
+
The file [Partnership Agreement Toolkit (for signatures and more advanced usage)](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/Partnership-Agreement-Toolkit_0_0.pdf) is provided as an example document to test AWS Textract + redaction with a document that has signatures in. If you have access to AWS Textract in the app, try removing all entity types from redaction on the Redaction settings and clicking the big X to the right of 'Entities to redact'.
|
206 |
+
|
207 |
+
To ensure that handwriting and signatures are enabled (enabled by default), on the front screen go the 'AWS Textract signature detection' to enable/disable the following options :
|
208 |
|
209 |

|
210 |
|
|
|
214 |
|
215 |
## Reviewing and modifying suggested redactions
|
216 |
|
217 |
+
Sometimes the app will suggest redactions that are incorrect, or will miss personal information entirely. The app allows you to review and modify suggested redactions to compensate for this. You can do this on the 'Review redactions' tab.
|
218 |
+
|
219 |
+
We will go through ways to review suggested redactions with an example.On the first tab 'PDFs/images' upload the ['Example of files sent to a professor before applying.pdf'](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/example_of_emails_sent_to_a_professor_before_applying.pdf) file. Let's stick with the 'Local model - selectable text' option, and click 'Redact document'. Once the outputs are created, go to the 'Review redactions' tab.
|
220 |
|
221 |
+
On the 'Review redactions' tab you have a visual interface that allows you to inspect and modify redactions suggested by the app. There are quite a few options to look at, so we'll go from top to bottom.
|
222 |
|
223 |

|
224 |
|
225 |
+
### Uploading documents for review
|
226 |
+
|
227 |
+
The top area has a file upload area where you can upload original, unredacted PDFs, alongside the '..._review_file.csv' that is produced by the redaction process. Once you have uploaded these two files, click the 'Review PDF...' button to load in the files for review. This will allow you to visualise and modify the suggested redactions using the interface below.
|
228 |
+
|
229 |
+
Optionally, you can also upload one of the '..._ocr_output.csv' files here that comes out of a redaction task, so that you can navigate the extracted text from the document.
|
230 |
+
|
231 |
+

|
232 |
+
|
233 |
+
You can upload the three review files in the box (unredacted document, '..._review_file.csv' and '..._ocr_output.csv' file) before clicking 'Review PDF...', as in the image below:
|
234 |
+
|
235 |
+

|
236 |
+
|
237 |
+
**NOTE:** ensure you upload the ***unredacted*** document here and not the redacted version, otherwise you will be checking over a document that already has redaction boxes applied!
|
238 |
+
|
239 |
+
### Page navigation
|
240 |
+
|
241 |
You can change the page viewed either by clicking 'Previous page' or 'Next page', or by typing a specific page number in the 'Current page' box and pressing Enter on your keyboard. Each time you switch page, it will save redactions you have made on the page you are moving from, so you will not lose changes you have made.
|
242 |
|
243 |
+
You can also navigate to different pages by clicking on rows in the tables under 'Search suggested redactions' to the right, or 'search all extracted text' (if enabled) beneath that.
|
244 |
+
|
245 |
+
### The document viewer pane
|
246 |
+
|
247 |
+
On the selected page, each redaction is highlighted with a box next to its suggested redaction label (e.g. person, email).
|
248 |
+
|
249 |
+

|
250 |
|
251 |
+
There are a number of different options to add and modify redaction boxes and page on the document viewer pane. To zoom in and out of the page, use your mouse wheel. To move around the page while zoomed, you need to be in modify mode. Scroll to the bottom of the document viewer to see the relevant controls. You should see a box icon, a hand icon, and two arrows pointing counter-clockwise and clockwise.
|
252 |
|
253 |

|
254 |
|
255 |
+
Click on the hand icon to go into modify mode. When you click and hold on the document viewer, This will allow you to move around the page when zoomed in. To rotate the page, you can click on either of the round arrow buttons to turn in that direction.
|
256 |
+
|
257 |
+
**NOTE:** When you switch page, the viewer will stay in your selected orientation, so if it looks strange, just rotate the page again and hopefully it will look correct!
|
258 |
+
|
259 |
+
#### Modify existing redactions (hand icon)
|
260 |
+
|
261 |
+
After clicking on the hand icon, the interface allows you to modify existing redaction boxes. When in this mode, you can click and hold on an existing box to move it.
|
262 |
+
|
263 |
+

|
264 |
+
|
265 |
+
Click on one of the small boxes at the edges to change the size of the box. To delete a box, click on it to highlight it, then press delete on your keyboard. Alternatively, double click on a box and click 'Remove' on the box that appears.
|
266 |
+
|
267 |
+

|
268 |
+
|
269 |
+
#### Add new redaction boxes (box icon)
|
270 |
|
271 |
+
To change to 'add redaction boxes' mode, scroll to the bottom of the page. Click on the box icon, and your cursor will change into a crosshair. Now you can add new redaction boxes where you wish. A popup will appear when you create a new box so you can select a label and colour for the new box.
|
272 |
|
273 |
+
#### 'Locking in' new redaction box format
|
274 |
|
275 |
+
It is possible to lock in a chosen format for new redaction boxes so that you don't have the popup appearing each time. When you make a new box, select the options for your 'locked' format, and then click on the lock icon on the left side of the popup, which should turn blue.
|
276 |
+
|
277 |
+

|
278 |
+
|
279 |
+
You can now add new redaction boxes without a popup appearing. If you want to change or 'unlock' the your chosen box format, you can click on the new icon that has appeared at the bottom of the document viewer pane that looks a little like a gift tag. You can then change the defaults, or click on the lock icon again to 'unlock' the new box format - then popups will appear again each time you create a new box.
|
280 |
+
|
281 |
+

|
282 |
+
|
283 |
+
### Apply redactions to PDF and Save changes on current page
|
284 |
+
|
285 |
+
Once you have reviewed all the redactions in your document and you are happy with the outputs, you can click 'Apply revised redactions to PDF' to create a new '_redacted.pdf' output alongside a new '_review_file.csv' output.
|
286 |
+
|
287 |
+
If you are working on a page and haven't saved for a while, you can click 'Save changes on current page to file' to ensure that they are saved to an updated 'review_file.csv' output.
|
288 |
|
289 |

|
290 |
|
291 |
+
### Selecting and removing redaction boxes using the 'Search suggested redactions' table
|
292 |
|
293 |
+
The table shows a list of all the suggested redactions in the document alongside the page, label, and text (if available).
|
294 |
|
295 |
+

|
296 |
|
297 |
+
If you click on one of the rows in this table, you will be taken to the page of the redaction. Clicking on a redaction row on the same page *should* change the colour of redaction box to blue to help you locate it in the document viewer (just in app, not in redaction output PDFs).
|
298 |
|
299 |
+

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
300 |
|
301 |
+
You can choose a specific entity type to see which pages the entity is present on. If you want to go to the page specified in the table, you can click on a cell in the table and the review page will be changed to that page.
|
302 |
|
303 |
+
To filter the 'Search suggested redactions' table you can:
|
304 |
+
1. Click on one of the dropdowns (Redaction category, Page, Text), and select an option, or
|
305 |
+
2. Write text in the 'Filter' box just above the table. Click the blue box to apply the filter to the table.
|
306 |
|
307 |
+
Once you have filtered the table, you have a few options underneath on what you can do with the filtered rows:
|
308 |
|
309 |
+
- Click the 'Exclude specific row from redactions' button to remove only the redaction from the last row you clicked on from the document.
|
310 |
+
- Click the 'Exclude all items in table from redactions' button to remove all redactions visible in the table from the document.
|
311 |
|
312 |
+
**NOTE**: After excluding redactions using either of the above options, click the 'Reset filters' button below to ensure that the dropdowns and table return to seeing all remaining redactions in the document.
|
|
|
313 |
|
314 |
+
If you made a mistake, click the 'Undo last element removal' button to restore the Search suggested redactions table to its previous state (can only undo the last action).
|
315 |
|
316 |
+
### Navigating through the document using the 'Search all extracted text'
|
317 |
|
318 |
+
The 'search all extracted text' table will contain text if you have just redacted a document, or if you have uploaded a '..._ocr_output.csv' file alongside a document file and review file on the Review redactions tab as [described above](#uploading-documents-for-review).
|
319 |
|
320 |
+
You can navigate through the document using this table. When you click on a row, the Document viewer pane to the left will change to the selected page.
|
321 |
|
322 |
+

|
323 |
|
324 |
+
You can search through the extracted text by using the search bar just above the table, which should filter as you type. To apply the filter and 'cut' the table, click on the blue tick inside the box next to your search term. To return the table to its original content, click the button below the table 'Reset OCR output table filter'.
|
325 |
+
|
326 |
+

|
327 |
+
|
328 |
+
# ADVANCED USER GUIDE
|
329 |
+
|
330 |
+
This advanced user guide will go over some of the features recently added to the app, including: modifying and merging redaction review files, identifying and redacting duplicate pages across multiple PDFs, 'fuzzy' search and redact, and exporting redactions to Adobe Acrobat.
|
331 |
|
332 |
+
## Table of contents
|
333 |
|
334 |
+
- [Merging redaction review files](#merging-redaction-review-files)
|
335 |
+
- [Identifying and redacting duplicate pages](#identifying-and-redacting-duplicate-pages)
|
336 |
+
- [Fuzzy search and redaction](#fuzzy-search-and-redaction)
|
337 |
+
- [Export redactions to and import from Adobe Acrobat](#export-to-and-import-from-adobe)
|
338 |
+
- [Exporting to Adobe Acrobat](#exporting-to-adobe-acrobat)
|
339 |
+
- [Importing from Adobe Acrobat](#importing-from-adobe-acrobat)
|
340 |
+
- [Using the AWS Textract document API](#using-the-aws-textract-document-api)
|
341 |
+
- [Using AWS Textract and Comprehend when not running in an AWS environment](#using-aws-textract-and-comprehend-when-not-running-in-an-aws-environment)
|
342 |
+
- [Modifying existing redaction review files](#modifying-existing-redaction-review-files)
|
343 |
+
|
344 |
+
|
345 |
+
## Merging redaction review files
|
346 |
|
347 |
Say you have run multiple redaction tasks on the same document, and you want to merge all of these redactions together. You could do this in your spreadsheet editor, but this could be fiddly especially if dealing with multiple review files or large numbers of redactions. The app has a feature to combine multiple review files together to create a 'merged' review file.
|
348 |
|
|
|
424 |
|
425 |

|
426 |
|
427 |
+
## Using the AWS Textract document API
|
428 |
+
|
429 |
+
This option can be enabled by your system admin, in the config file ('SHOW_BULK_TEXTRACT_CALL_OPTIONS' environment variable, and subsequent variables). Using this, you will have the option to submit whole documents in quick succession to the AWS Textract service to get extracted text outputs quickly (faster than using the 'Redact document' process described here).
|
430 |
+
|
431 |
+
### Starting a new Textract API job
|
432 |
+
|
433 |
+
To use this feature, first upload a document file in the file input box [in the usual way](#upload-files-to-the-app) on the first tab of the app. Under AWS Textract signature detection you can select whether or not you would like to analyse signatures or not (with a [cost implication](#optional---select-signature-extraction)).
|
434 |
+
|
435 |
+
Then, open the section under the heading 'Submit whole document to AWS Textract API...'.
|
436 |
+
|
437 |
+

|
438 |
+
|
439 |
+
Click 'Analyse document with AWS Textract API call'. After a few seconds, the job should be submitted to the AWS Textract service. The box 'Job ID to check status' should now have an ID filled in. If it is not already filled with previous jobs (up to seven days old), the table should have a row added with details of the new API job.
|
440 |
+
|
441 |
+
Click the button underneath, 'Check status of Textract job and download', to see progress on the job. Processing will continue in the background until the job is ready, so it is worth periodically clicking this button to see if the outputs are ready. In testing, and as a rough estimate, it seems like this process takes about five seconds per page. However, this has not been tested with very large documents. Once ready, the '_textract.json' output should appear below.
|
442 |
+
|
443 |
+
### Textract API job outputs
|
444 |
+
|
445 |
+
The '_textract.json' output can be used to speed up further redaction tasks as [described previously](#optional---costs-and-time-estimation), the 'Existing Textract output file found' flag should now be ticked.
|
446 |
+
|
447 |
+

|
448 |
+
|
449 |
+
You can now easily get the '..._ocr_output.csv' redaction output based on this '_textract.json' (described in [Redaction outputs](#redaction-outputs)) by clicking on the button 'Convert Textract job outputs to OCR results'. You can now use this file e.g. for [identifying duplicate pages](#identifying-and-redacting-duplicate-pages), or for redaction review.
|
450 |
+
|
451 |
## Using AWS Textract and Comprehend when not running in an AWS environment
|
452 |
|
453 |
AWS Textract and Comprehend give much better results for text extraction and document redaction than the local model options in the app. The most secure way to access them in the Redaction app is to run the app in a secure AWS environment with relevant permissions. Alternatively, you could run the app on your own system while logged in to AWS SSO with relevant permissions.
|
|
|
467 |
|
468 |
The app should then pick up these keys when trying to access the AWS Textract and Comprehend services during redaction.
|
469 |
|
470 |
+
Again, a lot can potentially go wrong with AWS solutions that are insecure, so before trying the above please consult with your AWS and data security teams.
|
471 |
+
|
472 |
+
## Modifying and merging redaction review files
|
473 |
+
|
474 |
+
You can find the folder containing the files discussed in this section [here](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/merge_review_files/).
|
475 |
+
|
476 |
+
As well as serving as inputs to the document redaction app's review function, the 'review_file.csv' output can be modified outside of the app, and also merged with others from multiple redaction attempts on the same file. This gives you the flexibility to change redaction details outside of the app.
|
477 |
+
|
478 |
+
### Modifying existing redaction review files
|
479 |
+
If you open up a 'review_file' csv output using a spreadsheet software program such as Microsoft Excel you can easily modify redaction properties. Open the file '[Partnership-Agreement-Toolkit_0_0_redacted.pdf_review_file_local.csv](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/merge_review_files/Partnership-Agreement-Toolkit_0_0.pdf_review_file_local.csv)', and you should see a spreadshet with just four suggested redactions (see below). The following instructions are for using Excel.
|
480 |
+
|
481 |
+

|
482 |
+
|
483 |
+
The first thing we can do is remove the first row - 'et' is suggested as a person, but is obviously not a genuine instance of personal information. Right click on the row number and select delete on this menu. Next, let's imagine that what the app identified as a 'phone number' was in fact another type of number and so we wanted to change the label. Simply click on the relevant label cells, let's change it to 'SECURITY_NUMBER'. You could also use 'Finad & Select' -> 'Replace' from the top ribbon menu if you wanted to change a number of labels simultaneously.
|
484 |
+
|
485 |
+
How about we wanted to change the colour of the 'email address' entry on the redaction review tab of the redaction app? The colours in a review file are based on an RGB scale with three numbers ranging from 0-255. [You can find suitable colours here](https://rgbcolorpicker.com). Using this scale, if I wanted my review box to be pure blue, I can change the cell value to (0,0,255).
|
486 |
+
|
487 |
+
Imagine that a redaction box was slightly too small, and I didn't want to use the in-app options to change the size. In the review file csv, we can modify e.g. the ymin and ymax values for any box to increase the extent of the redaction box. For the 'email address' entry, let's decrease ymin by 5, and increase ymax by 5.
|
488 |
+
|
489 |
+
I have saved an output file following the above steps as '[Partnership-Agreement-Toolkit_0_0_redacted.pdf_review_file_local_mod.csv](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/merge_review_files/outputs/Partnership-Agreement-Toolkit_0_0.pdf_review_file_local_mod.csv)' in the same folder that the original was found. Let's upload this file to the app along with the original pdf to see how the redactions look now.
|
490 |
+
|
491 |
+

|
492 |
+
|
493 |
+
We can see from the above that we have successfully removed a redaction box, changed labels, colours, and redaction box sizes.
|
app.py
CHANGED
@@ -4,11 +4,11 @@ import pandas as pd
|
|
4 |
import gradio as gr
|
5 |
from gradio_image_annotation import image_annotator
|
6 |
|
7 |
-
from tools.config import OUTPUT_FOLDER, INPUT_FOLDER, RUN_DIRECT_MODE, MAX_QUEUE_SIZE, DEFAULT_CONCURRENCY_LIMIT, MAX_FILE_SIZE, GRADIO_SERVER_PORT, ROOT_PATH, GET_DEFAULT_ALLOW_LIST, ALLOW_LIST_PATH, S3_ALLOW_LIST_PATH, FEEDBACK_LOGS_FOLDER, ACCESS_LOGS_FOLDER, USAGE_LOGS_FOLDER, TESSERACT_FOLDER, POPPLER_FOLDER, REDACTION_LANGUAGE, GET_COST_CODES, COST_CODES_PATH, S3_COST_CODES_PATH, ENFORCE_COST_CODES, DISPLAY_FILE_NAMES_IN_LOGS, SHOW_COSTS, RUN_AWS_FUNCTIONS, DOCUMENT_REDACTION_BUCKET, SHOW_BULK_TEXTRACT_CALL_OPTIONS, TEXTRACT_BULK_ANALYSIS_BUCKET, TEXTRACT_BULK_ANALYSIS_INPUT_SUBFOLDER, TEXTRACT_BULK_ANALYSIS_OUTPUT_SUBFOLDER, SESSION_OUTPUT_FOLDER, LOAD_PREVIOUS_TEXTRACT_JOBS_S3, TEXTRACT_JOBS_S3_LOC, TEXTRACT_JOBS_LOCAL_LOC, HOST_NAME, DEFAULT_COST_CODE, OUTPUT_COST_CODES_PATH, OUTPUT_ALLOW_LIST_PATH
|
8 |
from tools.helper_functions import put_columns_in_df, get_connection_params, reveal_feedback_buttons, custom_regex_load, reset_state_vars, load_in_default_allow_list, tesseract_ocr_option, text_ocr_option, textract_option, local_pii_detector, aws_pii_detector, no_redaction_option, reset_review_vars, merge_csv_files, load_all_output_files, update_dataframe, check_for_existing_textract_file, load_in_default_cost_codes, enforce_cost_codes, calculate_aws_costs, calculate_time_taken, reset_base_dataframe, reset_ocr_base_dataframe, update_cost_code_dataframe_from_dropdown_select
|
9 |
from tools.aws_functions import upload_file_to_s3, download_file_from_s3
|
10 |
from tools.file_redaction import choose_and_run_redactor
|
11 |
-
from tools.file_conversion import prepare_image_or_pdf, get_input_file_names
|
12 |
from tools.redaction_review import apply_redactions_to_review_df_and_files, update_all_page_annotation_object_based_on_previous_page, decrease_page, increase_page, update_annotator_object_and_filter_df, update_entities_df_recogniser_entities, update_entities_df_page, update_entities_df_text, df_select_callback, convert_df_to_xfdf, convert_xfdf_to_dataframe, reset_dropdowns, exclude_selected_items_from_redaction, undo_last_removal, update_selected_review_df_row_colour, update_all_entity_df_dropdowns, df_select_callback_cost, update_other_annotator_number_from_current, update_annotator_page_from_review_df, df_select_callback_ocr, df_select_callback_textract_api
|
13 |
from tools.data_anonymise import anonymise_data_files
|
14 |
from tools.auth import authenticate_user
|
@@ -572,9 +572,9 @@ with app:
|
|
572 |
text_entity_dropdown.select(update_entities_df_text, inputs=[text_entity_dropdown, recogniser_entity_dataframe_base, recogniser_entity_dropdown, page_entity_dropdown], outputs=[recogniser_entity_dataframe, recogniser_entity_dropdown, page_entity_dropdown])
|
573 |
|
574 |
# Clicking on a cell in the recogniser entity dataframe will take you to that page, and also highlight the target redaction box in blue
|
575 |
-
recogniser_entity_dataframe.select(df_select_callback, inputs=[recogniser_entity_dataframe], outputs=[
|
576 |
-
success(update_selected_review_df_row_colour, inputs=[selected_entity_dataframe_row, review_file_state, selected_entity_id, selected_entity_colour
|
577 |
-
success(update_annotator_page_from_review_df, inputs=[review_file_state, images_pdf_state, page_sizes,
|
578 |
|
579 |
reset_dropdowns_btn.click(reset_dropdowns, inputs=[recogniser_entity_dataframe_base], outputs=[recogniser_entity_dropdown, text_entity_dropdown, page_entity_dropdown]).\
|
580 |
success(update_annotator_object_and_filter_df, inputs=[all_image_annotations_state, annotate_current_page, recogniser_entity_dropdown, page_entity_dropdown, text_entity_dropdown, recogniser_entity_dataframe_base, annotator_zoom_number, review_file_state, page_sizes, doc_full_file_name_textbox, input_folder_textbox], outputs = [annotator, annotate_current_page, annotate_current_page_bottom, annotate_previous_page, recogniser_entity_dropdown, recogniser_entity_dataframe, recogniser_entity_dataframe_base, text_entity_dropdown, page_entity_dropdown, page_sizes, all_image_annotations_state])
|
@@ -733,7 +733,7 @@ with app:
|
|
733 |
if __name__ == "__main__":
|
734 |
if RUN_DIRECT_MODE == "0":
|
735 |
|
736 |
-
if
|
737 |
app.queue(max_size=int(MAX_QUEUE_SIZE), default_concurrency_limit=int(DEFAULT_CONCURRENCY_LIMIT)).launch(show_error=True, inbrowser=True, auth=authenticate_user, max_file_size=MAX_FILE_SIZE, server_port=GRADIO_SERVER_PORT, root_path=ROOT_PATH)
|
738 |
else:
|
739 |
app.queue(max_size=int(MAX_QUEUE_SIZE), default_concurrency_limit=int(DEFAULT_CONCURRENCY_LIMIT)).launch(show_error=True, inbrowser=True, max_file_size=MAX_FILE_SIZE, server_port=GRADIO_SERVER_PORT, root_path=ROOT_PATH)
|
|
|
4 |
import gradio as gr
|
5 |
from gradio_image_annotation import image_annotator
|
6 |
|
7 |
+
from tools.config import OUTPUT_FOLDER, INPUT_FOLDER, RUN_DIRECT_MODE, MAX_QUEUE_SIZE, DEFAULT_CONCURRENCY_LIMIT, MAX_FILE_SIZE, GRADIO_SERVER_PORT, ROOT_PATH, GET_DEFAULT_ALLOW_LIST, ALLOW_LIST_PATH, S3_ALLOW_LIST_PATH, FEEDBACK_LOGS_FOLDER, ACCESS_LOGS_FOLDER, USAGE_LOGS_FOLDER, TESSERACT_FOLDER, POPPLER_FOLDER, REDACTION_LANGUAGE, GET_COST_CODES, COST_CODES_PATH, S3_COST_CODES_PATH, ENFORCE_COST_CODES, DISPLAY_FILE_NAMES_IN_LOGS, SHOW_COSTS, RUN_AWS_FUNCTIONS, DOCUMENT_REDACTION_BUCKET, SHOW_BULK_TEXTRACT_CALL_OPTIONS, TEXTRACT_BULK_ANALYSIS_BUCKET, TEXTRACT_BULK_ANALYSIS_INPUT_SUBFOLDER, TEXTRACT_BULK_ANALYSIS_OUTPUT_SUBFOLDER, SESSION_OUTPUT_FOLDER, LOAD_PREVIOUS_TEXTRACT_JOBS_S3, TEXTRACT_JOBS_S3_LOC, TEXTRACT_JOBS_LOCAL_LOC, HOST_NAME, DEFAULT_COST_CODE, OUTPUT_COST_CODES_PATH, OUTPUT_ALLOW_LIST_PATH, COGNITO_AUTH
|
8 |
from tools.helper_functions import put_columns_in_df, get_connection_params, reveal_feedback_buttons, custom_regex_load, reset_state_vars, load_in_default_allow_list, tesseract_ocr_option, text_ocr_option, textract_option, local_pii_detector, aws_pii_detector, no_redaction_option, reset_review_vars, merge_csv_files, load_all_output_files, update_dataframe, check_for_existing_textract_file, load_in_default_cost_codes, enforce_cost_codes, calculate_aws_costs, calculate_time_taken, reset_base_dataframe, reset_ocr_base_dataframe, update_cost_code_dataframe_from_dropdown_select
|
9 |
from tools.aws_functions import upload_file_to_s3, download_file_from_s3
|
10 |
from tools.file_redaction import choose_and_run_redactor
|
11 |
+
from tools.file_conversion import prepare_image_or_pdf, get_input_file_names
|
12 |
from tools.redaction_review import apply_redactions_to_review_df_and_files, update_all_page_annotation_object_based_on_previous_page, decrease_page, increase_page, update_annotator_object_and_filter_df, update_entities_df_recogniser_entities, update_entities_df_page, update_entities_df_text, df_select_callback, convert_df_to_xfdf, convert_xfdf_to_dataframe, reset_dropdowns, exclude_selected_items_from_redaction, undo_last_removal, update_selected_review_df_row_colour, update_all_entity_df_dropdowns, df_select_callback_cost, update_other_annotator_number_from_current, update_annotator_page_from_review_df, df_select_callback_ocr, df_select_callback_textract_api
|
13 |
from tools.data_anonymise import anonymise_data_files
|
14 |
from tools.auth import authenticate_user
|
|
|
572 |
text_entity_dropdown.select(update_entities_df_text, inputs=[text_entity_dropdown, recogniser_entity_dataframe_base, recogniser_entity_dropdown, page_entity_dropdown], outputs=[recogniser_entity_dataframe, recogniser_entity_dropdown, page_entity_dropdown])
|
573 |
|
574 |
# Clicking on a cell in the recogniser entity dataframe will take you to that page, and also highlight the target redaction box in blue
|
575 |
+
recogniser_entity_dataframe.select(df_select_callback, inputs=[recogniser_entity_dataframe], outputs=[selected_entity_dataframe_row]).\
|
576 |
+
success(update_selected_review_df_row_colour, inputs=[selected_entity_dataframe_row, review_file_state, selected_entity_id, selected_entity_colour], outputs=[review_file_state, selected_entity_id, selected_entity_colour]).\
|
577 |
+
success(update_annotator_page_from_review_df, inputs=[review_file_state, images_pdf_state, page_sizes, all_image_annotations_state, annotator, selected_entity_dataframe_row, input_folder_textbox, doc_full_file_name_textbox], outputs=[annotator, all_image_annotations_state, annotate_current_page, page_sizes, review_file_state, annotate_previous_page])
|
578 |
|
579 |
reset_dropdowns_btn.click(reset_dropdowns, inputs=[recogniser_entity_dataframe_base], outputs=[recogniser_entity_dropdown, text_entity_dropdown, page_entity_dropdown]).\
|
580 |
success(update_annotator_object_and_filter_df, inputs=[all_image_annotations_state, annotate_current_page, recogniser_entity_dropdown, page_entity_dropdown, text_entity_dropdown, recogniser_entity_dataframe_base, annotator_zoom_number, review_file_state, page_sizes, doc_full_file_name_textbox, input_folder_textbox], outputs = [annotator, annotate_current_page, annotate_current_page_bottom, annotate_previous_page, recogniser_entity_dropdown, recogniser_entity_dataframe, recogniser_entity_dataframe_base, text_entity_dropdown, page_entity_dropdown, page_sizes, all_image_annotations_state])
|
|
|
733 |
if __name__ == "__main__":
|
734 |
if RUN_DIRECT_MODE == "0":
|
735 |
|
736 |
+
if COGNITO_AUTH == "1":
|
737 |
app.queue(max_size=int(MAX_QUEUE_SIZE), default_concurrency_limit=int(DEFAULT_CONCURRENCY_LIMIT)).launch(show_error=True, inbrowser=True, auth=authenticate_user, max_file_size=MAX_FILE_SIZE, server_port=GRADIO_SERVER_PORT, root_path=ROOT_PATH)
|
738 |
else:
|
739 |
app.queue(max_size=int(MAX_QUEUE_SIZE), default_concurrency_limit=int(DEFAULT_CONCURRENCY_LIMIT)).launch(show_error=True, inbrowser=True, max_file_size=MAX_FILE_SIZE, server_port=GRADIO_SERVER_PORT, root_path=ROOT_PATH)
|
tools/config.py
CHANGED
@@ -237,7 +237,7 @@ else: OUTPUT_ALLOW_LIST_PATH = 'config/default_allow_list.csv'
|
|
237 |
|
238 |
SHOW_COSTS = get_or_create_env_var('SHOW_COSTS', 'False')
|
239 |
|
240 |
-
GET_COST_CODES = get_or_create_env_var('GET_COST_CODES', '
|
241 |
|
242 |
DEFAULT_COST_CODE = get_or_create_env_var('DEFAULT_COST_CODE', '')
|
243 |
|
@@ -246,7 +246,7 @@ COST_CODES_PATH = get_or_create_env_var('COST_CODES_PATH', '') # 'config/COST_CE
|
|
246 |
S3_COST_CODES_PATH = get_or_create_env_var('S3_COST_CODES_PATH', '') # COST_CENTRES.csv # This is a path within the DOCUMENT_REDACTION_BUCKET
|
247 |
|
248 |
if COST_CODES_PATH: OUTPUT_COST_CODES_PATH = COST_CODES_PATH
|
249 |
-
else: OUTPUT_COST_CODES_PATH = '
|
250 |
|
251 |
ENFORCE_COST_CODES = get_or_create_env_var('ENFORCE_COST_CODES', 'False') # If you have cost codes listed, is it compulsory to choose one before redacting?
|
252 |
|
|
|
237 |
|
238 |
SHOW_COSTS = get_or_create_env_var('SHOW_COSTS', 'False')
|
239 |
|
240 |
+
GET_COST_CODES = get_or_create_env_var('GET_COST_CODES', 'True')
|
241 |
|
242 |
DEFAULT_COST_CODE = get_or_create_env_var('DEFAULT_COST_CODE', '')
|
243 |
|
|
|
246 |
S3_COST_CODES_PATH = get_or_create_env_var('S3_COST_CODES_PATH', '') # COST_CENTRES.csv # This is a path within the DOCUMENT_REDACTION_BUCKET
|
247 |
|
248 |
if COST_CODES_PATH: OUTPUT_COST_CODES_PATH = COST_CODES_PATH
|
249 |
+
else: OUTPUT_COST_CODES_PATH = ''
|
250 |
|
251 |
ENFORCE_COST_CODES = get_or_create_env_var('ENFORCE_COST_CODES', 'False') # If you have cost codes listed, is it compulsory to choose one before redacting?
|
252 |
|
tools/file_conversion.py
CHANGED
@@ -21,6 +21,7 @@ from PIL import Image
|
|
21 |
from scipy.spatial import cKDTree
|
22 |
import random
|
23 |
import string
|
|
|
24 |
|
25 |
IMAGE_NUM_REGEX = re.compile(r'_(\d+)\.png$')
|
26 |
|
@@ -617,11 +618,10 @@ def prepare_image_or_pdf(
|
|
617 |
|
618 |
elif file_extension in ['.csv']:
|
619 |
if '_review_file' in file_path_without_ext:
|
620 |
-
#print("file_path:", file_path)
|
621 |
review_file_csv = read_file(file_path)
|
622 |
all_annotations_object = convert_review_df_to_annotation_json(review_file_csv, image_file_paths, page_sizes)
|
623 |
json_from_csv = True
|
624 |
-
print("Converted CSV review file to image annotation object")
|
625 |
elif '_ocr_output' in file_path_without_ext:
|
626 |
all_line_level_ocr_results_df = read_file(file_path)
|
627 |
json_from_csv = False
|
@@ -850,121 +850,246 @@ def remove_duplicate_images_with_blank_boxes(data: List[dict]) -> List[dict]:
|
|
850 |
|
851 |
return result
|
852 |
|
853 |
-
def divide_coordinates_by_page_sizes(
|
854 |
-
|
855 |
-
|
856 |
-
|
857 |
-
|
858 |
-
|
859 |
-
|
860 |
-
coord_cols = [xmin, xmax, ymin, ymax]
|
861 |
-
for col in coord_cols:
|
862 |
-
review_file_df.loc[:, col] = pd.to_numeric(review_file_df[col], errors="coerce")
|
863 |
-
|
864 |
-
review_file_df_orig = review_file_df.copy().loc[(review_file_df[xmin] <= 1) & (review_file_df[xmax] <= 1) & (review_file_df[ymin] <= 1) & (review_file_df[ymax] <= 1),:]
|
865 |
-
|
866 |
-
#print("review_file_df_orig:", review_file_df_orig)
|
867 |
-
|
868 |
-
review_file_df_div = review_file_df.loc[(review_file_df[xmin] > 1) & (review_file_df[xmax] > 1) & (review_file_df[ymin] > 1) & (review_file_df[ymax] > 1),:]
|
869 |
-
|
870 |
-
#print("review_file_df_div:", review_file_df_div)
|
871 |
-
|
872 |
-
review_file_df_div.loc[:, "page"] = pd.to_numeric(review_file_df_div["page"], errors="coerce")
|
873 |
|
874 |
-
|
|
|
875 |
|
876 |
-
|
877 |
-
|
878 |
-
|
|
|
|
|
879 |
|
880 |
-
|
881 |
-
|
882 |
-
|
883 |
-
|
|
|
884 |
|
885 |
-
|
886 |
-
|
|
|
|
|
887 |
|
888 |
-
|
889 |
-
|
890 |
-
|
891 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
892 |
|
893 |
-
|
894 |
-
|
895 |
-
|
896 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
897 |
else:
|
898 |
-
|
899 |
|
900 |
-
# Only sort if the DataFrame is not empty and contains the required columns
|
901 |
-
required_sort_columns = {"page", xmin, ymin}
|
902 |
-
if not review_file_df_out.empty and required_sort_columns.issubset(review_file_df_out.columns):
|
903 |
-
review_file_df_out.sort_values(["page", ymin, xmin], inplace=True)
|
904 |
|
905 |
-
|
|
|
906 |
|
907 |
-
|
|
|
|
|
|
|
|
|
|
|
908 |
|
909 |
-
def multiply_coordinates_by_page_sizes(review_file_df: pd.DataFrame, page_sizes_df: pd.DataFrame, xmin="xmin", xmax="xmax", ymin="ymin", ymax="ymax"):
|
910 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
911 |
|
912 |
-
if xmin in review_file_df.columns and not review_file_df.empty:
|
913 |
|
914 |
-
|
915 |
-
|
916 |
-
|
|
|
917 |
|
918 |
-
|
919 |
-
review_file_df_orig = review_file_df.loc[
|
920 |
-
(review_file_df[xmin] > 1) & (review_file_df[xmax] > 1) &
|
921 |
-
(review_file_df[ymin] > 1) & (review_file_df[ymax] > 1), :].copy()
|
922 |
|
923 |
-
|
924 |
-
|
925 |
-
|
|
|
|
|
|
|
|
|
926 |
|
927 |
-
|
928 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
929 |
|
930 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
931 |
|
932 |
-
|
933 |
-
|
934 |
-
|
|
|
935 |
|
936 |
-
|
937 |
-
|
938 |
-
|
939 |
-
review_file_df_na = review_file_df.loc[review_file_df["image_width"].isna()].copy()
|
940 |
|
941 |
-
|
942 |
-
|
943 |
-
|
|
|
944 |
|
945 |
-
# Multiply coordinates by image sizes
|
946 |
-
review_file_df_not_na[xmin] *= review_file_df_not_na["image_width"]
|
947 |
-
review_file_df_not_na[xmax] *= review_file_df_not_na["image_width"]
|
948 |
-
review_file_df_not_na[ymin] *= review_file_df_not_na["image_height"]
|
949 |
-
review_file_df_not_na[ymax] *= review_file_df_not_na["image_height"]
|
950 |
|
951 |
-
|
952 |
-
|
|
|
953 |
|
954 |
-
|
955 |
-
|
956 |
-
if dfs_to_concat: # Ensure there's at least one non-empty DataFrame
|
957 |
-
review_file_df = pd.concat(dfs_to_concat)
|
958 |
-
else:
|
959 |
-
review_file_df = pd.DataFrame() # Return an empty DataFrame instead of raising an error
|
960 |
|
961 |
-
|
962 |
-
required_sort_columns = {"page", "xmin", "ymin"}
|
963 |
-
if not review_file_df.empty and required_sort_columns.issubset(review_file_df.columns):
|
964 |
-
review_file_df.sort_values(["page", "xmin", "ymin"], inplace=True)
|
965 |
|
966 |
-
|
|
|
|
|
|
|
|
|
967 |
|
|
|
968 |
|
969 |
def do_proximity_match_by_page_for_text(df1:pd.DataFrame, df2:pd.DataFrame):
|
970 |
'''
|
@@ -1018,7 +1143,6 @@ def do_proximity_match_by_page_for_text(df1:pd.DataFrame, df2:pd.DataFrame):
|
|
1018 |
|
1019 |
return merged_df
|
1020 |
|
1021 |
-
|
1022 |
def do_proximity_match_all_pages_for_text(df1:pd.DataFrame, df2:pd.DataFrame, threshold:float=0.03):
|
1023 |
'''
|
1024 |
Match text from one dataframe to another based on proximity matching of coordinates across all pages.
|
@@ -1142,12 +1266,12 @@ def convert_annotation_data_to_dataframe(all_annotations: List[Dict[str, Any]]):
|
|
1142 |
# prevents this from being necessary.
|
1143 |
|
1144 |
# 7. Ensure essential columns exist and set column order
|
1145 |
-
essential_box_cols = ["xmin", "xmax", "ymin", "ymax", "text", "id"]
|
1146 |
for col in essential_box_cols:
|
1147 |
if col not in final_df.columns:
|
1148 |
final_df[col] = pd.NA # Add column with NA if it wasn't present in any box
|
1149 |
|
1150 |
-
base_cols = ["image"
|
1151 |
extra_box_cols = [col for col in final_df.columns if col not in base_cols and col not in essential_box_cols]
|
1152 |
final_col_order = base_cols + essential_box_cols + sorted(extra_box_cols)
|
1153 |
|
@@ -1185,7 +1309,8 @@ def create_annotation_dicts_from_annotation_df(
|
|
1185 |
available_cols = [col for col in box_cols if col in all_image_annotations_df.columns]
|
1186 |
|
1187 |
if 'text' in all_image_annotations_df.columns:
|
1188 |
-
all_image_annotations_df
|
|
|
1189 |
|
1190 |
if not available_cols:
|
1191 |
print(f"Warning: None of the expected box columns ({box_cols}) found in DataFrame.")
|
@@ -1226,85 +1351,84 @@ def create_annotation_dicts_from_annotation_df(
|
|
1226 |
|
1227 |
return result
|
1228 |
|
1229 |
-
def convert_annotation_json_to_review_df(
|
1230 |
-
|
1231 |
-
|
1232 |
-
|
|
|
|
|
1233 |
'''
|
1234 |
Convert the annotation json data to a dataframe format.
|
1235 |
Add on any text from the initial review_file dataframe by joining based on 'id' if available
|
1236 |
in both sources, otherwise falling back to joining on pages/co-ordinates (if option selected).
|
|
|
|
|
|
|
1237 |
'''
|
1238 |
|
1239 |
# 1. Convert annotations to DataFrame
|
1240 |
-
# Ensure convert_annotation_data_to_dataframe populates the 'id' column
|
1241 |
-
# if 'id' exists in the dictionaries within all_annotations.
|
1242 |
-
|
1243 |
review_file_df = convert_annotation_data_to_dataframe(all_annotations)
|
1244 |
|
1245 |
-
# Only keep rows in review_df where there are coordinates
|
1246 |
-
|
|
|
1247 |
|
1248 |
# Exit early if the initial conversion results in an empty DataFrame
|
1249 |
if review_file_df.empty:
|
1250 |
# Define standard columns for an empty return DataFrame
|
1251 |
-
|
1252 |
-
#
|
1253 |
-
|
1254 |
-
|
1255 |
-
|
1256 |
-
|
1257 |
-
|
1258 |
-
|
1259 |
-
|
1260 |
-
|
1261 |
-
|
|
|
|
|
|
|
1262 |
|
1263 |
|
1264 |
-
|
|
|
|
|
|
|
|
|
1265 |
if not page_sizes_df.empty:
|
1266 |
-
#
|
1267 |
-
|
1268 |
-
|
1269 |
-
|
1270 |
-
|
1271 |
-
|
1272 |
-
|
1273 |
-
|
1274 |
-
print("review_file_df after coord divide:", review_file_df)
|
1275 |
-
|
1276 |
-
# Also apply to redaction_decision_output if it's not empty and has page numbers
|
1277 |
-
if not redaction_decision_output.empty and 'page' in redaction_decision_output.columns:
|
1278 |
-
redaction_decision_output['page'] = pd.to_numeric(redaction_decision_output['page'], errors='coerce')
|
1279 |
-
# Drop rows with invalid pages before division
|
1280 |
-
redaction_decision_output.dropna(subset=['page'], inplace=True)
|
1281 |
-
redaction_decision_output['page'] = redaction_decision_output['page'].astype(int)
|
1282 |
-
redaction_decision_output = divide_coordinates_by_page_sizes(redaction_decision_output, page_sizes_df)
|
1283 |
-
|
1284 |
-
print("redaction_decision_output after coord divide:", redaction_decision_output)
|
1285 |
-
else:
|
1286 |
-
print("Warning: Page sizes DataFrame became empty after processing, skipping coordinate division.")
|
1287 |
|
1288 |
|
1289 |
# 3. Join additional data from redaction_decision_output if provided
|
|
|
|
|
1290 |
if not redaction_decision_output.empty:
|
1291 |
-
# ---
|
1292 |
-
|
1293 |
-
|
1294 |
-
|
|
|
|
|
1295 |
|
1296 |
if id_col_exists_in_review and id_col_exists_in_redaction:
|
1297 |
#print("Attempting to join data based on 'id' column.")
|
1298 |
try:
|
1299 |
-
# Ensure 'id' columns are of
|
1300 |
review_file_df['id'] = review_file_df['id'].astype(str)
|
1301 |
-
# Make a copy to avoid
|
|
|
1302 |
redaction_copy = redaction_decision_output.copy()
|
1303 |
redaction_copy['id'] = redaction_copy['id'].astype(str)
|
1304 |
|
1305 |
-
# Select columns to merge from redaction output.
|
1306 |
-
# Primarily interested in 'text', but keep 'id' for the merge key.
|
1307 |
-
# Add other columns from redaction_copy if needed.
|
1308 |
cols_to_merge = ['id']
|
1309 |
if 'text' in redaction_copy.columns:
|
1310 |
cols_to_merge.append('text')
|
@@ -1312,83 +1436,128 @@ def convert_annotation_json_to_review_df(all_annotations: List[dict],
|
|
1312 |
print("Warning: 'text' column not found in redaction_decision_output. Cannot merge text using 'id'.")
|
1313 |
|
1314 |
# Perform a left merge to keep all annotations and add matching text
|
1315 |
-
#
|
1316 |
-
|
|
|
|
|
1317 |
merged_df = pd.merge(
|
1318 |
review_file_df,
|
1319 |
redaction_copy[cols_to_merge],
|
1320 |
on='id',
|
1321 |
how='left',
|
1322 |
-
suffixes=('',
|
1323 |
)
|
1324 |
|
1325 |
-
# Update the
|
1326 |
-
|
1327 |
-
|
1328 |
-
|
1329 |
-
|
1330 |
-
|
1331 |
-
#
|
1332 |
-
merged_df
|
1333 |
-
|
1334 |
-
|
1335 |
-
|
1336 |
|
1337 |
-
|
1338 |
-
final_cols = original_cols
|
1339 |
-
if 'text' not in final_cols and 'text' in merged_df.columns:
|
1340 |
-
final_cols.append('text') # Make sure text column is kept if newly added
|
1341 |
-
# Reorder/select columns if necessary, ensuring 'id' is kept
|
1342 |
-
review_file_df = merged_df[[col for col in final_cols if col in merged_df.columns] + (['id'] if 'id' not in final_cols else [])]
|
1343 |
|
|
|
1344 |
|
1345 |
-
#print("Successfully
|
1346 |
-
joined_by_id = True
|
1347 |
|
1348 |
except Exception as e:
|
1349 |
-
print(f"Error during 'id'-based merge: {e}.
|
1350 |
-
# Fall through to proximity match below
|
1351 |
-
|
1352 |
-
# --- Fallback to proximity match ---
|
1353 |
-
|
1354 |
-
|
1355 |
-
|
1356 |
-
|
1357 |
-
|
1358 |
-
|
1359 |
-
|
1360 |
-
|
1361 |
-
|
1362 |
-
|
1363 |
-
|
1364 |
-
|
1365 |
-
|
1366 |
-
|
1367 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1368 |
if 'id' in review_file_df.columns:
|
1369 |
-
|
|
|
|
|
1370 |
|
1371 |
-
|
|
|
1372 |
if col not in review_file_df.columns:
|
1373 |
-
#
|
1374 |
-
#
|
1375 |
-
review_file_df[col] = ''
|
1376 |
|
1377 |
# Select and order the final set of columns
|
1378 |
-
|
|
|
|
|
1379 |
|
1380 |
# 5. Final processing and sorting
|
1381 |
-
#
|
1382 |
if 'color' in review_file_df.columns:
|
1383 |
-
|
|
|
|
|
1384 |
|
1385 |
# Sort the results
|
1386 |
-
sort_columns = ['page', 'ymin', 'xmin', 'label']
|
1387 |
# Ensure sort columns exist before sorting
|
|
|
1388 |
valid_sort_columns = [col for col in sort_columns if col in review_file_df.columns]
|
1389 |
-
if valid_sort_columns:
|
1390 |
-
|
1391 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1392 |
return review_file_df
|
1393 |
|
1394 |
def fill_missing_box_ids(data_input: dict) -> dict:
|
@@ -1472,20 +1641,18 @@ def fill_missing_box_ids(data_input: dict) -> dict:
|
|
1472 |
|
1473 |
def fill_missing_ids(df: pd.DataFrame, column_name: str = 'id', length: int = 12) -> pd.DataFrame:
|
1474 |
"""
|
1475 |
-
Generates unique alphanumeric IDs for rows in a DataFrame column
|
1476 |
-
where the value is missing (NaN, None) or an empty string.
|
1477 |
|
1478 |
Args:
|
1479 |
df (pd.DataFrame): The input Pandas DataFrame.
|
1480 |
column_name (str): The name of the column to check and fill (defaults to 'id').
|
1481 |
This column will be added if it doesn't exist.
|
1482 |
length (int): The desired length of the generated IDs (defaults to 12).
|
1483 |
-
Cannot exceed the limits that guarantee uniqueness based
|
1484 |
-
on the number of IDs needed and character set size.
|
1485 |
|
1486 |
Returns:
|
1487 |
pd.DataFrame: The DataFrame with missing/empty IDs filled in the specified column.
|
1488 |
-
Note: The function modifies the DataFrame in
|
1489 |
"""
|
1490 |
|
1491 |
# --- Input Validation ---
|
@@ -1497,43 +1664,59 @@ def fill_missing_ids(df: pd.DataFrame, column_name: str = 'id', length: int = 12
|
|
1497 |
raise ValueError("'length' must be a positive integer.")
|
1498 |
|
1499 |
# --- Ensure Column Exists ---
|
|
|
1500 |
if column_name not in df.columns:
|
1501 |
print(f"Column '{column_name}' not found. Adding it to the DataFrame.")
|
1502 |
-
|
|
|
|
|
|
|
|
|
|
|
1503 |
|
1504 |
# --- Identify Rows Needing IDs ---
|
1505 |
-
# Check for NaN, None,
|
1506 |
-
|
1507 |
-
|
1508 |
-
|
1509 |
-
|
1510 |
-
|
1511 |
-
|
1512 |
-
|
1513 |
-
|
1514 |
-
|
1515 |
-
|
1516 |
-
|
1517 |
-
|
1518 |
-
is_missing_or_empty = df[column_name].isna()
|
1519 |
|
1520 |
rows_to_fill_index = df.index[is_missing_or_empty]
|
1521 |
num_needed = len(rows_to_fill_index)
|
1522 |
|
1523 |
if num_needed == 0:
|
1524 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
1525 |
return df
|
1526 |
|
1527 |
print(f"Found {num_needed} rows requiring a unique ID in column '{column_name}'.")
|
1528 |
|
1529 |
# --- Get Existing IDs to Ensure Uniqueness ---
|
1530 |
-
|
1531 |
-
|
1532 |
-
|
1533 |
-
|
1534 |
-
|
1535 |
-
|
1536 |
-
|
|
|
|
|
|
|
|
|
|
|
1537 |
|
1538 |
|
1539 |
# --- Generate Unique IDs ---
|
@@ -1543,93 +1726,230 @@ def fill_missing_ids(df: pd.DataFrame, column_name: str = 'id', length: int = 12
|
|
1543 |
|
1544 |
max_possible_ids = len(character_set) ** length
|
1545 |
if num_needed > max_possible_ids:
|
1546 |
-
|
1547 |
-
|
|
|
|
|
1548 |
|
1549 |
#print(f"Generating {num_needed} unique IDs of length {length}...")
|
1550 |
for i in range(num_needed):
|
1551 |
attempts = 0
|
1552 |
while True:
|
1553 |
candidate_id = ''.join(random.choices(character_set, k=length))
|
1554 |
-
# Check against *all* existing IDs and *newly* generated ones
|
1555 |
if candidate_id not in existing_ids and candidate_id not in generated_ids_set:
|
1556 |
generated_ids_set.add(candidate_id)
|
1557 |
new_ids_list.append(candidate_id)
|
1558 |
break # Found a unique ID
|
1559 |
attempts += 1
|
1560 |
-
if attempts >
|
1561 |
-
|
1562 |
|
1563 |
-
# Optional progress update
|
1564 |
-
if (i + 1) % 1000 == 0:
|
1565 |
-
|
1566 |
|
1567 |
|
1568 |
# --- Assign New IDs ---
|
1569 |
# Use the previously identified index to assign the new IDs correctly
|
|
|
|
|
|
|
|
|
1570 |
df.loc[rows_to_fill_index, column_name] = new_ids_list
|
1571 |
-
|
|
|
|
|
|
|
1572 |
|
1573 |
-
# The DataFrame 'df' has been modified in place
|
1574 |
return df
|
1575 |
|
1576 |
-
def convert_review_df_to_annotation_json(
|
1577 |
-
|
1578 |
-
|
1579 |
-
'''
|
1580 |
-
|
1581 |
-
|
1582 |
-
|
1583 |
-
|
1584 |
-
for col in float_cols:
|
1585 |
-
review_file_df.loc[:, col] = pd.to_numeric(review_file_df.loc[:, col], errors='coerce')
|
1586 |
-
|
1587 |
-
# Convert relative co-ordinates into image coordinates for the image annotation output object
|
1588 |
-
if page_sizes:
|
1589 |
-
page_sizes_df = pd.DataFrame(page_sizes)
|
1590 |
-
page_sizes_df[["page"]] = page_sizes_df[["page"]].apply(pd.to_numeric, errors="coerce")
|
1591 |
|
1592 |
-
|
1593 |
-
|
1594 |
-
review_file_df = fill_missing_ids(review_file_df)
|
1595 |
|
1596 |
-
|
1597 |
-
review_file_df
|
1598 |
-
|
1599 |
-
|
1600 |
-
|
1601 |
-
|
1602 |
-
|
1603 |
-
# If colours are saved as list, convert to tuple
|
1604 |
-
review_file_df.loc[:, "color"] = review_file_df.loc[:,"color"].apply(lambda x: tuple(x) if isinstance(x, list) else x)
|
1605 |
|
1606 |
-
|
1607 |
-
|
|
|
|
|
|
|
1608 |
|
1609 |
-
#
|
1610 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1611 |
|
1612 |
-
|
1613 |
-
|
1614 |
-
reported_page_number = int(page_no + 1)
|
1615 |
|
1616 |
-
if reported_page_number in review_file_df["page"].values:
|
1617 |
|
1618 |
-
|
1619 |
-
|
1620 |
-
|
1621 |
-
|
1622 |
-
|
1623 |
-
|
1624 |
-
|
1625 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1626 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1627 |
else:
|
1628 |
-
|
1629 |
-
|
1630 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1631 |
|
1632 |
-
|
1633 |
-
|
|
|
1634 |
|
1635 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
21 |
from scipy.spatial import cKDTree
|
22 |
import random
|
23 |
import string
|
24 |
+
import warnings # To warn about potential type changes
|
25 |
|
26 |
IMAGE_NUM_REGEX = re.compile(r'_(\d+)\.png$')
|
27 |
|
|
|
618 |
|
619 |
elif file_extension in ['.csv']:
|
620 |
if '_review_file' in file_path_without_ext:
|
|
|
621 |
review_file_csv = read_file(file_path)
|
622 |
all_annotations_object = convert_review_df_to_annotation_json(review_file_csv, image_file_paths, page_sizes)
|
623 |
json_from_csv = True
|
624 |
+
#print("Converted CSV review file to image annotation object")
|
625 |
elif '_ocr_output' in file_path_without_ext:
|
626 |
all_line_level_ocr_results_df = read_file(file_path)
|
627 |
json_from_csv = False
|
|
|
850 |
|
851 |
return result
|
852 |
|
853 |
+
def divide_coordinates_by_page_sizes(
|
854 |
+
review_file_df: pd.DataFrame,
|
855 |
+
page_sizes_df: pd.DataFrame,
|
856 |
+
xmin="xmin", xmax="xmax", ymin="ymin", ymax="ymax"
|
857 |
+
) -> pd.DataFrame:
|
858 |
+
"""
|
859 |
+
Optimized function to convert absolute image coordinates (>1) to relative coordinates (<=1).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
860 |
|
861 |
+
Identifies rows with absolute coordinates, merges page size information,
|
862 |
+
divides coordinates by dimensions, and combines with already-relative rows.
|
863 |
|
864 |
+
Args:
|
865 |
+
review_file_df: Input DataFrame with potentially mixed coordinate systems.
|
866 |
+
page_sizes_df: DataFrame with page dimensions ('page', 'image_width',
|
867 |
+
'image_height', 'mediabox_width', 'mediabox_height').
|
868 |
+
xmin, xmax, ymin, ymax: Names of the coordinate columns.
|
869 |
|
870 |
+
Returns:
|
871 |
+
DataFrame with coordinates converted to relative system, sorted.
|
872 |
+
"""
|
873 |
+
if review_file_df.empty or xmin not in review_file_df.columns:
|
874 |
+
return review_file_df # Return early if empty or key column missing
|
875 |
|
876 |
+
# --- Initial Type Conversion ---
|
877 |
+
coord_cols = [xmin, xmax, ymin, ymax]
|
878 |
+
cols_to_convert = coord_cols + ["page"]
|
879 |
+
temp_df = review_file_df.copy() # Work on a copy initially
|
880 |
|
881 |
+
for col in cols_to_convert:
|
882 |
+
if col in temp_df.columns:
|
883 |
+
temp_df[col] = pd.to_numeric(temp_df[col], errors="coerce")
|
884 |
+
else:
|
885 |
+
# If essential 'page' or coord column missing, cannot proceed meaningfully
|
886 |
+
if col == 'page' or col in coord_cols:
|
887 |
+
print(f"Warning: Required column '{col}' not found in review_file_df. Returning original DataFrame.")
|
888 |
+
return review_file_df
|
889 |
+
|
890 |
+
# --- Identify Absolute Coordinates ---
|
891 |
+
# Create mask for rows where *all* coordinates are potentially absolute (> 1)
|
892 |
+
# Handle potential NaNs introduced by to_numeric - treat NaN as not absolute.
|
893 |
+
is_absolute_mask = (
|
894 |
+
(temp_df[xmin] > 1) & (temp_df[xmin].notna()) &
|
895 |
+
(temp_df[xmax] > 1) & (temp_df[xmax].notna()) &
|
896 |
+
(temp_df[ymin] > 1) & (temp_df[ymin].notna()) &
|
897 |
+
(temp_df[ymax] > 1) & (temp_df[ymax].notna())
|
898 |
+
)
|
899 |
|
900 |
+
# --- Separate DataFrames ---
|
901 |
+
df_rel = temp_df[~is_absolute_mask] # Rows already relative or with NaN/mixed coords
|
902 |
+
df_abs = temp_df[is_absolute_mask].copy() # Absolute rows - COPY here to allow modifications
|
903 |
+
|
904 |
+
# --- Process Absolute Coordinates ---
|
905 |
+
if not df_abs.empty:
|
906 |
+
# Merge page sizes if necessary
|
907 |
+
if "image_width" not in df_abs.columns and not page_sizes_df.empty:
|
908 |
+
ps_df_copy = page_sizes_df.copy() # Work on a copy of page sizes
|
909 |
+
|
910 |
+
# Ensure page is numeric for merge key matching
|
911 |
+
ps_df_copy['page'] = pd.to_numeric(ps_df_copy['page'], errors='coerce')
|
912 |
+
|
913 |
+
# Columns to merge from page_sizes
|
914 |
+
merge_cols = ['page', 'image_width', 'image_height', 'mediabox_width', 'mediabox_height']
|
915 |
+
available_merge_cols = [col for col in merge_cols if col in ps_df_copy.columns]
|
916 |
+
|
917 |
+
# Prepare dimension columns in the copy
|
918 |
+
for col in ['image_width', 'image_height', 'mediabox_width', 'mediabox_height']:
|
919 |
+
if col in ps_df_copy.columns:
|
920 |
+
# Replace "<NA>" string if present
|
921 |
+
if ps_df_copy[col].dtype == 'object':
|
922 |
+
ps_df_copy[col] = ps_df_copy[col].replace("<NA>", pd.NA)
|
923 |
+
# Convert to numeric
|
924 |
+
ps_df_copy[col] = pd.to_numeric(ps_df_copy[col], errors='coerce')
|
925 |
+
|
926 |
+
# Perform the merge
|
927 |
+
if 'page' in available_merge_cols: # Check if page exists for merging
|
928 |
+
df_abs = df_abs.merge(
|
929 |
+
ps_df_copy[available_merge_cols],
|
930 |
+
on="page",
|
931 |
+
how="left"
|
932 |
+
)
|
933 |
+
else:
|
934 |
+
print("Warning: 'page' column not found in page_sizes_df. Cannot merge dimensions.")
|
935 |
+
|
936 |
+
|
937 |
+
# Fallback to mediabox dimensions if image dimensions are missing
|
938 |
+
if "image_width" in df_abs.columns and "mediabox_width" in df_abs.columns:
|
939 |
+
# Check if image_width mostly missing - use .isna().all() or check percentage
|
940 |
+
if df_abs["image_width"].isna().all():
|
941 |
+
print("Falling back to mediabox dimensions as image_width is entirely missing.")
|
942 |
+
df_abs["image_width"] = df_abs["image_width"].fillna(df_abs["mediabox_width"])
|
943 |
+
df_abs["image_height"] = df_abs["image_height"].fillna(df_abs["mediabox_height"])
|
944 |
+
else:
|
945 |
+
# Optional: Fill only missing image dims if some exist?
|
946 |
+
# df_abs["image_width"].fillna(df_abs["mediabox_width"], inplace=True)
|
947 |
+
# df_abs["image_height"].fillna(df_abs["mediabox_height"], inplace=True)
|
948 |
+
pass # Current logic only falls back if ALL image_width are NaN
|
949 |
+
|
950 |
+
# Ensure divisor columns are numeric before division
|
951 |
+
divisors_numeric = True
|
952 |
+
for col in ["image_width", "image_height"]:
|
953 |
+
if col in df_abs.columns:
|
954 |
+
df_abs[col] = pd.to_numeric(df_abs[col], errors='coerce')
|
955 |
+
else:
|
956 |
+
print(f"Warning: Dimension column '{col}' missing. Cannot perform division.")
|
957 |
+
divisors_numeric = False
|
958 |
+
|
959 |
+
|
960 |
+
# Perform division if dimensions are available and numeric
|
961 |
+
if divisors_numeric and "image_width" in df_abs.columns and "image_height" in df_abs.columns:
|
962 |
+
# Use np.errstate to suppress warnings about division by zero or NaN if desired
|
963 |
+
with np.errstate(divide='ignore', invalid='ignore'):
|
964 |
+
df_abs[xmin] = df_abs[xmin] / df_abs["image_width"]
|
965 |
+
df_abs[xmax] = df_abs[xmax] / df_abs["image_width"]
|
966 |
+
df_abs[ymin] = df_abs[ymin] / df_abs["image_height"]
|
967 |
+
df_abs[ymax] = df_abs[ymax] / df_abs["image_height"]
|
968 |
+
# Replace potential infinities with NaN (optional, depending on desired outcome)
|
969 |
+
df_abs.replace([np.inf, -np.inf], np.nan, inplace=True)
|
970 |
else:
|
971 |
+
print("Skipping coordinate division due to missing or non-numeric dimension columns.")
|
972 |
|
|
|
|
|
|
|
|
|
973 |
|
974 |
+
# --- Combine Relative and Processed Absolute DataFrames ---
|
975 |
+
dfs_to_concat = [df for df in [df_rel, df_abs] if not df.empty]
|
976 |
|
977 |
+
if dfs_to_concat:
|
978 |
+
final_df = pd.concat(dfs_to_concat, ignore_index=True)
|
979 |
+
else:
|
980 |
+
# If both splits were empty, return an empty DF with original columns
|
981 |
+
print("Warning: Both relative and absolute splits resulted in empty DataFrames.")
|
982 |
+
final_df = pd.DataFrame(columns=review_file_df.columns)
|
983 |
|
|
|
984 |
|
985 |
+
# --- Final Sort ---
|
986 |
+
required_sort_columns = {"page", xmin, ymin}
|
987 |
+
if not final_df.empty and required_sort_columns.issubset(final_df.columns):
|
988 |
+
# Ensure sort columns are numeric before sorting
|
989 |
+
final_df['page'] = pd.to_numeric(final_df['page'], errors='coerce')
|
990 |
+
final_df[ymin] = pd.to_numeric(final_df[ymin], errors='coerce')
|
991 |
+
final_df[xmin] = pd.to_numeric(final_df[xmin], errors='coerce')
|
992 |
+
# Sort by page, ymin, xmin (note order compared to multiply function)
|
993 |
+
final_df.sort_values(["page", ymin, xmin], inplace=True, na_position='last')
|
994 |
|
|
|
995 |
|
996 |
+
# --- Clean Up Columns ---
|
997 |
+
# Correctly drop columns and reassign the result
|
998 |
+
cols_to_drop = ["image_width", "image_height", "mediabox_width", "mediabox_height"]
|
999 |
+
final_df = final_df.drop(columns=cols_to_drop, errors="ignore")
|
1000 |
|
1001 |
+
return final_df
|
|
|
|
|
|
|
1002 |
|
1003 |
+
def multiply_coordinates_by_page_sizes(
|
1004 |
+
review_file_df: pd.DataFrame,
|
1005 |
+
page_sizes_df: pd.DataFrame,
|
1006 |
+
xmin="xmin", xmax="xmax", ymin="ymin", ymax="ymax"
|
1007 |
+
):
|
1008 |
+
"""
|
1009 |
+
Optimized function to convert relative coordinates to absolute based on page sizes.
|
1010 |
|
1011 |
+
Separates relative (<=1) and absolute (>1) coordinates, merges page sizes
|
1012 |
+
for relative coordinates, calculates absolute pixel values, and recombines.
|
1013 |
+
"""
|
1014 |
+
if review_file_df.empty or xmin not in review_file_df.columns:
|
1015 |
+
return review_file_df # Return early if empty or key column missing
|
1016 |
+
|
1017 |
+
coord_cols = [xmin, xmax, ymin, ymax]
|
1018 |
+
# Initial type conversion for coordinates and page
|
1019 |
+
for col in coord_cols + ["page"]:
|
1020 |
+
if col in review_file_df.columns:
|
1021 |
+
# Use astype for potentially faster conversion if confident,
|
1022 |
+
# but to_numeric is safer for mixed types/errors
|
1023 |
+
review_file_df[col] = pd.to_numeric(review_file_df[col], errors="coerce")
|
1024 |
+
|
1025 |
+
# --- Identify relative coordinates ---
|
1026 |
+
# Create mask for rows where *all* coordinates are potentially relative (<= 1)
|
1027 |
+
# Handle potential NaNs introduced by to_numeric - treat NaN as not relative here.
|
1028 |
+
is_relative_mask = (
|
1029 |
+
(review_file_df[xmin].le(1) & review_file_df[xmin].notna()) &
|
1030 |
+
(review_file_df[xmax].le(1) & review_file_df[xmax].notna()) &
|
1031 |
+
(review_file_df[ymin].le(1) & review_file_df[ymin].notna()) &
|
1032 |
+
(review_file_df[ymax].le(1) & review_file_df[ymax].notna())
|
1033 |
+
)
|
1034 |
|
1035 |
+
# Separate DataFrames (minimal copies)
|
1036 |
+
df_abs = review_file_df[~is_relative_mask].copy() # Keep absolute rows separately
|
1037 |
+
df_rel = review_file_df[is_relative_mask].copy() # Work only with relative rows
|
1038 |
+
|
1039 |
+
if df_rel.empty:
|
1040 |
+
# If no relative coordinates, just sort and return absolute ones (if any)
|
1041 |
+
if not df_abs.empty and {"page", xmin, ymin}.issubset(df_abs.columns):
|
1042 |
+
df_abs.sort_values(["page", xmin, ymin], inplace=True, na_position='last')
|
1043 |
+
return df_abs
|
1044 |
+
|
1045 |
+
# --- Process relative coordinates ---
|
1046 |
+
if "image_width" not in df_rel.columns and not page_sizes_df.empty:
|
1047 |
+
# Prepare page_sizes_df for merge
|
1048 |
+
page_sizes_df = page_sizes_df.copy() # Avoid modifying original page_sizes_df
|
1049 |
+
page_sizes_df['page'] = pd.to_numeric(page_sizes_df['page'], errors='coerce')
|
1050 |
+
# Ensure proper NA handling for image dimensions
|
1051 |
+
page_sizes_df[['image_width', 'image_height']] = page_sizes_df[['image_width','image_height']].replace("<NA>", pd.NA)
|
1052 |
+
page_sizes_df['image_width'] = pd.to_numeric(page_sizes_df['image_width'], errors='coerce')
|
1053 |
+
page_sizes_df['image_height'] = pd.to_numeric(page_sizes_df['image_height'], errors='coerce')
|
1054 |
+
|
1055 |
+
# Merge page sizes
|
1056 |
+
df_rel = df_rel.merge(
|
1057 |
+
page_sizes_df[['page', 'image_width', 'image_height']],
|
1058 |
+
on="page",
|
1059 |
+
how="left"
|
1060 |
+
)
|
1061 |
|
1062 |
+
# Multiply coordinates where image dimensions are available
|
1063 |
+
if "image_width" in df_rel.columns:
|
1064 |
+
# Create mask for rows in df_rel that have valid image dimensions
|
1065 |
+
has_size_mask = df_rel["image_width"].notna() & df_rel["image_height"].notna()
|
1066 |
|
1067 |
+
# Apply multiplication using .loc and the mask (vectorized and efficient)
|
1068 |
+
# Ensure columns are numeric before multiplication (might be redundant if types are good)
|
1069 |
+
# df_rel.loc[has_size_mask, coord_cols + ['image_width', 'image_height']] = df_rel.loc[has_size_mask, coord_cols + ['image_width', 'image_height']].apply(pd.to_numeric, errors='coerce')
|
|
|
1070 |
|
1071 |
+
df_rel.loc[has_size_mask, xmin] *= df_rel.loc[has_size_mask, "image_width"]
|
1072 |
+
df_rel.loc[has_size_mask, xmax] *= df_rel.loc[has_size_mask, "image_width"]
|
1073 |
+
df_rel.loc[has_size_mask, ymin] *= df_rel.loc[has_size_mask, "image_height"]
|
1074 |
+
df_rel.loc[has_size_mask, ymax] *= df_rel.loc[has_size_mask, "image_height"]
|
1075 |
|
|
|
|
|
|
|
|
|
|
|
1076 |
|
1077 |
+
# --- Combine absolute and processed relative DataFrames ---
|
1078 |
+
# Use list comprehension to handle potentially empty DataFrames
|
1079 |
+
dfs_to_concat = [df for df in [df_abs, df_rel] if not df.empty]
|
1080 |
|
1081 |
+
if not dfs_to_concat:
|
1082 |
+
return pd.DataFrame() # Return empty if both are empty
|
|
|
|
|
|
|
|
|
1083 |
|
1084 |
+
final_df = pd.concat(dfs_to_concat, ignore_index=True) # ignore_index is good practice after filtering/concat
|
|
|
|
|
|
|
1085 |
|
1086 |
+
# --- Final Sort ---
|
1087 |
+
required_sort_columns = {"page", xmin, ymin}
|
1088 |
+
if not final_df.empty and required_sort_columns.issubset(final_df.columns):
|
1089 |
+
# Handle potential NaNs in sort columns gracefully
|
1090 |
+
final_df.sort_values(["page", xmin, ymin], inplace=True, na_position='last')
|
1091 |
|
1092 |
+
return final_df
|
1093 |
|
1094 |
def do_proximity_match_by_page_for_text(df1:pd.DataFrame, df2:pd.DataFrame):
|
1095 |
'''
|
|
|
1143 |
|
1144 |
return merged_df
|
1145 |
|
|
|
1146 |
def do_proximity_match_all_pages_for_text(df1:pd.DataFrame, df2:pd.DataFrame, threshold:float=0.03):
|
1147 |
'''
|
1148 |
Match text from one dataframe to another based on proximity matching of coordinates across all pages.
|
|
|
1266 |
# prevents this from being necessary.
|
1267 |
|
1268 |
# 7. Ensure essential columns exist and set column order
|
1269 |
+
essential_box_cols = ["xmin", "xmax", "ymin", "ymax", "text", "id", "label"]
|
1270 |
for col in essential_box_cols:
|
1271 |
if col not in final_df.columns:
|
1272 |
final_df[col] = pd.NA # Add column with NA if it wasn't present in any box
|
1273 |
|
1274 |
+
base_cols = ["image"]
|
1275 |
extra_box_cols = [col for col in final_df.columns if col not in base_cols and col not in essential_box_cols]
|
1276 |
final_col_order = base_cols + essential_box_cols + sorted(extra_box_cols)
|
1277 |
|
|
|
1309 |
available_cols = [col for col in box_cols if col in all_image_annotations_df.columns]
|
1310 |
|
1311 |
if 'text' in all_image_annotations_df.columns:
|
1312 |
+
all_image_annotations_df['text'] = all_image_annotations_df['text'].fillna('')
|
1313 |
+
#all_image_annotations_df.loc[all_image_annotations_df['text'].isnull(), 'text'] = ''
|
1314 |
|
1315 |
if not available_cols:
|
1316 |
print(f"Warning: None of the expected box columns ({box_cols}) found in DataFrame.")
|
|
|
1351 |
|
1352 |
return result
|
1353 |
|
1354 |
+
def convert_annotation_json_to_review_df(
|
1355 |
+
all_annotations: List[dict],
|
1356 |
+
redaction_decision_output: pd.DataFrame = pd.DataFrame(),
|
1357 |
+
page_sizes: List[dict] = [],
|
1358 |
+
do_proximity_match: bool = True
|
1359 |
+
) -> pd.DataFrame:
|
1360 |
'''
|
1361 |
Convert the annotation json data to a dataframe format.
|
1362 |
Add on any text from the initial review_file dataframe by joining based on 'id' if available
|
1363 |
in both sources, otherwise falling back to joining on pages/co-ordinates (if option selected).
|
1364 |
+
|
1365 |
+
Refactored for improved efficiency, prioritizing ID-based join and conditionally applying
|
1366 |
+
coordinate division and proximity matching.
|
1367 |
'''
|
1368 |
|
1369 |
# 1. Convert annotations to DataFrame
|
|
|
|
|
|
|
1370 |
review_file_df = convert_annotation_data_to_dataframe(all_annotations)
|
1371 |
|
1372 |
+
# Only keep rows in review_df where there are coordinates (assuming xmin is representative)
|
1373 |
+
# Use .notna() for robustness with potential None or NaN values
|
1374 |
+
review_file_df.dropna(subset=['xmin', 'ymin', 'xmax', 'ymax'], how='any', inplace=True)
|
1375 |
|
1376 |
# Exit early if the initial conversion results in an empty DataFrame
|
1377 |
if review_file_df.empty:
|
1378 |
# Define standard columns for an empty return DataFrame
|
1379 |
+
# Ensure 'id' is included if it was potentially expected based on input structure
|
1380 |
+
# We don't know the columns from convert_annotation_data_to_dataframe without seeing it,
|
1381 |
+
# but let's assume a standard set and add 'id' if it appeared.
|
1382 |
+
standard_cols = ["image", "page", "label", "color", "xmin", "ymin", "xmax", "ymax", "text"]
|
1383 |
+
if 'id' in review_file_df.columns:
|
1384 |
+
standard_cols.append('id')
|
1385 |
+
return pd.DataFrame(columns=standard_cols)
|
1386 |
+
|
1387 |
+
# Ensure 'id' column exists for logic flow, even if empty
|
1388 |
+
if 'id' not in review_file_df.columns:
|
1389 |
+
review_file_df['id'] = ''
|
1390 |
+
# Do the same for redaction_decision_output if it's not empty
|
1391 |
+
if not redaction_decision_output.empty and 'id' not in redaction_decision_output.columns:
|
1392 |
+
redaction_decision_output['id'] = ''
|
1393 |
|
1394 |
|
1395 |
+
# 2. Process page sizes if provided - needed potentially for coordinate division later
|
1396 |
+
# Process this once upfront if the data is available
|
1397 |
+
page_sizes_df = pd.DataFrame() # Initialize as empty
|
1398 |
+
if page_sizes:
|
1399 |
+
page_sizes_df = pd.DataFrame(page_sizes)
|
1400 |
if not page_sizes_df.empty:
|
1401 |
+
# Safely convert page column to numeric and then int
|
1402 |
+
page_sizes_df["page"] = pd.to_numeric(page_sizes_df["page"], errors="coerce")
|
1403 |
+
page_sizes_df.dropna(subset=["page"], inplace=True)
|
1404 |
+
if not page_sizes_df.empty: # Check again after dropping NaNs
|
1405 |
+
page_sizes_df["page"] = page_sizes_df["page"].astype(int)
|
1406 |
+
else:
|
1407 |
+
print("Warning: Page sizes DataFrame became empty after processing, coordinate division will be skipped.")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1408 |
|
1409 |
|
1410 |
# 3. Join additional data from redaction_decision_output if provided
|
1411 |
+
text_added_successfully = False # Flag to track if text was added by any method
|
1412 |
+
|
1413 |
if not redaction_decision_output.empty:
|
1414 |
+
# --- Attempt to join data based on 'id' column first ---
|
1415 |
+
|
1416 |
+
# Check if 'id' columns are present and have non-null values in *both* dataframes
|
1417 |
+
id_col_exists_in_review = 'id' in review_file_df.columns and not review_file_df['id'].isnull().all() and not (review_file_df['id'] == '').all()
|
1418 |
+
id_col_exists_in_redaction = 'id' in redaction_decision_output.columns and not redaction_decision_output['id'].isnull().all() and not (redaction_decision_output['id'] == '').all()
|
1419 |
+
|
1420 |
|
1421 |
if id_col_exists_in_review and id_col_exists_in_redaction:
|
1422 |
#print("Attempting to join data based on 'id' column.")
|
1423 |
try:
|
1424 |
+
# Ensure 'id' columns are of string type for robust merging
|
1425 |
review_file_df['id'] = review_file_df['id'].astype(str)
|
1426 |
+
# Make a copy if needed, but try to avoid if redaction_decision_output isn't modified later
|
1427 |
+
# Let's use a copy for safety as in the original code
|
1428 |
redaction_copy = redaction_decision_output.copy()
|
1429 |
redaction_copy['id'] = redaction_copy['id'].astype(str)
|
1430 |
|
1431 |
+
# Select columns to merge from redaction output. Prioritize 'text'.
|
|
|
|
|
1432 |
cols_to_merge = ['id']
|
1433 |
if 'text' in redaction_copy.columns:
|
1434 |
cols_to_merge.append('text')
|
|
|
1436 |
print("Warning: 'text' column not found in redaction_decision_output. Cannot merge text using 'id'.")
|
1437 |
|
1438 |
# Perform a left merge to keep all annotations and add matching text
|
1439 |
+
# Use a suffix for the text column from the right DataFrame
|
1440 |
+
original_text_col_exists = 'text' in review_file_df.columns
|
1441 |
+
merge_suffix = '_redaction' if original_text_col_exists else ''
|
1442 |
+
|
1443 |
merged_df = pd.merge(
|
1444 |
review_file_df,
|
1445 |
redaction_copy[cols_to_merge],
|
1446 |
on='id',
|
1447 |
how='left',
|
1448 |
+
suffixes=('', merge_suffix)
|
1449 |
)
|
1450 |
|
1451 |
+
# Update the 'text' column if a new one was brought in
|
1452 |
+
if 'text' + merge_suffix in merged_df.columns:
|
1453 |
+
redaction_text_col = 'text' + merge_suffix
|
1454 |
+
if original_text_col_exists:
|
1455 |
+
# Combine: Use text from redaction where available, otherwise keep original
|
1456 |
+
merged_df['text'] = merged_df[redaction_text_col].combine_first(merged_df['text'])
|
1457 |
+
# Drop the temporary column
|
1458 |
+
merged_df = merged_df.drop(columns=[redaction_text_col])
|
1459 |
+
else:
|
1460 |
+
# Redaction output had text, but review_file_df didn't. Rename the new column.
|
1461 |
+
merged_df = merged_df.rename(columns={redaction_text_col: 'text'})
|
1462 |
|
1463 |
+
text_added_successfully = True # Indicate text was potentially added
|
|
|
|
|
|
|
|
|
|
|
1464 |
|
1465 |
+
review_file_df = merged_df # Update the main DataFrame
|
1466 |
|
1467 |
+
#print("Successfully attempted to join data using 'id'.") # Note: Text might not have been in redaction data
|
|
|
1468 |
|
1469 |
except Exception as e:
|
1470 |
+
print(f"Error during 'id'-based merge: {e}. Checking for proximity match fallback.")
|
1471 |
+
# Fall through to proximity match logic below
|
1472 |
+
|
1473 |
+
# --- Fallback to proximity match if ID join wasn't possible/successful and enabled ---
|
1474 |
+
# Note: If id_col_exists_in_review or id_col_exists_in_redaction was False,
|
1475 |
+
# the block above was skipped, and we naturally fall here.
|
1476 |
+
# If an error occurred in the try block, joined_by_id would implicitly be False
|
1477 |
+
# because text_added_successfully wasn't set to True.
|
1478 |
+
|
1479 |
+
# Only attempt proximity match if text wasn't added by ID join and proximity is requested
|
1480 |
+
if not text_added_successfully and do_proximity_match:
|
1481 |
+
print("Attempting proximity match to add text data.")
|
1482 |
+
|
1483 |
+
# Ensure 'page' columns are numeric before coordinate division and proximity match
|
1484 |
+
# (Assuming divide_coordinates_by_page_sizes and do_proximity_match_all_pages_for_text need this)
|
1485 |
+
if 'page' in review_file_df.columns:
|
1486 |
+
review_file_df['page'] = pd.to_numeric(review_file_df['page'], errors='coerce').fillna(-1).astype(int) # Use -1 for NaN pages
|
1487 |
+
review_file_df = review_file_df[review_file_df['page'] != -1] # Drop rows where page conversion failed
|
1488 |
+
if not redaction_decision_output.empty and 'page' in redaction_decision_output.columns:
|
1489 |
+
redaction_decision_output['page'] = pd.to_numeric(redaction_decision_output['page'], errors='coerce').fillna(-1).astype(int)
|
1490 |
+
redaction_decision_output = redaction_decision_output[redaction_decision_output['page'] != -1]
|
1491 |
+
|
1492 |
+
# Perform coordinate division IF page_sizes were processed and DataFrame is not empty
|
1493 |
+
if not page_sizes_df.empty:
|
1494 |
+
# Apply coordinate division *before* proximity match
|
1495 |
+
review_file_df = divide_coordinates_by_page_sizes(review_file_df, page_sizes_df)
|
1496 |
+
if not redaction_decision_output.empty:
|
1497 |
+
redaction_decision_output = divide_coordinates_by_page_sizes(redaction_decision_output, page_sizes_df)
|
1498 |
+
|
1499 |
+
# Now perform the proximity match
|
1500 |
+
# Note: Potential DataFrame copies happen inside do_proximity_match based on its implementation
|
1501 |
+
if not redaction_decision_output.empty:
|
1502 |
+
try:
|
1503 |
+
review_file_df = do_proximity_match_all_pages_for_text(
|
1504 |
+
df1=review_file_df, # Pass directly, avoid caller copy if possible by modifying function signature
|
1505 |
+
df2=redaction_decision_output # Pass directly
|
1506 |
+
)
|
1507 |
+
# Assuming do_proximity_match_all_pages_for_text adds/updates the 'text' column
|
1508 |
+
if 'text' in review_file_df.columns:
|
1509 |
+
text_added_successfully = True
|
1510 |
+
print("Proximity match completed.")
|
1511 |
+
except Exception as e:
|
1512 |
+
print(f"Error during proximity match: {e}. Text data may not be added.")
|
1513 |
+
|
1514 |
+
elif not text_added_successfully and not do_proximity_match:
|
1515 |
+
print("Skipping joining text data (ID join not possible/failed, proximity match disabled).")
|
1516 |
+
|
1517 |
+
# 4. Ensure required columns exist and are ordered
|
1518 |
+
# Define base required columns. 'id' and 'text' are conditionally added.
|
1519 |
+
required_columns_base = ["image", "page", "label", "color", "xmin", "ymin", "xmax", "ymax"]
|
1520 |
+
final_columns = required_columns_base[:] # Start with base columns
|
1521 |
+
|
1522 |
+
# Add 'id' and 'text' if they exist in the DataFrame at this point
|
1523 |
if 'id' in review_file_df.columns:
|
1524 |
+
final_columns.append('id')
|
1525 |
+
if 'text' in review_file_df.columns:
|
1526 |
+
final_columns.append('text') # Add text column if it was created/merged
|
1527 |
|
1528 |
+
# Add any missing required columns with a default value (e.g., blank string)
|
1529 |
+
for col in final_columns:
|
1530 |
if col not in review_file_df.columns:
|
1531 |
+
# Use appropriate default based on expected type, '' for text/id, np.nan for coords?
|
1532 |
+
# Sticking to '' as in original for simplicity, but consider data types.
|
1533 |
+
review_file_df[col] = '' # Or np.nan for numerical, but coords already checked by dropna
|
1534 |
|
1535 |
# Select and order the final set of columns
|
1536 |
+
# Ensure all selected columns actually exist after adding defaults
|
1537 |
+
review_file_df = review_file_df[[col for col in final_columns if col in review_file_df.columns]]
|
1538 |
+
|
1539 |
|
1540 |
# 5. Final processing and sorting
|
1541 |
+
# Convert colours from list to tuple if necessary - apply is okay here unless lists are vast
|
1542 |
if 'color' in review_file_df.columns:
|
1543 |
+
# Check if the column actually contains lists before applying lambda
|
1544 |
+
if review_file_df['color'].apply(lambda x: isinstance(x, list)).any():
|
1545 |
+
review_file_df["color"] = review_file_df["color"].apply(lambda x: tuple(x) if isinstance(x, list) else x)
|
1546 |
|
1547 |
# Sort the results
|
|
|
1548 |
# Ensure sort columns exist before sorting
|
1549 |
+
sort_columns = ['page', 'ymin', 'xmin', 'label']
|
1550 |
valid_sort_columns = [col for col in sort_columns if col in review_file_df.columns]
|
1551 |
+
if valid_sort_columns and not review_file_df.empty: # Only sort non-empty df
|
1552 |
+
# Convert potential numeric sort columns to appropriate types if necessary
|
1553 |
+
# (e.g., 'page', 'ymin', 'xmin') to ensure correct sorting.
|
1554 |
+
# dropna(subset=[...], inplace=True) earlier should handle NaNs in coords.
|
1555 |
+
# page conversion already done before proximity match.
|
1556 |
+
try:
|
1557 |
+
review_file_df = review_file_df.sort_values(valid_sort_columns)
|
1558 |
+
except TypeError as e:
|
1559 |
+
print(f"Warning: Could not sort DataFrame due to type error in sort columns: {e}")
|
1560 |
+
# Proceed without sorting
|
1561 |
return review_file_df
|
1562 |
|
1563 |
def fill_missing_box_ids(data_input: dict) -> dict:
|
|
|
1641 |
|
1642 |
def fill_missing_ids(df: pd.DataFrame, column_name: str = 'id', length: int = 12) -> pd.DataFrame:
|
1643 |
"""
|
1644 |
+
Optimized: Generates unique alphanumeric IDs for rows in a DataFrame column
|
1645 |
+
where the value is missing (NaN, None) or an empty/whitespace string.
|
1646 |
|
1647 |
Args:
|
1648 |
df (pd.DataFrame): The input Pandas DataFrame.
|
1649 |
column_name (str): The name of the column to check and fill (defaults to 'id').
|
1650 |
This column will be added if it doesn't exist.
|
1651 |
length (int): The desired length of the generated IDs (defaults to 12).
|
|
|
|
|
1652 |
|
1653 |
Returns:
|
1654 |
pd.DataFrame: The DataFrame with missing/empty IDs filled in the specified column.
|
1655 |
+
Note: The function modifies the DataFrame directly (in-place).
|
1656 |
"""
|
1657 |
|
1658 |
# --- Input Validation ---
|
|
|
1664 |
raise ValueError("'length' must be a positive integer.")
|
1665 |
|
1666 |
# --- Ensure Column Exists ---
|
1667 |
+
original_dtype = None
|
1668 |
if column_name not in df.columns:
|
1669 |
print(f"Column '{column_name}' not found. Adding it to the DataFrame.")
|
1670 |
+
# Initialize with None (which Pandas often treats as NaN but allows object dtype)
|
1671 |
+
df[column_name] = None
|
1672 |
+
# Set original_dtype to object so it likely becomes string later
|
1673 |
+
original_dtype = object
|
1674 |
+
else:
|
1675 |
+
original_dtype = df[column_name].dtype
|
1676 |
|
1677 |
# --- Identify Rows Needing IDs ---
|
1678 |
+
# 1. Check for actual null values (NaN, None, NaT)
|
1679 |
+
is_null = df[column_name].isna()
|
1680 |
+
|
1681 |
+
# 2. Check for empty or whitespace-only strings AFTER converting potential values to string
|
1682 |
+
# Only apply string checks on rows that are *not* null to avoid errors/warnings
|
1683 |
+
# Fill NaN temporarily for string operations, then check length or equality
|
1684 |
+
is_empty_str = pd.Series(False, index=df.index) # Default to False
|
1685 |
+
if not is_null.all(): # Only check strings if there are non-null values
|
1686 |
+
temp_str_col = df.loc[~is_null, column_name].astype(str).str.strip()
|
1687 |
+
is_empty_str.loc[~is_null] = (temp_str_col == '')
|
1688 |
+
|
1689 |
+
# Combine the conditions
|
1690 |
+
is_missing_or_empty = is_null | is_empty_str
|
|
|
1691 |
|
1692 |
rows_to_fill_index = df.index[is_missing_or_empty]
|
1693 |
num_needed = len(rows_to_fill_index)
|
1694 |
|
1695 |
if num_needed == 0:
|
1696 |
+
# Ensure final column type is consistent if nothing was done
|
1697 |
+
if pd.api.types.is_object_dtype(original_dtype) or pd.api.types.is_string_dtype(original_dtype):
|
1698 |
+
pass # Likely already object or string
|
1699 |
+
else:
|
1700 |
+
# If original was numeric/etc., but might contain strings now? Unlikely here.
|
1701 |
+
pass # Or convert to object: df[column_name] = df[column_name].astype(object)
|
1702 |
+
# print(f"No missing or empty values found requiring IDs in column '{column_name}'.")
|
1703 |
return df
|
1704 |
|
1705 |
print(f"Found {num_needed} rows requiring a unique ID in column '{column_name}'.")
|
1706 |
|
1707 |
# --- Get Existing IDs to Ensure Uniqueness ---
|
1708 |
+
# Consider only rows that are *not* missing/empty
|
1709 |
+
valid_rows = df.loc[~is_missing_or_empty, column_name]
|
1710 |
+
# Drop any remaining nulls (shouldn't be any based on mask, but belts and braces)
|
1711 |
+
valid_rows = valid_rows.dropna()
|
1712 |
+
# Convert to string *only* if not already string/object, then filter out empty strings again
|
1713 |
+
if not pd.api.types.is_object_dtype(valid_rows.dtype) and not pd.api.types.is_string_dtype(valid_rows.dtype):
|
1714 |
+
existing_ids = set(valid_rows.astype(str).str.strip())
|
1715 |
+
else: # Already string or object, just strip and convert to set
|
1716 |
+
existing_ids = set(valid_rows.astype(str).str.strip()) # astype(str) handles mixed types in object column
|
1717 |
+
|
1718 |
+
# Remove empty string from existing IDs if it's there after stripping
|
1719 |
+
existing_ids.discard('')
|
1720 |
|
1721 |
|
1722 |
# --- Generate Unique IDs ---
|
|
|
1726 |
|
1727 |
max_possible_ids = len(character_set) ** length
|
1728 |
if num_needed > max_possible_ids:
|
1729 |
+
raise ValueError(f"Cannot generate {num_needed} unique IDs with length {length}. Maximum possible is {max_possible_ids}.")
|
1730 |
+
|
1731 |
+
# Pre-calculate safety break limit
|
1732 |
+
max_attempts_per_id = max(1000, num_needed * 10) # Adjust multiplier as needed
|
1733 |
|
1734 |
#print(f"Generating {num_needed} unique IDs of length {length}...")
|
1735 |
for i in range(num_needed):
|
1736 |
attempts = 0
|
1737 |
while True:
|
1738 |
candidate_id = ''.join(random.choices(character_set, k=length))
|
1739 |
+
# Check against *all* known existing IDs and *newly* generated ones
|
1740 |
if candidate_id not in existing_ids and candidate_id not in generated_ids_set:
|
1741 |
generated_ids_set.add(candidate_id)
|
1742 |
new_ids_list.append(candidate_id)
|
1743 |
break # Found a unique ID
|
1744 |
attempts += 1
|
1745 |
+
if attempts > max_attempts_per_id : # Safety break
|
1746 |
+
raise RuntimeError(f"Failed to generate a unique ID after {attempts} attempts. Check length, character set, or density of existing IDs.")
|
1747 |
|
1748 |
+
# Optional progress update
|
1749 |
+
# if (i + 1) % 1000 == 0:
|
1750 |
+
# print(f"Generated {i+1}/{num_needed} IDs...")
|
1751 |
|
1752 |
|
1753 |
# --- Assign New IDs ---
|
1754 |
# Use the previously identified index to assign the new IDs correctly
|
1755 |
+
# Assigning string IDs might change the column's dtype to 'object'
|
1756 |
+
if not pd.api.types.is_object_dtype(original_dtype) and not pd.api.types.is_string_dtype(original_dtype):
|
1757 |
+
warnings.warn(f"Column '{column_name}' dtype might change from '{original_dtype}' to 'object' due to string ID assignment.", UserWarning)
|
1758 |
+
|
1759 |
df.loc[rows_to_fill_index, column_name] = new_ids_list
|
1760 |
+
print(f"Successfully assigned {len(new_ids_list)} new unique IDs to column '{column_name}'.")
|
1761 |
+
|
1762 |
+
# Optional: Convert the entire column to string type at the end for consistency
|
1763 |
+
# df[column_name] = df[column_name].astype(str)
|
1764 |
|
|
|
1765 |
return df
|
1766 |
|
1767 |
+
def convert_review_df_to_annotation_json(
|
1768 |
+
review_file_df: pd.DataFrame,
|
1769 |
+
image_paths: List[str], # List of image file paths
|
1770 |
+
page_sizes: List[Dict], # List of dicts like [{'page': 1, 'image_path': '...', 'image_width': W, 'image_height': H}, ...]
|
1771 |
+
xmin="xmin", xmax="xmax", ymin="ymin", ymax="ymax" # Coordinate column names
|
1772 |
+
) -> List[Dict]:
|
1773 |
+
"""
|
1774 |
+
Optimized function to convert review DataFrame to Gradio Annotation JSON format.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1775 |
|
1776 |
+
Ensures absolute coordinates, handles missing IDs, deduplicates based on key fields,
|
1777 |
+
selects final columns, and structures data per image/page based on page_sizes.
|
|
|
1778 |
|
1779 |
+
Args:
|
1780 |
+
review_file_df: Input DataFrame with annotation data.
|
1781 |
+
image_paths: List of image file paths (Note: currently unused if page_sizes provides paths).
|
1782 |
+
page_sizes: REQUIRED list of dictionaries, each containing 'page',
|
1783 |
+
'image_path', 'image_width', and 'image_height'. Defines
|
1784 |
+
output structure and dimensions for coordinate conversion.
|
1785 |
+
xmin, xmax, ymin, ymax: Names of the coordinate columns.
|
|
|
|
|
1786 |
|
1787 |
+
Returns:
|
1788 |
+
List of dictionaries suitable for Gradio Annotation output, one dict per image/page.
|
1789 |
+
"""
|
1790 |
+
if not page_sizes:
|
1791 |
+
raise ValueError("page_sizes argument is required and cannot be empty.")
|
1792 |
|
1793 |
+
# --- Prepare Page Sizes DataFrame ---
|
1794 |
+
try:
|
1795 |
+
page_sizes_df = pd.DataFrame(page_sizes)
|
1796 |
+
required_ps_cols = {'page', 'image_path', 'image_width', 'image_height'}
|
1797 |
+
if not required_ps_cols.issubset(page_sizes_df.columns):
|
1798 |
+
missing = required_ps_cols - set(page_sizes_df.columns)
|
1799 |
+
raise ValueError(f"page_sizes is missing required keys: {missing}")
|
1800 |
+
# Convert page sizes columns to appropriate numeric types early
|
1801 |
+
page_sizes_df['page'] = pd.to_numeric(page_sizes_df['page'], errors='coerce')
|
1802 |
+
page_sizes_df['image_width'] = pd.to_numeric(page_sizes_df['image_width'], errors='coerce')
|
1803 |
+
page_sizes_df['image_height'] = pd.to_numeric(page_sizes_df['image_height'], errors='coerce')
|
1804 |
+
# Use nullable Int64 for page number consistency
|
1805 |
+
page_sizes_df['page'] = page_sizes_df['page'].astype('Int64')
|
1806 |
|
1807 |
+
except Exception as e:
|
1808 |
+
raise ValueError(f"Error processing page_sizes: {e}") from e
|
|
|
1809 |
|
|
|
1810 |
|
1811 |
+
# Handle empty input DataFrame gracefully
|
1812 |
+
if review_file_df.empty:
|
1813 |
+
print("Input review_file_df is empty. Proceeding to generate JSON structure with empty boxes.")
|
1814 |
+
# Ensure essential columns exist even if empty for later steps
|
1815 |
+
for col in [xmin, xmax, ymin, ymax, "page", "label", "color", "id", "text"]:
|
1816 |
+
if col not in review_file_df.columns:
|
1817 |
+
review_file_df[col] = pd.NA
|
1818 |
+
else:
|
1819 |
+
# --- Coordinate Conversion (if needed) ---
|
1820 |
+
coord_cols_to_check = [c for c in [xmin, xmax, ymin, ymax] if c in review_file_df.columns]
|
1821 |
+
needs_multiplication = False
|
1822 |
+
if coord_cols_to_check:
|
1823 |
+
temp_df_numeric = review_file_df[coord_cols_to_check].apply(pd.to_numeric, errors='coerce')
|
1824 |
+
if temp_df_numeric.le(1).any().any(): # Check if any numeric coord <= 1 exists
|
1825 |
+
needs_multiplication = True
|
1826 |
+
|
1827 |
+
if needs_multiplication:
|
1828 |
+
#print("Relative coordinates detected or suspected, running multiplication...")
|
1829 |
+
review_file_df = multiply_coordinates_by_page_sizes(
|
1830 |
+
review_file_df.copy(), # Pass a copy to avoid modifying original outside function
|
1831 |
+
page_sizes_df,
|
1832 |
+
xmin, xmax, ymin, ymax
|
1833 |
+
)
|
1834 |
+
else:
|
1835 |
+
#print("No relative coordinates detected or required columns missing, skipping multiplication.")
|
1836 |
+
# Still ensure essential coordinate/page columns are numeric if they exist
|
1837 |
+
cols_to_convert = [c for c in [xmin, xmax, ymin, ymax, "page"] if c in review_file_df.columns]
|
1838 |
+
for col in cols_to_convert:
|
1839 |
+
review_file_df[col] = pd.to_numeric(review_file_df[col], errors='coerce')
|
1840 |
|
1841 |
+
# Handle potential case where multiplication returns an empty DF
|
1842 |
+
if review_file_df.empty:
|
1843 |
+
print("DataFrame became empty after coordinate processing.")
|
1844 |
+
# Re-add essential columns if they were lost
|
1845 |
+
for col in [xmin, xmax, ymin, ymax, "page", "label", "color", "id", "text"]:
|
1846 |
+
if col not in review_file_df.columns:
|
1847 |
+
review_file_df[col] = pd.NA
|
1848 |
+
|
1849 |
+
# --- Fill Missing IDs ---
|
1850 |
+
review_file_df = fill_missing_ids(review_file_df.copy()) # Pass a copy
|
1851 |
+
|
1852 |
+
# --- Deduplicate Based on Key Fields ---
|
1853 |
+
base_dedupe_cols = ["page", xmin, ymin, xmax, ymax, "label", "id"]
|
1854 |
+
# Identify which deduplication columns actually exist in the DataFrame
|
1855 |
+
cols_for_dedupe = [col for col in base_dedupe_cols if col in review_file_df.columns]
|
1856 |
+
# Add 'image' column for deduplication IF it exists (matches original logic intent)
|
1857 |
+
if "image" in review_file_df.columns:
|
1858 |
+
cols_for_dedupe.append("image")
|
1859 |
+
|
1860 |
+
# Ensure placeholder columns exist if they are needed for deduplication
|
1861 |
+
# (e.g., 'label', 'id' should be present after fill_missing_ids)
|
1862 |
+
for col in ['label', 'id']:
|
1863 |
+
if col in cols_for_dedupe and col not in review_file_df.columns:
|
1864 |
+
# This might indicate an issue in fill_missing_ids or prior steps
|
1865 |
+
print(f"Warning: Column '{col}' needed for dedupe but not found. Adding NA.")
|
1866 |
+
review_file_df[col] = "" # Add default empty string
|
1867 |
+
|
1868 |
+
if cols_for_dedupe: # Only attempt dedupe if we have columns to check
|
1869 |
+
#print(f"Deduplicating based on columns: {cols_for_dedupe}")
|
1870 |
+
# Convert relevant columns to string before dedupe to avoid type issues with mixed data (optional, depends on data)
|
1871 |
+
# for col in cols_for_dedupe:
|
1872 |
+
# review_file_df[col] = review_file_df[col].astype(str)
|
1873 |
+
review_file_df = review_file_df.drop_duplicates(subset=cols_for_dedupe)
|
1874 |
else:
|
1875 |
+
print("Skipping deduplication: No valid columns found to deduplicate by.")
|
1876 |
+
|
1877 |
+
|
1878 |
+
# --- Select and Prepare Final Output Columns ---
|
1879 |
+
required_final_cols = ["page", "label", "color", xmin, ymin, xmax, ymax, "id", "text"]
|
1880 |
+
# Identify which of the desired final columns exist in the (now potentially deduplicated) DataFrame
|
1881 |
+
available_final_cols = [col for col in required_final_cols if col in review_file_df.columns]
|
1882 |
+
|
1883 |
+
# Ensure essential output columns exist, adding defaults if missing AFTER deduplication
|
1884 |
+
for col in required_final_cols:
|
1885 |
+
if col not in review_file_df.columns:
|
1886 |
+
print(f"Adding missing final column '{col}' with default value.")
|
1887 |
+
if col in ['label', 'id', 'text']:
|
1888 |
+
review_file_df[col] = "" # Default empty string
|
1889 |
+
elif col == 'color':
|
1890 |
+
review_file_df[col] = None # Default None or a default color tuple
|
1891 |
+
else: # page, coordinates
|
1892 |
+
review_file_df[col] = pd.NA # Default NA for numeric/page
|
1893 |
+
available_final_cols.append(col) # Add to list of available columns
|
1894 |
+
|
1895 |
+
# Select only the final desired columns in the correct order
|
1896 |
+
review_file_df = review_file_df[available_final_cols]
|
1897 |
+
|
1898 |
+
# --- Final Formatting ---
|
1899 |
+
if not review_file_df.empty:
|
1900 |
+
# Convert list colors to tuples (important for some downstream uses)
|
1901 |
+
if 'color' in review_file_df.columns:
|
1902 |
+
review_file_df['color'] = review_file_df['color'].apply(
|
1903 |
+
lambda x: tuple(x) if isinstance(x, list) else x
|
1904 |
+
)
|
1905 |
+
# Ensure page column is nullable integer type for reliable grouping
|
1906 |
+
if 'page' in review_file_df.columns:
|
1907 |
+
review_file_df['page'] = review_file_df['page'].astype('Int64')
|
1908 |
+
|
1909 |
+
# --- Group Annotations by Page ---
|
1910 |
+
if 'page' in review_file_df.columns:
|
1911 |
+
grouped_annotations = review_file_df.groupby('page')
|
1912 |
+
group_keys = set(grouped_annotations.groups.keys()) # Use set for faster lookups
|
1913 |
+
else:
|
1914 |
+
# Cannot group if page column is missing
|
1915 |
+
print("Error: 'page' column missing, cannot group annotations.")
|
1916 |
+
grouped_annotations = None
|
1917 |
+
group_keys = set()
|
1918 |
+
|
1919 |
|
1920 |
+
# --- Build JSON Structure ---
|
1921 |
+
json_data = []
|
1922 |
+
output_cols_for_boxes = [col for col in ["label", "color", xmin, ymin, xmax, ymax, "id", "text"] if col in review_file_df.columns]
|
1923 |
|
1924 |
+
# Iterate through page_sizes_df to define the structure (one entry per image path)
|
1925 |
+
for _, row in page_sizes_df.iterrows():
|
1926 |
+
page_num = row['page'] # Already Int64
|
1927 |
+
pdf_image_path = row['image_path']
|
1928 |
+
annotation_boxes = [] # Default to empty list
|
1929 |
+
|
1930 |
+
# Check if the page exists in the grouped annotations (using the faster set lookup)
|
1931 |
+
# Check pd.notna because page_num could be <NA> if conversion failed
|
1932 |
+
if pd.notna(page_num) and page_num in group_keys and grouped_annotations:
|
1933 |
+
try:
|
1934 |
+
page_group_df = grouped_annotations.get_group(page_num)
|
1935 |
+
# Convert the group to list of dicts, selecting only needed box properties
|
1936 |
+
# Handle potential NaN coordinates before conversion to JSON
|
1937 |
+
annotation_boxes = page_group_df[output_cols_for_boxes].replace({np.nan: None}).to_dict(orient='records')
|
1938 |
+
|
1939 |
+
# Optional: Round coordinates here if needed AFTER potential multiplication
|
1940 |
+
# for box in annotation_boxes:
|
1941 |
+
# for coord in [xmin, ymin, xmax, ymax]:
|
1942 |
+
# if coord in box and box[coord] is not None:
|
1943 |
+
# box[coord] = round(float(box[coord]), 2) # Example: round to 2 decimals
|
1944 |
+
|
1945 |
+
except KeyError:
|
1946 |
+
print(f"Warning: Group key {page_num} not found despite being in group_keys (should not happen).")
|
1947 |
+
annotation_boxes = [] # Keep empty
|
1948 |
+
|
1949 |
+
# Append the structured data for this image/page
|
1950 |
+
json_data.append({
|
1951 |
+
"image": pdf_image_path,
|
1952 |
+
"boxes": annotation_boxes
|
1953 |
+
})
|
1954 |
+
|
1955 |
+
return json_data
|
tools/file_redaction.py
CHANGED
@@ -258,8 +258,7 @@ def choose_and_run_redactor(file_paths:List[str],
|
|
258 |
|
259 |
|
260 |
# Call prepare_image_or_pdf only if needed
|
261 |
-
if prepare_images_flag is not None
|
262 |
-
#print("Calling preparation function. prepare_images_flag:", prepare_images_flag)
|
263 |
out_message, prepared_pdf_file_paths, pdf_image_file_paths, annotate_max_pages, annotate_max_pages_bottom, pymupdf_doc, annotations_all_pages, review_file_state, document_cropboxes, page_sizes, textract_output_found, all_img_details_state, placeholder_ocr_results_df = prepare_image_or_pdf(
|
264 |
file_paths_loop, text_extraction_method, 0, out_message, True,
|
265 |
annotate_max_pages, annotations_all_pages, document_cropboxes, redact_whole_page_list,
|
@@ -333,7 +332,7 @@ def choose_and_run_redactor(file_paths:List[str],
|
|
333 |
# Try to connect to AWS services directly only if RUN_AWS_FUNCTIONS environmental variable is 1, otherwise an environment variable or direct textbox input is needed.
|
334 |
if pii_identification_method == aws_pii_detector:
|
335 |
if aws_access_key_textbox and aws_secret_key_textbox:
|
336 |
-
print("Connecting to Comprehend using AWS access key and secret keys from
|
337 |
comprehend_client = boto3.client('comprehend',
|
338 |
aws_access_key_id=aws_access_key_textbox,
|
339 |
aws_secret_access_key=aws_secret_key_textbox, region_name=AWS_REGION)
|
@@ -356,7 +355,7 @@ def choose_and_run_redactor(file_paths:List[str],
|
|
356 |
# Try to connect to AWS Textract Client if using that text extraction method
|
357 |
if text_extraction_method == textract_option:
|
358 |
if aws_access_key_textbox and aws_secret_key_textbox:
|
359 |
-
print("Connecting to Textract using AWS access key and secret keys from
|
360 |
textract_client = boto3.client('textract',
|
361 |
aws_access_key_id=aws_access_key_textbox,
|
362 |
aws_secret_access_key=aws_secret_key_textbox, region_name=AWS_REGION)
|
@@ -401,7 +400,7 @@ def choose_and_run_redactor(file_paths:List[str],
|
|
401 |
is_a_pdf = is_pdf(file_path) == True
|
402 |
if is_a_pdf == False and text_extraction_method == text_ocr_option:
|
403 |
# If user has not submitted a pdf, assume it's an image
|
404 |
-
print("File is not a
|
405 |
text_extraction_method = tesseract_ocr_option
|
406 |
else:
|
407 |
out_message = "No file selected"
|
@@ -862,17 +861,6 @@ def convert_pikepdf_annotations_to_result_annotation_box(page:Page, annot:dict,
|
|
862 |
|
863 |
rect = Rect(pymupdf_x1, pymupdf_y1, pymupdf_x2, pymupdf_y2)
|
864 |
|
865 |
-
# if image or image_dimensions:
|
866 |
-
# print("Dividing result by image coordinates")
|
867 |
-
|
868 |
-
# image_x1, image_y1, image_x2, image_y2 = convert_pymupdf_to_image_coords(page, pymupdf_x1, pymupdf_y1, pymupdf_x2, pymupdf_y2, image, image_dimensions=image_dimensions)
|
869 |
-
|
870 |
-
# img_annotation_box["xmin"] = image_x1
|
871 |
-
# img_annotation_box["ymin"] = image_y1
|
872 |
-
# img_annotation_box["xmax"] = image_x2
|
873 |
-
# img_annotation_box["ymax"] = image_y2
|
874 |
-
|
875 |
-
# else:
|
876 |
convert_df = pd.DataFrame({
|
877 |
"page": [page_no],
|
878 |
"xmin": [pymupdf_x1],
|
@@ -1016,9 +1004,6 @@ def redact_page_with_pymupdf(page:Page, page_annotations:dict, image:Image=None,
|
|
1016 |
|
1017 |
img_annotation_box = fill_missing_box_ids(img_annotation_box)
|
1018 |
|
1019 |
-
#print("image_dimensions:", image_dimensions)
|
1020 |
-
#print("annot:", annot)
|
1021 |
-
|
1022 |
all_image_annotation_boxes.append(img_annotation_box)
|
1023 |
|
1024 |
# Redact the annotations from the document
|
@@ -1285,8 +1270,6 @@ def redact_image_pdf(file_path:str,
|
|
1285 |
page_handwriting_recogniser_results = []
|
1286 |
page_break_return = False
|
1287 |
reported_page_number = str(page_no + 1)
|
1288 |
-
|
1289 |
-
#print("page_sizes_df for row:", page_sizes_df.loc[page_sizes_df["page"] == (page_no + 1)])
|
1290 |
|
1291 |
# Try to find image location
|
1292 |
try:
|
@@ -1328,7 +1311,7 @@ def redact_image_pdf(file_path:str,
|
|
1328 |
|
1329 |
# Step 1: Perform OCR. Either with Tesseract, or with AWS Textract
|
1330 |
|
1331 |
-
# If using Tesseract
|
1332 |
if text_extraction_method == tesseract_ocr_option:
|
1333 |
#print("image_path:", image_path)
|
1334 |
#print("print(type(image_path)):", print(type(image_path)))
|
@@ -1449,7 +1432,6 @@ def redact_image_pdf(file_path:str,
|
|
1449 |
# Assume image_path is an image
|
1450 |
image = image_path
|
1451 |
|
1452 |
-
print("image:", image)
|
1453 |
|
1454 |
fill = (0, 0, 0) # Fill colour for redactions
|
1455 |
draw = ImageDraw.Draw(image)
|
@@ -1631,8 +1613,6 @@ def get_text_container_characters(text_container:LTTextContainer):
|
|
1631 |
for line in text_container
|
1632 |
if isinstance(line, LTTextLine) or isinstance(line, LTTextLineHorizontal)
|
1633 |
for char in line]
|
1634 |
-
|
1635 |
-
#print("Initial characters:", characters)
|
1636 |
|
1637 |
return characters
|
1638 |
return []
|
@@ -1762,9 +1742,6 @@ def create_text_redaction_process_results(analyser_results, analysed_bounding_bo
|
|
1762 |
analysed_bounding_boxes_df_new = pd.concat([analysed_bounding_boxes_df_new, analysed_bounding_boxes_df_text], axis = 1)
|
1763 |
analysed_bounding_boxes_df_new['page'] = page_num + 1
|
1764 |
|
1765 |
-
#analysed_bounding_boxes_df_new = fill_missing_ids(analysed_bounding_boxes_df_new)
|
1766 |
-
analysed_bounding_boxes_df_new.to_csv("output/analysed_bounding_boxes_df_new_with_ids.csv")
|
1767 |
-
|
1768 |
decision_process_table = pd.concat([decision_process_table, analysed_bounding_boxes_df_new], axis = 0).drop('result', axis=1)
|
1769 |
|
1770 |
return decision_process_table
|
@@ -1772,7 +1749,6 @@ def create_text_redaction_process_results(analyser_results, analysed_bounding_bo
|
|
1772 |
def create_pikepdf_annotations_for_bounding_boxes(analysed_bounding_boxes):
|
1773 |
pikepdf_redaction_annotations_on_page = []
|
1774 |
for analysed_bounding_box in analysed_bounding_boxes:
|
1775 |
-
#print("analysed_bounding_box:", analysed_bounding_boxes)
|
1776 |
|
1777 |
bounding_box = analysed_bounding_box["boundingBox"]
|
1778 |
annotation = Dictionary(
|
@@ -1997,7 +1973,6 @@ def redact_text_pdf(
|
|
1997 |
pass
|
1998 |
#print("Not redacting page:", page_no)
|
1999 |
|
2000 |
-
#print("page_image_annotations after page", reported_page_number, "are", page_image_annotations)
|
2001 |
|
2002 |
# Join extracted text outputs for all lines together
|
2003 |
if not page_text_ocr_outputs.empty:
|
|
|
258 |
|
259 |
|
260 |
# Call prepare_image_or_pdf only if needed
|
261 |
+
if prepare_images_flag is not None:
|
|
|
262 |
out_message, prepared_pdf_file_paths, pdf_image_file_paths, annotate_max_pages, annotate_max_pages_bottom, pymupdf_doc, annotations_all_pages, review_file_state, document_cropboxes, page_sizes, textract_output_found, all_img_details_state, placeholder_ocr_results_df = prepare_image_or_pdf(
|
263 |
file_paths_loop, text_extraction_method, 0, out_message, True,
|
264 |
annotate_max_pages, annotations_all_pages, document_cropboxes, redact_whole_page_list,
|
|
|
332 |
# Try to connect to AWS services directly only if RUN_AWS_FUNCTIONS environmental variable is 1, otherwise an environment variable or direct textbox input is needed.
|
333 |
if pii_identification_method == aws_pii_detector:
|
334 |
if aws_access_key_textbox and aws_secret_key_textbox:
|
335 |
+
print("Connecting to Comprehend using AWS access key and secret keys from user input.")
|
336 |
comprehend_client = boto3.client('comprehend',
|
337 |
aws_access_key_id=aws_access_key_textbox,
|
338 |
aws_secret_access_key=aws_secret_key_textbox, region_name=AWS_REGION)
|
|
|
355 |
# Try to connect to AWS Textract Client if using that text extraction method
|
356 |
if text_extraction_method == textract_option:
|
357 |
if aws_access_key_textbox and aws_secret_key_textbox:
|
358 |
+
print("Connecting to Textract using AWS access key and secret keys from user input.")
|
359 |
textract_client = boto3.client('textract',
|
360 |
aws_access_key_id=aws_access_key_textbox,
|
361 |
aws_secret_access_key=aws_secret_key_textbox, region_name=AWS_REGION)
|
|
|
400 |
is_a_pdf = is_pdf(file_path) == True
|
401 |
if is_a_pdf == False and text_extraction_method == text_ocr_option:
|
402 |
# If user has not submitted a pdf, assume it's an image
|
403 |
+
print("File is not a PDF, assuming that image analysis needs to be used.")
|
404 |
text_extraction_method = tesseract_ocr_option
|
405 |
else:
|
406 |
out_message = "No file selected"
|
|
|
861 |
|
862 |
rect = Rect(pymupdf_x1, pymupdf_y1, pymupdf_x2, pymupdf_y2)
|
863 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
864 |
convert_df = pd.DataFrame({
|
865 |
"page": [page_no],
|
866 |
"xmin": [pymupdf_x1],
|
|
|
1004 |
|
1005 |
img_annotation_box = fill_missing_box_ids(img_annotation_box)
|
1006 |
|
|
|
|
|
|
|
1007 |
all_image_annotation_boxes.append(img_annotation_box)
|
1008 |
|
1009 |
# Redact the annotations from the document
|
|
|
1270 |
page_handwriting_recogniser_results = []
|
1271 |
page_break_return = False
|
1272 |
reported_page_number = str(page_no + 1)
|
|
|
|
|
1273 |
|
1274 |
# Try to find image location
|
1275 |
try:
|
|
|
1311 |
|
1312 |
# Step 1: Perform OCR. Either with Tesseract, or with AWS Textract
|
1313 |
|
1314 |
+
# If using Tesseract
|
1315 |
if text_extraction_method == tesseract_ocr_option:
|
1316 |
#print("image_path:", image_path)
|
1317 |
#print("print(type(image_path)):", print(type(image_path)))
|
|
|
1432 |
# Assume image_path is an image
|
1433 |
image = image_path
|
1434 |
|
|
|
1435 |
|
1436 |
fill = (0, 0, 0) # Fill colour for redactions
|
1437 |
draw = ImageDraw.Draw(image)
|
|
|
1613 |
for line in text_container
|
1614 |
if isinstance(line, LTTextLine) or isinstance(line, LTTextLineHorizontal)
|
1615 |
for char in line]
|
|
|
|
|
1616 |
|
1617 |
return characters
|
1618 |
return []
|
|
|
1742 |
analysed_bounding_boxes_df_new = pd.concat([analysed_bounding_boxes_df_new, analysed_bounding_boxes_df_text], axis = 1)
|
1743 |
analysed_bounding_boxes_df_new['page'] = page_num + 1
|
1744 |
|
|
|
|
|
|
|
1745 |
decision_process_table = pd.concat([decision_process_table, analysed_bounding_boxes_df_new], axis = 0).drop('result', axis=1)
|
1746 |
|
1747 |
return decision_process_table
|
|
|
1749 |
def create_pikepdf_annotations_for_bounding_boxes(analysed_bounding_boxes):
|
1750 |
pikepdf_redaction_annotations_on_page = []
|
1751 |
for analysed_bounding_box in analysed_bounding_boxes:
|
|
|
1752 |
|
1753 |
bounding_box = analysed_bounding_box["boundingBox"]
|
1754 |
annotation = Dictionary(
|
|
|
1973 |
pass
|
1974 |
#print("Not redacting page:", page_no)
|
1975 |
|
|
|
1976 |
|
1977 |
# Join extracted text outputs for all lines together
|
1978 |
if not page_text_ocr_outputs.empty:
|
tools/redaction_review.py
CHANGED
@@ -6,12 +6,11 @@ import numpy as np
|
|
6 |
from xml.etree.ElementTree import Element, SubElement, tostring, parse
|
7 |
from xml.dom import minidom
|
8 |
import uuid
|
9 |
-
from typing import List
|
10 |
from gradio_image_annotation import image_annotator
|
11 |
from gradio_image_annotation.image_annotator import AnnotatedImageData
|
12 |
from pymupdf import Document, Rect
|
13 |
import pymupdf
|
14 |
-
#from fitz
|
15 |
from PIL import ImageDraw, Image
|
16 |
|
17 |
from tools.config import OUTPUT_FOLDER, CUSTOM_BOX_COLOUR, MAX_IMAGE_PIXELS, INPUT_FOLDER
|
@@ -55,7 +54,6 @@ def update_zoom(current_zoom_level:int, annotate_current_page:int, decrease:bool
|
|
55 |
|
56 |
return current_zoom_level, annotate_current_page
|
57 |
|
58 |
-
|
59 |
def update_dropdown_list_based_on_dataframe(df:pd.DataFrame, column:str) -> List["str"]:
|
60 |
'''
|
61 |
Gather unique elements from a string pandas Series, then append 'ALL' to the start and return the list.
|
@@ -166,49 +164,205 @@ def update_recogniser_dataframes(page_image_annotator_object:AnnotatedImageData,
|
|
166 |
|
167 |
return recogniser_entities_list, recogniser_dataframe_out_gr, recogniser_dataframe_out, recogniser_entities_drop, text_entities_drop, page_entities_drop
|
168 |
|
169 |
-
def undo_last_removal(backup_review_state, backup_image_annotations_state, backup_recogniser_entity_dataframe_base):
|
170 |
return backup_review_state, backup_image_annotations_state, backup_recogniser_entity_dataframe_base
|
171 |
|
172 |
-
def update_annotator_page_from_review_df(
|
173 |
-
|
174 |
-
|
175 |
-
|
176 |
-
|
177 |
-
|
178 |
-
|
|
|
|
|
|
|
179 |
'''
|
180 |
-
Update the visible annotation object with the latest review file information
|
|
|
181 |
'''
|
182 |
-
|
183 |
-
|
184 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
185 |
|
|
|
186 |
if not review_df.empty:
|
187 |
-
#
|
188 |
-
#
|
189 |
-
if
|
|
|
190 |
|
191 |
-
|
192 |
-
if gradio_annotator_current_page_number > 0: page_num_reported = gradio_annotator_current_page_number
|
193 |
-
elif gradio_annotator_current_page_number == 0: page_num_reported = 1 # minimum possible reported page is 1
|
194 |
-
else:
|
195 |
-
gradio_annotator_current_page_number = 0
|
196 |
-
page_num_reported = 1
|
197 |
|
198 |
-
|
199 |
-
page_max_reported = len(out_image_annotations_state)
|
200 |
-
if page_num_reported > page_max_reported: page_num_reported = page_max_reported
|
201 |
|
202 |
-
|
203 |
-
|
|
|
|
|
|
|
|
|
204 |
|
205 |
-
|
|
|
206 |
|
207 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
208 |
|
209 |
-
|
|
|
|
|
210 |
|
211 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
212 |
|
213 |
def exclude_selected_items_from_redaction(review_df: pd.DataFrame,
|
214 |
selected_rows_df: pd.DataFrame,
|
@@ -216,7 +370,7 @@ def exclude_selected_items_from_redaction(review_df: pd.DataFrame,
|
|
216 |
page_sizes:List[dict],
|
217 |
image_annotations_state:dict,
|
218 |
recogniser_entity_dataframe_base:pd.DataFrame):
|
219 |
-
'''
|
220 |
Remove selected items from the review dataframe from the annotation object and review dataframe.
|
221 |
'''
|
222 |
|
@@ -253,149 +407,267 @@ def exclude_selected_items_from_redaction(review_df: pd.DataFrame,
|
|
253 |
|
254 |
return out_review_df, out_image_annotations_state, out_recogniser_entity_dataframe_base, backup_review_state, backup_image_annotations_state, backup_recogniser_entity_dataframe_base
|
255 |
|
256 |
-
def
|
257 |
-
|
258 |
-
|
259 |
-
|
260 |
-
|
261 |
-
text_dropdown_value:str="ALL",
|
262 |
-
recogniser_dataframe_base:gr.Dataframe=gr.Dataframe(pd.DataFrame(data={"page":[], "label":[], "text":[], "id":[]}), type="pandas", headers=["page", "label", "text", "id"], show_fullscreen_button=True, wrap=True, show_search='filter', max_height=400, static_columns=[0,1,2,3]),
|
263 |
-
zoom:int=100,
|
264 |
-
review_df:pd.DataFrame=[],
|
265 |
-
page_sizes:List[dict]=[],
|
266 |
-
doc_full_file_name_textbox:str='',
|
267 |
-
input_folder:str=INPUT_FOLDER):
|
268 |
-
'''
|
269 |
-
Update a gradio_image_annotation object with new annotation data.
|
270 |
-
'''
|
271 |
-
zoom_str = str(zoom) + '%'
|
272 |
-
|
273 |
-
#print("all_image_annotations at start of update_annotator_object_and_filter_df[-1]:", all_image_annotations[-1])
|
274 |
-
|
275 |
-
if not gradio_annotator_current_page_number: gradio_annotator_current_page_number = 0
|
276 |
-
|
277 |
-
# Check bounding values for current page and page max
|
278 |
-
if gradio_annotator_current_page_number > 0: page_num_reported = gradio_annotator_current_page_number
|
279 |
-
elif gradio_annotator_current_page_number == 0: page_num_reported = 1 # minimum possible reported page is 1
|
280 |
-
else:
|
281 |
-
gradio_annotator_current_page_number = 0
|
282 |
-
page_num_reported = 1
|
283 |
|
284 |
-
|
285 |
-
|
286 |
-
|
287 |
|
288 |
-
|
|
|
|
|
|
|
|
|
289 |
|
290 |
-
|
291 |
-
|
292 |
|
293 |
-
|
294 |
-
|
295 |
-
|
|
|
|
|
|
|
296 |
|
297 |
-
|
298 |
-
page_sizes_df = pd.DataFrame(page_sizes)
|
299 |
|
300 |
-
|
|
|
|
|
301 |
|
302 |
-
|
303 |
|
304 |
-
|
305 |
-
page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_width"] = width
|
306 |
-
page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_height"] = height
|
307 |
-
page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_path"] = replaced_image_path
|
308 |
-
|
309 |
-
else:
|
310 |
-
if not page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_width"].isnull().all():
|
311 |
-
width = page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_width"].max()
|
312 |
-
height = page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_height"].max()
|
313 |
-
else:
|
314 |
-
image = Image.open(current_image_path)
|
315 |
-
width = image.width
|
316 |
-
height = image.height
|
317 |
|
|
|
318 |
page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_width"] = width
|
319 |
page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_height"] = height
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
320 |
|
321 |
-
|
|
|
322 |
|
323 |
-
|
324 |
|
325 |
-
|
326 |
-
|
|
|
327 |
|
328 |
-
|
329 |
-
|
330 |
-
|
331 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
332 |
|
333 |
-
|
334 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
335 |
|
336 |
-
|
337 |
-
|
338 |
|
339 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
340 |
|
341 |
-
|
342 |
-
|
|
|
|
|
343 |
|
344 |
-
|
|
|
|
|
|
|
|
|
345 |
|
346 |
-
#print("all_image_annotations_df[-1] just before creating annotation dicts:", all_image_annotations_df.iloc[-1, :])
|
347 |
|
348 |
-
|
|
|
|
|
|
|
349 |
|
350 |
-
#print("all_image_annotations[-1] after creating annotation dicts:", all_image_annotations[-1])
|
351 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
352 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
353 |
|
354 |
-
|
355 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
356 |
|
357 |
-
current_page_image_annotator_object = all_image_annotations[page_num_reported_zero_indexed]
|
358 |
|
359 |
-
#
|
|
|
360 |
|
361 |
-
page_number_reported_gradio = gr.Number(label = "Current page", value=page_num_reported, precision=0)
|
362 |
|
363 |
-
###
|
364 |
-
# If no data, present a blank page
|
365 |
-
if not all_image_annotations:
|
366 |
-
print("No all_image_annotation object found")
|
367 |
-
page_num_reported = 1
|
368 |
|
369 |
-
|
370 |
-
|
371 |
-
|
372 |
-
|
373 |
-
|
374 |
-
|
375 |
-
|
376 |
-
height=zoom_str,
|
377 |
-
width=zoom_str,
|
378 |
-
box_min_size=1,
|
379 |
-
box_selected_thickness=2,
|
380 |
-
handle_size=4,
|
381 |
-
sources=None,#["upload"],
|
382 |
-
show_clear_button=False,
|
383 |
-
show_share_button=False,
|
384 |
-
show_remove_button=False,
|
385 |
-
handles_cursor=True,
|
386 |
-
interactive=True,
|
387 |
-
use_default_label=True
|
388 |
-
)
|
389 |
-
|
390 |
-
return out_image_annotator, page_number_reported_gradio, page_number_reported_gradio, page_num_reported, recogniser_entities_dropdown_value, recogniser_dataframe_out_gr, recogniser_dataframe_modified, text_entities_drop, page_entities_drop, page_sizes, all_image_annotations
|
391 |
-
|
392 |
else:
|
393 |
-
### Present image_annotator outputs
|
394 |
out_image_annotator = image_annotator(
|
395 |
value = current_page_image_annotator_object,
|
396 |
boxes_alpha=0.1,
|
397 |
box_thickness=1,
|
398 |
-
label_list=recogniser_entities_list,
|
399 |
label_colors=recogniser_colour_list,
|
400 |
show_label=False,
|
401 |
height=zoom_str,
|
@@ -408,41 +680,23 @@ def update_annotator_object_and_filter_df(
|
|
408 |
show_share_button=False,
|
409 |
show_remove_button=False,
|
410 |
handles_cursor=True,
|
411 |
-
interactive=True
|
412 |
)
|
413 |
|
414 |
-
#
|
415 |
-
#
|
416 |
-
|
417 |
-
return out_image_annotator,
|
418 |
-
|
419 |
-
|
420 |
-
|
421 |
-
|
422 |
-
|
423 |
-
|
424 |
-
|
425 |
-
|
426 |
-
|
427 |
-
|
428 |
-
|
429 |
-
page_zero_index = page - 1
|
430 |
-
|
431 |
-
if isinstance(all_image_annotations[page_zero_index]["image"], np.ndarray) or "placeholder_image" in all_image_annotations[page_zero_index]["image"] or isinstance(page_image_annotator_object['image'], np.ndarray):
|
432 |
-
page_sizes_df = pd.DataFrame(page_sizes)
|
433 |
-
page_sizes_df[["page"]] = page_sizes_df[["page"]].apply(pd.to_numeric, errors="coerce")
|
434 |
-
|
435 |
-
# Check for matching pages
|
436 |
-
matching_paths = page_sizes_df.loc[page_sizes_df['page'] == page, "image_path"].unique()
|
437 |
-
|
438 |
-
if matching_paths.size > 0:
|
439 |
-
image_path = matching_paths[0]
|
440 |
-
page_image_annotator_object['image'] = image_path
|
441 |
-
all_image_annotations[page_zero_index]["image"] = image_path
|
442 |
-
else:
|
443 |
-
print(f"No image path found for page {page}.")
|
444 |
-
|
445 |
-
return page_image_annotator_object, all_image_annotations
|
446 |
|
447 |
def update_all_page_annotation_object_based_on_previous_page(
|
448 |
page_image_annotator_object:AnnotatedImageData,
|
@@ -459,12 +713,9 @@ def update_all_page_annotation_object_based_on_previous_page(
|
|
459 |
previous_page_zero_index = previous_page -1
|
460 |
|
461 |
if not current_page: current_page = 1
|
462 |
-
|
463 |
-
#
|
464 |
-
|
465 |
-
page_image_annotator_object, all_image_annotations = replace_images_in_image_annotation_object(all_image_annotations, page_image_annotator_object, page_sizes, previous_page)
|
466 |
-
|
467 |
-
#print("page_image_annotator_object after replace_images in update_all_page_annotation_object:", page_image_annotator_object)
|
468 |
|
469 |
if clear_all == False: all_image_annotations[previous_page_zero_index] = page_image_annotator_object
|
470 |
else: all_image_annotations[previous_page_zero_index]["boxes"] = []
|
@@ -493,7 +744,7 @@ def apply_redactions_to_review_df_and_files(page_image_annotator_object:Annotate
|
|
493 |
page_image_annotator_object = all_image_annotations[current_page - 1]
|
494 |
|
495 |
# This replaces the numpy array image object with the image file path
|
496 |
-
page_image_annotator_object, all_image_annotations =
|
497 |
page_image_annotator_object['image'] = all_image_annotations[current_page - 1]["image"]
|
498 |
|
499 |
if not page_image_annotator_object:
|
@@ -529,7 +780,7 @@ def apply_redactions_to_review_df_and_files(page_image_annotator_object:Annotate
|
|
529 |
# Check if all elements are integers in the range 0-255
|
530 |
if all(isinstance(c, int) and 0 <= c <= 255 for c in fill):
|
531 |
pass
|
532 |
-
|
533 |
else:
|
534 |
print(f"Invalid color values: {fill}. Defaulting to black.")
|
535 |
fill = (0, 0, 0) # Default to black if invalid
|
@@ -553,7 +804,6 @@ def apply_redactions_to_review_df_and_files(page_image_annotator_object:Annotate
|
|
553 |
doc = [image]
|
554 |
|
555 |
elif file_extension in '.csv':
|
556 |
-
#print("This is a csv")
|
557 |
pdf_doc = []
|
558 |
|
559 |
# If working with pdfs
|
@@ -797,11 +1047,9 @@ def df_select_callback(df: pd.DataFrame, evt: gr.SelectData):
|
|
797 |
|
798 |
row_value_df = pd.DataFrame(data={"page":[row_value_page], "label":[row_value_label], "text":[row_value_text], "id":[row_value_id]})
|
799 |
|
800 |
-
return
|
801 |
|
802 |
def df_select_callback_textract_api(df: pd.DataFrame, evt: gr.SelectData):
|
803 |
-
|
804 |
-
#print("evt.data:", evt._data)
|
805 |
|
806 |
row_value_job_id = evt.row_value[0] # This is the page number value
|
807 |
# row_value_label = evt.row_value[1] # This is the label number value
|
@@ -829,59 +1077,108 @@ def df_select_callback_ocr(df: pd.DataFrame, evt: gr.SelectData):
|
|
829 |
|
830 |
return row_value_page, row_value_df
|
831 |
|
832 |
-
def update_selected_review_df_row_colour(
|
|
|
|
|
|
|
|
|
|
|
|
|
833 |
'''
|
834 |
Update the colour of a single redaction box based on the values in a selection row
|
|
|
835 |
'''
|
836 |
-
colour_tuple = str(tuple(colour))
|
837 |
|
838 |
-
|
|
|
|
|
|
|
|
|
839 |
if "id" not in review_df.columns:
|
840 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
841 |
|
842 |
-
# Reset existing highlight colours
|
843 |
-
review_df.loc[review_df["id"]==previous_id, "color"] = review_df.loc[review_df["id"]==previous_id, "color"].apply(lambda _: previous_colour)
|
844 |
-
review_df.loc[review_df["color"].astype(str)==colour, "color"] = review_df.loc[review_df["color"].astype(str)==colour, "color"].apply(lambda _: '(0, 0, 0)')
|
845 |
|
846 |
if not redaction_row_selection.empty and not review_df.empty:
|
847 |
use_id = (
|
848 |
-
"id" in redaction_row_selection.columns
|
849 |
-
and "id" in review_df.columns
|
850 |
-
and not redaction_row_selection["id"].isnull().all()
|
851 |
and not review_df["id"].isnull().all()
|
852 |
)
|
853 |
|
854 |
-
selected_merge_cols = ["id"] if use_id else ["label", "page", "text"]
|
855 |
|
856 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
857 |
|
858 |
-
|
859 |
-
|
860 |
-
|
861 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
862 |
|
863 |
-
if not filtered_reviews.empty:
|
864 |
-
previous_colour = str(filtered_reviews["color"].values[0])
|
865 |
-
previous_id = filtered_reviews["id"].values[0]
|
866 |
-
review_df.loc[review_df["_merge"]=="both", "color"] = review_df.loc[review_df["_merge"] == "both", "color"].apply(lambda _: colour)
|
867 |
else:
|
868 |
-
|
869 |
-
|
870 |
-
|
871 |
-
|
872 |
-
previous_id =''
|
873 |
|
874 |
-
review_df.drop("_merge", axis=1, inplace=True)
|
875 |
|
876 |
-
# Ensure
|
877 |
-
#
|
878 |
-
|
879 |
-
|
880 |
-
|
881 |
-
|
882 |
-
#print("review_df after divide:", review_df)
|
883 |
|
884 |
-
review_df = review_df[["image", "page", "label", "color", "xmin","ymin", "xmax", "ymax", "text", "id"]]
|
885 |
|
886 |
return review_df, previous_id, previous_colour
|
887 |
|
@@ -988,8 +1285,6 @@ def create_xfdf(review_file_df:pd.DataFrame, pdf_path:str, pymupdf_doc:object, i
|
|
988 |
page_sizes_df = pd.DataFrame(page_sizes)
|
989 |
|
990 |
# If there are no image coordinates, then convert coordinates to pymupdf coordinates prior to export
|
991 |
-
#print("Using pymupdf coordinates for conversion.")
|
992 |
-
|
993 |
pages_are_images = False
|
994 |
|
995 |
if "mediabox_width" not in review_file_df.columns:
|
@@ -1041,33 +1336,9 @@ def create_xfdf(review_file_df:pd.DataFrame, pdf_path:str, pymupdf_doc:object, i
|
|
1041 |
raise ValueError(f"Invalid cropbox format: {document_cropboxes[page_python_format]}")
|
1042 |
else:
|
1043 |
print("Document cropboxes not found.")
|
1044 |
-
|
1045 |
|
1046 |
pdf_page_height = pymupdf_page.mediabox.height
|
1047 |
-
pdf_page_width = pymupdf_page.mediabox.width
|
1048 |
-
|
1049 |
-
# Check if image dimensions for page exist in page_sizes_df
|
1050 |
-
# image_dimensions = {}
|
1051 |
-
|
1052 |
-
# image_dimensions['image_width'] = page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_width"].max()
|
1053 |
-
# image_dimensions['image_height'] = page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_height"].max()
|
1054 |
-
|
1055 |
-
# if pd.isna(image_dimensions['image_width']):
|
1056 |
-
# image_dimensions = {}
|
1057 |
-
|
1058 |
-
# image = image_paths[page_python_format]
|
1059 |
-
|
1060 |
-
# if image_dimensions:
|
1061 |
-
# image_page_width, image_page_height = image_dimensions["image_width"], image_dimensions["image_height"]
|
1062 |
-
# if isinstance(image, str) and 'placeholder' not in image:
|
1063 |
-
# image = Image.open(image)
|
1064 |
-
# image_page_width, image_page_height = image.size
|
1065 |
-
# else:
|
1066 |
-
# try:
|
1067 |
-
# image = Image.open(image)
|
1068 |
-
# image_page_width, image_page_height = image.size
|
1069 |
-
# except Exception as e:
|
1070 |
-
# print("Could not get image sizes due to:", e)
|
1071 |
|
1072 |
# Create redaction annotation
|
1073 |
redact_annot = SubElement(annots, 'redact')
|
@@ -1345,8 +1616,6 @@ def convert_xfdf_to_dataframe(file_paths_list:List[str], pymupdf_doc, image_path
|
|
1345 |
# Optionally, you can add the image path or other relevant information
|
1346 |
df.loc[_, 'image'] = image_path
|
1347 |
|
1348 |
-
#print('row:', row)
|
1349 |
-
|
1350 |
out_file_path = output_folder + file_path_name + "_review_file.csv"
|
1351 |
df.to_csv(out_file_path, index=None)
|
1352 |
|
|
|
6 |
from xml.etree.ElementTree import Element, SubElement, tostring, parse
|
7 |
from xml.dom import minidom
|
8 |
import uuid
|
9 |
+
from typing import List, Tuple
|
10 |
from gradio_image_annotation import image_annotator
|
11 |
from gradio_image_annotation.image_annotator import AnnotatedImageData
|
12 |
from pymupdf import Document, Rect
|
13 |
import pymupdf
|
|
|
14 |
from PIL import ImageDraw, Image
|
15 |
|
16 |
from tools.config import OUTPUT_FOLDER, CUSTOM_BOX_COLOUR, MAX_IMAGE_PIXELS, INPUT_FOLDER
|
|
|
54 |
|
55 |
return current_zoom_level, annotate_current_page
|
56 |
|
|
|
57 |
def update_dropdown_list_based_on_dataframe(df:pd.DataFrame, column:str) -> List["str"]:
|
58 |
'''
|
59 |
Gather unique elements from a string pandas Series, then append 'ALL' to the start and return the list.
|
|
|
164 |
|
165 |
return recogniser_entities_list, recogniser_dataframe_out_gr, recogniser_dataframe_out, recogniser_entities_drop, text_entities_drop, page_entities_drop
|
166 |
|
167 |
+
def undo_last_removal(backup_review_state:pd.DataFrame, backup_image_annotations_state:list[dict], backup_recogniser_entity_dataframe_base:pd.DataFrame):
|
168 |
return backup_review_state, backup_image_annotations_state, backup_recogniser_entity_dataframe_base
|
169 |
|
170 |
+
def update_annotator_page_from_review_df(
|
171 |
+
review_df: pd.DataFrame,
|
172 |
+
image_file_paths:List[str], # Note: This input doesn't seem used in the original logic flow after the first line was removed
|
173 |
+
page_sizes:List[dict],
|
174 |
+
current_image_annotations_state:List[str], # This should ideally be List[dict] based on its usage
|
175 |
+
current_page_annotator:object, # Should be dict or a custom annotation object for one page
|
176 |
+
selected_recogniser_entity_df_row:pd.DataFrame,
|
177 |
+
input_folder:str,
|
178 |
+
doc_full_file_name_textbox:str
|
179 |
+
) -> Tuple[object, List[dict], int, List[dict], pd.DataFrame, int]: # Correcting return types based on usage
|
180 |
'''
|
181 |
+
Update the visible annotation object and related objects with the latest review file information,
|
182 |
+
optimizing by processing only the current page's data.
|
183 |
'''
|
184 |
+
# Assume current_image_annotations_state is List[dict] and current_page_annotator is dict
|
185 |
+
out_image_annotations_state: List[dict] = list(current_image_annotations_state) # Make a copy to avoid modifying input in place
|
186 |
+
out_current_page_annotator: dict = current_page_annotator
|
187 |
+
|
188 |
+
# Get the target page number from the selected row
|
189 |
+
# Safely access the page number, handling potential errors or empty DataFrame
|
190 |
+
gradio_annotator_current_page_number: int = 0
|
191 |
+
annotate_previous_page: int = 0 # Renaming for clarity if needed, matches original output
|
192 |
+
if not selected_recogniser_entity_df_row.empty and 'page' in selected_recogniser_entity_df_row.columns:
|
193 |
+
try:
|
194 |
+
# Use .iloc[0] and .item() for robust scalar extraction
|
195 |
+
gradio_annotator_current_page_number = int(selected_recogniser_entity_df_row['page'].iloc[0])
|
196 |
+
annotate_previous_page = gradio_annotator_current_page_number # Store original page number
|
197 |
+
except (IndexError, ValueError, TypeError):
|
198 |
+
print("Warning: Could not extract valid page number from selected_recogniser_entity_df_row. Defaulting to page 0 (or 1).")
|
199 |
+
gradio_annotator_current_page_number = 1 # Or 0 depending on 1-based vs 0-based indexing elsewhere
|
200 |
+
|
201 |
+
# Ensure page number is valid and 1-based for external display/logic
|
202 |
+
if gradio_annotator_current_page_number <= 0:
|
203 |
+
gradio_annotator_current_page_number = 1
|
204 |
+
|
205 |
+
page_max_reported = len(out_image_annotations_state)
|
206 |
+
if gradio_annotator_current_page_number > page_max_reported:
|
207 |
+
gradio_annotator_current_page_number = page_max_reported # Cap at max pages
|
208 |
+
|
209 |
+
page_num_reported_zero_indexed = gradio_annotator_current_page_number - 1
|
210 |
+
|
211 |
+
# Process page sizes DataFrame early, as it's needed for image path handling and potentially coordinate multiplication
|
212 |
+
page_sizes_df = pd.DataFrame(page_sizes)
|
213 |
+
if not page_sizes_df.empty:
|
214 |
+
# Safely convert page column to numeric and then int
|
215 |
+
page_sizes_df["page"] = pd.to_numeric(page_sizes_df["page"], errors="coerce")
|
216 |
+
page_sizes_df.dropna(subset=["page"], inplace=True)
|
217 |
+
if not page_sizes_df.empty:
|
218 |
+
page_sizes_df["page"] = page_sizes_df["page"].astype(int)
|
219 |
+
else:
|
220 |
+
print("Warning: Page sizes DataFrame became empty after processing.")
|
221 |
|
222 |
+
# --- OPTIMIZATION: Process only the current page's data from review_df ---
|
223 |
if not review_df.empty:
|
224 |
+
# Filter review_df for the current page
|
225 |
+
# Ensure 'page' column in review_df is comparable to page_num_reported
|
226 |
+
if 'page' in review_df.columns:
|
227 |
+
review_df['page'] = pd.to_numeric(review_df['page'], errors='coerce').fillna(-1).astype(int)
|
228 |
|
229 |
+
current_image_path = out_image_annotations_state[page_num_reported_zero_indexed]['image']
|
|
|
|
|
|
|
|
|
|
|
230 |
|
231 |
+
replaced_image_path, page_sizes_df = replace_placeholder_image_with_real_image(doc_full_file_name_textbox, current_image_path, page_sizes_df, gradio_annotator_current_page_number, input_folder)
|
|
|
|
|
232 |
|
233 |
+
# page_sizes_df has been changed - save back to page_sizes_object
|
234 |
+
page_sizes = page_sizes_df.to_dict(orient='records')
|
235 |
+
review_df.loc[review_df["page"]==gradio_annotator_current_page_number, 'image'] = replaced_image_path
|
236 |
+
images_list = list(page_sizes_df["image_path"])
|
237 |
+
images_list[page_num_reported_zero_indexed] = replaced_image_path
|
238 |
+
out_image_annotations_state[page_num_reported_zero_indexed]['image'] = replaced_image_path
|
239 |
|
240 |
+
current_page_review_df = review_df[review_df['page'] == gradio_annotator_current_page_number].copy()
|
241 |
+
current_page_review_df = multiply_coordinates_by_page_sizes(current_page_review_df, page_sizes_df)
|
242 |
|
243 |
+
else:
|
244 |
+
print(f"Warning: 'page' column not found in review_df. Cannot filter for page {gradio_annotator_current_page_number}. Skipping update from review_df.")
|
245 |
+
current_page_review_df = pd.DataFrame() # Empty dataframe if filter fails
|
246 |
+
|
247 |
+
if not current_page_review_df.empty:
|
248 |
+
# Convert the current page's review data to annotation list format for *this page*
|
249 |
+
|
250 |
+
current_page_annotations_list = []
|
251 |
+
# Define expected annotation dict keys, including 'image', 'page', coords, 'label', 'text', 'color' etc.
|
252 |
+
# Assuming review_df has compatible columns
|
253 |
+
expected_annotation_keys = ['label', 'color', 'xmin', 'ymin', 'xmax', 'ymax', 'text', 'id'] # Add/remove as needed
|
254 |
+
|
255 |
+
# Ensure necessary columns exist in current_page_review_df before converting rows
|
256 |
+
for key in expected_annotation_keys:
|
257 |
+
if key not in current_page_review_df.columns:
|
258 |
+
# Add missing column with default value
|
259 |
+
# Use np.nan for numeric, '' for string/object
|
260 |
+
default_value = np.nan if key in ['xmin', 'ymin', 'xmax', 'ymax'] else ''
|
261 |
+
current_page_review_df[key] = default_value
|
262 |
+
|
263 |
+
# Convert filtered DataFrame rows to list of dicts
|
264 |
+
# Using .to_dict(orient='records') is efficient for this
|
265 |
+
current_page_annotations_list_raw = current_page_review_df[expected_annotation_keys].to_dict(orient='records')
|
266 |
+
|
267 |
+
current_page_annotations_list = current_page_annotations_list_raw
|
268 |
+
|
269 |
+
# Update the annotations state for the current page
|
270 |
+
# Each entry in out_image_annotations_state seems to be a dict containing keys like 'image', 'page', 'annotations' (List[dict])
|
271 |
+
# Need to update the 'annotations' list for the specific page.
|
272 |
+
# Find the entry for the current page in the state
|
273 |
+
page_state_entry_found = False
|
274 |
+
for i, page_state_entry in enumerate(out_image_annotations_state):
|
275 |
+
# Assuming page_state_entry has a 'page' key (1-based)
|
276 |
+
|
277 |
+
match = re.search(r"(\d+)\.png$", page_state_entry['image'])
|
278 |
+
if match: page_no = int(match.group(1))
|
279 |
+
else: page_no = -1
|
280 |
+
|
281 |
+
if 'image' in page_state_entry and page_no == page_num_reported_zero_indexed:
|
282 |
+
# Replace the annotations list for this page with the new list from review_df
|
283 |
+
out_image_annotations_state[i]['boxes'] = current_page_annotations_list
|
284 |
+
|
285 |
+
# Update the image path as well, based on review_df if available, or keep existing
|
286 |
+
# Assuming review_df has an 'image' column for this page
|
287 |
+
if 'image' in current_page_review_df.columns and not current_page_review_df.empty:
|
288 |
+
# Use the image path from the first row of the filtered review_df
|
289 |
+
out_image_annotations_state[i]['image'] = current_page_review_df['image'].iloc[0]
|
290 |
+
page_state_entry_found = True
|
291 |
+
break
|
292 |
+
|
293 |
+
if not page_state_entry_found:
|
294 |
+
# This scenario might happen if the current_image_annotations_state didn't initially contain
|
295 |
+
# an entry for this page number. Depending on the application logic, you might need to
|
296 |
+
# add a new entry here, but based on the original code's structure, it seems
|
297 |
+
# out_image_annotations_state is pre-populated for all pages.
|
298 |
+
print(f"Warning: Entry for page {gradio_annotator_current_page_number} not found in current_image_annotations_state. Cannot update page annotations.")
|
299 |
+
|
300 |
+
|
301 |
+
# --- Image Path and Page Size Handling (already seems focused on current page, keep similar logic) ---
|
302 |
+
# Get the image path for the current page from the updated state
|
303 |
+
# Ensure the entry exists before accessing
|
304 |
+
current_image_path = None
|
305 |
+
if len(out_image_annotations_state) > page_num_reported_zero_indexed and 'image' in out_image_annotations_state[page_num_reported_zero_indexed]:
|
306 |
+
current_image_path = out_image_annotations_state[page_num_reported_zero_indexed]['image']
|
307 |
+
else:
|
308 |
+
print(f"Warning: Could not get image path from state for page index {page_num_reported_zero_indexed}.")
|
309 |
+
|
310 |
+
|
311 |
+
# Replace placeholder image with real image path if needed
|
312 |
+
if current_image_path and not page_sizes_df.empty:
|
313 |
+
try:
|
314 |
+
replaced_image_path, page_sizes_df = replace_placeholder_image_with_real_image(
|
315 |
+
doc_full_file_name_textbox, current_image_path, page_sizes_df,
|
316 |
+
gradio_annotator_current_page_number, input_folder # Use 1-based page number
|
317 |
+
)
|
318 |
|
319 |
+
# Update state and review_df with the potentially replaced image path
|
320 |
+
if len(out_image_annotations_state) > page_num_reported_zero_indexed:
|
321 |
+
out_image_annotations_state[page_num_reported_zero_indexed]['image'] = replaced_image_path
|
322 |
|
323 |
+
if 'page' in review_df.columns and 'image' in review_df.columns:
|
324 |
+
review_df.loc[review_df["page"]==gradio_annotator_current_page_number, 'image'] = replaced_image_path
|
325 |
+
|
326 |
+
except Exception as e:
|
327 |
+
print(f"Error during image path replacement for page {gradio_annotator_current_page_number}: {e}")
|
328 |
+
|
329 |
+
|
330 |
+
# Save back page_sizes_df to page_sizes list format
|
331 |
+
if not page_sizes_df.empty:
|
332 |
+
page_sizes = page_sizes_df.to_dict(orient='records')
|
333 |
+
else:
|
334 |
+
page_sizes = [] # Ensure page_sizes is a list if df is empty
|
335 |
+
|
336 |
+
# --- Re-evaluate Coordinate Multiplication and Duplicate Removal ---
|
337 |
+
# The original code multiplied coordinates for the *entire* document and removed duplicates
|
338 |
+
# across the *entire* document *after* converting the full review_df to state.
|
339 |
+
# With the optimized approach, we updated only one page's annotations in the state.
|
340 |
+
|
341 |
+
# Let's assume remove_duplicate_images_with_blank_boxes expects the raw list of dicts state format:
|
342 |
+
try:
|
343 |
+
out_image_annotations_state = remove_duplicate_images_with_blank_boxes(out_image_annotations_state)
|
344 |
+
except Exception as e:
|
345 |
+
print(f"Error during duplicate removal: {e}. Proceeding without duplicate removal.")
|
346 |
+
|
347 |
+
|
348 |
+
# Select the current page's annotation object from the (potentially updated) state
|
349 |
+
if len(out_image_annotations_state) > page_num_reported_zero_indexed:
|
350 |
+
out_current_page_annotator = out_image_annotations_state[page_num_reported_zero_indexed]
|
351 |
+
else:
|
352 |
+
print(f"Warning: Cannot select current page annotator object for index {page_num_reported_zero_indexed}.")
|
353 |
+
out_current_page_annotator = {} # Or None, depending on expected output type
|
354 |
+
|
355 |
+
|
356 |
+
# The original code returns gradio_annotator_current_page_number as the 3rd value,
|
357 |
+
# which was potentially updated by bounding checks. Keep this.
|
358 |
+
final_page_number_returned = gradio_annotator_current_page_number
|
359 |
+
|
360 |
+
return (out_current_page_annotator,
|
361 |
+
out_image_annotations_state,
|
362 |
+
final_page_number_returned,
|
363 |
+
page_sizes,
|
364 |
+
review_df, # review_df might have its 'page' column type changed, keep it as is or revert if necessary
|
365 |
+
annotate_previous_page) # The original page number from selected_recogniser_entity_df_row
|
366 |
|
367 |
def exclude_selected_items_from_redaction(review_df: pd.DataFrame,
|
368 |
selected_rows_df: pd.DataFrame,
|
|
|
370 |
page_sizes:List[dict],
|
371 |
image_annotations_state:dict,
|
372 |
recogniser_entity_dataframe_base:pd.DataFrame):
|
373 |
+
'''
|
374 |
Remove selected items from the review dataframe from the annotation object and review dataframe.
|
375 |
'''
|
376 |
|
|
|
407 |
|
408 |
return out_review_df, out_image_annotations_state, out_recogniser_entity_dataframe_base, backup_review_state, backup_image_annotations_state, backup_recogniser_entity_dataframe_base
|
409 |
|
410 |
+
def replace_annotator_object_img_np_array_with_page_sizes_image_path(
|
411 |
+
all_image_annotations:List[dict],
|
412 |
+
page_image_annotator_object:AnnotatedImageData,
|
413 |
+
page_sizes:List[dict],
|
414 |
+
page:int):
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
415 |
|
416 |
+
'''
|
417 |
+
Check if the image value in an AnnotatedImageData dict is a placeholder or np.array. If either of these, replace the value with the file path of the image that is hopefully already loaded into the app related to this page.
|
418 |
+
'''
|
419 |
|
420 |
+
page_zero_index = page - 1
|
421 |
+
|
422 |
+
if isinstance(all_image_annotations[page_zero_index]["image"], np.ndarray) or "placeholder_image" in all_image_annotations[page_zero_index]["image"] or isinstance(page_image_annotator_object['image'], np.ndarray):
|
423 |
+
page_sizes_df = pd.DataFrame(page_sizes)
|
424 |
+
page_sizes_df[["page"]] = page_sizes_df[["page"]].apply(pd.to_numeric, errors="coerce")
|
425 |
|
426 |
+
# Check for matching pages
|
427 |
+
matching_paths = page_sizes_df.loc[page_sizes_df['page'] == page, "image_path"].unique()
|
428 |
|
429 |
+
if matching_paths.size > 0:
|
430 |
+
image_path = matching_paths[0]
|
431 |
+
page_image_annotator_object['image'] = image_path
|
432 |
+
all_image_annotations[page_zero_index]["image"] = image_path
|
433 |
+
else:
|
434 |
+
print(f"No image path found for page {page}.")
|
435 |
|
436 |
+
return page_image_annotator_object, all_image_annotations
|
|
|
437 |
|
438 |
+
def replace_placeholder_image_with_real_image(doc_full_file_name_textbox:str, current_image_path:str, page_sizes_df:pd.DataFrame, page_num_reported:int, input_folder:str):
|
439 |
+
''' If image path is still not valid, load in a new image an overwrite it. Then replace all items in the image annotation object for all pages based on the updated information.'''
|
440 |
+
page_num_reported_zero_indexed = page_num_reported - 1
|
441 |
|
442 |
+
if not os.path.exists(current_image_path):
|
443 |
|
444 |
+
page_num, replaced_image_path, width, height = process_single_page_for_image_conversion(doc_full_file_name_textbox, page_num_reported_zero_indexed, input_folder=input_folder)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
445 |
|
446 |
+
# Overwrite page_sizes values
|
447 |
page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_width"] = width
|
448 |
page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_height"] = height
|
449 |
+
page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_path"] = replaced_image_path
|
450 |
+
|
451 |
+
else:
|
452 |
+
if not page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_width"].isnull().all():
|
453 |
+
width = page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_width"].max()
|
454 |
+
height = page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_height"].max()
|
455 |
+
else:
|
456 |
+
image = Image.open(current_image_path)
|
457 |
+
width = image.width
|
458 |
+
height = image.height
|
459 |
|
460 |
+
page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_width"] = width
|
461 |
+
page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_height"] = height
|
462 |
|
463 |
+
page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_path"] = current_image_path
|
464 |
|
465 |
+
replaced_image_path = current_image_path
|
466 |
+
|
467 |
+
return replaced_image_path, page_sizes_df
|
468 |
|
469 |
+
def update_annotator_object_and_filter_df(
|
470 |
+
all_image_annotations:List[AnnotatedImageData],
|
471 |
+
gradio_annotator_current_page_number:int,
|
472 |
+
recogniser_entities_dropdown_value:str="ALL",
|
473 |
+
page_dropdown_value:str="ALL",
|
474 |
+
text_dropdown_value:str="ALL",
|
475 |
+
recogniser_dataframe_base:gr.Dataframe=None, # Simplified default
|
476 |
+
zoom:int=100,
|
477 |
+
review_df:pd.DataFrame=None, # Use None for default empty DataFrame
|
478 |
+
page_sizes:List[dict]=[],
|
479 |
+
doc_full_file_name_textbox:str='',
|
480 |
+
input_folder:str=INPUT_FOLDER
|
481 |
+
) -> Tuple[image_annotator, gr.Number, gr.Number, int, str, gr.Dataframe, pd.DataFrame, List[str], List[str], List[dict], List[AnnotatedImageData]]:
|
482 |
+
'''
|
483 |
+
Update a gradio_image_annotation object with new annotation data for the current page
|
484 |
+
and update filter dataframes, optimizing by processing only the current page's data for display.
|
485 |
+
'''
|
486 |
+
zoom_str = str(zoom) + '%'
|
487 |
+
|
488 |
+
# Handle default empty review_df and recogniser_dataframe_base
|
489 |
+
if review_df is None or not isinstance(review_df, pd.DataFrame):
|
490 |
+
review_df = pd.DataFrame(columns=["image", "page", "label", "color", "xmin", "ymin", "xmax", "ymax", "text", "id"])
|
491 |
+
if recogniser_dataframe_base is None: # Create a simple default if None
|
492 |
+
recogniser_dataframe_base = gr.Dataframe(pd.DataFrame(data={"page":[], "label":[], "text":[], "id":[]}))
|
493 |
+
|
494 |
+
|
495 |
+
# Handle empty all_image_annotations state early
|
496 |
+
if not all_image_annotations:
|
497 |
+
print("No all_image_annotation object found")
|
498 |
+
# Return blank/default outputs
|
499 |
+
blank_annotator = gr.ImageAnnotator(
|
500 |
+
value = None, boxes_alpha=0.1, box_thickness=1, label_list=[], label_colors=[],
|
501 |
+
show_label=False, height=zoom_str, width=zoom_str, box_min_size=1,
|
502 |
+
box_selected_thickness=2, handle_size=4, sources=None,
|
503 |
+
show_clear_button=False, show_share_button=False, show_remove_button=False,
|
504 |
+
handles_cursor=True, interactive=True, use_default_label=True
|
505 |
+
)
|
506 |
+
blank_df_out_gr = gr.Dataframe(pd.DataFrame(columns=["page", "label", "text", "id"]))
|
507 |
+
blank_df_modified = pd.DataFrame(columns=["page", "label", "text", "id"])
|
508 |
|
509 |
+
return (blank_annotator, gr.Number(value=1), gr.Number(value=1), 1,
|
510 |
+
recogniser_entities_dropdown_value, blank_df_out_gr, blank_df_modified,
|
511 |
+
[], [], [], []) # Return empty lists/defaults for other outputs
|
512 |
+
|
513 |
+
# Validate and bound the current page number (1-based logic)
|
514 |
+
page_num_reported = max(1, gradio_annotator_current_page_number) # Minimum page is 1
|
515 |
+
page_max_reported = len(all_image_annotations)
|
516 |
+
if page_num_reported > page_max_reported:
|
517 |
+
page_num_reported = page_max_reported
|
518 |
|
519 |
+
page_num_reported_zero_indexed = page_num_reported - 1
|
520 |
+
annotate_previous_page = page_num_reported # Store the determined page number
|
521 |
|
522 |
+
# --- Process page sizes DataFrame ---
|
523 |
+
page_sizes_df = pd.DataFrame(page_sizes)
|
524 |
+
if not page_sizes_df.empty:
|
525 |
+
page_sizes_df["page"] = pd.to_numeric(page_sizes_df["page"], errors="coerce")
|
526 |
+
page_sizes_df.dropna(subset=["page"], inplace=True)
|
527 |
+
if not page_sizes_df.empty:
|
528 |
+
page_sizes_df["page"] = page_sizes_df["page"].astype(int)
|
529 |
+
else:
|
530 |
+
print("Warning: Page sizes DataFrame became empty after processing.")
|
531 |
+
|
532 |
+
# --- Handle Image Path Replacement for the Current Page ---
|
533 |
+
# This modifies the specific page entry within all_image_annotations list
|
534 |
+
# Assuming replace_annotator_object_img_np_array_with_page_sizes_image_path
|
535 |
+
# correctly updates the image path within the list element.
|
536 |
+
if len(all_image_annotations) > page_num_reported_zero_indexed:
|
537 |
+
# Make a shallow copy of the list and deep copy the specific page dict before modification
|
538 |
+
# to avoid modifying the input list unexpectedly if it's used elsewhere.
|
539 |
+
# However, the original code modified the list in place, so we'll stick to that
|
540 |
+
# pattern but acknowledge it.
|
541 |
+
page_object_to_update = all_image_annotations[page_num_reported_zero_indexed]
|
542 |
+
|
543 |
+
# Use the helper function to replace the image path within the page object
|
544 |
+
# Note: This helper returns the potentially modified page_object and the full state.
|
545 |
+
# The full state return seems redundant if only page_object_to_update is modified.
|
546 |
+
# Let's call it and assume it correctly updates the item in the list.
|
547 |
+
updated_page_object, all_image_annotations_after_img_replace = replace_annotator_object_img_np_array_with_page_sizes_image_path(
|
548 |
+
all_image_annotations, page_object_to_update, page_sizes, page_num_reported)
|
549 |
+
|
550 |
+
# The original code immediately re-assigns all_image_annotations.
|
551 |
+
# We'll rely on the function modifying the list element in place or returning the updated list.
|
552 |
+
# Assuming it returns the updated list for robustness:
|
553 |
+
all_image_annotations = all_image_annotations_after_img_replace
|
554 |
+
|
555 |
+
|
556 |
+
# Now handle the actual image file path replacement using replace_placeholder_image_with_real_image
|
557 |
+
current_image_path = updated_page_object.get('image') # Get potentially updated image path
|
558 |
+
|
559 |
+
if current_image_path and not page_sizes_df.empty:
|
560 |
+
try:
|
561 |
+
replaced_image_path, page_sizes_df = replace_placeholder_image_with_real_image(
|
562 |
+
doc_full_file_name_textbox, current_image_path, page_sizes_df,
|
563 |
+
page_num_reported, input_folder=input_folder # Use 1-based page num
|
564 |
+
)
|
565 |
|
566 |
+
# Update the image path in the state and review_df for the current page
|
567 |
+
# Find the correct entry in all_image_annotations list again by index
|
568 |
+
if len(all_image_annotations) > page_num_reported_zero_indexed:
|
569 |
+
all_image_annotations[page_num_reported_zero_indexed]['image'] = replaced_image_path
|
570 |
|
571 |
+
# Update review_df's image path for this page
|
572 |
+
if 'page' in review_df.columns and 'image' in review_df.columns:
|
573 |
+
# Ensure review_df page column is numeric for filtering
|
574 |
+
review_df['page'] = pd.to_numeric(review_df['page'], errors='coerce').fillna(-1).astype(int)
|
575 |
+
review_df.loc[review_df["page"]==page_num_reported, 'image'] = replaced_image_path
|
576 |
|
|
|
577 |
|
578 |
+
except Exception as e:
|
579 |
+
print(f"Error during image path replacement for page {page_num_reported}: {e}")
|
580 |
+
else:
|
581 |
+
print(f"Warning: Page index {page_num_reported_zero_indexed} out of bounds for all_image_annotations list.")
|
582 |
|
|
|
583 |
|
584 |
+
# Save back page_sizes_df to page_sizes list format
|
585 |
+
if not page_sizes_df.empty:
|
586 |
+
page_sizes = page_sizes_df.to_dict(orient='records')
|
587 |
+
else:
|
588 |
+
page_sizes = [] # Ensure page_sizes is a list if df is empty
|
589 |
+
|
590 |
+
# --- OPTIMIZATION: Prepare data *only* for the current page for display ---
|
591 |
+
current_page_image_annotator_object = None
|
592 |
+
if len(all_image_annotations) > page_num_reported_zero_indexed:
|
593 |
+
page_data_for_display = all_image_annotations[page_num_reported_zero_indexed]
|
594 |
+
|
595 |
+
# Convert current page annotations list to DataFrame for coordinate multiplication IF needed
|
596 |
+
# Assuming coordinate multiplication IS needed for display if state stores relative coords
|
597 |
+
current_page_annotations_df = convert_annotation_data_to_dataframe([page_data_for_display])
|
598 |
+
|
599 |
+
|
600 |
+
if not current_page_annotations_df.empty and not page_sizes_df.empty:
|
601 |
+
# Multiply coordinates *only* for this page's DataFrame
|
602 |
+
try:
|
603 |
+
# Need the specific page's size for multiplication
|
604 |
+
page_size_row = page_sizes_df[page_sizes_df['page'] == page_num_reported]
|
605 |
+
if not page_size_row.empty:
|
606 |
+
current_page_annotations_df = multiply_coordinates_by_page_sizes(
|
607 |
+
current_page_annotations_df, page_size_row, # Pass only the row for the current page
|
608 |
+
xmin="xmin", xmax="xmax", ymin="ymin", ymax="ymax"
|
609 |
+
)
|
610 |
+
|
611 |
+
except Exception as e:
|
612 |
+
print(f"Warning: Error during coordinate multiplication for page {page_num_reported}: {e}. Using original coordinates.")
|
613 |
+
# If error, proceed with original coordinates or handle as needed
|
614 |
+
|
615 |
+
if "color" not in current_page_annotations_df.columns:
|
616 |
+
current_page_annotations_df['color'] = '(0, 0, 0)'
|
617 |
+
|
618 |
+
# Convert the processed DataFrame back to the list of dicts format for the annotator
|
619 |
+
processed_current_page_annotations_list = current_page_annotations_df[["xmin", "xmax", "ymin", "ymax", "label", "color", "text", "id"]].to_dict(orient='records')
|
620 |
+
|
621 |
+
# Construct the final object expected by the Gradio ImageAnnotator value parameter
|
622 |
+
current_page_image_annotator_object: AnnotatedImageData = {
|
623 |
+
'image': page_data_for_display.get('image'), # Use the (potentially updated) image path
|
624 |
+
'boxes': processed_current_page_annotations_list
|
625 |
+
}
|
626 |
|
627 |
+
# --- Update Dropdowns and Review DataFrame ---
|
628 |
+
# This external function still operates on potentially large DataFrames.
|
629 |
+
# It receives all_image_annotations and a copy of review_df.
|
630 |
+
try:
|
631 |
+
recogniser_entities_list, recogniser_dataframe_out_gr, recogniser_dataframe_modified, recogniser_entities_dropdown_value, text_entities_drop, page_entities_drop = update_recogniser_dataframes(
|
632 |
+
all_image_annotations, # Pass the updated full state
|
633 |
+
recogniser_dataframe_base,
|
634 |
+
recogniser_entities_dropdown_value,
|
635 |
+
text_dropdown_value,
|
636 |
+
page_dropdown_value,
|
637 |
+
review_df.copy(), # Keep the copy as per original function call
|
638 |
+
page_sizes # Pass updated page sizes
|
639 |
+
)
|
640 |
+
# Generate default black colors for labels if needed by image_annotator
|
641 |
+
recogniser_colour_list = [(0, 0, 0) for _ in range(len(recogniser_entities_list))]
|
642 |
|
643 |
+
except Exception as e:
|
644 |
+
print(f"Error calling update_recogniser_dataframes: {e}. Returning empty/default filter data.")
|
645 |
+
recogniser_entities_list = []
|
646 |
+
recogniser_colour_list = []
|
647 |
+
recogniser_dataframe_out_gr = gr.Dataframe(pd.DataFrame(columns=["page", "label", "text", "id"]))
|
648 |
+
recogniser_dataframe_modified = pd.DataFrame(columns=["page", "label", "text", "id"])
|
649 |
+
text_entities_drop = []
|
650 |
+
page_entities_drop = []
|
651 |
|
|
|
652 |
|
653 |
+
# --- Final Output Components ---
|
654 |
+
page_number_reported_gradio_comp = gr.Number(label = "Current page", value=page_num_reported, precision=0)
|
655 |
|
|
|
656 |
|
|
|
|
|
|
|
|
|
|
|
657 |
|
658 |
+
### Present image_annotator outputs
|
659 |
+
# Handle the case where current_page_image_annotator_object couldn't be prepared
|
660 |
+
if current_page_image_annotator_object is None:
|
661 |
+
# This should ideally be covered by the initial empty check for all_image_annotations,
|
662 |
+
# but as a safeguard:
|
663 |
+
print("Warning: Could not prepare annotator object for the current page.")
|
664 |
+
out_image_annotator = image_annotator(value=None, interactive=False) # Present blank/non-interactive
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
665 |
else:
|
|
|
666 |
out_image_annotator = image_annotator(
|
667 |
value = current_page_image_annotator_object,
|
668 |
boxes_alpha=0.1,
|
669 |
box_thickness=1,
|
670 |
+
label_list=recogniser_entities_list, # Use labels from update_recogniser_dataframes
|
671 |
label_colors=recogniser_colour_list,
|
672 |
show_label=False,
|
673 |
height=zoom_str,
|
|
|
680 |
show_share_button=False,
|
681 |
show_remove_button=False,
|
682 |
handles_cursor=True,
|
683 |
+
interactive=True # Keep interactive if data is present
|
684 |
)
|
685 |
|
686 |
+
# The original code returned page_number_reported_gradio twice;
|
687 |
+
# returning the Gradio component and the plain integer value.
|
688 |
+
# Let's match the output signature.
|
689 |
+
return (out_image_annotator,
|
690 |
+
page_number_reported_gradio_comp,
|
691 |
+
page_number_reported_gradio_comp, # Redundant, but matches original return signature
|
692 |
+
page_num_reported, # Plain integer value
|
693 |
+
recogniser_entities_dropdown_value,
|
694 |
+
recogniser_dataframe_out_gr,
|
695 |
+
recogniser_dataframe_modified,
|
696 |
+
text_entities_drop, # List of text entities for dropdown
|
697 |
+
page_entities_drop, # List of page numbers for dropdown
|
698 |
+
page_sizes, # Updated page_sizes list
|
699 |
+
all_image_annotations) # Return the updated full state
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
700 |
|
701 |
def update_all_page_annotation_object_based_on_previous_page(
|
702 |
page_image_annotator_object:AnnotatedImageData,
|
|
|
713 |
previous_page_zero_index = previous_page -1
|
714 |
|
715 |
if not current_page: current_page = 1
|
716 |
+
|
717 |
+
# This replaces the numpy array image object with the image file path
|
718 |
+
page_image_annotator_object, all_image_annotations = replace_annotator_object_img_np_array_with_page_sizes_image_path(all_image_annotations, page_image_annotator_object, page_sizes, previous_page)
|
|
|
|
|
|
|
719 |
|
720 |
if clear_all == False: all_image_annotations[previous_page_zero_index] = page_image_annotator_object
|
721 |
else: all_image_annotations[previous_page_zero_index]["boxes"] = []
|
|
|
744 |
page_image_annotator_object = all_image_annotations[current_page - 1]
|
745 |
|
746 |
# This replaces the numpy array image object with the image file path
|
747 |
+
page_image_annotator_object, all_image_annotations = replace_annotator_object_img_np_array_with_page_sizes_image_path(all_image_annotations, page_image_annotator_object, page_sizes, current_page)
|
748 |
page_image_annotator_object['image'] = all_image_annotations[current_page - 1]["image"]
|
749 |
|
750 |
if not page_image_annotator_object:
|
|
|
780 |
# Check if all elements are integers in the range 0-255
|
781 |
if all(isinstance(c, int) and 0 <= c <= 255 for c in fill):
|
782 |
pass
|
783 |
+
|
784 |
else:
|
785 |
print(f"Invalid color values: {fill}. Defaulting to black.")
|
786 |
fill = (0, 0, 0) # Default to black if invalid
|
|
|
804 |
doc = [image]
|
805 |
|
806 |
elif file_extension in '.csv':
|
|
|
807 |
pdf_doc = []
|
808 |
|
809 |
# If working with pdfs
|
|
|
1047 |
|
1048 |
row_value_df = pd.DataFrame(data={"page":[row_value_page], "label":[row_value_label], "text":[row_value_text], "id":[row_value_id]})
|
1049 |
|
1050 |
+
return row_value_df
|
1051 |
|
1052 |
def df_select_callback_textract_api(df: pd.DataFrame, evt: gr.SelectData):
|
|
|
|
|
1053 |
|
1054 |
row_value_job_id = evt.row_value[0] # This is the page number value
|
1055 |
# row_value_label = evt.row_value[1] # This is the label number value
|
|
|
1077 |
|
1078 |
return row_value_page, row_value_df
|
1079 |
|
1080 |
+
def update_selected_review_df_row_colour(
|
1081 |
+
redaction_row_selection: pd.DataFrame,
|
1082 |
+
review_df: pd.DataFrame,
|
1083 |
+
previous_id: str = "",
|
1084 |
+
previous_colour: str = '(0, 0, 0)',
|
1085 |
+
colour: str = '(1, 0, 255)'
|
1086 |
+
) -> tuple[pd.DataFrame, str, str]:
|
1087 |
'''
|
1088 |
Update the colour of a single redaction box based on the values in a selection row
|
1089 |
+
(Optimized Version)
|
1090 |
'''
|
|
|
1091 |
|
1092 |
+
# Ensure 'color' column exists, default to previous_colour if previous_id is provided
|
1093 |
+
if "color" not in review_df.columns:
|
1094 |
+
review_df["color"] = previous_colour if previous_id else '(0, 0, 0)'
|
1095 |
+
|
1096 |
+
# Ensure 'id' column exists
|
1097 |
if "id" not in review_df.columns:
|
1098 |
+
# Assuming fill_missing_ids is a defined function that returns a DataFrame
|
1099 |
+
# It's more efficient if this is handled outside if possible,
|
1100 |
+
# or optimized internally.
|
1101 |
+
print("Warning: 'id' column not found. Calling fill_missing_ids.")
|
1102 |
+
review_df = fill_missing_ids(review_df) # Keep this if necessary, but note it can be slow
|
1103 |
+
|
1104 |
+
# --- Optimization 1 & 2: Reset existing highlight colours using vectorized assignment ---
|
1105 |
+
# Reset the color of the previously highlighted row
|
1106 |
+
if previous_id and previous_id in review_df["id"].values:
|
1107 |
+
review_df.loc[review_df["id"] == previous_id, "color"] = previous_colour
|
1108 |
+
|
1109 |
+
# Reset the color of any row that currently has the highlight colour (handle cases where previous_id might not have been tracked correctly)
|
1110 |
+
# Convert to string for comparison only if the dtype might be mixed or not purely string
|
1111 |
+
# If 'color' is consistently string, the .astype(str) might be avoidable.
|
1112 |
+
# Assuming color is consistently string format like '(R, G, B)'
|
1113 |
+
review_df.loc[review_df["color"] == colour, "color"] = '(0, 0, 0)'
|
1114 |
|
|
|
|
|
|
|
1115 |
|
1116 |
if not redaction_row_selection.empty and not review_df.empty:
|
1117 |
use_id = (
|
1118 |
+
"id" in redaction_row_selection.columns
|
1119 |
+
and "id" in review_df.columns
|
1120 |
+
and not redaction_row_selection["id"].isnull().all()
|
1121 |
and not review_df["id"].isnull().all()
|
1122 |
)
|
1123 |
|
1124 |
+
selected_merge_cols = ["id"] if use_id else ["label", "page", "text"]
|
1125 |
|
1126 |
+
# --- Optimization 3: Use inner merge directly ---
|
1127 |
+
# Merge to find rows in review_df that match redaction_row_selection
|
1128 |
+
merged_reviews = review_df.merge(
|
1129 |
+
redaction_row_selection[selected_merge_cols],
|
1130 |
+
on=selected_merge_cols,
|
1131 |
+
how="inner" # Use inner join as we only care about matches
|
1132 |
+
)
|
1133 |
|
1134 |
+
if not merged_reviews.empty:
|
1135 |
+
# Assuming we only expect one match for highlighting a single row
|
1136 |
+
# If multiple matches are possible and you want to highlight all,
|
1137 |
+
# the logic for previous_id and previous_colour needs adjustment.
|
1138 |
+
new_previous_colour = str(merged_reviews["color"].iloc[0])
|
1139 |
+
new_previous_id = merged_reviews["id"].iloc[0]
|
1140 |
+
|
1141 |
+
# --- Optimization 1 & 2: Update color of the matched row using vectorized assignment ---
|
1142 |
+
|
1143 |
+
if use_id:
|
1144 |
+
# Faster update if using unique 'id' as merge key
|
1145 |
+
review_df.loc[review_df["id"].isin(merged_reviews["id"]), "color"] = colour
|
1146 |
+
else:
|
1147 |
+
# More general case using multiple columns - might be slower
|
1148 |
+
# Create a temporary key for comparison
|
1149 |
+
def create_merge_key(df, cols):
|
1150 |
+
return df[cols].astype(str).agg('_'.join, axis=1)
|
1151 |
+
|
1152 |
+
review_df_key = create_merge_key(review_df, selected_merge_cols)
|
1153 |
+
merged_reviews_key = create_merge_key(merged_reviews, selected_merge_cols)
|
1154 |
+
|
1155 |
+
review_df.loc[review_df_key.isin(merged_reviews_key), "color"] = colour
|
1156 |
+
|
1157 |
+
previous_colour = new_previous_colour
|
1158 |
+
previous_id = new_previous_id
|
1159 |
+
else:
|
1160 |
+
# No rows matched the selection
|
1161 |
+
print("No reviews found matching selection criteria")
|
1162 |
+
# The reset logic at the beginning already handles setting color to (0, 0, 0)
|
1163 |
+
# if it was the highlight colour and didn't match.
|
1164 |
+
# No specific action needed here for color reset beyond what's done initially.
|
1165 |
+
previous_colour = '(0, 0, 0)' # Reset previous_colour as no row was highlighted
|
1166 |
+
previous_id = '' # Reset previous_id
|
1167 |
|
|
|
|
|
|
|
|
|
1168 |
else:
|
1169 |
+
# If selection is empty, reset any existing highlights
|
1170 |
+
review_df.loc[review_df["color"] == colour, "color"] = '(0, 0, 0)'
|
1171 |
+
previous_colour = '(0, 0, 0)'
|
1172 |
+
previous_id = ''
|
|
|
1173 |
|
|
|
1174 |
|
1175 |
+
# Ensure column order is maintained if necessary, though pandas generally preserves order
|
1176 |
+
# Creating a new DataFrame here might involve copying data, consider if this is strictly needed.
|
1177 |
+
if set(["image", "page", "label", "color", "xmin","ymin", "xmax", "ymax", "text", "id"]).issubset(review_df.columns):
|
1178 |
+
review_df = review_df[["image", "page", "label", "color", "xmin","ymin", "xmax", "ymax", "text", "id"]]
|
1179 |
+
else:
|
1180 |
+
print("Warning: Not all expected columns are present in review_df for reordering.")
|
|
|
1181 |
|
|
|
1182 |
|
1183 |
return review_df, previous_id, previous_colour
|
1184 |
|
|
|
1285 |
page_sizes_df = pd.DataFrame(page_sizes)
|
1286 |
|
1287 |
# If there are no image coordinates, then convert coordinates to pymupdf coordinates prior to export
|
|
|
|
|
1288 |
pages_are_images = False
|
1289 |
|
1290 |
if "mediabox_width" not in review_file_df.columns:
|
|
|
1336 |
raise ValueError(f"Invalid cropbox format: {document_cropboxes[page_python_format]}")
|
1337 |
else:
|
1338 |
print("Document cropboxes not found.")
|
|
|
1339 |
|
1340 |
pdf_page_height = pymupdf_page.mediabox.height
|
1341 |
+
pdf_page_width = pymupdf_page.mediabox.width
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1342 |
|
1343 |
# Create redaction annotation
|
1344 |
redact_annot = SubElement(annots, 'redact')
|
|
|
1616 |
# Optionally, you can add the image path or other relevant information
|
1617 |
df.loc[_, 'image'] = image_path
|
1618 |
|
|
|
|
|
1619 |
out_file_path = output_folder + file_path_name + "_review_file.csv"
|
1620 |
df.to_csv(out_file_path, index=None)
|
1621 |
|